Perfectly Readable C++ | kojobailey.me

C++ is full of weird features, quirks, and historical artefacts. This is in part due to its desire for classic C code to be allowed under the C++ standard ever since it first released. It is also due to the fact that C++ was first published in 1985, and has naturally been filled with all sorts of features, good and bad, over the past 40+ years.

As someone interested in programming language design, I’ve taken an interest in some of these features, eager to see just how hilariously difficult to read C++ code can be.

In order to explore some of these features in this article, I present the following:

%:import "bits/stdc++.h"
auto main(register int(C), const char*(_)<::>) noexcept -> int {
  return operator<<(std::wclog, *(C ^= 074144, NULL<:_:> + 9))
    << (typeof(*_))new volatile const signed long long<%C++%>, not true;
}

I wrote this code while playing around with compiler.gg which uses Godbolt’s Compiler Explorer. This code depends at least partly on being compiled in this way, so if you want to test it yourself, do so via this link with the “x86-64 gcc 15.2” compiler.

Now, what this code does is simple: It outputs “sex” (as in “sextet”) to the console.

How it goes about it… is less simple, and is what I will attempt to explain in the rest of this article! As I do so though, I invite you to also try and figure things out yourself to test your understanding of C and C++!

Digraphs

https://en.cppreference.com/w/cpp/language/operator_alternative.html

Right away, we are met with this strange %: pair of symbols, and similar symbols like <: :> and <% %> can be seen later in the code as well.

Such symbols are called digraphs - as in “double symbol” - and act as alternative ways to write other symbols for keyboards/systems that do not support ASCII. In particular, this is to comply with the ISO 646 7-bit character set, which lacks common symbols you may be familiar with as a programmer.

C’s digraphs include:

%: = #
<: = [
:> = ]
<% = {
%> = }

Furthermore, rather than use a digraph for the not operator, !, C includes the not keyword, which can be see towards the end of this code with not true.

Therefore, replacing all these alternative representations with their “normal” ASCII counterparts already helps to make the code a lot more readable:

#import "bits/stdc++.h"
auto main(register int(C), const char*(_)[]) noexcept -> int {
  return operator<<(std::wclog, *(C ^= 074144, NULL[_] + 9))
    << (typeof(*_))new volatile const signed long long{C++}, !true;
}

Bonus: Trigraphs

C++ also used to supported trigraphs up until C++17. These functioned similar to digraphs, except were parsed before both comments and string literals, leading to some funky situations such as "Enter date ??/??/??" becoming "Enter date \\??", hence their eventual removal.

Some of these trigraphs were:

??< = {
??> = }
??( = [
??) = ]
??= = #
??/ = \

That being said, GCC includes trigraphs as an opt-in feature via the -trigraphs flag, and I must admit - I’m tempted to use them for pure style.

??=include<stdio.h>

`#import`

https://gcc.gnu.org/onlinedocs/cpp/Alternatives-to-Wrapper-_0023ifndef.html

Those who have been keeping up with C++20 and beyond should recognise the import keyword, used to import modules as an alternative to including header files.

However, #import is not the same, and goes way back to the 80s/90s when GCC implemented support for Objective-C. It acts as an alternative to #include, ensuring that a header is only included in a project once. Although, the feature was later deprecated as it became apparent that it’s better that the header file is responsible for whether or not it is included multiple times, hence developers are instead expected to either use #ifndef guards (generally recommended) or #pragma once (not standard) - or soon, perhaps, modules!

In the case of this code (and GCC compiler), there is only one file involved anyway, so #import is synonymous with #include:

#include "bits/stdc++.h"

auto main(register int(C), const char*(_)[]) noexcept -> int {
  return operator<<(std::wclog, *(C ^= 074144, NULL[_] + 9))
    << (typeof(*_))new volatile const signed long long{C++}, !true;
}

`bits/stdc++.h`

https://stackoverflow.com/questions/25311011/how-does-include-bits-stdc-h-work-in-c

bits/stdc++.h is essentially a header that includes all the C and C++ headers, useful for quick testing, and especially useful for precompiled headers. Otherwise, though, it expectedly slows down compilation, and we’d better of just using the headers we actually need. In our case, we only use iostream to print to the console, so we can use that instead:

#include "iostream"

auto main(register int(C), const char*(_)[]) noexcept -> int {
  return operator<<(std::wclog, *(C ^= 074144, NULL[_] + 9))
    << (typeof(*_))new volatile const signed long long{C++}, !true;
}

Moreover, we can replace the "" around iostream with the more typical <> as the only real difference in this case is expressing intention, and we want to express that iostream is an external header that is not part of our own project’s source code:

#include <iostream>

auto main(register int(C), const char*(_)[]) noexcept -> int {
  return operator<<(std::wclog, *(C ^= 074144, NULL[_] + 9))
    << (typeof(*_))new volatile const signed long long{C++}, !true;
}

Trailing return type

https://en.wikipedia.org/wiki/Trailing_return_type

Since C++11, trailing return types have been supported, which can make functions more readable in some cases via the syntax auto foo() -> T {}, reminiscent of Rust.

fn foo(x: u32) -> u32 { /* ... */ }

Generally-speaking, trailing return types have 3 main uses:

Return type with decltype in C++11, before decltype(auto) was around (C++14):

template<typename A, typename B>
auto add(A a, B b) -> decltype(a + b);

Lambda expressions (also introduced in C++11).

auto f = [](int x) -> double { return x; };

Readability of more complex types.

template<std::integral T>
std::expected<BinaryView::Error, T> BinaryView::read(std::endian endian);

template<std::integral T>
auto BinaryView::read(std::endian endian)
    -> std::expected<T, BinaryView::Error>;

In the case of main, the return type is just int, and so using a trailing return type is nothing but stylistic choice. Although I personally don’t enforce trailing return types in my code unless I’m using something verbose like std::expected, I do see why some programmers would and do choose to consistently use them for every function.

#include <iostream>

int main(register int(C), const char*(_)[]) noexcept {
  return operator<<(std::wclog, *(C ^= 074144, NULL[_] + 9))
    << (typeof(*_))new volatile const signed long long{C++}, !true;
}

`noexcept`

https://en.cppreference.com/w/cpp/language/noexcept_spec.html

In short, noexcept as a specifier tells the compiler that a function will not attempt to propagate exceptions outwards - i.e. any exceptions that are propagated into the function are caught. This enables the compiler to make some optimisations as a result, but if the promise is broken, std::terminate is called instead.

So is there any point in tagging main with noexcept? Well, no, because main will call std::terminate if it has a propagating exception anyway. Therefore, it isn’t useful to include it here:

#include <iostream>

int main(register int(C), const char*(_)[]) {
  return operator<<(std::wclog, *(C ^= 074144, NULL[_] + 9))
    << (typeof(*_))new volatile const signed long long{C++}, !true;
}

`register`

https://en.cppreference.com/w/cpp/language/storage_duration.html

The register keyword is an artefact of a time when compilers weren’t as intelligent as they are now, and the first time I ever saw it used was in the DOOM codebase of all places.

What it does is signal to the compiler that a declared variable is going to be heavily used, and should therefore be stored directly on a CPU register, as registers are significantly faster to access than RAM.

However, the keyword was deprecated as of C++11 and outright removed in C++17, as compilers are allowed to ignore this hint if needed, and generally can decide when a variable should be in a register more reliably than a humann can. Therefore, we have no need for it here:

#include <iostream>

int main(int(C), const char*(_)[]) {
  return operator<<(std::wclog, *(C ^= 074144, NULL[_] + 9))
    << (typeof(*_))new volatile const signed long long{C++}, !true;
}

Parenthesised declarators

https://en.cppreference.com/w/cpp/language/declarations.html

C++ supports parenthesised declarators to help with disambiguation in certain contexts, although we are also free to use them in other scenarios too. For instance, the following code is valid C++:

int((((((((a))))))))=5;

Cases where they’re actually useful include:

Function pointers

int *f();   // A function that returns int*
int (*f)(); // A pointer to a function that returns int

Array pointers

int *a[23];   // An array of 23 int*'s
int (*a)[23]; // A pointer to an array of 23 int's

There’s also this funky case where the use of a constructor can be confused for a function declaration:

Foo bar(Baz());   // Could be a function that has a parameter for a function that returns Baz
Foo bar((Baz())); // Unambiguous - simply takes the result of the function call Baz() in Foo's constructor

But C++11 added curly-brace initialisation for this exact reason, which I use extensively:

Foo bar{Baz()};

Anyway, in the context of our code, we don’t need this disambiguation. The first parameter of main is simply an int, and the second is a char* array, char**. Therefore, we can remove the parentheses for readability:

#include <iostream>

int main(int C, const char* _[]) {
  return operator<<(std::wclog, *(C ^= 074144, NULL[_] + 9))
    << (typeof(*_))new volatile const signed long long{C++}, !true;
}

Underscore identifier

https://en.cppreference.com/w/cpp/language/identifiers.html

Ignoring funky unicode stuff, C++ identfiers such as variable names can me made up of uppercase and lowercase Latin letters (A-Z and a-Z), underscores, and the digits 0-9 except for as the first character. There are also reserved identifiers, like with keywords and patterns like _Foo and __bar. However, a single underscore, _, rather than be reserved for wildcard syntax as seen in Haskell and Rust, manages to somehow slip through as a valid unique identifer. Therefore, we’re able to use _ as a variable name.

C++ function type signatures are independent of the names given to the function’s parameters, hence we are able to change the typical argc to just C (in order to have C++ later in the code!) and argv to _. Changing these names back to normal, we get:

#include <iostream>

int main(int argc, const char* argv[]) {
  return operator<<(std::wclog, *(argc ^= 074144, NULL[argv] + 9))
    << (typeof(*argv))new volatile const signed long long{argc++}, !true;
}

It is also worth noting that const can be used here without any issues as well. In fact, it is the first obfuscation so far that can actually be argued as idomatic C++! Even if it is unusual. As such, I will leave it in for this article.

Operator overloading

https://en.cppreference.com/w/cpp/language/operators.html

Anyone who’s learned C++ should remember the first time they learned how to print to the terminal, using the << operator as if constructing a conveyor belt for which to stream content into std::cout. Hopefully, with std::println coming with C++23 and improved in C++26, newcomers to C++ will never have to face this atrocity ever again - a textbook example of operator overloading gone wrong. At least, most would likely agree that overloading an operator to do something completely different from what it was originally intended to do is not a great idea, and we’d be better of defining new operators instead (>>= anyone?).

<< is originally the bitwise left shift operator, used for… well, shifting bits. The ostream header overloads this with its own custom behaviour.

A cool quirk of operator overloading, however, is that it necessitates some way to… overload operators, and C++ does that my making it possible to define operators as functions, their symbols prefixed with operator, like operator+ and operator<<.

This also means that when such an overload is defined, it can be called in this way too, hence the following code snippets are equivalent:

std::cout << "Hello, world!" << std::endl;

operator<<(std::cout, "Hello, world!").operator<<(std::endl);

With this mind, we can once again make our code significantly easier to read:

#include <iostream>

int main(int argc, const char* argv[]) {
  return std::wclog << *(argc ^= 074144, NULL[argv] + 9)
    << (typeof(*argv))new volatile const signed long long{argc++}, !true;
}

`std::wclog`

https://en.cppreference.com/w/cpp/io/clog.html

But hold on, what is std::wclog? Shouldn’t it be std::cout for writing to standard output?

First, std::clog is similar to std::cerr in that they both write to standard error (stderr). However, unlike std::cerr, std::clog is buffered, meaning it is not flushed immediately, which is important where performant logging matters.

std::wclog is just the wide char version of this. Typically, a wide char is more than 1 byte. However, std::wcout and its variants all implicitly convert char* to wchar_t*. Therefore, both these code snippets have the same visual result when a wide char is 4 bytes:

static_assert(sizeof(wchar_t) == 4);

std::wcout << (wchar_t*)"b\0\0\0a\0\0\0z\0\0\0";

std::wcout << "baz";

Now, although it’s not immediately obvious that our code uses normal char* as input to std::wclog, we can see that (typeof(*argv)) is used as a cast (more on that later), which should evaluate to (char*). Therefore, it’s fairly safe to assume that we can just use std::cout instead, and we’ll replace the cast while we’re at it:

#include <iostream>

int main(int argc, const char* argv[]) {
  return std::cout << *(argc ^= 074144, NULL[argv] + 9)
    << (char*)new volatile const signed long long{argc++}, !true;
}

Comma operator

https://en.cppreference.com/w/cpp/language/operator_other.html

One of my absolute favourite features of C is the comma operator, which I imagine many programmers aren’t aware exists. What I find so fascinating about it is that it is technically useful - potentially really useful in some contexts - but is nevertheless generally bad to use.

Not to be confused with the use of commas in other C/C++ contexts (e.g. lists), the comma operator forms an expression in which the left-hand argument(s) are evaluated with their results discarded, while the right-most argument is what the expression returns. Some seemingly useful applications of this include:

Multiple varable mutations in for loops:

// Comma operator used             ↓ here only!  
for (int i = 0, j = 10; i <= j; ++i, --j)
    std::cout << "i = " << i << " j = " << j << '\n';

Makeshift lambda expressions (not really):

int a = (a = 5, a *= 3, a += 4, a %= 8, a ^= 23);

Logging in return statements:

return LOG_ERROR("Some error."), -1;

In our code, we are incidentally making use of that last point, logging before returning !true. If we write this out as two separate lines, we get something more conventional and readable:

#include <iostream>

int main(int argc, const char* argv[]) {
  std::cout << *(argc ^= 074144, NULL[argv] + 9)
    << (char*)new volatile const signed long long{argc++};
  return !true;
}

But we’re not done yet! There’s another use of the operator within the dereference we have going on. Extracting this out to yet another line, things should be starting to make more sense:

#include <iostream>

int main(int argc, const char* argv[]) {
  argc ^= 074144;
  std::cout << *(NULL[argv] + 9)
    << (char*)new volatile const signed long long{argc++};
  return !true;
}

Implicit casting

https://en.cppreference.com/w/cpp/language/implicit_cast.html

Love it or hate it, C++ is notorious for implicit type conversion, so it’s pretty much expected that it’d turn up somewhere in this code here - and sure enough, it’s easy to spot.

The function main has a return type of int, but we are attempting to return !true, so naturally, it must be that this bool expression is being implicitly converted to type int. In particular, !true == false, and false as an integer is simply 0. Therefore, we can write:

#include <iostream>

int main(int argc, const char* argv[]) {
  argc ^= 074144;
  std::cout << *(NULL[argv] + 9)
    << (char*)new volatile const signed long long{argc++};
  return 0;
}

Array subscript commutativity

https://learn.microsoft.com/en-us/cpp/cpp/subscript-operator?view=msvc-170

The subscript operator, [], is defined such that a[b] == *(a + b). Because of how pointer arithmeic works (more on that in a bit), this means that a[b] == b[a], and the expression NULL[argv] is equivalent to argv[NULL].

#include <iostream>

int main(int argc, const char* argv[]) {
  argc ^= 074144;
  std::cout << *(argv[NULL] + 9)
    << (char*)new volatile const signed long long{argc++};
  return 0;
}

`NULL`

https://en.cppreference.com/w/c/types/NULL.html

But what even is NULL? In C, it is typically defined as a null pointer, but this is implementation defined.

Importantly, NULL can be the constant 0 or, as off C23, nullptr in order to remain compatible with C++. To be compatible with C++, it cannot be (void*)0, nor can it be an expression that evaluates to 0, like 10 * 2 - 20.

GCC defines NULL as its own magic keyword, __null and is supposed to be a null pointer. Therefore, the GCC compiler generates the “converting NULL to non-pointer type” warning when attempting to index an array with it. As such, we’d best just use 0 for clarity’s sake:

#include <iostream>

int main(int argc, const char* argv[]) {
  argc ^= 074144;
  std::cout << *(argv[0] + 9)
    << (char*)new volatile const signed long long{argc++};
  return 0;
}

Pointer arithmetic

https://www.learncpp.com/cpp-tutorial/pointer-arithmetic-and-subscripting/

As mentioned previously, the subscript operator expands to pointer arithmetic, such that a[b] == *(a + b).

However, consider the following:

int a[] = {0, 1, 2, 3, 4, 5, 6, 7};
int b = *(a + 4); // Equivalent to a[4]

It’s important to note that a + 4 does not simply mean adding 4 to the address of a. Rather, it is adding 4 * sizeof(a[0]), which in this example evaluates to 4 * sizeof(int) == 4 * 4 == 16. Thus, *(a + 4) gives the value 4 from the array.

Therefore, our code *(argv[0] + 9) is equivalent to argv[0][9]:

#include <iostream>

int main(int argc, const char* argv[]) {
  argc ^= 074144;
  std::cout << argv[0][9]
    << (char*)new volatile const signed long long{argc++};
  return 0;
}

And since argv is a char** (i.e., an array of char arrays, i.e., an array of null-terminated strings), argv[0][9] is the 10th character of the 1st string in argv.

At this point, it’s useful to actually see what this is, because the Godbolt’s Compiler Explorer does in fact feed an argument to the program, and that is “./output.s”. Do some basic counting, and we can see that the 10th character in this string is indeed 's'. Let’s make all of this more clear in our code:

#include <iostream>

int main(int argc, const char* argv[]) {
  argc ^= 074144;
  const char* input = argv[0];
  assert(input[9] == 's');
  std::cout << input[9]
    << (char*)new volatile const signed long long{argc++};
  return 0;
}

Octal

https://en.cppreference.com/w/cpp/language/integer_literal.html

Back in the 1970s, octal encoding was all the rage - that is, the base 8 numeral system (e.g., 472 = 4 * 8^2 + 7 * 8 + 2 = 314).

Unfortunately, this meant that unlike binary literals which are prefixed with 0b or hexadecimal literals which are prefixed with 0x, octal literals are only prefixed with 0 in C and by extension C++.

So really, 074144 in our everyday base-10 is 30820, which we can replace here:

#include <iostream>

int main(int argc, const char* argv[]) {
  argc ^= 30820;
  const char* input = argv[0];
  assert(input[9] == 's');
  std::cout << input[9]
    << (char*)new volatile const signed long long{argc++};
  return 0;
}

XOR

https://en.wikipedia.org/wiki/Exclusive_or

^ is C’s exclusive or bitwise operator, and essentially takes the bits of two values and compares their bits in pairs, individually setting the resultant bits to 1 if the pairs do not match, and 0 otherwise.

In our code, argc starts as 1 (because there is one argument given to the program) and then is XOR’d with 30820. To deobfuscate this, let’s just compute the result: 30821. It just so happens that the use of ^= gives the same result as if we had used +=.

#include <iostream>

int main(int argc, const char* argv[]) {
  argc = 30821;
  const char* input = argv[0];
  assert(input[9] == 's');
  std::cout << input[9]
    << (char*)new volatile const signed long long{argc++};
  return 0;
}

Post-increment operator

https://en.cppreference.com/w/cpp/language/operator_incdec.html

Later in our code, we also see that argc is being incremented via the post-increment operator (argc++), the operator responsible for C++ being called what it is today.

However, although not immediately obvious, this operator doesn’t actually do anything in this context. This is because we are passing argc++ into the constructor of long long, meaning a new long long is constructed and argc is only incremented after the fact - unlike if ++argc was used, which increments it before argc is accessed.

Therefore, since argc isn’t used used after this, the operation serves no purpose and can be removed.

#include <iostream>

int main(int argc, const char* argv[]) {
  argc = 30821;
  const char* input = argv[0];
  assert(input[9] == 's');
  std::cout << input[9]
    << (char*)new volatile const signed long long{argc};
  return 0;
}

`new`

https://en.cppreference.com/w/cpp/language/new.html

Now we’re very close to deciphering this completely!

It is usually advised to new C++ programmers to avoid using new as much as possible, especially for programmers coming from OOP languages like Java which use it extensively. This is because it allocates dynamic memory, unlike normal pointers which use automatic memory management, and dynamic memory is computationally expensive!

However, a C++ new expression also offers a way to get the address of some memory without needing to use the & operator since it returns the address of the memory it allocates, and that’s exactly what’s happening here. Therefore, since we would prefer not to use dynamic allocation anyway, let’s just use the & to get the address of the existing variable argc instead of duplicating its data:

#include <iostream>

int main(int argc, const char* argv[]) {
  argc = 30821;
  const char* input = argv[0];
  assert(input[9] == 's');
  std::cout << input[9] << (char*)&argc;
  return 0;
}

You can see that I’ve also removed the volatile const signed long long, and that’s because it is just fluff, as will become evident in our final step…

String in disguise

https://medium.com/@ryan_forrester_/c-integer-to-character-conversion-a-comprehensive-guide-ac8bf9e5940a

Finally, we are down to the last layer of obfuscation, and this just uses a neat little trick that isn’t particularly special to C, but rather a quirk of how computers represent data in general.

First, it is important to understand that the char type is just the mapping of a byte-sized number ranging from 0-255 to what is typically an ASCII character. Therefore, a typical C string is just a sequence of these characters - an array of char values.

But what if we take a number that is bigger than a byte? Say, 30821. Well, assuming we are using little endian, 30821 as an int is represented in hexadecimal as 65 78 00 00 (note that each pair of digits is 1 byte). Therefore, an int can also be thought of as an array of bytes - 4 in total if we assume that sizeof(int) == 4.

So what happens if we take this number and treat it as a char array? Well, 0x65 == 'e' and 0x78 == 'x', so we get the string "ex".

However, it’s important to note that for a char array to be a valid C string, it must be null-terminated (i.e., ending with 0x00). As a result, we are able toget a string of maximum length 3 from a single 32-bit integer. "ex" is only of length 2, so it is able to fit without the need of a larger type like a long long.

So, when we cast (char*)&argc, what we’re really doing is telling the compiler to process our integer argc as a string instead, and finally it becomes clear why our program outputs “sex” (as in “sextet”). Let’s remove the conversion and just make the value explicit:

#include <iostream>

int main(int argc, const char* argv[]) {
  const char* input = argv[0];
  assert(input[9] == 's');
  std::cout << input[9] << "ex";
  return 0;
}

And voilà, we have successfully picked apart the original obfuscated code to reveal what it is really doing.

Conclusion

So, it turns out that all this:

%:import "bits/stdc++.h"
auto main(register int(C), const char*(_)<::>) noexcept -> int {
  return operator<<(std::wclog, *(C ^= 074144, NULL<:_:> + 9))
    << (typeof(*_))new volatile const signed long long<%C++%>, not true;
}

is really just an obfuscated way of doing this:

#import <iostream>
int main() {
    std::cout << "sex"; // as in "sextet"
}

To whoever gave this a read, I hope it was as insightful to you as it was fun for me to put together. But do not be fooled! What I have covered here is but the tip of an iceberg that will lead you far into the depths of Hell if you dare to attempt traversing your way to its bottom.

Perhaps I am exaggerating, but if you consider the fact that the comma operator can be overloaded… am I really?

Digraphs¶

Bonus: Trigraphs¶

#import¶

bits/stdc++.h¶

Trailing return type¶

noexcept¶

register¶

Parenthesised declarators¶

Underscore identifier¶

Operator overloading¶

std::wclog¶

Comma operator¶

Implicit casting¶

Array subscript commutativity¶

NULL¶

Pointer arithmetic¶

Octal¶

XOR¶

Post-increment operator¶

new¶

String in disguise¶

Conclusion¶