Fun with Uninitialized Variables and Compiler (Gcc)

Fun with uninitialized variables and compiler (GCC)

I'm just curious to know why this sudden change in the behavior of uninitialized bool?

Disassemble the code and see what the compiler’s doing.

My guess: since the value is now only used locally, the compiler optimizes it away completely. Since the behaviour is undefined anyway, the compiler can safely just assume any value, e.g. false. This is a pretty obvious optimization since the value of b is constant as far as the compiler is concerned, and the whole logic of the switch is redundant. So why put it in the executable?

(The important point here is really that b is only ever used locally in the second code, and that in turn will trigger more optimizations even in unoptimized code. The first code has to be inlined before the compiler can do any such optimizations, or the code paths have to be traced which isn’t trivial).

Inconsistency in uninitialized boolean variable

An uninitialiazed variable is just a piece of raw memory and will show as value whatever happes to be there. You idea that "In the live system it always happens to be true" is totally wrong. All you can say is that every time you observed it in the live system it seemed to be true. May be on next Tuesday instead it will happen to be false because it's well known that uninitialized bools hate tuesdays.

Note that's even entirely possible that an uninitialized boolean may seem true for one function and false for another (normally a full byte is allocated for a bool, but only a bit is needed to represent the value: it's possible that an uninitialized bool will contain a magic fuzzy bool value that it's true for someone and false for someone else).

As for what the standard says accessing an uninitialized variable for reading may indeed be undefined behavior, with no limits on what may happen including crashes (and for example is easy to have a program to "stop" when reading an uninitialized variable, just compile with a specific tool for tracking this kind of problem). Always having a program crash on accessing an uninitialized variable would be wonderful, but unfortunately it's quite costly on current CPUs and it's not going to happen unless specific tools are used.

Of course adding even just a printf call can change the apparent behavior of code handling uninitialized variables. This kind of bug is often referred to as "heisenbug" and actually the random or heisenbug behavior is often an indication of an uninitialized variable or of a thread synchronization problem.

What happens to a declared, uninitialized variable in C? Does it have a value?

Static variables (file scope and function static) are initialized to zero:

int x; // zero
int y = 0; // also zero

void foo() {
static int x; // also zero
}

Non-static variables (local variables) are indeterminate. Reading them prior to assigning a value results in undefined behavior.

void foo() {
int x;
printf("%d", x); // the compiler is free to crash here
}

In practice, they tend to just have some nonsensical value in there initially - some compilers may even put in specific, fixed values to make it obvious when looking in a debugger - but strictly speaking, the compiler is free to do anything from crashing to summoning demons through your nasal passages.

As for why it's undefined behavior instead of simply "undefined/arbitrary value", there are a number of CPU architectures that have additional flag bits in their representation for various types. A modern example would be the Itanium, which has a "Not a Thing" bit in its registers; of course, the C standard drafters were considering some older architectures.

Attempting to work with a value with these flag bits set can result in a CPU exception in an operation that really shouldn't fail (eg, integer addition, or assigning to another variable). And if you go and leave a variable uninitialized, the compiler might pick up some random garbage with these flag bits set - meaning touching that uninitialized variable may be deadly.

When I use gcc with -O1 optimization. Array data initialization is ignored and I end up with uninitialized data when I attempt to use the array

gcc is fine, your code is buggy.

static int populate_structs(optests_caps_t *caps)
{
// ...
optests_cap_t cap[CAP_COUNT];
// ...
caps->structs = cap;
}

cap is local to the function populate_structs, so after that function returns, any further access to the memory pointed to by caps->structs is undefined behavior.

Maybe you wanted to declare cap as static, or use malloc to allocate some memory for it.

Why GCC warns when casting an uninitialized volatile pointer to `void`?

For C++, this is controlled by [expr]/12, which says that (void) x applies the lvalue-to-rvalue conversion to x (i.e., reads the value of x) only if x is a glvalue of volatile-qualified type. Therefore, if x is volatile-qualified, then (void) x reads its value (which yields an indeterminate value and triggers undefined behavior). If x isn't volatile-qualified, then (void) x doesn't apply the lvalue-to-rvalue conversion and the behavior is well-defined.

How to safely implement Using Uninitialized Memory For Fun And Profit?

I can see four basic approaches you can take. These are applicable not only to C++ but also most other low-level languages like C that make uninitialized access possible but not allowed, and the last is applicable even to higher-level "safe" languages.

Ignore the standard, implement it in the usual way

This is the one crazy trick language lawyers hate! Don't freak out yet though - the solutions following this one won't break the rules, so just skip this part if you are of the rules-stickler variety.

The standard makes most uses of uninitialized values undefined and the few loopholes it does allow (e.g., copying one undefined value to another) don't really give you enough rope to actually implement what you want - even in C which is slightly less restrictive (see for example this answer covering C11, which explains that while accessing an indeterminiate value may not directly trigger UB anything that results is also indeterminate and indeed the value may appear to chance from access to access).

So you just implement it anyway, with the knowledge that most or all currently compilers will just compile it to the expected code, and know that your code is not standards compliant.

At least in my test all of gcc, clang and icc didn't take advantage of the illegal access to do anything crazy. Of course, the test is not comprehensive and even if you could construct one, the behavior could change in a new version of the compiler.

You would be safest if the implementation of the methods that access uninitialized memory was compiled, once, in a separate compilation unit - this makes it easy to check that it does the right thing (just check the assembly once) and makes it nearly impossible (outside of LTGC) for the compiler to do anything tricky, since it can't prove whether uninitialized values are being accessed.

Still, this approach is theoretically unsafe and you should check the compiled output very carefully and have additional safeguards in place if you take it.

If you take this approach, tools like valgrind are fairly likely to report a uninitialized read error.

Now these tools work at the assembly level, and some uninitialized reads may be fine (see, for example, the next item on fast standard library implementations), so they don't actually report a uninitialized read immediately, but rather have a variety of heuristics to determine if invalid values are actually used. For example, they may avoid reporting an error until they determine the uninitialized value is used to determine the direction of a conditional jump, or some other action that is not trackable/recoverable according to the heuristic. You may be able to get the compiler to emit code that reads uninitialized memory but is safe according to this heuristic.

More likely, you won't be able to do that (since the logic here is fairly subtle as it relies on the relationship between the values in two arrays), so you can use the suppression options in your tools of choice to hide the errors. For example, valgrind can suppress based on the stack trace - and in fact there are already many such suppression entries used by default to hide false-positives in various standard libraries.

Since it works based on stack traces, you'll probably have difficulties if the reads occur in inlined code, since the top-of-stack will then be different for every call-site. You could avoid this my making sure the function is not inlined.

Use assembly

What is ill-defined in the standard, is usually well-defined at the assembly level. It is why the compiler and standard library can often implement things in a faster way than you could achieve with C or C++: a libc routine written in assembly is already targeting a specific architecture and doesn't have to worry about the various caveats in the language specification that are there to make things run fast on a variety of hardware.

Normally, implementing any serious amount of code in assembly is a costly undertaking, but here it is only a handful, so it may be feasible depending on how many platforms you are targeting. You don't even really need to write the methods yourself - just compile the C++ version (or use godbolt and copy the assembly. The is_member function, for example1, looks like:

sparse_array::is_member(unsigned long):
mov rax, QWORD PTR [rdi+16]
mov rdx, QWORD PTR [rax+rsi*8]
xor eax, eax
cmp rdx, QWORD PTR [rdi]
jnb .L1
mov rax, QWORD PTR [rdi+8]
cmp QWORD PTR [rax+rdx*8], rsi
sete al

Rely on calloc magic

If you use calloc2, you explicitly request zeroed memory from the underlying allocator. Now a correct version of calloc may simply call malloc and then zero out the returned memory, but actual implementations rely on the fact that the OS-level memory allocation routines (sbrk and mmap, pretty much) will generally return you zeroed memory on any OS with protected memory (i.e., all the big ones), to avoid zeroing out the memory again.

As a practical matter, for large allocations, this is typically satisfied by implementing a call like anonymous mmap by mapping a special zero page of all zeros. When (if ever) the memory is written, does copy-on-write actually allocate a new page. So the allocation of large, zeroed memory regions may be for free since the OS already needs to zero the pages.

In that case, implementing your sparse set on top of calloc could be just as fast as the nominally uninitialized version, while being safe and standards compliant.

Calloc Caveats

You should of course test to ensure that calloc is behaving as expected. The optimized behavior is usually only going to occur when your program initializes a lot of long-lived zeroed memory approximately "up-front". That is, the typical logic for optimized calloc if something like this:

calloc(N)
if (can satisfy a request for N bytes from allocated-then-freed memory)
memset those bytes to zero and return them
else
ask the OS for memory, return it directly because it is zeroed

Basically, the malloc infrastructure (which also underlies new and friends) has a (possibly empty) pool of memory that it has already requested from the OS and generally tries to allocated there first. This pool is composed of memory from the last block request from the OS but not handed out (e.g., because the user requested 32 bytes but the allocated asks for chunks from the OS in 1 MB blocks, so there is a lot left over), and also of memory that was handed out to the process but subsequently returned via free or delete or whatever. The memory in that pool has arbitrary values, and if a calloc can be satisfied from that pool, you don't get your magic, since the zero-init has to occur.

On the other hand if the memory has to be allocated from the OS, you get the magic. So it depends on your use case: if you are frequently creating and destroying sparse_set objects, you will generally just be drawing from the internal malloc pools and will pay the zeroing costs. If you have a long-lived sparse_set objects which take up a lot of memory, they likely were allocated by asking the OS and you got the zeroing nearly for free.

The good news is that if you don't want to rely on the calloc behavior above (indeed, on your OS or with your allocator it may not even be optimized in that way), you could usually replicate the behavior by mapping in /dev/zero manually for your allocations. On OSes that offer it, this guarantees that you get the "cheap" behavior.

Use Lazy Initialization

For a solution that is totally platform agnostic you could simply use yet another array which tracks the initialization state of the array.

First you choose some granule, at which you will track initialization, and use bitmap where each bit tracks the initialization state of that granule of the sparse array.

For example, let's say you choose your granule to be 4 elements, and the size of the elements in your array is 4 bytes (e.g., int32_t values): you need 1 bit to track every 4 elements * 4 bytes/element * 8 bits/byte, which is an overhead of less than 1%3 in allocated memory.

Now you simply check the corresponding bit in this array before accessing sparse. This adds some small cost to accessing the sparse array, but doesn't change the overall complexity, and the check is still quite fast.

For example, your is_member function now looks like:

bool sparse_set::is_member(size_t i){
bool init = is_init[i >> INIT_SHIFT] & (1UL << (i & INIT_MASK));
return init && sparse[i] < n && dense[sparse[i]] == i;
}

The generated assembly on x86 (gcc) now starts with:

mov     rax, QWORD PTR [rdi+24]
mov rdx, rsi
shr rdx, 8
mov rdx, QWORD PTR [rax+rdx*8]
xor eax, eax
bt rdx, rsi
jnc .L2
...

.L2:
ret

That's all associate with the bitmap check. It's all going to be pretty quick (and often off the critical path since it isn't part of the data flow).

In general, the cost of this approach depends on the density of your set, and what functions you are calling - is_member is about the worse case for this approach since some functions (e.g., clear) aren't affected at all, and others (e.g., iterate) can batch up the is_init check and only do it once every INIT_COVERAGE elements (meaning the overhead would be again ~1% for the example values).

Sometimes this approach will be faster than the approach suggested in the OP's link, especially when the handling elements not in the set - in this case, the is_init check will fail and often shortcut the remaining code, and in this case you have a working set that is much smaller (256 times using the example granule size) than the size of the sparse array, so you may great reduce misses to DRAM or outer cache levels.

The granule size itself is an important tunable for this approach. Intuitively, a larger granule size pays a larger initialization cost when the an element covered by the granule is accessed for the first time, but saves on memory and up-front is_init initialization cost. You can come up with a formula that finds the optimum size in the simple case - but the behavior also depends on the "clustering" of values and other factors. Finally, it is totally reasonable to use a dynamic granule size to cover your bases under varying workloads - but it comes at the cost of variable shifts.

Really Lazy Solution

It is worth noting that there is a similarity between the calloc and lazy init solutions: both lazily initialize blocks of memory as they are needed, but the calloc solution track this implicitly in hardware through MMU magic with page tables and TLB entries, while the lazy init solution does it in software, with a bitmap explicitly tracking which granules have been allocated.

The hardware approach has the advantage of being nearly free in (for the "hit" case, anyway) since it uses the always-present virtual memory support in the CPU to detect misses, but the software case has the advantage of being portable and allowing precise control over the granule size etc.

You can actually combine these approaches, to make a lazy approach that doesn't use a bitmap, and doesn't even need the dense array at all: just allocate your sparse array with mmap as PROT_NONE, so you fault whenever you read from an un-allocated page in a sparse array. You catch the fault and allocate the page in the sparse array with a sentinel value indicating "not present" for every element.

This is the fastest of all for the "hot" case: you don't need any of the ... && dense[sparse[i]] == i checks any more.

The downsides are:

  • Your code is really not portable since you need to implement the fault-handling logic, which is usually platform specific.
  • You can't control the granule size: it must be at page granularity (or some multiple thereof). If your set is very sparse (say less than 1 out of 4096 elements occupied) and uniformly distributed, you end up paying a high initialization cost since you need to handle a fault and initialize a full page of values for every element.
  • Misses (i.e., non-insert accesses to set elements that don't exist) either need to allocate a page even if no elements will exist in that range, or will be very slow (incurring a fault) every time.

1 This implementation has no "range checking" - i.e., it doesn't check if i is greater than MAX_ELEM - depending on your use case you may want to check this. My implementation used a template parameter for MAX_ELEM, which may result in slightly faster code, but also more bloat, and you'd do fine to just make the max size a class member as well.

2 Really, the only requirement that you use something that calls calloc under the covers or performs the equivalent zero-fill optimization, but based on my tests more idiomatic C++ approaches like new int[size]() just do the allocation followed by a memset. gcc does optimize malloc followed by memset into calloc, but that's not useful if you are trying to avoid the use of C routines anyway!

3 Precisely, you need 1 extra bit to track every 128 bits of the sparse array.

Is there a use for uninitialized pointers in C or C++?

This is a very specialized optimized case for Video Games (basically an embedded system). We used to use them for Load-In-Place data behavior in our Video Games to speed up loading (and avoid fragmentation).

Basically we would create console-side (Playstation) objects in a PC cooker. Then to reduce fragmentation overload, we would pack the data objects in a contiguous buffer with a single alloc. References to the data objects in this buffer would then be changed to subtract the base from pointers to offsets (unfix call -- we also had a virtual fix / unfix calls that took the buffer base and could convert between offsets and pointers).

When we loaded the data, it loaded in one large block. All data referenced by the root was off the root object. We could do an inplace "new" on the the root that would initialize the proper VF tables for the object and fixup all the attached blocks (by doing inplace new and then fixing up attached blocks respectively).

We needed the constructors called (in place new) to generate the proper VF-Tables in the objects. However, if the pointers were automatically cleared to NULL during the constructor, we would have lost the offset data and not been able to recreate the pointers between the objects within the contiguous block.


FWIW, this is a common technique in the Video Game world. This Gamasutra article (not written by me or my coworkers) explains in detail the similar thing they did at another company:

Also, this topic of discussion on SourceForge.

There have even been several GDC (Game Developer Conference) talks on the subject.

Searching on Google for "load-in-place" will give many other examples of people using this technique that basically requires uninitialized pointers.


NOTE: Currently, this is the only response that actually answers the question asked ("Is there a use for uninitialized pointers in C or C++?") by giving a specific use for pointers that must remain unitialized.

All the other responses are better answers for the original question referenced ("[C++] Why aren’t pointers initialized with NULL by default?") that caused the poster to ask this question.



Related Topics



Leave a reply



Submit