What Are All the Common Undefined Behaviours That a C++ Programmer Should Know About

What are all the common undefined behaviours that a C++ programmer should know about?

Pointer

  • Dereferencing a NULL pointer
  • Dereferencing a pointer returned by a "new" allocation of size zero
  • Using pointers to objects whose lifetime has ended (for instance, stack allocated objects or deleted objects)
  • Dereferencing a pointer that has not yet been definitely initialized
  • Performing pointer arithmetic that yields a result outside the boundaries (either above or below) of an array.
  • Dereferencing the pointer at a location beyond the end of an array.
  • Converting pointers to objects of incompatible types
  • Using memcpy to copy overlapping buffers.

Buffer overflows

  • Reading or writing to an object or array at an offset that is negative, or beyond the size of that object (stack/heap overflow)

Integer Overflows

  • Signed integer overflow
  • Evaluating an expression that is not mathematically defined
  • Left-shifting values by a negative amount (right shifts by negative amounts are implementation defined)
  • Shifting values by an amount greater than or equal to the number of bits in the number (e.g. int64_t i = 1; i <<= 72 is undefined)

Types, Cast and Const

  • Casting a numeric value into a value that can't be represented by the target type (either directly or via static_cast)
  • Using an automatic variable before it has been definitely assigned (e.g., int i; i++; cout << i;)
  • Using the value of any object of type other than volatile or sig_atomic_t at the receipt of a signal
  • Attempting to modify a string literal or any other const object during its lifetime
  • Concatenating a narrow with a wide string literal during preprocessing

Function and Template

  • Not returning a value from a value-returning function (directly or by flowing off from a try-block)
  • Multiple different definitions for the same entity (class, template, enumeration, inline function, static member function, etc.)
  • Infinite recursion in the instantiation of templates
  • Calling a function using different parameters or linkage to the parameters and linkage that the function is defined as using.

OOP

  • Cascading destructions of objects with static storage duration
  • The result of assigning to partially overlapping objects
  • Recursively re-entering a function during the initialization of its static objects
  • Making virtual function calls to pure virtual functions of an object from its constructor or destructor
  • Referring to nonstatic members of objects that have not been constructed or have already been destructed

Source file and Preprocessing

  • A non-empty source file that doesn't end with a newline, or ends with a backslash (prior to C++11)
  • A backslash followed by a character that is not part of the specified escape codes in a character or string constant (this is implementation-defined in C++11).
  • Exceeding implementation limits (number of nested blocks, number of functions in a program, available stack space ...)
  • Preprocessor numeric values that can't be represented by a long int
  • Preprocessing directive on the left side of a function-like macro definition
  • Dynamically generating the defined token in a #if expression

To be classified

  • Calling exit during the destruction of a program with static storage duration

What are the common undefined/unspecified behavior for C that you run into?

A language lawyer question. Hmkay.

My personal top3:

  1. violating the strict aliasing rule

  2. violating the strict aliasing rule

  3. violating the strict aliasing rule

    :-)

Edit Here is a little example that does it wrong twice:

(assume 32 bit ints and little endian)

float funky_float_abs (float a)
{
unsigned int temp = *(unsigned int *)&a;
temp &= 0x7fffffff;
return *(float *)&temp;
}

That code tries to get the absolute value of a float by bit-twiddling with the sign bit directly in the representation of a float.

However, the result of creating a pointer to an object by casting from one type to another is not valid C. The compiler may assume that pointers to different types don't point to the same chunk of memory. This is true for all kind of pointers except void* and char* (sign-ness does not matter).

In the case above I do that twice. Once to get an int-alias for the float a, and once to convert the value back to float.

There are three valid ways to do the same.

Use a char or void pointer during the cast. These always alias to anything, so they are safe.

float funky_float_abs (float a)
{
float temp_float = a;
// valid, because it's a char pointer. These are special.
unsigned char * temp = (unsigned char *)&temp_float;
temp[3] &= 0x7f;
return temp_float;
}

Use memcopy. Memcpy takes void pointers, so it will force aliasing as well.

float funky_float_abs (float a)
{
int i;
float result;
memcpy (&i, &a, sizeof (int));
i &= 0x7fffffff;
memcpy (&result, &i, sizeof (int));
return result;
}

The third valid way: use unions. This is explicitly not undefined since C99:

float funky_float_abs (float a)
{
union
{
unsigned int i;
float f;
} cast_helper;

cast_helper.f = a;
cast_helper.i &= 0x7fffffff;
return cast_helper.f;
}

Undefined behavior, in principle

Frama-C's value analysis, a static analyzer the purported goal of which is to find all undefined behaviors in a C program, considers the assignment const int b = a; as okay. This is a deliberate design decision in order to allow memcpy() (typically implemented as a loop over unsigned char elements of a virtual array, and that the C standard arguably allows to re-implement as such) to copy a struct (which can have padding and uninitialized members) to another.

The “exception” is only for lvalue = lvalue; assignments without an intervening conversion, that is, for an assignment that amounts to a copy of a slice of memory for a memory location to another.

I (as one of the authors of Frama-C's value analysis) discussed this with Xavier Leroy at a time when he was himself wondering about the definition to pick in the verified C compiler CompCert, so he may have ended up using the same definition. It is in my opinion cleaner than what the C standard tries to do with indeterminate values that can be trap representations, and the type unsigned char that is guaranteed not to have any trap representations, but both CompCert and Frama-C assume relatively non-exotic targets, and perhaps the standardization committee was trying to accommodate platforms where reading an uninitialized int can indeed abort the program.

Returning b, or passing n1, n2 or n3 to printf in the end at least can be considered undefined behavior, because copying an uninitialized slice of memory does not making it initialized. With an oldish Frama-C version:

$ frama-c -val t.c

t.c:19:… accessing uninitialized left-value: assert \initialized(&n1);

And in an oldish version of CompCert, after minor modifications to make the program acceptable to it:

$ ccomp -interp t.c
Time 33: in function foo, expression <loc> = <undef>
ERROR: Undefined behavior

What's the reason for letting the semantics of a=a++ be undefined?

UPDATE: This question was the subject of my blog on June 18th, 2012. Thanks for the great question!


Why? I want to know if this was a design decision and if so, what prompted it?

You are essentially asking for the minutes of the meeting of the ANSI C design committee, and I don't have those handy. If your question can only be answered definitively by someone who was in the room that day, then you're going to have to find someone who was in that room.

However, I can answer a broader question:

What are some of the factors that lead a language design committee to leave the behaviour of a legal program (*) "undefined" or "implementation defined" (**)?

The first major factor is: are there two existing implementations of the language in the marketplace that disagree on the behaviour of a particular program? If FooCorp's compiler compiles M(A(), B()) as "call A, call B, call M", and BarCorp's compiler compiles it as "call B, call A, call M", and neither is the "obviously correct" behaviour then there is strong incentive to the language design committee to say "you're both right", and make it implementation defined behaviour. Particularly this is the case if FooCorp and BarCorp both have representatives on the committee.

The next major factor is: does the feature naturally present many different possibilities for implementation? For example, in C# the compiler's analysis of a "query comprehension" expression is specified as "do a syntactic transformation into an equivalent program that does not have query comprehensions, and then analyze that program normally". There is very little freedom for an implementation to do otherwise.

By contrast, the C# specification says that the foreach loop should be treated as the equivalent while loop inside a try block, but allows the implementation some flexibility. A C# compiler is permitted to say, for example "I know how to implement foreach loop semantics more efficiently over an array" and use the array's indexing feature rather than converting the array to a sequence as the specification suggests it should.

A third factor is: is the feature so complex that a detailed breakdown of its exact behaviour would be difficult or expensive to specify? The C# specification says very little indeed about how anonymous methods, lambda expressions, expression trees, dynamic calls, iterator blocks and async blocks are to be implemented; it merely describes the desired semantics and some restrictions on behaviour, and leaves the rest up to the implementation.

A fourth factor is: does the feature impose a high burden on the compiler to analyze? For example, in C# if you have:

Func<int, int> f1 = (int x)=>x + 1;
Func<int, int> f2 = (int x)=>x + 1;
bool b = object.ReferenceEquals(f1, f2);

Suppose we require b to be true. How are you going to determine when two functions are "the same"? Doing an "intensionality" analysis -- do the function bodies have the same content? -- is hard, and doing an "extensionality" analysis -- do the functions have the same results when given the same inputs? -- is even harder. A language specification committee should seek to minimize the number of open research problems that an implementation team has to solve!

In C# this is therefore left to be implementation-defined; a compiler can choose to make them reference equal or not at its discretion.

A fifth factor is: does the feature impose a high burden on the runtime environment?

For example, in C# dereferencing past the end of an array is well-defined; it produces an array-index-was-out-of-bounds exception. This feature can be implemented with a small -- not zero, but small -- cost at runtime. Calling an instance or virtual method with a null receiver is defined as producing a null-was-dereferenced exception; again, this can be implemented with a small, but non-zero cost. The benefit of eliminating the undefined behaviour pays for the small runtime cost.

A sixth factor is: does making the behaviour defined preclude some major optimization? For example, C# defines the ordering of side effects when observed from the thread that causes the side effects. But the behaviour of a program that observes side effects of one thread from another thread is implementation-defined except for a few "special" side effects. (Like a volatile write, or entering a lock.) If the C# language required that all threads observe the same side effects in the same order then we would have to restrict modern processors from doing their jobs efficiently; modern processors depend on out-of-order execution and sophisticated caching strategies to obtain their high level of performance.

Those are just a few factors that come to mind; there are of course many, many other factors that language design committees debate before making a feature "implementation defined" or "undefined".

Now let's return to your specific example.

The C# language does make that behaviour strictly defined(); the side effect of the increment is observed to happen before the side effect of the assignment. So there cannot be any "well, it's just impossible" argument there, because it is possible to choose a behaviour and stick to it. Nor does this preclude major opportunities for optimizations. And there are not a multiplicity of possible complex implementation strategies.

My guess, therefore, and I emphasize that this is a guess, is that the C language committee made ordering of side effects into implementation defined behaviour because there were multiple compilers in the marketplace that did it differently, none was clearly "more correct", and the committee was unwilling to tell half of them that they were wrong.


(*) Or, sometimes, its compiler! But let's ignore that factor.

(**) "Undefined" behaviour means that the code can do anything, including erasing your hard disk. The compiler is not required to generate code that has any particular behaviour, and not required to tell you that it is generating code with undefined behaviour. "Implementation defined" behaviour means that the compiler author is given considerable freedom in choice of implementation strategy, but is required to pick a strategy, use it consistently, and document that choice.

() When observed from a single thread, of course.

Which languages have documented undefined behavior?

I was the one who wrote that text in the wiki. What I meant with "documented undefined behavior" is something which is formally undefined behavior in the language standard, but perfectly well-defined in the real world. No language has "documented undefined behavior", but the real world doesn't always care about what the language standard says.

A better term would perhaps be non-standard language extensions, or if you will "undefined as far as the programming language standard is concerned".

There are several reasons why something could be regarded as undefined behavior in the language standard:

  1. Something is simply outside the scope of the standard. Such as the behavior of memory, monitors etc. All behavior that isn't documented in the standard is in theory undefined.
  2. Something is actually well-defined on a given specific hardware, but the standard doesn't want impose restrictions on the system/hardware, so that no technology gets an unfair market advantage. Therefore it labels something undefined behavior even though it isn't in practice.
  3. Something is truly undefined behavior even in the hardware, or doesn't make any sense in any context.

Example of 1)

Where are variables stored in memory? This is outside the scope of the standard yet perfectly well-defined on any computer where programs are executed.

Similarly, if I say "my cat is black", it is undefined behavior, because the color of cats isn't covered by the programming language. This doesn't mean that my cat will suddenly start to shimmer in a caleidoscope of mysterious colors, but rather that reality takes precedence over theoretical programming standards. We can be perfectly certain that the specific cat will always be a black cat, even though it is undefined behavior.

Example of 2)

Signed integer overflow. What happens in case of integer overflow is perfectly well-defined on the CPU level. In most cases the value itself will get treated as simple unsigned addition/subtraction, but an overflow flag will be set in a status register. As far as the C or C++ language is concerned, such overflows might in theory cause dreadful, unexplained events to happen. But in reality, the underlying hardware will produce perfectly well-defined results.

Example of 3)

Division by zero. Accessing invalid addresses. The behavior upon stack overflow. Etc.

Pointer to one before first element of array

(1) Is it legal to write --p?

It's "legal" as in the C syntax allows it, but it invokes undefined behavior. For the purpose of finding the relevant section in the standard, --p is equivalent to p = p - 1 (except p is only evaluated once). Then:

C17 6.5.6/8

If both the pointer
operand and the result point to elements of the same array object, or one past the last
element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.

The evaluation invokes undefined behavior, meaning it doesn't matter if you de-reference the pointer or not - you already invoked undefined behavior.

Furthermore:

C17 6.5.6/9:

When two pointers are subtracted, both shall point to elements of the same array object, or one past the last element of the array object;

If your code violates a "shall" in the ISO standard, it invokes undefined behavior.

(2) Is it legal to write p-1 in an expression?

Same as (1), undefined behavior.


As for examples of how this could cause problems in practice: imagine that the array is placed at the very beginning of a valid memory page. When you decrement outside that page, there could be a hardware exception or a pointer trap representation. This isn't a completely unlikely scenario for microcontrollers, particularly when they are using segmented memory maps.



Related Topics



Leave a reply



Submit