In Which Versions of the C++ Standard Does "(I+=10)+=10" Have Undefined Behaviour

In which versions of the C++ standard does (i+=10)+=10 have undefined behaviour?

tl;dr: The sequence of the modifications and reads performed in (i+=10)+=10 is well defined in both C++98 and C++11, however in C++98 this is not sufficient to make the behavior defined.

In C++98 multiple modifications to the same object without an intervening sequence-point results in undefined behavior, even when the order of those modifications is well specified. This expression does not contain any sequence points and so the fact that it consists of two modifications is sufficient to render its behavior undefined.

C++11 doesn't have sequence points and only requires that the modifications of an object be ordered with respect to each other and to reads of the same object to produce defined behavior.

Therefore the behavior is undefined in C++98 but well defined in C++11.



C++98

C++98 clause [expr] 5 p4

Except where noted, the order of evaluation of operands of individual operators and subexpressions of individual expression, and the order in which side effects take place, is unspecified.

C++98 clause [expr.ass] 5.17 p1

The result of the assignment operation is the value stored in the left operand after the assignment has taken place; the result is an lvalue

So I believe the order is specified, however I don't see that that alone is enough to create a sequence point in the middle of an expression. And continuing on with the quote of [expr] 5 p4:

Between the previous and next sequence point a scalar object shall have its stored value modified at most once by the evaluation of an expression.

So even though the order is specified it appears to me that this is not sufficient for defined behavior in C++98.



C++11

C++11 does away sequence points for the much clearer idea of sequence-before and sequenced-after. The language from C++98 is replaced with

C++11 [intro.execution] 1.9 p15

Except where noted, evaluations of operands of individual operators and of subexpressions of individual expressions are unsequenced. [...]

If a side effect on a scalar object is unsequenced relative to either another side effect on the same scalar object or a value computation using the value of the same scalar object, the behavior is undefined.

C++11 [expr.ass] 5.17 p1

In all cases, the assignment is sequenced after the value computation of the right and left operands, and before the value computation of the assignment expression.

So while being ordered was not sufficient to make the behavior defined in C++98, C++11 has changed the requirement such that being ordered (i.e., sequenced) is sufficient.

(And it seems to me that the extra flexibility afforded by 'sequence before' and 'sequenced after' has lead to a much more clear, consistent, and well specified language.)


It seems unlikely to me that any C++98 implementation would actually do anything surprising when the sequence of operations is well specified even if that is insufficient to produce technically well defined behavior. As an example, the internal representation of this expression produced by Clang in C++98 mode has well defined behavior and does the expected thing.

How undefined is undefined behavior?

I'd say that the behavior is undefined only if the users inserts any number different from 0. After all, if the offending code section is not actually run the conditions for UB aren't met (i.e. the non-initialized pointer is not created neither dereferenced).

A hint of this can be found into the standard, at 3.4.3:

behavior, upon use of a nonportable or erroneous program construct or of erroneous data,
for which this International Standard imposes no requirements

This seems to imply that, if such "erroneous data" was instead correct, the behavior would be perfectly defined - which seems pretty much applicable to our case.


Additional example: integer overflow. Any program that does an addition with user-provided data without doing extensive check on it is subject to this kind of undefined behavior - but an addition is UB only when the user provides such particular data.

Undefined, unspecified and implementation-defined behavior

Undefined behavior is one of those aspects of the C and C++ language that can be surprising to programmers coming from other languages (other languages try to hide it better). Basically, it is possible to write C++ programs that do not behave in a predictable way, even though many C++ compilers will not report any errors in the program!

Let's look at a classic example:

#include <iostream>

int main()
{
char* p = "hello!\n"; // yes I know, deprecated conversion
p[0] = 'y';
p[5] = 'w';
std::cout << p;
}

The variable p points to the string literal "hello!\n", and the two assignments below try to modify that string literal. What does this program do? According to section 2.14.5 paragraph 11 of the C++ standard, it invokes undefined behavior:

The effect of attempting to modify a string literal is undefined.

I can hear people screaming "But wait, I can compile this no problem and get the output yellow" or "What do you mean undefined, string literals are stored in read-only memory, so the first assignment attempt results in a core dump". This is exactly the problem with undefined behavior. Basically, the standard allows anything to happen once you invoke undefined behavior (even nasal demons). If there is a "correct" behavior according to your mental model of the language, that model is simply wrong; The C++ standard has the only vote, period.

Other examples of undefined behavior include accessing an array beyond its bounds, dereferencing the null pointer, accessing objects after their lifetime ended or writing allegedly clever expressions like i++ + ++i.

Section 1.9 of the C++ standard also mentions undefined behavior's two less dangerous brothers, unspecified behavior and implementation-defined behavior:

The semantic descriptions in this International Standard define a parameterized nondeterministic abstract machine.

Certain aspects and operations of the abstract machine are described in this International Standard as implementation-defined (for example, sizeof(int)). These constitute the parameters of the abstract machine. Each implementation shall include documentation describing its characteristics and behavior in these respects.

Certain other aspects and operations of the abstract machine are described in this International Standard as unspecified (for example, order of evaluation of arguments to a function). Where possible, this International Standard defines a set of allowable behaviors. These define the nondeterministic aspects of the abstract machine.

Certain other operations are described in this International Standard as undefined (for example, the effect of dereferencing the null pointer). [ Note: this International Standard imposes no requirements on the behavior of programs that contain undefined behavior.end note ]

Specifically, section 1.3.24 states:

Permissible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message).

What can you do to avoid running into undefined behavior? Basically, you have to read good C++ books by authors who know what they're talking about. Avoid internet tutorials. Avoid bullschildt.

`y=++y`, is this standard compliant? [which appears in a test by Microsoft]

They're both well-defined behaviour as per C++11. C++11 doesn't even have sequence points, so a sequence-point related warning is obviously outdated. The assignment operator imposes sequencing on its arguments.

Edit:

Whilst everyone can agree that i = ++i is now well-defined behaviour, it is rather non-trivial to decide the definedness of i = i++. Intuitively, I should describe it as well-defined, but the Standard clearly states that

i = i++ + 1;

is UB, and I'm not seeing the + 1 making any difference here.

The short is, if you wanted to answer this question, you would need to be an expert on the decidedly non-trivial C++11 sequencing rules, which it appears that I am not, but it appears that the answer is "None of the above because UB".

Multiple compound assignments in a single statement: is it Undefined Behavior or not?

The behavior is undefined. Let's look at the slightly simpler expression:

x += (x+=1)

In C++11, the value computation of the left x is unsequenced relative to the value computation of the expression (x+=1). This means that value computation of x is unsequenced relative to the assignment to x (due to x+=1), and therefore the behavior is undefined.

The reason for this is that the value computation of the two sides of the += operator are unsequenced relative to each other (as the standard doesn't specify otherwise). And 1.9p15 states:

If a side effect on a scalar object is unsequenced relative to either another side effect on the same scalar object or a value computation using the value of the same scalar object, the behavior is undefined.

In C++03 the behavior is undefined because x is modified twice without an intervening sequence point.

Why is undefined behaviour allowed in C

Originally, most forms of Undefined Behavior represented things which some implementations might trap, but other implementations might not. Because there was no way for the authors of the Standard to predict all the things a platform might do in case of a trap (including, literally, the possibility that a system would sound an alarm and lock up until an operator manually cleared the fault), the consequences of traps fell outside the jurisdiction of the C Standard, and thus almost every action for which some platform might conceivably cause a trap is--from the point of view of the Standard--considered "Undefined Behavior".

That should not be taken to imply that the authors of the Standard didn't believe implementations should try to behave sensibly for such things when practical. The authors of the C89 Standard noted, for example, that the majority of current systems of that era would define behavior for:

/* Assume USmall is half the size of "int" */
unsigned mult(USmall x, USmall y) { return x*y; }

which would in all cases, including those where the mathematical product of x and y was between INT_MAX+1 and UINT_MAX, be equivalent to (unsigned)x*y;. I see no reason to believe they wouldn't have expected that trend to continue.

Unfortunately, a new philosophy has become fashionable, based on the revisionist viewpoint that compiler writers only supported useful behaviors in cases not mandated by the Standard because they were too unsophisticated to do anything else. In gcc, for example, using optimization level 2 but no other non-default options, the above "mult" routine will sometimes generate bogus code in cases where the product would be between 0x80000000u and 0xFFFFFFFFu, even when running on platforms where such computations would historically have worked. This is supposedly being done in the name of "optimization"; it would be interesting to know how many of the "optimizations" such techniques end up performing are actually useful and could not have been achieved via safer means.

Historically, Undefined Behavior was a license for a C compiler to expose the behavior of the underlying platform; in cases where the underlying platform's behavior fit the programmer's needs, this allowed the programmer's requirements to be expressed in machine code more efficiently than if everything had to be done in ways defined by the Standard. Lately, however, it has been interpreted as license for compilers to implement behaviors which not only bear no relation to anything in the underlying platform nor to any plausible programmer expectations, but aren't even bound by laws of time and causality.

Can code that will never be executed invoke undefined behavior?

Let's look at how the C standard defines the terms "behavior" and "undefined behavior".

References are to the N1570 draft of the ISO C 2011 standard; I'm not aware of any relevant differences in any of the three published ISO C standards (1990, 1999, and 2011).

Section 3.4:

behavior
external appearance or action

Ok, that's a bit vague, but I'd argue that a given statement has no "appearance", and certainly no "action", unless it's actually executed.

Section 3.4.3:

undefined behavior
behavior, upon use of a nonportable or erroneous program construct or of erroneous data,
for which this International Standard imposes no requirements

It says "upon use" of such a construct. The word "use" is not defined by the standard, so we fall back to the common English meaning. A construct is not "used" if it's never executed.

There's a note under that definition:

NOTE Possible undefined behavior ranges from ignoring the situation
completely with unpredictable results, to behaving during translation
or program execution in a documented manner characteristic of the
environment (with or without the issuance of a diagnostic message), to
terminating a translation or execution (with the issuance of a
diagnostic message).

So a compiler is permitted to reject your program at compile time if its behavior is undefined. But my interpretation of that is that it can do so only if it can prove that every execution of the program will encounter undefined behavior. Which implies, I think, that this:

if (rand() % 2 == 0) {
i = i / 0;
}

which certainly can have undefined behavior, cannot be rejected at compile time.

As a practical matter, programs have to be able to perform runtime tests to guard against invoking undefined behavior, and the standard has to permit them to do so.

Your example was:

if (0) {
i = 1/0;
}

which never executes the division by 0. A very common idiom is:

int x, y;
/* set values for x and y */
if (y != 0) {
x = x / y;
}

The division certainly has undefined behavior if y == 0, but it's never executed if y == 0. The behavior is well defined, and for the same reason that your example is well defined: because the potential undefined behavior can never actually happen.

(Unless INT_MIN < -INT_MAX && x == INT_MIN && y == -1 (yes, integer division can overflow), but that's a separate issue.)

In a comment (since deleted), somebody pointed out that the compiler may evaluate constant expressions at compile time. Which is true, but not relevant in this case, because in the context of

i = 1/0;

1/0 is not a constant expression.

A constant-expression is a syntactic category that reduces to conditional-expression (which excludes assignments and comma expressions). The production constant-expression appears in the grammar only in contexts that actually require a constant expression, such as case labels. So if you write:

switch (...) {
case 1/0:
...
}

then 1/0 is a constant expression -- and one that violates the constraint in 6.6p4: "Each constant expression shall evaluate to a constant that is in the range of representable
values for its type.", so a diagnostic is required. But the right hand side of an assignment does not require a constant-expression, merely a conditional-expression, so the constraints on constant expressions don't apply. A compiler can evaluate any expression that it's able to at compile time, but only if the behavior is the same as if it were evaluated during execution (or, in the context of if (0), not evaluated during execution().

(Something that looks exactly like a constant-expression is not necessarily a constant-expression, just as, in x + y * z, the sequence x + y is not an additive-expression because of the context in which it appears.)

Which means the footnote in N1570 section 6.6 that I was going to cite:

Thus, in the following initialization,

static int i = 2 || 1 / 0;
the expression is a valid integer constant expression with value one.

isn't actually relevant to this question.

Finally, there are a few things that are defined to cause undefined behavior that aren't about what happens during execution. Annex J, section 2 of the C standard (again, see the N1570 draft) lists things that cause undefined behavior, gathered from the rest of the standard. Some examples (I don't claim this is an exhaustive list) are:

  • A nonempty source file does not end in a new-line character which is not immediately preceded by a backslash character or ends in a partial
    preprocessing token or comment
  • Token concatenation produces a character sequence matching the syntax of a universal character name
  • A character not in the basic source character set is encountered in a source file, except in an identifier, a character constant, a string
    literal, a header name, a comment, or a preprocessing token that is
    never converted to a token
  • An identifier, comment, string literal, character constant, or header name contains an invalid multibyte character or does not begin
    and end in the initial shift state
  • The same identifier has both internal and external linkage in the same translation unit

These particular cases are things that a compiler could detect. I think their behavior is undefined because the committee didn't want to, or couldn't, impose the same behavior on all implementations, and defining a range of permitted behaviors just wasn't worth the effort. They don't really fall into the category of "code that will never be executed", but I mention them here for completeness.

What does dereferencing a pointer mean?

Reviewing the basic terminology

It's usually good enough - unless you're programming assembly - to envisage a pointer containing a numeric memory address, with 1 referring to the second byte in the process's memory, 2 the third, 3 the fourth and so on....

  • What happened to 0 and the first byte? Well, we'll get to that later - see null pointers below.
  • For a more accurate definition of what pointers store, and how memory and addresses relate, see "More about memory addresses, and why you probably don't need to know" at the end of this answer.

When you want to access the data/value in the memory that the pointer points to - the contents of the address with that numerical index - then you dereference the pointer.

Different computer languages have different notations to tell the compiler or interpreter that you're now interested in the pointed-to object's (current) value - I focus below on C and C++.

A pointer scenario

Consider in C, given a pointer such as p below...

const char* p = "abc";

...four bytes with the numerical values used to encode the letters 'a', 'b', 'c', and a 0 byte to denote the end of the textual data, are stored somewhere in memory and the numerical address of that data is stored in p. This way C encodes text in memory is known as ASCIIZ.

For example, if the string literal happened to be at address 0x1000 and p a 32-bit pointer at 0x2000, the memory content would be:

Memory Address (hex)    Variable name    Contents
1000 'a' == 97 (ASCII)
1001 'b' == 98
1002 'c' == 99
1003 0
...
2000-2003 p 1000 hex

Note that there is no variable name/identifier for address 0x1000, but we can indirectly refer to the string literal using a pointer storing its address: p.

Dereferencing the pointer

To refer to the characters p points to, we dereference p using one of these notations (again, for C):

assert(*p == 'a');  // The first character at address p will be 'a'
assert(p[1] == 'b'); // p[1] actually dereferences a pointer created by adding
// p and 1 times the size of the things to which p points:
// In this case they're char which are 1 byte in C...
assert(*(p + 1) == 'b'); // Another notation for p[1]

You can also move pointers through the pointed-to data, dereferencing them as you go:

++p;  // Increment p so it's now 0x1001
assert(*p == 'b'); // p == 0x1001 which is where the 'b' is...

If you have some data that can be written to, then you can do things like this:

int x = 2;
int* p_x = &x; // Put the address of the x variable into the pointer p_x
*p_x = 4; // Change the memory at the address in p_x to be 4
assert(x == 4); // Check x is now 4

Above, you must have known at compile time that you would need a variable called x, and the code asks the compiler to arrange where it should be stored, ensuring the address will be available via &x.

Dereferencing and accessing a structure data member

In C, if you have a variable that is a pointer to a structure with data members, you can access those members using the -> dereferencing operator:

typedef struct X { int i_; double d_; } X;
X x;
X* p = &x;
p->d_ = 3.14159; // Dereference and access data member x.d_
(*p).d_ *= -1; // Another equivalent notation for accessing x.d_

Multi-byte data types

To use a pointer, a computer program also needs some insight into the type of data that is being pointed at - if that data type needs more than one byte to represent, then the pointer normally points to the lowest-numbered byte in the data.

So, looking at a slightly more complex example:

double sizes[] = { 10.3, 13.4, 11.2, 19.4 };
double* p = sizes;
assert(p[0] == 10.3); // Knows to look at all the bytes in the first double value
assert(p[1] == 13.4); // Actually looks at bytes from address p + 1 * sizeof(double)
// (sizeof(double) is almost always eight bytes)
++p; // Advance p by sizeof(double)
assert(*p == 13.4); // The double at memory beginning at address p has value 13.4
*(p + 2) = 29.8; // Change sizes[3] from 19.4 to 29.8
// Note earlier ++p and + 2 here => sizes[3]

Pointers to dynamically allocated memory

Sometimes you don't know how much memory you'll need until your program is running and sees what data is thrown at it... then you can dynamically allocate memory using malloc. It is common practice to store the address in a pointer...

int* p = (int*)malloc(sizeof(int)); // Get some memory somewhere...
*p = 10; // Dereference the pointer to the memory, then write a value in
fn(*p); // Call a function, passing it the value at address p
(*p) += 3; // Change the value, adding 3 to it
free(p); // Release the memory back to the heap allocation library

In C++, memory allocation is normally done with the new operator, and deallocation with delete:

int* p = new int(10); // Memory for one int with initial value 10
delete p;

p = new int[10]; // Memory for ten ints with unspecified initial value
delete[] p;

p = new int[10](); // Memory for ten ints that are value initialised (to 0)
delete[] p;

See also C++ smart pointers below.

Losing and leaking addresses

Often a pointer may be the only indication of where some data or buffer exists in memory. If ongoing use of that data/buffer is needed, or the ability to call free() or delete to avoid leaking the memory, then the programmer must operate on a copy of the pointer...

const char* p = asprintf("name: %s", name);  // Common but non-Standard printf-on-heap

// Replace non-printable characters with underscores....
for (const char* q = p; *q; ++q)
if (!isprint(*q))
*q = '_';

printf("%s\n", p); // Only q was modified
free(p);

...or carefully orchestrate reversal of any changes...

const size_t n = ...;
p += n;
...
p -= n; // Restore earlier value...
free(p);

C++ smart pointers

In C++, it's best practice to use smart pointer objects to store and manage the pointers, automatically deallocating them when the smart pointers' destructors run. Since C++11 the Standard Library provides two, unique_ptr for when there's a single owner for an allocated object...

{
std::unique_ptr<T> p{new T(42, "meaning")};
call_a_function(p);
// The function above might throw, so delete here is unreliable, but...
} // p's destructor's guaranteed to run "here", calling delete

...and shared_ptr for share ownership (using reference counting)...

{
auto p = std::make_shared<T>(3.14, "pi");
number_storage1.may_add(p); // Might copy p into its container
number_storage2.may_add(p); // Might copy p into its container } // p's destructor will only delete the T if neither may_add copied it

Null pointers

In C, NULL and 0 - and additionally in C++ nullptr - can be used to indicate that a pointer doesn't currently hold the memory address of a variable, and shouldn't be dereferenced or used in pointer arithmetic. For example:

const char* p_filename = NULL; // Or "= 0", or "= nullptr" in C++
int c;
while ((c = getopt(argc, argv, "f:")) != -1)
switch (c) {
case f: p_filename = optarg; break;
}
if (p_filename) // Only NULL converts to false
... // Only get here if -f flag specified

In C and C++, just as inbuilt numeric types don't necessarily default to 0, nor bools to false, pointers are not always set to NULL. All these are set to 0/false/NULL when they're static variables or (C++ only) direct or indirect member variables of static objects or their bases, or undergo zero initialisation (e.g. new T(); and new T(x, y, z); perform zero-initialisation on T's members including pointers, whereas new T; does not).

Further, when you assign 0, NULL and nullptr to a pointer the bits in the pointer are not necessarily all reset: the pointer may not contain "0" at the hardware level, or refer to address 0 in your virtual address space. The compiler is allowed to store something else there if it has reason to, but whatever it does - if you come along and compare the pointer to 0, NULL, nullptr or another pointer that was assigned any of those, the comparison must work as expected. So, below the source code at the compiler level, "NULL" is potentially a bit "magical" in the C and C++ languages...

More about memory addresses, and why you probably don't need to know

More strictly, initialised pointers store a bit-pattern identifying either NULL or a (often virtual) memory address.

The simple case is where this is a numeric offset into the process's entire virtual address space; in more complex cases the pointer may be relative to some specific memory area, which the CPU may select based on CPU "segment" registers or some manner of segment id encoded in the bit-pattern, and/or looking in different places depending on the machine code instructions using the address.

For example, an int* properly initialised to point to an int variable might - after casting to a float* - access memory in "GPU" memory quite distinct from the memory where the int variable is, then once cast to and used as a function pointer it might point into further distinct memory holding machine opcodes for the program (with the numeric value of the int* effectively a random, invalid pointer within these other memory regions).

3GL programming languages like C and C++ tend to hide this complexity, such that:

  • If the compiler gives you a pointer to a variable or function, you can dereference it freely (as long as the variable's not destructed/deallocated meanwhile) and it's the compiler's problem whether e.g. a particular CPU segment register needs to be restored beforehand, or a distinct machine code instruction used

  • If you get a pointer to an element in an array, you can use pointer arithmetic to move anywhere else in the array, or even to form an address one-past-the-end of the array that's legal to compare with other pointers to elements in the array (or that have similarly been moved by pointer arithmetic to the same one-past-the-end value); again in C and C++, it's up to the compiler to ensure this "just works"

  • Specific OS functions, e.g. shared memory mapping, may give you pointers, and they'll "just work" within the range of addresses that makes sense for them

  • Attempts to move legal pointers beyond these boundaries, or to cast arbitrary numbers to pointers, or use pointers cast to unrelated types, typically have undefined behaviour, so should be avoided in higher level libraries and applications, but code for OSes, device drivers, etc. may need to rely on behaviour left undefined by the C or C++ Standard, that is nevertheless well defined by their specific implementation or hardware.



Related Topics



Leave a reply



Submit