Can Branches with Undefined Behavior Be Assumed Unreachable and Optimized as Dead Code

Can branches with undefined behavior be assumed unreachable and optimized as dead code?

Does the existence of such a statement in a given program mean that
the whole program is undefined or that behavior only becomes undefined
once control flow hits this statement?

Neither. The first condition is too strong and the second is too weak.

Object access are sometimes sequenced, but the standard describes the behavior of the program outside of time. Danvil already quoted:

if any such execution contains an undefined operation, this
International Standard places no requirement on the implementation
executing that program with that input (not even with regard to
operations preceding the first undefined operation)

This can be interpreted:

If the execution of the program yields undefined behavior, then the whole program has
undefined behavior.

So, an unreachable statement with UB doesn't give the program UB. A reachable statement that (because of the values of inputs) is never reached, doesn't give the program UB. That's why your first condition is too strong.

Now, the compiler cannot in general tell what has UB. So to allow the optimizer to re-order statements with potential UB that would be re-orderable should their behavior be defined, it's necessary to permit UB to "reach back in time" and go wrong prior to the preceding sequence point (or in C++11 terminology, for the UB to affect things that are sequenced before the UB thing). Therefore your second condition is too weak.

A major example of this is when the optimizer relies on strict aliasing. The whole point of the strict aliasing rules is to allow the compiler to re-order operations that could not validly be re-ordered if it were possible that the pointers in question alias the same memory. So if you use illegally aliasing pointers, and UB does occur, then it can easily affect a statement "before" the UB statement. As far as the abstract machine is concerned the UB statement has not been executed yet. As far as the actual object code is concerned, it has been partly or fully executed. But the standard doesn't try to get into detail about what it means for the optimizer to re-order statements, or what the implications of that are for UB. It just gives the implementation license to go wrong as soon as it pleases.

You can think of this as, "UB has a time machine".

Specifically to answer your examples:

  • Behavior is only undefined if 3 is read.
  • Compilers can and do eliminate code as dead if a basic block contains an operation certain to be undefined. They're permitted (and I'm guessing do) in cases which aren't a basic block but where all branches lead to UB. This example isn't a candidate unless PrintToConsole(3) is somehow known to be sure to return. It could throw an exception or whatever.

A similar example to your second is the gcc option -fdelete-null-pointer-checks, which can take code like this (I haven't checked this specific example, consider it illustrative of the general idea):

void foo(int *p) {
if (p) *p = 3;
std::cout << *p << '\n';
}

and change it to:

*p = 3;
std::cout << "3\n";

Why? Because if p is null then the code has UB anyway, so the compiler may assume it is not null and optimize accordingly. The linux kernel tripped over this (https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2009-1897) essentially because it operates in a mode where dereferencing a null pointer isn't supposed to be UB, it's expected to result in a defined hardware exception that the kernel can handle. When optimization is enabled, gcc requires the use of -fno-delete-null-pointer-checks in order to provide that beyond-standard guarantee.

P.S. The practical answer to the question "when does undefined behavior strike?" is "10 minutes before you were planning to leave for the day".

Is the behaviour of a program that has undefined behaviour on an unreachable path defined?

It is not necessary to base a position on this question on the usefulness of any given code construct or practice, nor on anything written about C++, whether in its standard or in another SO answer, no matter how similar C++'s definitions may be. The key thing to consider is C's definition of undefined behavior:

behavior, upon use of a nonportable or erroneous program construct or
of erroneous data, for which this International Standard imposes no
requirements

(C2011, 3.4.3/1; emphasis added)

Thus, undefined behavior is triggered temporally ("upon use" of a construct or data), not by mere presence.* It is convenient that this is consistent for undefined behavior arising from data and that arising from program constructs; the standard need not have been consistent there. And as another answer describes, this "upon use" definition is a good design choice, as it allows programs to avoid executing undefined behaviors associated with erroneous data.

On the other hand, if a program does execute undefined behavior then it follows from the standard's definition that the whole behavior of the program is undefined. This consequent undefinedness is a more general kind arising from the fact that the UB associated directly with the erroneous data or construct could, in principle, include altering the behavior of other parts of the program, even retroactively (or apparently so). There are of course extra-lingual limitations on what could happen -- so no, nasal demons will not actually be making any appearances -- but those are not necessarily as strong as one might suppose.


* Caveat: some program constructs are used at translation time. These produce UB in program translation, with the result that every execution of the program has wholly-undefined behavior. For a somewhat stupid example, if your program source does not end with an unescaped newline then the program's behavior is completely undefined (see C2011, 5.1.1.2/1, point 2).

How undefined is undefined behavior?

I'd say that the behavior is undefined only if the users inserts any number different from 0. After all, if the offending code section is not actually run the conditions for UB aren't met (i.e. the non-initialized pointer is not created neither dereferenced).

A hint of this can be found into the standard, at 3.4.3:

behavior, upon use of a nonportable or erroneous program construct or of erroneous data,
for which this International Standard imposes no requirements

This seems to imply that, if such "erroneous data" was instead correct, the behavior would be perfectly defined - which seems pretty much applicable to our case.


Additional example: integer overflow. Any program that does an addition with user-provided data without doing extensive check on it is subject to this kind of undefined behavior - but an addition is UB only when the user provides such particular data.

Is it legal for source code containing undefined behavior to crash the compiler?

The normative definition of undefined behavior is as follows:

[defns.undefined]

behavior for which this International Standard imposes no requirements

[ Note: Undefined behavior may be expected when this International
Standard omits any explicit definition of behavior or when a program
uses an erroneous construct or erroneous data. Permissible undefined
behavior ranges from ignoring the situation completely with
unpredictable results, to behaving during translation or program
execution in a documented manner characteristic of the environment
(with or without the issuance of a diagnostic message), to terminating
a translation or execution (with the issuance of a diagnostic
message). Many erroneous program constructs do not engender undefined
behavior; they are required to be diagnosed. Evaluation of a constant
expression never exhibits behavior explicitly specified as undefined.
 — end note ]

While the note itself is not normative, it does describe a range of behaviors implementations are known to exhibit. So crashing the compiler (which is translation terminating abruptly), is legitimate according to that note. But really, as the normative text says, the standard doesn't place any bounds for either execution or translation. If an implementation steals your passwords, it's not a violation of any contract laid forth in the standard.

Can guaranteed UB be rejected at compile-time?

Can the compiler therefore reject the program and not generate an executable?

Yes. The definition of undefined behaviour is:

behavior for which this International Standard imposes no requirements
[ Note: Undefined behavior may be expected when this International Standard omits any explicit definition of behavior or when a program uses an erroneous construct or erroneous data. Permissible undefined behavior
ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message). Many erroneous program constructs do not engender undefined behavior; they are required to be diagnosed. — end note ]

When undefined behavior can be considered well-known and accepted?

First of all, any compiler implementation is free to define any behavior it likes in any situation which would, from the point of view of the standard, produce Undefined Behavior.

Secondly, code which is written for a particular compiler implementation is free to make use of any behaviors which are documented by that implementation; code which does so, however, may not be usable on other implementations.

One of the longstanding shortcomings of C is that while there are many situations where constructs which could Undefined Behavior on some implementations are handled usefully by others, only a tiny minority of such situations provide any means by which code can specify that a compiler which won't handle them a certain way should refuse compilation. Further, there are many cases in which the Standards Committee allows full-on UB even though on most implementations the "natural" consequences would be much more constrained. Consider, for example (assume int is 32 bits)

int weird(uint16_t x, int64_t y, int64_t z)
{
int r=0;
if (y > 0) return 1;
if (z < 0x80000000L) return 2;
if (x > 50000) r |= 31;
if (x*x > z) r |= 8;
if (x*x < y) r |= 16;
return r;
}

If the above code was run on a machine that simply ignores integer overflow, passing 50001,0,0x80000000L should result in the code returning 31; passing 50000,0,0x80000000L could result in it returning 0, 8, 16, or 24 depending upon how the code handles the comparison operations. The C standard, however, would allow the code to do anything whatsoever in any of those cases; because of that, some compilers might determine that none of the if statements beyond the first two could ever be true in any situation which hadn't invoked Undefined Behavior, and may thus assume that r is always zero. Note that one of the inferences would affect the behavior of a statement which precedes the Undefined Behavior.

One thing I'd really like to see would be a concept of "Implementation Constrained" behavior, which would be something of a cross between Undefined Behavior and Implementation-Defined Behavior: compilers would be required to document all possible consequences of certain constructs which under the old rules would be Undefined Behavior, but--unlike Implementation-Defined behavior--an implementation would not be required to specify one specific thing that would happen; implementations would be allowed to specify that a certain construct may have arbitrary unconstrained consequences (full UB) but would be discouraged from doing so. In the case of something like integer overflow, a reasonable compromise would be to say that the result of an expression that overflows may be a "magic" value which, if explicitly typecast, will yield an arbitrary (and "ordinary") value of the indicated type, but which may otherwise appears to have arbitrarily changing values which may or may not be representable. Compilers would be allowed to assume that the result of an operation will not be a result of overflow, but would refrain from making inferences about the operands. To use a vague analogy, the behavior would be similar to how floating-point would be if explicitly typecasting a NaN could yield any arbitrary non-NaN result.

IMHO, C would greatly benefit from combining the above concept of "implementation-constrained" behaviors with some standard predefined macros which would allow code to test whether an implementation makes any particular promises about its behavior in various situations. Additionally, it would be helpful if there were a standard means by which a section of code could request a particular "dialects" [combination of int size, implementation-constrained behaviors, etc.]. It would be possible to write a compiler for any platform which could, upon request, have promotion rules behave as though int was exactly 32 bits. For example, given code like:

uint64_t l1,l2; uint32_t w1,w2; uint16_t h1,h2;
...
l1+=(h1+h2);
l2+=(w2-w1);

A 16-bit compiler might be fastest if it performed the math on h1 and h2 using 16 bits, and a 64-bit compiler might be fastest if it added to l2 the 64-bit result of subtracting w1 from w2, but if the code was written for a 32-bit system, being able to have compilers for the other two systems generate code which would behave as it had on the 32-bit system would be more helpful than having them generate code which performed some different computation, no matter how much faster the latter code would be.

Unfortunately, there is not at present any standard means by which code can ask for such semantics [a fact which will likely limit the efficiency of 64-bit code in many cases]; the best one can do is probably to expressly document the code's environmental requirements somewhere and hope that whoever is using the code sees them.

Is it true that ill-formed program does not always lead to UB (either at compile time or at run time)?

If a program is ill-formed, a conforming implementation which has adequate resources to process it would be required to issue a diagnostic. Once an implementation has done so, anything it might do after that would be outside the Standard's jurisdiction. Some possible outcomes would be:

  1. Terminate processing after outputting the diagnostic.

  2. After outputting the diagnostic, proceed as though the language were extended to allow whatever constructs would otherwise make the program ill-formed.

  3. Run the last successfully-compiled version of the program, whose behavior might bear no relationship to anything in the current source files.

If an construct is characterized as "Ill-formed; no diagnostic required", all of the above remain possible in addition to a fourth possibility, namely proceeding as though the language were extended to allow the construct without bothering to output a diagnostic first.

Since the Standard only seeks to describe program behavior in terms of the current contents of the source files, and not anything they might have contained at some time in the past, it cannot meaningfully describe might happen if an implementation is fed a ill-formed source file. That doesn't imply that quality implementations shouldn't seek to avoid accidental execution of obsolete versions of programs, but any measures they might take toward that end would be outside the Standard's jurisdiction.

Note that on some implementations that target embedded or remote systems, option #3 may not be nearly as absurd as it sounds. In many cases, a build-and-execute cycle would involve stopping execution on the remote system, feeding it new code, and then restarting execution on that system. If no new code image is available, restarting whatever code image was previously present may in many cases may be more useful than leaving the system in a dead state, but there's no way the Standard can impose anything meaningful about what that program does, implying that its execution as a consequence of the failure to build the new program cannot be characterized as anything other than Undefined Behavior.



Related Topics



Leave a reply



Submit