How to Explain Undefined Behavior to Know-It-All Newbies

How to explain undefined behavior to know-it-all newbies?

"Congratulations, you've defined the behavior that compiler has for that operation. I'll expect the report on the behavior that the other 200 compilers that exist in the world exhibit to be on my desk by 10 AM tomorrow. Don't disappoint me now, your future looks promising!"

Why is the phrase: undefined behavior means the compiler can do anything it wants true?

Nothing "causes" this to occur. Undefined behaviour cannot "occur". There is no mystical force that descends upon your computer and suddenly makes it create black holes inside of cats.

That anything can happen when you run a program whose behaviour is undefined, is stated as fact by the C++ standard. It's a statement of leeway, a handy excuse used by compilers to make assumptions about your code so as to provide useful optimisations.

For example, if we say that dereferencing nullptr is undefined (which it is) then no compiler needs to ever check that a pointer is not nullptr: it can just assume that a dereferenced pointer will never be nullptr, and if it's not then any consequences are the programmer's problem.

Due to the astounding complexity of compilers, some of those consequences can be rather unexpected.

Of course it is not actually true that "anything can happen". Your computer has neither the necessary physical power nor the necessary legal authority to instantiate a black hole inside of a cat. But since C++ is an abstraction, it seems only fitting that we use abstractions to teach people not to write programs with undefined behaviour. If you program rigorously, assuming that "anything can happen" if your program has undefined behaviour, then you will not be surprised by said rather unexpected consequences, and you will not be tempted to try to "control" the outcome in any way.

Is undefined behavior worth it?

I think the heart of the concern comes from the C/C++ philosophy of speed above all.

These languages were created at a time when raw power was sparse and you needed to get all the optimizations you could just to have something usable.

Specifying how to deal with UB would mean detecting it in the first place and then of course specifying the handling proper. However detecting it is against the speed first philosophy of the languages!

Today, do we still need fast programs ? Yes, for those of us working either with very limited resources (embedded systems) or with very harsh constraints (on response time or transactions per second), we do need to squeeze out as much as we can.

I know the motto throw more hardware at the problem. We have an application where I work:

  • expected time for an answer ? Less than 100ms, with DB calls in the midst (say thanks to memcached).
  • number of transactions per second ? 1200 in average, peaks at 1500/1700.

It runs on about 40 monsters: 8 dual core opteron (2800MHz) with 32GB of RAM. It gets difficult to be "faster" with more hardware at this point, so we need optimized code, and a language that allows it (we did restrain to throw assembly code in there).

I must say that I don't care much for UB anyway. If you get to the point that your program invokes UB then it needs fixing whatever the behavior that actually occurred. Of course it would be easier to fix them if it was reported straight away: that's what debug builds are for.

So perhaps that instead of focusing on UB we should learn to use the language:

  • don't use unchecked calls
  • (for experts) don't use unchecked calls
  • (for gurus) are you sure you really need an unchecked call here ?

And everything is suddenly better :)

How to quickly get started at using and learning Emacs

The biggest thing about learning how to use Emacs is ... (drumroll please) learning how to use Emacs.

Okay, okay, okay. It's a silly answer, and it's a tautology, but it's true. If you start up Emacs, and think to yourself "How could I find every instance of the word 'foobar' in my source tree?" the worst thing you could do is hit Alt + Tab and visit Google.

Seriously.

Learning the help system and how it works is the best thing you can do. It's so nice to just hit C-h a find, and suddenly get all the information you need, right at your fingertips.

The next best thing you could do is install a wonderful little package called Icicles which has some seriously groovy completion functions. After you get it installed, just know that anytime the minibuffer is asking for some kind of input, you can now use regular expressions.

How would this apply to finding every file in your source tree? Well, you'd hit M-x, and then type "find". After that, you could hit (for instance) Shift + Tab and Icicles would kick in, finding every command that prefixes with "find". Alternatively, you could do M-x .find. and it would give you any command with find in it.

Build a cheat sheet. Just keep a saved buffer somewhere that has all of the keyboard shortcuts you use frequently in it. Remove the ones that you know off by heart, and pick up new ones. In most cases when you do a M-x command, the message buffer will tell you what the keyboard shortcut was for that command (if there was one).

Learn. Keyboard. Macros.

Learn. Emacs. Lisp.

Steven Huwig's idea of using some killer applications is a good one. Emacs is easier to use when you want to use it. For me, it was Planner Mode. (I've just moved to Org-mode, and it's even better.)

Unspecified, undefined and implementation defined behavior WIKI for C

C standards define UsB, UB and IDB in a way that can be summarized as follows:

Unspecified Behavior (UsB)

This is a behavior for which the standard gives some alternatives among which the implementation must choose, but it doesn't mandate how and when the choice is to be made. In other words, the implementation must accept user code triggering that behavior without erroring out and must comply with one of the alternatives given by the standard.

Be aware that the implementation is not required to document anything about the choices made. These choices may also be non-deterministic or dependent (in an undocumented way) on compiler options.

To summarize: the standard gives some possibilities among which to choose, the implementation chooses when and how the specific alternative is selected and applied.

Note that the standard may provide a really large number of alternatives. The typical example is the initial value of local variables that are not explicitly initialized. The standard says that this value is unspecified as long as it is a valid value for the variable's data type.

To be more specific consider an int variable: an implementation is free to choose any int value, and this choice can be completely random, non-deterministic or be at the mercy of the whims of the implementation, which is not required to document anything about it. As long as the implementation stays within the limits stated by the standard this is ok and the user cannot complain.

Undefined Behavior (UB)

As the naming indicates this is a situation in which the C standard doesn't impose or guarantee what the program would or should do. All bets are off. Such a situation:

  • renders a program either erroneous or nonportable

  • doesn't require absolutely anything from the implementation

This is a really nasty situation: as long as there is a piece of code that has undefined behavior, the entire program is considered erroneous and the implementation is allowed by the standard to do everything.

In other words, the presence of a cause of UB allows the implementation to completely ignore the standard, as long as the program triggering the UB is concerned.

Note that the actual behavior in this case may cover an unlimited range of possibilities, the following is by no means an exhaustive list:

  • A compile-time error may be issued.
  • A run-time error may be issued.
  • The problem is completely ignored (and this may lead to program bugs).
  • The compiler silently throws the UB-code away as an optimization.
  • Your hard disk may be formatted.
  • Your computer may erase your bank account and ask your girlfriend for a date.

I hope the last two (half-serious) items can give you the right gut-feeling about the nastiness of UB. And even though most implementations will not insert the necessary code to format you hard drive, real compilers do optimize!

Terminology Note: Sometimes people argue that some piece of code which the standard deems a source of UB in their implementation/system/environment work in a documented way, therefore it cannot be really UB. This reasoning is wrong, but it is a common (and somewhat understandable) misunderstanding: when the term UB (and also UsB and IDB) is used in a C context it is meant as a technical term whose precise meaning is defined by the standard(s). In particular the word "undefined" loses its everyday meaning. Therefore it doesn't make sense to show examples where erroneous or nonportable programs produce "well-defined" behavior as counterexamples. If you try, you really miss the point. UB means that you lose all the guarantees of the standard. If your implementation provides an extension then your guarantees are only those of your implementation. If you use that extension your program is no more a conforming C program (in a sense, it is no more a C program, since it doesn't follow the standard any longer!).

Usefulness of undefined behavior

A common question about UB is something on these lines: "If UB is so nasty, why does not the standard mandate that an implementation issues an error when faced with UB?"

First, optimizations. Allowing implementations not to check for possible causes of UB allows lots of optimizations that make a C program extremely efficient. This is one of the features of C, although it makes C a source of many pitfalls for beginners.

Second, the existence of UB in the standards allows a conforming implementation to provide extensions to C without being deemed non-conforming as a whole.

As long as an implementation behaves as mandated for a conforming program, it is itself conforming, although it may provide non-standard facilities that may be useful on specific platforms. Of course the programs using those facilities will be nonportable and will rely on documented UB, i.e. behavior that is UB according to the standard, but that an implementation documents as an extension.

Implementation-defined Behavior (IDB)

This is a behavior that can be described in a way similar to UsB: the standard provides some alternatives and the implementation choose one, but the implementation is required to document exactly how the choice is made.

This means that a user reading her compiler's documentation must be given enough information to predict exactly what will happen in the specific case.

Note that an implementation that doesn't fully document an IDB cannot be deemed conforming. A conforming implementation must document exactly what happens in any case that the standard declares IDB.



Examples of unspecified behavior

Order of evaluation

Function arguments

The order of evaluation for function arguments is unspecified EXP30-C.

For instance, in c(a(), b()); it is unspecified whether the function a is called before or after b. The only guarantee is that both are called before the c function.



Examples of undefined behavior

Pointers

Dereferencing of null pointer

Null pointers are used to signal that a pointer does not point to valid memory. As such, it does not make much sense to try to read or write to memory via a null pointer.

Technically, this is undefined behaviour. However, since this is a very common source of bugs, most C-environments ensure that most attempts to dereference a null pointer will immediately crash the program (usually killing it with a segmentation fault). This guard is not perfect due to the pointer arithmetic involved in references to arrays and/or structures, so even with modern tools, dereferencing a null pointer may format your hard drive.

Dereferencing of uninitialized pointer

Just like null pointers, dereferencing a pointer before explitely setting its value is UB. Unlike for null pointers, most environments do not provide any safety net against this sort of error, except that compiler can warn about it. If you compile your code anyway, you'll are likely to experience the whole nastiness of UB.

Dereferencing of invalid pointers

An invalid pointer is a pointer that contains an address that is not within any allocated memory area. Common ways to create invalid pointers is to call free() (after the call, the pointer will be invalid, which is pretty much the point of calling free()), or to use pointer arithmetic to get an address that is beyond the limits of an allocated memory block.

This is the most evil variant of pointer dereferencing UB: There is no safety net, there is no compiler warning, there is just the fact that the code may do anything. And commonly, it does: Most malware attacks use this kind of UB behaviour in programs to make the programs behave as they want them to behave (like installing a trojan, keylogger, encrypting your hard drive etc.). The possibility of a formatted hard drive becomes very real with this kind of UB!

Casting away constness

If we declare an object as const, we give a promise to the compiler that we will never change the value of that object. In many contexts compilers will spot such an invalid modification and shout at us. But if we cast the constness away as in this snippet:

int const a = 42;
...
int* ap0 = &a; //< error, compiler will tell us
int* ap1 = (int*)&a; //< silences the compiler
...
*ap1 = 43; //< UB ==> program crash?

the compiler might not be able to track this invalid access, compile the code to an executable and only at run time the invalid access will be detected and lead to a program crash.

category 2

put a title here!

put your explanation here!



Examples of implementation-defined behavior

category 1

put a title here!

put your explanation here!

Why does this code print 1 2 2 and not the expected 3 3 1?

The author is wrong. Not only is the order of evaluation of function arguments unspecified in C, the evaluations are unsequenced with regards to each other. Adding to the injury, reading and modifying the same object without an intervening sequence point in independent expressions (here the value of a is evaluated in 3 independent expressions and modified in 2) has undefined behaviour, so the compiler has the liberty of producing any kind of code that it sees fit.

For details, see Why are these constructs using pre and post-increment undefined behavior?



Related Topics



Leave a reply



Submit