Is Undefined Behavior Worth It

Is undefined behavior worth it?

I think the heart of the concern comes from the C/C++ philosophy of speed above all.

These languages were created at a time when raw power was sparse and you needed to get all the optimizations you could just to have something usable.

Specifying how to deal with UB would mean detecting it in the first place and then of course specifying the handling proper. However detecting it is against the speed first philosophy of the languages!

Today, do we still need fast programs ? Yes, for those of us working either with very limited resources (embedded systems) or with very harsh constraints (on response time or transactions per second), we do need to squeeze out as much as we can.

I know the motto throw more hardware at the problem. We have an application where I work:

  • expected time for an answer ? Less than 100ms, with DB calls in the midst (say thanks to memcached).
  • number of transactions per second ? 1200 in average, peaks at 1500/1700.

It runs on about 40 monsters: 8 dual core opteron (2800MHz) with 32GB of RAM. It gets difficult to be "faster" with more hardware at this point, so we need optimized code, and a language that allows it (we did restrain to throw assembly code in there).

I must say that I don't care much for UB anyway. If you get to the point that your program invokes UB then it needs fixing whatever the behavior that actually occurred. Of course it would be easier to fix them if it was reported straight away: that's what debug builds are for.

So perhaps that instead of focusing on UB we should learn to use the language:

  • don't use unchecked calls
  • (for experts) don't use unchecked calls
  • (for gurus) are you sure you really need an unchecked call here ?

And everything is suddenly better :)

Why is the phrase: undefined behavior means the compiler can do anything it wants true?

Nothing "causes" this to occur. Undefined behaviour cannot "occur". There is no mystical force that descends upon your computer and suddenly makes it create black holes inside of cats.

That anything can happen when you run a program whose behaviour is undefined, is stated as fact by the C++ standard. It's a statement of leeway, a handy excuse used by compilers to make assumptions about your code so as to provide useful optimisations.

For example, if we say that dereferencing nullptr is undefined (which it is) then no compiler needs to ever check that a pointer is not nullptr: it can just assume that a dereferenced pointer will never be nullptr, and if it's not then any consequences are the programmer's problem.

Due to the astounding complexity of compilers, some of those consequences can be rather unexpected.

Of course it is not actually true that "anything can happen". Your computer has neither the necessary physical power nor the necessary legal authority to instantiate a black hole inside of a cat. But since C++ is an abstraction, it seems only fitting that we use abstractions to teach people not to write programs with undefined behaviour. If you program rigorously, assuming that "anything can happen" if your program has undefined behaviour, then you will not be surprised by said rather unexpected consequences, and you will not be tempted to try to "control" the outcome in any way.

How to explain undefined behavior to know-it-all newbies?

"Congratulations, you've defined the behavior that compiler has for that operation. I'll expect the report on the behavior that the other 200 compilers that exist in the world exhibit to be on my desk by 10 AM tomorrow. Don't disappoint me now, your future looks promising!"

Why is undefined behavior allowed (as opposed to not compiling/crashing)?

The concept of undefined behaviour is required in languages like C and C++, because detecting the conditions that cause it would be impossible or prohibitively expensive. Take for example this code:

int * p = new int(0);
// lots of conditional code, somewhere in which we do
int * q = p;
// lots more conditional code, somewhere in which we do
delete p;
// even more conditional code, somewhere in which we do
delete q;

Here the pointer has been deleted twice, resulting in undefind behaviour. Detecting the error is too hard to do for a language like C or C++.

Undefined behavior, in principle

Frama-C's value analysis, a static analyzer the purported goal of which is to find all undefined behaviors in a C program, considers the assignment const int b = a; as okay. This is a deliberate design decision in order to allow memcpy() (typically implemented as a loop over unsigned char elements of a virtual array, and that the C standard arguably allows to re-implement as such) to copy a struct (which can have padding and uninitialized members) to another.

The “exception” is only for lvalue = lvalue; assignments without an intervening conversion, that is, for an assignment that amounts to a copy of a slice of memory for a memory location to another.

I (as one of the authors of Frama-C's value analysis) discussed this with Xavier Leroy at a time when he was himself wondering about the definition to pick in the verified C compiler CompCert, so he may have ended up using the same definition. It is in my opinion cleaner than what the C standard tries to do with indeterminate values that can be trap representations, and the type unsigned char that is guaranteed not to have any trap representations, but both CompCert and Frama-C assume relatively non-exotic targets, and perhaps the standardization committee was trying to accommodate platforms where reading an uninitialized int can indeed abort the program.

Returning b, or passing n1, n2 or n3 to printf in the end at least can be considered undefined behavior, because copying an uninitialized slice of memory does not making it initialized. With an oldish Frama-C version:

$ frama-c -val t.c

t.c:19:… accessing uninitialized left-value: assert \initialized(&n1);

And in an oldish version of CompCert, after minor modifications to make the program acceptable to it:

$ ccomp -interp t.c
Time 33: in function foo, expression <loc> = <undef>
ERROR: Undefined behavior

Does an expression with undefined behaviour that is never actually executed make a program erroneous?

If a side effect on a scalar object is unsequenced relative to etc

Side effects are changes in the state of the execution environment (1.9/12). A change is a change, not an expression that, if evaluated, would potentially produce a change. If there is no change, there is no side effect. If there is no side effect, then no side effect is unsequenced relative to anything else.

This does not mean that any code which is never executed is UB-free (though I'm pretty sure most of it is). Each occurrence of UB in the standard needs to be examined separately. (The stricken-out text is probably overly cautious; see below).

The standard also says that

A conforming implementation executing a well-formed program shall produce the same observable behavior
as one of the possible executions of the corresponding instance of the abstract machine with the same program
and the same input. However, if any such execution contains an undefined operation, this International
Standard places no requirement on the implementation executing that program with that input (not even
with regard to operations preceding the first undefined operation).

(emphasis mine)

This, as far as I can tell, is the only normative reference that says what the phrase "undefined behavior" means: an undefined operation in a program execution. No execution, no UB.

How to quickly get started at using and learning Emacs

The biggest thing about learning how to use Emacs is ... (drumroll please) learning how to use Emacs.

Okay, okay, okay. It's a silly answer, and it's a tautology, but it's true. If you start up Emacs, and think to yourself "How could I find every instance of the word 'foobar' in my source tree?" the worst thing you could do is hit Alt + Tab and visit Google.

Seriously.

Learning the help system and how it works is the best thing you can do. It's so nice to just hit C-h a find, and suddenly get all the information you need, right at your fingertips.

The next best thing you could do is install a wonderful little package called Icicles which has some seriously groovy completion functions. After you get it installed, just know that anytime the minibuffer is asking for some kind of input, you can now use regular expressions.

How would this apply to finding every file in your source tree? Well, you'd hit M-x, and then type "find". After that, you could hit (for instance) Shift + Tab and Icicles would kick in, finding every command that prefixes with "find". Alternatively, you could do M-x .find. and it would give you any command with find in it.

Build a cheat sheet. Just keep a saved buffer somewhere that has all of the keyboard shortcuts you use frequently in it. Remove the ones that you know off by heart, and pick up new ones. In most cases when you do a M-x command, the message buffer will tell you what the keyboard shortcut was for that command (if there was one).

Learn. Keyboard. Macros.

Learn. Emacs. Lisp.

Steven Huwig's idea of using some killer applications is a good one. Emacs is easier to use when you want to use it. For me, it was Planner Mode. (I've just moved to Org-mode, and it's even better.)

Is undefined behavior only an issue if you are deploying on several platforms?

OS changes, innocuous system changes (different hardware version!), or compiler changes can all cause previously "working" UB to not work.

But it is worse than that.

Sometimes a change to an unrelated compilation unit, or far away code in the same compilation unit, can cause previously "working" UB to not work; as an example, two inline functions or methods with different definitions but the same signature. One is silently discarded during linking; and completely innocuous code changes can change which one is discarded.

The code that is working in one context can suddenly stop working in the same compiler, OS and hardware when you use it in a different context. An example of this is violating strong aliasing; the compiled code might work when called at spot A, but when inlined (possibly at link-time!) the code can change meaning.

Your code, if part of a larger project, could conditionally call some 3rd party code (say, a shell extension that previews an image type in a file open dialog) that changes the state of some flags (floating point precision, locale, integer overflow flags, division by zero behavior, etc). Your code, which worked fine before, now exhibits completely different behavior.

Next, many kinds of undefined behavior are inherently non-deterministic. Accessing the contents of a pointer after it is freed (even writing to it) might be safe 99/100, but 1/100 the page was swapped out, or something else was written there before you got to it. Now you have memory corruption. It passes all your tests, but you lacked complete knowledge of what can go wrong.

By using undefined behavior, you commit yourself to a complete understanding of the C++ standard, everything your compiler can do in that situation, and every way the runtime environment can react. You have to audit the produced assembly, not the C++ source, possibly for the entire program, every time you build it! You also commit everyone who reads that code, or who modifies that code, to that level of knowledge.

It is sometimes still worth it.

Fastest Possible Delegates uses UB and knowledge about calling conventions to be a really fast non-owning std::function-like type.

Impossibly Fast Delegates competes. It is faster in some situations, slower in others, and is compliant with the C++ standard.

Using the UB might be worth it, for the performance boost. It is rare that you gain something other than performance (speed or memory usage) from such UB hackery.

Another example I've seen is when we had to register a callback with a poor C API that just took a function pointer. We'd create a function (compiled without optimization), copy it to another page, modify a pointer constant within that function, then mark that page as executable, allowing us to secretly pass a pointer along with the function pointer to the callback.

An alternative implementation would be to have some fixed size set of functions (10? 100? 1000? 1 million?) all of which look up a std::function in a global array and invoke it. This would put a limit on how many such callbacks we install at any one time, but practically was sufficient.



Related Topics



Leave a reply



Submit