How to Produce Deterministic Binary Output with G++

How to produce deterministic binary output with g++?

We also depend on bit-identical rebuilds, and are using gcc-4.7.x.

Besides setting PWD=/proc/self/cwd and using -frandom-seed=<input-file-name>, there are a handful of patches, which can be found in svn://gcc.gnu.org/svn/gcc/branches/google/gcc-4_7 branch.

binary object file changing in each build

Copied from the GCC man-page:

-frandom-seed=string

This option provides a seed that GCC uses when it would otherwise
use random numbers. It is
used to generate certain symbol names that have to be different
in every compiled file. It
is also used to place unique stamps in coverage data files
and the object files that produce
them. You can use the -frandom-seed option to produce reproducibly identical object files.

The string should be different for every file you compile.

GCC .obj file output is not deterministic (.debug_info, PROGBITS section)

Causes for non-determinism

The usual culprits are the macros __DATE__, __TIME__, __TIMESTAMP__ which the compiler expands to values calculated from the system time.

One possibility is that the debug info generated for the binary is written in a non-deterministic manner. This could happen, for example, when the in-memory layout of the debug info in the compiler process is not deterministic. I don't know the internals of GCC. But I guess something like this can happen when

  • using random GUIDs in debug information output,
  • using mangling of symbols in anonymous namespaces, or
  • serializing a hashmap where the key is a pointer to memory, sequentially.

The latter source of non-determinism is usually considered to be a bug in the compiler (e.g. GCC PR65015)

Mitigation

To force reproducible expansions of the __DATE__, __TIME__ and __TIMESTAMP__ macros, one has to emulate and fake the system time (e.g. by using libfaketime/faketime) to the compiler. The -Wdate-time command-line option to GCC can be used to warn whenever these predefined macros are used.

To force reproducible "randomness" for GUIDs and mangling, you could try to compile with -frandom-seed=<somestring> where <somestring> is a unique string for your build (e.g. the hash of the contents of the source file you're compiling should do it).

Alternatively you can try to compile without debug information (e.g. without the -ggdb etc flags) or use some strip tool to remove the debug information section later.

See also

  • https://wiki.debian.org/ReproducibleBuilds/About
  • How to produce deterministic binary output with g++? - Stack Overflow

What could cause a deterministic process to generate floating point errors

In almost any situation where there's a fast mode and a safe mode, you'll find a trade-off of some sort. Otherwise everything would run in fast-safe mode :-).

And, if you're getting different results with the same input, your process is not deterministic, no matter how much you believe it to be (in spite of the empirical evidence).

I would say your explanation is the most likely. Put it in safe mode and see if the non-determinism goes away. That will tell you for sure.

As to whether there are other optimizations, if you're compiling on the same hardware with the same compiler/linker and the same options to those tools, it should generate identical code. I can't see any other possibility other than the fast mode (or bit rot in the memory due to cosmic rays, but that's pretty unlikely).

Following your update:

Intel has a document here which explains some of the things they're not allowed to do in safe mode, including but not limited to:

  • reassociation: (a+b)+c -> a+(b+c).
  • zero folding: x + 0 -> x, x * 0 -> 0.
  • reciprocal multiply: a/b -> a*(1/b).

While you state that these operations are compile-time defined, the Intel chips are pretty darned clever. They can re-order instructions to keep pipelines full in multi-CPU set-ups so, unless the code specifically prohibits such behavior, things may change at run-time (not compile-time) to keep things going at full speed.

This is covered (briefly) on page 15 of that linked document that talks about vectorization ("Issue: different results re-running the same binary on the same data on the same processor").

My advice would be to decide whether you need raw grunt or total reproducability of results and then choose the mode based on that.

How to check whether two executable binary files are generated from same source code?

In general, this is completely impossible to do.

  • You can generate different binaries from the same source
  • Two identical binaries can be generated from different sources

It is possible to add version information in different ways. However, you can fool all of those methods quite easily if you want.

Here is a short script that might help you. Note that it might have flaws. It's just to show the idea. Don't just copy this and use in production code.

#!/bin/bash 

STR="asm(\".ascii \\\"$(md5sum $1)\\\"\");"
NEWNAME=$1.aux.c
cp $1 $NEWNAME
echo $STR >> $NEWNAME
gcc $NEWNAME

What it does is basically to make sure that the md5sum of the source gets included as a string in the binary. It's gcc specific, and you can read more about the idea here: embed string via header that cannot be optimized away

Why is the binary output not equal when compiling again?

ANOTHER UPDATE:

Since 2015 the compiler team has been making an effort to get sources of non-determinism out of the compiler toolchain, so that identical inputs really do produce identical outputs. See the "Concept-determinism" tag on the Roslyn github for more details.


UPDATE: This question was the subject of my blog in May 2012. Thanks for the great question!


How is this possible?

Very easily.

Isn't the binary result supposed to be exactly equal for the same input?

Absolutely not. The opposite is true. Every time you run the compiler you should get a different output. Otherwise how could you know that you'd recompiled?

The C# compiler embeds a freshly generated GUID in an assembly on every compilation, thereby guaranteeing that no two compilations produce exactly the same result.

Moreover -- even without the GUID, the compiler makes no guarantees whatsoever that two "identical" compilations will produce the same results.

In particular, the order in which the metadata tables are populated is highly dependent on details of the file system; the C# compiler starts generating metadata in the order in which the files are given to it, and that can be subtly changed by a variety of factors.

due to the way our build server works the checked in changes trigger a rebuild, causing the once again modified binary files to be checked in in a circle.

I'd fix that if I were you.

C (or any) compilers deterministic performance

For safety critical embedded application certifying agencies require to satisfy the "proven-in-use" requirement for the compiler. There are typically certain requirements (kind of like "hours of operation") that need to be met and proven by detailed documentation. However, most people either cannot or don't want to meet these requirements because it can be very difficult especially on your first project with a new target/compiler.

One other approach is basically to NOT trust the compiler's output at all. Any compiler and even language-dependent (Appendix G of the C-90 standard, anyone?) deficiencies need to be covered by a strict set of static analysis, unit- and coverage testing in addition to the later functional testing.

A standard like MISRA-C can help to restrict the input to the compiler to a "safe" subset of the C language. Another approach is to restrict the input to a compiler to a subset of a language and test what the output for the entire subset is. If our application is only built of components from the subset it is assumed to be known what the output of the compiler will be. The usually goes by "qualification of the compiler".

The goal of all of this is to be able to answer the QA representative's question with "We don't just rely on determinism of the compiler but this is the way we prove it...".



Related Topics



Leave a reply



Submit