How to produce deterministic binary output with g++?
We also depend on bit-identical rebuilds, and are using gcc-4.7.x.
Besides setting PWD=/proc/self/cwd
and using -frandom-seed=<input-file-name>
, there are a handful of patches, which can be found in svn://gcc.gnu.org/svn/gcc/branches/google/gcc-4_7
branch.
binary object file changing in each build
Copied from the GCC man-page:
-frandom-seed=string
This option provides a seed that GCC uses when it would otherwise
use random numbers. It is
used to generate certain symbol names that have to be different
in every compiled file. It
is also used to place unique stamps in coverage data files
and the object files that produce
them. You can use the -frandom-seed option to produce reproducibly identical object files.The string should be different for every file you compile.
GCC .obj file output is not deterministic (.debug_info, PROGBITS section)
Causes for non-determinism
The usual culprits are the macros __DATE__
, __TIME__
, __TIMESTAMP__
which the compiler expands to values calculated from the system time.
One possibility is that the debug info generated for the binary is written in a non-deterministic manner. This could happen, for example, when the in-memory layout of the debug info in the compiler process is not deterministic. I don't know the internals of GCC. But I guess something like this can happen when
- using random GUIDs in debug information output,
- using mangling of symbols in anonymous namespaces, or
- serializing a hashmap where the key is a pointer to memory, sequentially.
The latter source of non-determinism is usually considered to be a bug in the compiler (e.g. GCC PR65015)
Mitigation
To force reproducible expansions of the __DATE__
, __TIME__
and __TIMESTAMP__
macros, one has to emulate and fake the system time (e.g. by using libfaketime/faketime) to the compiler. The -Wdate-time
command-line option to GCC can be used to warn whenever these predefined macros are used.
To force reproducible "randomness" for GUIDs and mangling, you could try to compile with -frandom-seed=<somestring>
where <somestring>
is a unique string for your build (e.g. the hash of the contents of the source file you're compiling should do it).
Alternatively you can try to compile without debug information (e.g. without the -ggdb
etc flags) or use some strip tool to remove the debug information section later.
See also
- https://wiki.debian.org/ReproducibleBuilds/About
- How to produce deterministic binary output with g++? - Stack Overflow
What could cause a deterministic process to generate floating point errors
In almost any situation where there's a fast mode and a safe mode, you'll find a trade-off of some sort. Otherwise everything would run in fast-safe mode :-).
And, if you're getting different results with the same input, your process is not deterministic, no matter how much you believe it to be (in spite of the empirical evidence).
I would say your explanation is the most likely. Put it in safe mode and see if the non-determinism goes away. That will tell you for sure.
As to whether there are other optimizations, if you're compiling on the same hardware with the same compiler/linker and the same options to those tools, it should generate identical code. I can't see any other possibility other than the fast mode (or bit rot in the memory due to cosmic rays, but that's pretty unlikely).
Following your update:
Intel has a document here which explains some of the things they're not allowed to do in safe mode, including but not limited to:
- reassociation:
(a+b)+c -> a+(b+c)
. - zero folding:
x + 0 -> x, x * 0 -> 0
. - reciprocal multiply:
a/b -> a*(1/b)
.
While you state that these operations are compile-time defined, the Intel chips are pretty darned clever. They can re-order instructions to keep pipelines full in multi-CPU set-ups so, unless the code specifically prohibits such behavior, things may change at run-time (not compile-time) to keep things going at full speed.
This is covered (briefly) on page 15 of that linked document that talks about vectorization ("Issue: different results re-running the same binary on the same data on the same processor").
My advice would be to decide whether you need raw grunt or total reproducability of results and then choose the mode based on that.
How to check whether two executable binary files are generated from same source code?
In general, this is completely impossible to do.
- You can generate different binaries from the same source
- Two identical binaries can be generated from different sources
It is possible to add version information in different ways. However, you can fool all of those methods quite easily if you want.
Here is a short script that might help you. Note that it might have flaws. It's just to show the idea. Don't just copy this and use in production code.
#!/bin/bash
STR="asm(\".ascii \\\"$(md5sum $1)\\\"\");"
NEWNAME=$1.aux.c
cp $1 $NEWNAME
echo $STR >> $NEWNAME
gcc $NEWNAME
What it does is basically to make sure that the md5sum of the source gets included as a string in the binary. It's gcc specific, and you can read more about the idea here: embed string via header that cannot be optimized away
Why is the binary output not equal when compiling again?
ANOTHER UPDATE:
Since 2015 the compiler team has been making an effort to get sources of non-determinism out of the compiler toolchain, so that identical inputs really do produce identical outputs. See the "Concept-determinism" tag on the Roslyn github for more details.
UPDATE: This question was the subject of my blog in May 2012. Thanks for the great question!
How is this possible?
Very easily.
Isn't the binary result supposed to be exactly equal for the same input?
Absolutely not. The opposite is true. Every time you run the compiler you should get a different output. Otherwise how could you know that you'd recompiled?
The C# compiler embeds a freshly generated GUID in an assembly on every compilation, thereby guaranteeing that no two compilations produce exactly the same result.
Moreover -- even without the GUID, the compiler makes no guarantees whatsoever that two "identical" compilations will produce the same results.
In particular, the order in which the metadata tables are populated is highly dependent on details of the file system; the C# compiler starts generating metadata in the order in which the files are given to it, and that can be subtly changed by a variety of factors.
due to the way our build server works the checked in changes trigger a rebuild, causing the once again modified binary files to be checked in in a circle.
I'd fix that if I were you.
C (or any) compilers deterministic performance
For safety critical embedded application certifying agencies require to satisfy the "proven-in-use" requirement for the compiler. There are typically certain requirements (kind of like "hours of operation") that need to be met and proven by detailed documentation. However, most people either cannot or don't want to meet these requirements because it can be very difficult especially on your first project with a new target/compiler.
One other approach is basically to NOT trust the compiler's output at all. Any compiler and even language-dependent (Appendix G of the C-90 standard, anyone?) deficiencies need to be covered by a strict set of static analysis, unit- and coverage testing in addition to the later functional testing.
A standard like MISRA-C can help to restrict the input to the compiler to a "safe" subset of the C language. Another approach is to restrict the input to a compiler to a subset of a language and test what the output for the entire subset is. If our application is only built of components from the subset it is assumed to be known what the output of the compiler will be. The usually goes by "qualification of the compiler".
The goal of all of this is to be able to answer the QA representative's question with "We don't just rely on determinism of the compiler but this is the way we prove it...".
Related Topics
How to Get File Extension from String in C++
Convert a Static Library to a Shared Library (Create Libsome.So from Libsome.A): Where's My Symbols
Cin.Getline() Is Skipping an Input in C++
Using Std::Map<K,V> Where V Has No Usable Default Constructor
Why Is Volatile Deprecated in C++20
Error: 'Int32_Max' Was Not Declared in This Scope
Was Not Declared in This Scope' Error
How to Avoid Entering Library's Source Files While Debugging in Qt Creator with Gdb
Why the Libc++ Std::Vector Internally Keeps Three Pointers Instead of One Pointer and Two Sizes
Address of Function Is Not Actual Code Address
What Does the Vertical Pipe ( | ) Mean in C++
C++: Fastest Method to Check If All Array Elements Are Equal
How to Avoid Undefined Execution Order for the Constructors When Using Std::Make_Tuple
How to Test an Exe with Google Test