Enforcing Statement Order in C++

Enforcing statement order in C++

I'd like to try to provide a somewhat more comprehensive answer after this was discussed with the C++ standards committee. In addition to being a member of the C++ committee, I'm also a developer on the LLVM and Clang compilers.

Fundamentally, there is no way to use a barrier or some operation in the sequence to achieve these transformations. The fundamental problem is that the operational semantics of something like an integer addition are totally known to the implementation. It can simulate them, it knows they cannot be observed by correct programs, and is always free to move them around.

We could try to prevent this, but it would have extremely negative results and would ultimately fail.

First, the only way to prevent this in the compiler is to tell it that all of these basic operations are observable. The problem is that this then would preclude the overwhelming majority of compiler optimizations. Inside the compiler, we have essentially no good mechanisms to model that the timing is observable but nothing else. We don't even have a good model of what operations take time. As an example, does converting a 32-bit unsigned integer to a 64-bit unsigned integer take time? It takes zero time on x86-64, but on other architectures it takes non-zero time. There is no generically correct answer here.

But even if we succeed through some heroics at preventing the compiler from reordering these operations, there is no guarantee this will be enough. Consider a valid and conforming way to execute your C++ program on an x86 machine: DynamoRIO. This is a system that dynamically evaluates the machine code of the program. One thing it can do is online optimizations, and it is even capable of speculatively executing the entire range of basic arithmetic instructions outside of the timing. And this behavior isn't unique to dynamic evaluators, the actual x86 CPU will also speculate (a much smaller number of) instructions and reorder them dynamically.

The essential realization is that the fact that arithmetic isn't observable (even at the timing level) is something that permeates the layers of the computer. It is true for the compiler, the runtime, and often even the hardware. Forcing it to be observable would both dramatically constrain the compiler, but it would also dramatically constrain the hardware.

But all of this should not cause you to lose hope. When you want to time the execution of basic mathematical operations, we have well studied techniques that work reliably. Typically these are used when doing micro-benchmarking. I gave a talk about this at CppCon2015: https://youtu.be/nXaxk27zwlk

The techniques shown there are also provided by various micro-benchmark libraries such as Google's: https://github.com/google/benchmark#preventing-optimization

The key to these techniques is to focus on the data. You make the input to the computation opaque to the optimizer and the result of the computation opaque to the optimizer. Once you've done that, you can time it reliably. Let's look at a realistic version of the example in the original question, but with the definition of foo fully visible to the implementation. I've also extracted a (non-portable) version of DoNotOptimize from the Google Benchmark library which you can find here: https://github.com/google/benchmark/blob/v1.0.0/include/benchmark/benchmark_api.h#L208

#include <chrono>

template <class T>
__attribute__((always_inline)) inline void DoNotOptimize(const T &value) {
  asm volatile("" : "+m"(const_cast<T &>(value)));
}

// The compiler has full knowledge of the implementation.
static int foo(int x) { return x * 2; }

auto time_foo() {
  using Clock = std::chrono::high_resolution_clock;

  auto input = 42;

  auto t1 = Clock::now();         // Statement 1
  DoNotOptimize(input);
  auto output = foo(input);       // Statement 2
  DoNotOptimize(output);
  auto t2 = Clock::now();         // Statement 3

  return t2 - t1;
}

Here we ensure that the input data and the output data are marked as un-optimizable around the computation foo, and only around those markers are the timings computed. Because you are using data to pincer the computation, it is guaranteed to stay between the two timings and yet the computation itself is allowed to be optimized. The resulting x86-64 assembly generated by a recent build of Clang/LLVM is:

% ./bin/clang++ -std=c++14 -c -S -o - so.cpp -O3
        .text
        .file   "so.cpp"
        .globl  _Z8time_foov
        .p2align        4, 0x90
        .type   _Z8time_foov,@function
_Z8time_foov:                           # @_Z8time_foov
        .cfi_startproc
# BB#0:                                 # %entry
        pushq   %rbx
.Ltmp0:
        .cfi_def_cfa_offset 16
        subq    $16, %rsp
.Ltmp1:
        .cfi_def_cfa_offset 32
.Ltmp2:
        .cfi_offset %rbx, -16
        movl    $42, 8(%rsp)
        callq   _ZNSt6chrono3_V212system_clock3nowEv
        movq    %rax, %rbx
        #APP
        #NO_APP
        movl    8(%rsp), %eax
        addl    %eax, %eax              # This is "foo"!
        movl    %eax, 12(%rsp)
        #APP
        #NO_APP
        callq   _ZNSt6chrono3_V212system_clock3nowEv
        subq    %rbx, %rax
        addq    $16, %rsp
        popq    %rbx
        retq
.Lfunc_end0:
        .size   _Z8time_foov, .Lfunc_end0-_Z8time_foov
        .cfi_endproc


        .ident  "clang version 3.9.0 (trunk 273389) (llvm/trunk 273380)"
        .section        ".note.GNU-stack","",@progbits

Here you can see the compiler optimizing the call to foo(input) down to a single instruction, addl %eax, %eax, but without moving it outside of the timing or eliminating it entirely despite the constant input.

Hope this helps, and the C++ standards committee is looking at the possibility of standardizing APIs similar to DoNotOptimize here.

Force order of execution of C statements?

So as I understand it the pNext = plog2sizeChunk->pNext publishes the block so that it can be seen by other threads and you have to make sure they see the correct busy flag.

That means you need a uni-directional memory barrier before publishing it (also one before reading it in another thread, although if your code runs on x86 you get those for free) to make sure that threads actually see the change. You also need one before the write to avoid reordering writes after it. No just inserting assembly or using a standard compliant volatile (MSVC volatile gives extra guarantees though that make a difference here) is not enough - yes this stops the compiler from shifting reads and writes around, but the CPU is not bound by it and can do the same reordering internally.

Both MSVC and gcc have intrinsics/macros to create memory barriers (see eg here). MSVC also gives stronger guarantees to volatiles that are good enough for your problem. Finally C++11 atomics would work as well, but I'm not sure if C itself has any portable way to guarantee memory barriers.

Enforcing order of execution

Write this critical chunk of code in assembly language.

The situation you're in is unusual. Most of the time people want the compiler to do optimizations, so compiler developers don't spend much development effort on means to avoid them. Even with the knobs you do get (pragmas, separate compilation, indirections, ...) you can never be sure something won't be optimized. Some of the undesirable optimizations you mention (constant folding, for instance) cannot be turned off by any means in modern compilers.

If you use assembly language you can be sure you're getting exactly what you wrote. If you do it any other way you won't have that level of confidence.

enforcing order of calling function in google mock

The idea is simple: if HttpStub::get is called before HttpStub::initialize, then fail the test APlaceDescriptionService.MakesHttpRequestToObtainAddress. In other words, the test insures that HttpStub::get is not called before httpStub is initialized.

If your question is how it works:

InSequence forceExpectationOrder sets new Sequence object to the global g_gmock_implicit_sequence on creation, that uses a chain of expectations in the order of their creations.

// Points to the implicit sequence introduced by a living InSequence
// object (if any) in the current thread or NULL.
GTEST_API_ ThreadLocal<Sequence*> g_gmock_implicit_sequence;

// Creates the implicit sequence if there isn't one.
InSequence::InSequence() {
  if (internal::g_gmock_implicit_sequence.get() == nullptr) {
    internal::g_gmock_implicit_sequence.set(new Sequence);
    sequence_created_ = true;
  }
  // ...
}

// Deletes the implicit sequence if it was created by the constructor
// of this object.
InSequence::~InSequence() {
  if (sequence_created_) {
    delete internal::g_gmock_implicit_sequence.get();
    internal::g_gmock_implicit_sequence.set(nullptr);
  }
}

// Adds and returns an expectation spec for this mock function.
TypedExpectation<F>& AddNewExpectation(const char* file, int line,
                                       const std::string& source_text,
                                       const ArgumentMatcherTuple& m) {
  // ...

  // Adds this expectation into the implicit sequence if there is one.
  Sequence* const implicit_sequence = g_gmock_implicit_sequence.get();
  if (implicit_sequence != nullptr) {
    implicit_sequence->AddExpectation(Expectation(untyped_expectation));
  }

  return *expectation;
}

Does statement re-ordering apply to conditional/control statements?

The "as-if" rule ([intro.abstract]) is important to note here:

The semantic descriptions in this document define a parameterized nondeterministic abstract machine. This document places no requirement on the structure of conforming implementations. In particular, they need not copy or emulate the structure of the abstract machine. Rather, conforming implementations are required to emulate (only) the observable behavior of the abstract machine as explained below

Anything^* can be reordered so long as the implementation can guarantee that the observable behavior of the resulting program is unchanged.

Thread synchronization constructs in general cannot be implemented properly without fences and preventing reordering. For example, the standard guarantees that lock/unlock operations on mutexes shall behave as atomic operations. Atomics also explicitly introduce fences, especially with respect to the memory_order specified. This means that statements depending on an (unrelaxed) atomic operation cannot be reordered, otherwise the observable behavior of the program may change.

[intro.races] talks a great deal about data races and ordering.

Statements that can be reordered

The only standard rule that governs reordering (and optimization in general) is the "as if" rule, meaning that the compiler can do anything that it pleases if it determines that the end result is the same, and the standard won't get in the way. However, when we get to that level, the compiler probably no longer operates on the C++ source level: it's probably dealing with an intermediate form, and thinking in statements might not actually best reflect what is going on.

For instance, in your second example:

// both a and b are non-volatile ints
a = rand();
b = rand();

The compiler could call rand twice and store the result of the second rand call to b before storing the result of the first rand call to a. In other words, the assignments have been reordered but the calls haven't. This is possible because the compiler optimizes a representation of your program that is more granular than C++.

Various compilers use various tricks to determine whether two intermediate representation instructions can be reordered, but most often, function calls are insurmountable barriers that cannot be reordered. Especially, calls to external libraries (like rand) can't even be analyzed and are certain to not be reordered since the compiler will prefer a conservative but known-correct approach.

In fact, function calls can only be reordered if the compiler determines that they cannot interfere with one another. Compilers attempt to solve this problem with alias analysis. The idea is that beyond value dependence (for instance, in a * b + c, the + operation depends on the result of a * b and thus a * b must happen first), ordering (almost) only matters when you write somewhere and read back from there later. This means that if you can correctly identify every memory operation and their impact, you can determine if two memory operations can be reordered, or even eliminated altogether. For these purposes, a call is considered a big memory operation that encompasses all the smaller loads and stores that it does.

Unfortunately, the general case of alias analysis is known to be uncomputable. While compilers get smarter and smarter, you'll probably never have a compiler that is systematically able to make the best decision on call reordering even if you had all of the source code that you link against.

Some compilers have attributes specific to themselves that govern whether they should consider the function safe for reordering, regardless of their own analysis. For instance, gcc will happily reorder, cache or even eliminate function calls with the [[gnu::pure]] attribute if it thinks that it will lead to a performance increase.

How does Google's `DoNotOptimize()` function enforce statement ordering

Summary: DoNotOptimize is ordered wrt. time() by the the "memory" clobbers, as if it were another function call to an opaque function that could modify any global state.

DoNotOptimize is ordered wrt. the computation of output from input by the data dependency of the calculation on the input, and the output on the calculation, as Chandler Carruth explained in the Q&A you linked. The "memory" clobber is irrelevant for this part.

`"memory"` clobber is like a non-inline function call

DoNotOptimize's asm statement contains a "memory" clobber. As far as the optimizer is concerned, that's equivalent to an opaque function call: it has to be assumed to read and write every globally-reachable object¹. (Even ones this compilation unit might not know about.)

Since time() itself doesn't have an inline definition in any header, it can't reorder with DoNotOptimize at compile time for the same reason that a compiler can't reorder calls to foo() and bar() when it can't see the definitions of those functions. Same reason compilers don't need any special logic to stop them from reordering puts("hi"); puts("mom");.

(A hypothetical time() that could inline and only contained an asm statement would have to use asm volatile to make sure repeated calls didn't just use the first one's output. asm volatile statements can't reorder with each other or accesses to volatile variables, so that would be ok too, for a different reason.)

Footnote 1: Globally reachable = any object that might be pointed-to by any hypothetical global variable. i.e. anything except local variables within this function, or memory freshly allocated with new, if escape analysis can prove that nothing outside this function could have pointers to them.

How the `asm` statement works

I think you're pretty seriously misunderstanding how the asm works. "+r,m" tells the compiler to materialize the value in a register (or memory if it wants), and then use the value there at the end of the (empty) asm template as the new value of that C++ object.

So it forces the compiler to actually materialize (produce) the value somewhere, which means it has to be computed. And it means has to forget what it previously knew about the value (e.g. that it was a compile time constant 5, or non-negative, or anything) because the "+" modifier declares a read/write operand.

The point of DoNotOptimize on the input is to defeat constant-propagation that would let the benchmark optimize away.

And on the output to make sure a final result is actually materialized in a register (or memory) instead of optimizing away all the computation leading to an unused result. (This is where being asm volatile is relevant; defeating constant-propagation still works with non-volatile asm.)

So the computation you want to benchmark has to happen between the two DoNotOptimize() statements, and separately those two statements can't reorder with time().

The compiler has to assume that the asm statement modifies the value like val ^= random for all it knows, along with changing the value in memory of any/every other object except for private locals that weren't operands, so e.g. the "memory" clobber doesn't stop the compiler from keeping a local loop counter in memory. (It doesn't special case an empty asm template string here; programs don't contain asm statements like this by accident so nobody wants them optimized away.)

Misconceptions about the reference arg and picking `"m"`

I only got part way into the details of your attempt to reason about the "+r,m" operand and the reference function-arg before deciding it would probably be better to just explain from scratch. The correct reason isn't that complicated. But a couple things are worth specifically correcting:

The C++ function containing the asm statement can inline, letting the by-reference function arg optimize away. (It's even declared inline __attribute__((always_inline)) to force inlining even with optimization disabled, although in that case the reference variable won't optimize away.)

The net result is as if the asm statement were used directly on the C++ variable passed to DoNotOptimize. e.g. DoNotOptimize(foo) is like asm volatile("" : "+r,m"(foo) :: "memory")

The compiler can always pick register if it wants to, e.g. choosing to load a variable's value into a register before an asm statement. (And if the C++ semantics demand updating the variable's value in memory, also emitting a store instruction after the asm statement.)

For example, we can see that GCC does choose to do that. (I guess I could have used incl %0 as the example, but I just chose nop as a way to show what the compiler picked for the operand location as an alternative to # %0 pure comment, so the Godbolt compiler explorer wouldn't filter it out.)

void foo(int *p)
{
    asm volatile("nop # operand picked %0" : "+r,m" (p[4]) );
}

# GCC 11.2 -O2
foo(int*):
        movl    16(%rdi), %eax
        nop # operand picked %eax
        movl    %eax, 16(%rdi)
        ret

vs. clang choosing to leave the value in memory, so every instruction in the asm template would be accessing memory instead of a register. (If there were any instructions).

# clang 12.0.1 -O2 -fPIE
foo(int*):                               # @foo(int*)
        nop     # operand picked 16(%rdi)
        retq

Fun fact: "r,m" is an attempt to work around a clang missed-optimization bug that makes it always pick memory for "rm" constraints, even if the value was already in a register. Spilling it first, even if it has to invent a temporary location for the value of an expression as an input.

Enforcing Statement Order in C++