Lto, Devirtualization, and Virtual Tables

Using final to reduce virtual method overhead

You're correctly understanding the effects of final (except maybe the inner loop of case 2), but your cost estimates are way off. We shouldn't expect a big effect anywhere because mt19937 is just plain slow and all 3 versions spend most of their time in it.

The only thing that's not lost / buried in noise / overhead is the effect of inlining int run_once() override final into the inner loop in FooPlus::run_multiple, which both Case 2 and Case 3 run.

But Case 1 can't inline Foo::run_once() into Foo::run_multiple(), so there's function-call overhead inside the inner loop, unlike the other 2 cases.

Case 2 has to call run_multiple repeatedly, but that's only once per 100 runs of run_once and has no measurable effect.

For all 3 cases, most of the time is spent dist(rng);, because std::mt19937 is pretty slow compared to the extra overhead of not inlining a function call. Out-of-order execution can probably hide a lot of that overhead, too. But not all of it, so there's still something to measure.

Case 3 is able to inline everything to this asm loop (from your quickbench link):

 # percentages are *self* time, not including time spent in the PRNG
 # These are from QuickBench's perf report tab,
 #  presumably sample for core clock cycle perf events.
 # Take them with a grain of salt: superscalar + out-of-order exec
 #  makes it hard to blame one instruction for a clock cycle

   VirtualWithFinalCase2(benchmark::State&):   # case 3 from QuickBench link
     ... setup before the loop
     .p2align 3
    .Louter:                # do{
       xor    %ebp,%ebp          # sum = 0
       mov    $0x64,%ebx         # inner = 100
     .p2align 3  #  nopw   0x0(%rax,%rax,1)
     .Linner:                    # do {
51.82% mov    %r13,%rdi
       mov    %r15,%rsi
       mov    %r13,%rdx           # copy args from call-preserved regs
       callq  404d60              # mt PRNG for unsigned long
47.27% add    %eax,%ebp           # sum += run_once()
       add    $0xffffffff,%ebx    # --inner
       jne    .Linner            # }while(inner);
       mov    %ebp,0x4(%rsp)     # store to volatile local:  benchmark::DoNotOptimize(x);
0.91%  add    $0xffffffffffffffff,%r12   # --outer
       jne                    # } while(outer)

Case 2 can still inline run_once into run_multiple because class FooPlus uses int run_once() override final. There's virtual-dispatch overhead in the outer loop (only), but this small extra cost for each outer loop iteration is totally dwarfed by the cost of the inner loop (identical between Case 2 and Case 3).

So the inner loop will be essentially identical, with indirect-call overhead only in the outer loop. It's unsurprising that this is unmeasurable or at least lost in noise on Quickbench.

Case 1 can't inline Foo::run_once() into Foo::run_multiple(), so there's function-call overhead there, too. (The fact that it's an indirect function call is relatively minor; in a tight loop branch prediction will do a near-perfect job.)

Case 1 and Case 2 have identical asm for their outer loop, if you look at the disassembly on your Quick-Bench link.

Neither one can devirtualize and inline run_multiple. Case 1 because it's virtual non-final, Case 2 because it's only the base class, not the derived class with a final override.

        # case 2 and case 1 *outer* loops
      .loop:                 # do {
       mov    (%r15),%rax     # load vtable pointer
       mov    $0x64,%esi      # first C++ arg
       mov    %r15,%rdi       # this pointer = hidden first arg
       callq  *0x8(%rax)      # memory-indirect call through a vtable entry
       mov    %eax,0x4(%rsp)  # store the return value to a `volatile` local
       add    $0xffffffffffffffff,%rbx      
       jne    4049f0 .loop   #  } while(--i != 0);

This is probably a missed optimization: the compiler can prove that Base *f came from new FooPlus(), and thus is statically known to be of type FooPlus. operator new can be overridden, but the compiler still emits a separate call to FooPlus::FooPlus() (passing it a pointer to the storage from new). So this just seems to be a cast of clang not taking advantage in Case 2 and maybe also Case 1.

Why doesn't this simple function get de-virtualized?

GCC guesses that Aint *p points to instance of Aint *p (but does not think this is guaranteed to happen) and therefore it devirtualises speculatively the call to operator+= and the typeinfo checking is an inlined copy of it.
-fno-devirtualize-speculatively leads to same code as Clang and MSVC produces.

_Z11foo_virtualP4Aint:
.LFB4:
        .cfi_startproc
        movq    (%rdi), %rax
        movq    %rdi, %rsi
        movq    (%rax), %rax
        jmp     *%rax

Inlining of virtual functions (Clang vs GCC)

As you already figured out, gcc is trying to optmistically inline the functions. That is, inline their bodies after checking that the functions are not in fact overridden: it does this by comparing the value of the function pointers in their vtable¹. This type of optimization may be referred to as speculative devirtualization and is widely applied in JIT-compiled languages such as C# and Java, while being more difficult, less profitable and less often applied in compiled lanaguages such as C++. Here we use speculative to distinguish it from the variant where the compiler can prove that the devirtualization is possible. In the optmistic case we instead hope that it is valid, and double check at runtime.

This whole topic could fill a book (well, at least a paper or two), so if you want all the gory details on how the various types of devirtualization are implemented, take a look at Hanza's 7 part series on devirtualization. He helped implement in gcc⁴ and a lot of the content there goes directly to this question. Here I'll focus on some specifics of your example.

Overall, it's a pretty interesting optimization, that could pay big dividends in certain cases. In your specific case, however, it seems a bit questionable. First, note that falls into a class of probabilistic optimizations: optimizations whose benefit or penalty depends on the runtime behavior of the application: in particular, whether the base argument to type actually overrides the ant() and dec() methods. Whether the optimization is profitable will swing from "strictly a bad idea" to "looks somewhat OK" depending on the actual frequencies.

For example, if the actual behavior of the application is to always pass a base with the default ant() and dec() methods shown, the benefits are:

A somewhat lower latency code path. Ignoring common work (such as loading the function pointers from the vtable, and the actual increment), the inlined approach basically adds two cmp + jne pairs, but saves one indirect call, one indirect jmp and two rets. Just counting instructions that is likely to be a wash, but in practice two macro-fused cmp/jmp pairs are going to be very cheap, and also "out of line" with respect to the increment operations, so their latency cost is probably completely hidden. The indirect jumps and ret calls are less likely to be hidden: modern x86 like Haswell is still good at executing these quickly, but there are hard limits like one taken branch per cycle, etc.
That said, in this case, the difference between the two paths is likely to be fairly small. The two RMW inc and dec operations are likely to take longer than the jumping around.
The branch prediction behavior is likely to be better, most notably when this code is cold. When cold, there is no info in any of the predictors about the likelihood of branches being taken, or their targets (for the case of indirect branches). The two jne calls in inlined case will likely default predicted not-taken, and be predicted correctly², and the indirect branches will be avoided entirely. The always-virtual-call approach, however, has no chance of correctly predicting the branch target, so you are going to suffer two consecutive branch mispredictions.

On the other hand, when the ant() and dec() methods are always overridden, the optimization is a dead-loss. You added the extra comparisons and taken jumps to the execution path, and then you had to do the indirect call anyway. Furthermore, you bloated the code size by a huge (47 bytes for the gcc version vs. 13 bytes for the clang one), relatively speaking. If your runtime code footprint is large, this will really hurt you in terms of I$ misses.

Peter mentions that even when the checks fail, at least the optimization reduces the number of targets (by 1) for the subsequent indirect branch. However, you still need to include the mispredicts for the cmp/jne call in the analysis. When you do that, it seems like the check-base-first approach is always either tied or worse - at least in terms of expected mispredicts.

A couple of quick examples:

Two targets (x, y) for ant() with probabilities p(x) and p(y).

Assume p(x) > p(y) without loss of generality¹.

Check first:

jne predicts x always so expected jne misses: p(y)

indirect call predicts y always, with zero expected misses expected

misses = p(y)

Virtual call only:

BTB predicts x

expected misses: p(y)

So both cases are exactly the same, with p(y) expected misses.

Three targets, with probabilities x, y, z (we skip the p(x) notation)
Case x > y + z

Assume for simplicity that that y == z. You can work through the case where y != z. It doesn't change the qualitative conclusions.

Case x > y + z

Check first:

jne predicts taken (x) always so expected jne misses: y + z = 2y

indirect call predicts y, with z (==y) expected misses

misses = x*(0 + 0) + y*(1 + 0) + z*(1 + 1)

= y + 2z
= 3y

Virtual call only:

BTB predicts x

misses = x*0 + y + z
= 2y

So in this case, check-first is dominated by virtual-only in mispredict chance, suffering 50% more mispredicts (proportional to the probability of the two less common targets). At the boundaries, when p(y) == p(z) == 0, it ties (but that means there aren't really three targets), and when p(x) == p(y) + p(z) it suffers an expected 0.75 mispredicts per call compared to 0.5 for the virtual-call-only approach.

Let's check the x < y + z case.

Check first:

jne predicts not taken (y or z) always so expected jne misses: x

indirect call predicts y always, with z expected misses

misses = x*(1 + 0) + y*(0 + 0) + z*(0 + 1)

= x + z

Virtual call only:

BTB predicts x

expected misses: y + z

So here again the virtual-call dominates the check-first approach. The latter suffers p(x) - p(y) additional misses. The worst case seems to be when p(x) == ~0.49... and p(y) == p(z) == ~0.25 misses, where again the virtual approach suffers ~0.25 additional misses per call. The other boundary conditions is again a tie when p(z) == 0 (expected, since that's the two-target case).

Now all of the above assume that taken vs not-taken branch mispredicts are equal in cost to branch targets mispredicts. On modern x86 CPUs the cost does in fact seem to be about the same. Each type of mispredict causes a full redirect and a penalty of 15-20 cycles. There are still second order effects present - for example, an indirect branch may be resolved later than a conditional direct one if the jump address takes longer to calculate than the branch condition. That doesn't seem to be the case here since both the branch decision and the branch address depend on the same thing: the address of the ant() function in the vtable.

The above also assumes that indirect branches are equally well predicted compared to conditional branches, and that this prediction consumes an equal amount of resources. This is, in general, not true. Processors generally have fewer resources dedicated to indirect BTB entries, and even given equal resources, getting to a given prediction rate requires more resources in the IBTB scenario versus the conditional branch scenario, since IBTB state (target) is larger than a single bit of taken not-taken info. Smaller or older CPUs may not have any indirect branch prediction capability as well, but this difference is real even in modern x86 CPUs. It is hard to quantify, but at the limit, when this function is called a lot, the difference disapears (since there are enough resources to track accurately IBTB calls for at least the hottest calls).

If you got this far, you might well fell that, overall, this seems like a questionable optimization, in this specific case. The potential upside is fairly small, and the cost in terms of code bloat is large. Furthermore, in the intermediate cases (where base.ant() is sometimes equal to Base::ant), the proposition is questionable since the increased mispredicts eat into the benefit of inlining the call.

On the face of it, I would tend to agree - but there are a couple of mitigating factors:

First, gcc is actually trying to be smart about when it applies this optimization. It applies this optimization only when it can't see that the methods in question are actually overloaded. Take a look at this small modification of your example. The function is unchanged, but we add an (unused in this file) subclass of Base which does override the methods. Now, gcc no longer does the speculative inlining. It sees that the method is overridden, so it doesn't find the inlining worth it.

This whole idea of what can gcc see is pretty important. Usually gcc is looking only at a single compilation unit. That greatly restricts its visibility. You can make a reasonable argument that, in general, Base would be overridden in a separate compilation unit, if at all, so the behavior noted above (gcc deciding to apply the speculative inlining based on whether an override exists) isn't too useful, since same-file overrides are rare.

On the other hand, you might note that goldbolt is putting your class definition in a .cpp file⁵. It is very rare and bad practice for someone to include a .cpp file in another compilation unit. So the guess that Base isn't overridden is perhaps a pretty good one in this case. Of course, I don't gcc actually uses that info - by the time the actual compiler sees the file, the distinction between header and .cpp files has pretty much been lost.

Whether a class is likely to be overridden, when your scope is a single compilation unit is nearly as much a philosophical question as a technical one. This is exactly the kind of question that LTO and PGO (as Peter mentioned) is supposed to solve. In the LTO case, by deferring optimization to link-time, the entire set of statically available⁶ classes and method overrrides is available for examination. In the case of PGO, the compiler can use information about which classes actually appear as targets at every call site, and optimize based on the observed overrides. It can even make different decisions for different call sites (assuming function() itself can be inlined). PGO can approach the quality of devirtualization that is usually reserved for JIT-compiled languages.

In fact, this topic is important enough that Jan gives it an entire entry on his series about devirtualization. Of particular importance, he covers cases where the compiler can be sure that it knows there are no subclasses/method overrides, and hence devirtualizaton is no longer speculative: classes in anonymous namespaces, local classes, and final methods and classes.

A final note is in defense of gcc's decision to speculatively inline this. The example given is more or less at one end of the risk-vs-reward spectrum for this optimization. As it turns out, the functions involved only take a single instruction to implement (aside from the required ret, one of which is even optimized away by the tailcall). So the benefit of inlining is very small, because the benefit of inlining is mostly as a "granddaddy" optimization that enables lots of other optimizations to work, and often eliminates a large part of the cost of the inlined function. Because the function is so small, with zero prologue and epilogue, and because can't easily optimize between ant() and dec(), because they are called in different basic blocks, inlining doesn't help very much.

Here's another example which gives gcc more of a chance to optimize:

class Base
{
  public:
    virtual int add5(int x) { return 5 + x; };
};

int function(Base * base, int x) 
{ 
  return base->add5(x) - base->add5(x);
}

int add5_static(int x) {
  return 5 + x;
}

int function2(int x) {
  return add5_static(x) - add5_static(x);
}

int function3(int x) {
  Base b = Base();
  return b.add5(x) - b.add5(x);
}

Here, you call the same function more than once. This could allow gcc to optimize the function pointer checks (you only need one for add5 not two for ant and dec). This could allow gcc to optimize between the two function calls, replacing(5 + x) - (5 + x)with something as simple as0`.

Let's see what gcc does with that, via godbolt! Looks good initially. The inlined version of the call only requires one cmp/jne since the same function is called twice. Here's the optimized version, starting with the function pointer check:

    cmp     rax, OFFSET FLAT:Base::add5(int)  # _4,
    jne     .L3       #,
    add     esi, 5    # _11,
    mov     ebp, esi  # _7, _11
    mov     eax, ebp  # _7, _7
    pop     rbx       #
    sub     eax, esi  # _7, _11
    pop     rbp       #
    pop     r12       #
    ret

It's a mixed bag. There are 7 instructions after the jump, and a lot of redundancy. First, note that gcc is able to do inter-call optimization. In particular, the single call to add esi, 5 shows that the common sub-expression 5 + part of the two (identical calls) was optimized to a single call. You then get eax = ebp = esi. The assignment to ebp is redundant - it is never used again in this function. Then you get sub eax,esi. This is totally redundant because eax == esi, and so the result is always zero.

Even the pop instructions are all redundant. We simply push those three registers at the top of the method, then pop them off in the optimized function. It is legitimate to push these registers before we call the virtual methods, but that can all be deferred until after the check. So a total of six push and pop instructions are unnecessary in the optimized path.

All that to say that the optimized implementation of these 7 instructions is simply xor eax, eax, a near-free instruction (in addition to avoiding the three earlier push instructions, not shown, which can be also avoided⁷). Gcc missed all the easy optimizations, and the optimized function is perhaps an order of magnitude slower. The reality is that even though these optimizations are "obvious", everything occurs in stages. The stage that can remove all the redundant moves, the subtraction and the pushes and pops probably occurs before the devirtualization phase. Later on, it is too late for this redundancy to be removed.

Just to show that gcc is indeed capable of optimizing this in a better way, take a look at function 2:

int add5_static(int x) {
  return 5 + x;
}

int function2(int x) {
  return add5_static(x) - add5_static(x);
}

This is the same as function, except without the complication of possibly virtual function calls. The entire function is simply:

function2(int):
    xor     eax, eax  #
    ret

So everything cancels out, into return 0;. The compiler could do the same thing in the speculatively devirtualized version of add5, but it fails to.

Oddly enough, the non-speculative devirtualization does just fine:

int function3(int x) {
  Base b = Base();
  return b.add5(x) - b.add5(x);
}

reduces to exactly the same thing:

function3(int):
    xor     eax, eax  #
    ret

So something about speculative devirtualization seems to make it less susceptible to many of the optimizations that would otherwise lead to some great simplifications of the code and lead to some big wins. Even without those, the optimized version is likely to be noticeably faster than the version without devirtualization.

¹It is worth noting that this is a more precise check than typical checks than what occurs when devirtualizing JITed languages. In Java, for example, the check is simply against the object type (i.e., the value of the vtable pointer), rather than against the particular function implementation (i.e., against the value of the function pointer in the vtable). The function-specific check is more costly, since it involves dereferencing the vtable pointer, but it "works" in many more cases: it would work for any subclass of Base that didn't actually override inc() and/or dec(), while the class type check would fail completely.

The difference is probably down to when the optimization is applied: because the JITed code knows the most common class(es) at the call site based on profiling, even the coarser (and cheaper) check is going to be effective, since it uses the most common observed type. The C++ code doesn't have this benefit (unless LTO and PGO are used), so the more careful check probably pays off.

²In fact, the actual behavior is complex and depends specifically on the exact processor version. In older processors, the branch predictors were actually much more likely to use the default predictor for cold branches (which predicts not-taken for forward branches), since the prediction mechanisms were tied tightly to the IP of the branch.

In more recent predictors (Haswell and newer), the mechanism is more complex, with a hash of several factors being used, so you may often "collide" with other existing state even in the cold case, and so behavior is not as predictable any more. I think it's safe you say you won't usually get more than a 50% mispredict rate for forward, not taken branches, however. You can find some investigation on this blog and discussion here.

³OK, you caught me. It excludes the case p(x) == p(y), but you can work that through too and it doesn't change anything.

⁴A binary search on godbolt as well as Hanza's blog confirms that this was added in gcc 4.9. Prior to that version, it is compiled in the same was as clang and icc.

⁵In particular, the #pragma message shows that the source is all compiled in a single file called example.cpp.

⁶By statically available I mean the set of LTO-enabled files that are available to gcc when the LTO phase occurs. This excludes at least two important categories of code: (1) code in static or dynamic objects which are linked to the final binary but for which LTO infomration is not available (pretty much everything you link which isn't part of your compile process) and (2) additional code loaded at runtime, e.g., via dlopen, which LTO can't possibly account for.

⁷Peter Cordes points out that this "push everything up front" behavior - where everything that needs to be saved, through any possible path through a function is pushed immediately on entry, seems to be a pattern in gcc. In the case of functions with very short fast paths, this seems like a big limitation.

Is final used for optimization in C++?

Is final used for optimization in C++?

It can be, and is.

As noted, it is being used already; see here and here showing the generated code for the override with and without final.

An optimisation along these lines would relate to the "de-virtualization" of the virtual calls. This is not always immediately affected by the final of the class nor method. Albeit they offer help to determine this, the normal rules of the virtual functions and class hierarchy apply.

If the compiler can determine that at runtime a particular method will always be called (e.g. given the OP example, with an automatic object), it could apply such an optimisation anyway, irrespective of whether the method is final or not.

Optimisations fall under the as-if rule, that allow the compiler to apply any transformation so long as the observable behaviour is as-if the original code had been executed.

Why that pure virtual function cannot be inline?

A virtual function 99% of the time cannot be inlined because the function pointer is resolved at runtime using vtable.

The purpose of a virtual function is to be overloaded by subclasses.
An example :

class A
{
    virtual int foo() { return 1; }
}

class B : public A
{
    virtual int foo() { return 2; }
}

int main(void)
{
    B b;
    A *a = &b;

    return a->foo();
}

The main function will return 2 and not 1.

If the compiler inline the function it will return 1 because the compiler can only assume that the type of *a is A and not B. So the compiler will not do it if he cannot assume safely the type of *a.

In this example the compiler may successfully and safely assume that virtualization is not needed : This is really depending of the compiler and of optimization level.

In some case the compiler can safely assume that virtualization is not needed and only in these cases the inline keyword makes sense.

Even if you declare a function with the keywords inline, the function may not be inlined.

Anyway adding manually the inline keyword is most of the time not a good idea, compiler today are good and automatically inline function when necessary.
By adding inline you may in some case decrease performance, so it a good practice to not abuse of it

Does C++ final imply final in all aspects?

The reason I am asking this is because final functions become qualified for de-virtualization, which is a great optimization.

Do they? "De-virtualization" is not part of the C++ standard. Or at least, not really.

De-virtualization is merely a consequence of the "as if" rule, which states that the implementation can do whatever it likes so long as the implementation behaves "as if" it is doing what the standard says.

If the compiler can detect at compile-time that a particular call to a virtual member function, through a polymorphic type, will undeniably call a specific version of that function, then it is allowed to avoid using the virtual dispatching logic and calling the function statically. That's behaving "as if" it had used the virtual dispatching logic, since the compiler can prove that this is the function that would have been called.

As such, the standard does not define when de-virtualization is allowed/forbidden. A compiler, upon inlining a function that takes a pointer to a base class type, may find that the pointer being passed is pointing to a stack variable local declared in the function that it is being inlined within. Or that the compiler can trace down a particular inline/call graph to the point of origin for a particular polymorphic pointer/reference. In those cases, the compiler can de-virtualize calls into that type. But only if it's smart enough to do so.

Will a compiler devirtualize all virtual function calls to a final class, regardless of whether those methods are declared final themselves? It may. It may not. It may not even devirtualize any calls to methods declared final on the polymorphic type. That's a valid (if not particularly bright) implementation.

The question you're asking is implementation specific. It can vary from compiler to compiler.

However, a class being declared final, as you pointed out, ought to be sufficient information for the compiler to devirtualize all calls to pointers/references to the final class type. If a compiler doesn't do so, then that's a quality-of-implementation issue, not a standards one.

Lto, Devirtualization, and Virtual Tables

Using final to reduce virtual method overhead

Why doesn't this simple function get de-virtualized?

Inlining of virtual functions (Clang vs GCC)

Is final used for optimization in C++?

Why that pure virtual function cannot be inline?

Does C++ final imply final in all aspects?

Related Topics

Leave a reply