What Is the Performance Cost of Having a Virtual Method in a C++ Class

What is the performance cost of having a virtual method in a C++ class?

I ran some timings on a 3ghz in-order PowerPC processor. On that architecture, a virtual function call costs 7 nanoseconds longer than a direct (non-virtual) function call.

So, not really worth worrying about the cost unless the function is something like a trivial Get()/Set() accessor, in which anything other than inline is kind of wasteful. A 7ns overhead on a function that inlines to 0.5ns is severe; a 7ns overhead on a function that takes 500ms to execute is meaningless.

The big cost of virtual functions isn't really the lookup of a function pointer in the vtable (that's usually just a single cycle), but that the indirect jump usually cannot be branch-predicted. This can cause a large pipeline bubble as the processor cannot fetch any instructions until the indirect jump (the call through the function pointer) has retired and a new instruction pointer computed. So, the cost of a virtual function call is much bigger than it might seem from looking at the assembly... but still only 7 nanoseconds.

Edit: Andrew, Not Sure, and others also raise the very good point that a virtual function call may cause an instruction cache miss: if you jump to a code address that is not in cache then the whole program comes to a dead halt while the instructions are fetched from main memory. This is always a significant stall: on Xenon, about 650 cycles (by my tests).

However this isn't a problem specific to virtual functions because even a direct function call will cause a miss if you jump to instructions that aren't in cache. What matters is whether the function has been run before recently (making it more likely to be in cache), and whether your architecture can predict static (not virtual) branches and fetch those instructions into cache ahead of time. My PPC does not, but maybe Intel's most recent hardware does.

My timings control for the influence of icache misses on execution (deliberately, since I was trying to examine the CPU pipeline in isolation), so they discount that cost.

What are the performance implications of marking methods / properties as virtual?

Virtual functions only have a very small performance overhead compared to direct calls. At a low level, you're basically looking at an array lookup to get a function pointer, and then a call via a function pointer. Modern CPUs can even predict indirect function calls reasonably well in their branch predictors, so they generally won't hurt modern CPU pipelines too badly. At the assembly level, a virtual function call translates to something like the following, where I is an arbitrary immediate value.

MOV EAX, [EBP + I] ; Move pointer to class instance into register
MOV EBX, [EAX] ;  Move vtbl pointer into register.
CALL [EBX + I]  ;   Call function

Vs. the following for a direct function call:

CALL I  ;  Call function directly

The real overhead comes in that virtual functions can't be inlined, for the most part. (They can be in JIT languages if the VM realizes they're always going to the same address anyhow.) Besides the speedup you get from inlining itself, inlining enables several other optimizations such as constant folding, because the caller can know how the callee works internally. For functions that are large enough not to be inlined anyhow, the performance hit will likely be negligible. For very small functions that might be inlined, that's when you need to be careful about virtual functions.

Edit: Another thing to keep in mind is that all programs require flow control, and this is never free. What would replace your virtual function? A switch statement? A series of if statements? These are still branches that may be unpredictable. Furthermore, given an N-way branch, a series of if statements will find the proper path in O(N), while a virtual function will find it in O(1). The switch statement may be O(N) or O(1) depending on whether it is optimized to a jump table.

Virtual functions and performance - C++

A good rule of thumb is:

It's not a performance problem until you can prove it.

The use of virtual functions will have a very slight effect on performance, but it's unlikely to affect the overall performance of your application. Better places to look for performance improvements are in algorithms and I/O.

An excellent article that talks about virtual functions (and more) is Member Function Pointers and the Fastest Possible C++ Delegates.

Is there any penalty/cost of virtual inheritance in C++, when calling non-virtual base method?

There may be, yes, if you call the member function via a pointer or reference and the compiler can't determine with absolute certainty what type of object that pointer or reference points or refers to. For example, consider:

void f(B* p) { p->foo(); }

void g()
{
    D bar;
    f(&bar);
}

Assuming the call to f is not inlined, the compiler needs to generate code to find the location of the A virtual base class subobject in order to call foo. Usually this lookup involves checking the vptr/vtable.

If the compiler knows the type of the object on which you are calling the function, though (as is the case in your example), there should be no overhead because the function call can be dispatched statically (at compile time). In your example, the dynamic type of bar is known to be D (it can't be anything else), so the offset of the virtual base class subobject A can be computed at compile time.

Avoiding the overhead of C# virtual calls

You can cause the JIT to devirtualize your interface calls by using a struct with a constrained generic.

public SomeObject<TMathFunction> where TMathFunction: struct, IMathFunction 
{
  private readonly TMathFunction mathFunction_;

  public double SomeWork(double input, double step)
  {
    var f = mathFunction_.Calculate(input);
    var dv = mathFunction_.Derivate(input);
    return f - (dv * step);
  }
}

// ...

var obj = new SomeObject<CoolMathFunction>();
obj.SomeWork(x, y);

Here are the important pieces to note:

The implementation of the IMathFunction interface, CoolMathFunction, is known at compile-time through a generic. This limits the applicability of this optimization quite a bit.
A generic parameter type TMathFunction is called directly rather than the interface IMathFunction.
The generic is constrained to implement IMathFunction so we can call those methods.
The generic is constrained to a struct -- not strictly a requirement, but to ensure we correctly exploit how the JIT generates codes for generics: the code will still run, but we won't get the optimization we want without a struct.

When generics are instantiated, codegen is different depending on the generic parameter being a class or a struct. For classes, every instantiation actually shares the same code and is done through vtables. But structs are special: they get their own instantiation that devirtualizes the interface calls into calling the struct's methods directly, avoiding any vtables and enabling inlining.

This feature exists to avoid boxing value types into reference types every time you call a generic. It avoids allocations and is a key factor in List<T> etc. being an improvement over the non-generic List etc.

Some implementation:

I made a simple implementation of IMathFunction for testing:

class SomeImplementationByRef : IMathFunction
{
    public double Calculate(double input)
    {
        return input + input;
    }

    public double Derivate(double input)
    {
        return input * input;
    }
}

... as well as a struct version and an abstract version.

So, here's what happens with the interface version. You can see it is relatively inefficient because it performs two levels of indirection:

    return obj.SomeWork(input, step);
sub         esp,40h  
vzeroupper  
vmovaps     xmmword ptr [rsp+30h],xmm6  
vmovaps     xmmword ptr [rsp+20h],xmm7  
mov         rsi,rcx
vmovsd      qword ptr [rsp+60h],xmm2  
vmovaps     xmm6,xmm1
mov         rcx,qword ptr [rsi+8]          ; load mathFunction_ into rcx.
vmovaps     xmm1,xmm6  
mov         r11,7FFED7980020h              ; load vtable address of the IMathFunction.Calculate function.
cmp         dword ptr [rcx],ecx  
call        qword ptr [r11]                ; call IMathFunction.Calculate function which will call the actual Calculate via vtable.
vmovaps     xmm7,xmm0
mov         rcx,qword ptr [rsi+8]          ; load mathFunction_ into rcx.
vmovaps     xmm1,xmm6  
mov         r11,7FFED7980028h              ; load vtable address of the IMathFunction.Derivate function.
cmp         dword ptr [rcx],ecx  
call        qword ptr [r11]                ; call IMathFunction.Derivate function which will call the actual Derivate via vtable.
vmulsd      xmm0,xmm0,mmword ptr [rsp+60h] ; dv * step
vsubsd      xmm7,xmm7,xmm0                 ; f - (dv * step)
vmovaps     xmm0,xmm7  
vmovaps     xmm6,xmmword ptr [rsp+30h]  
vmovaps     xmm7,xmmword ptr [rsp+20h]  
add         rsp,40h  
pop         rsi  
ret

Here's an abstract class. It's a little more efficient but only negligibly:

        return obj.SomeWork(input, step);
 sub         esp,40h  
 vzeroupper  
 vmovaps     xmmword ptr [rsp+30h],xmm6  
 vmovaps     xmmword ptr [rsp+20h],xmm7  
 mov         rsi,rcx  
 vmovsd      qword ptr [rsp+60h],xmm2  
 vmovaps     xmm6,xmm1  
 mov         rcx,qword ptr [rsi+8]           ; load mathFunction_ into rcx.
 vmovaps     xmm1,xmm6  
 mov         rax,qword ptr [rcx]             ; load object type data from mathFunction_.
 mov         rax,qword ptr [rax+40h]         ; load address of vtable into rax.
 call        qword ptr [rax+20h]             ; call Calculate via offset 0x20 of vtable.
 vmovaps     xmm7,xmm0  
 mov         rcx,qword ptr [rsi+8]           ; load mathFunction_ into rcx.
 vmovaps     xmm1,xmm6  
 mov         rax,qword ptr [rcx]             ; load object type data from mathFunction_.
 mov         rax,qword ptr [rax+40h]         ; load address of vtable into rax.
 call        qword ptr [rax+28h]             ; call Derivate via offset 0x28 of vtable.
 vmulsd      xmm0,xmm0,mmword ptr [rsp+60h]  ; dv * step
 vsubsd      xmm7,xmm7,xmm0                  ; f - (dv * step)
 vmovaps     xmm0,xmm7
 vmovaps     xmm6,xmmword ptr [rsp+30h]  
 vmovaps     xmm7,xmmword ptr [rsp+20h]  
 add         rsp,40h  
 pop         rsi  
 ret

So both an interface and an abstract class rely heavily on branch target prediction to have acceptable performance. Even then, you can see there's quite a lot more going into it, so the best-case is still relatively slow while the worst-case is a stalled pipeline due to a mispredict.

And finally here's the generic version with a struct. You can see it's massively more efficient because everything has been fully inlined so there's no branch prediction involved. It also has the nice side effect of removing most of the stack/parameter management that was in there too, so the code becomes very compact:

    return obj.SomeWork(input, step);
push        rax  
vzeroupper  
movsx       rax,byte ptr [rcx+8]  
vmovaps     xmm0,xmm1  
vaddsd      xmm0,xmm0,xmm1  ; Calculate - got inlined
vmulsd      xmm1,xmm1,xmm1  ; Derivate - got inlined
vmulsd      xmm1,xmm1,xmm2  ; dv * step
vsubsd      xmm0,xmm0,xmm1  ; f - 
add         rsp,8  
ret

Does the virtual keyword affect performance of the base class methods?

If I call a method in the base class with the virtual keyword, does this performance hit still occur?

That the virtual function is being called from the base class will not prevent virtual lookup.

Consider this trivial example:

class Base
{
public:
    virtual get_int() { return 1; }
    void print_int()
    {
        // Calling a virtual function from the base class
        std::cout << get_int();
    }
};

class Derived : public Base
{
public:
    virtual get_int() { return 2; }
};

int main()
{
    Base().print_int();
    Derived().print_int();
}

Is print_int() guaranteed to print 1? It is not.

That the function print_int() exists in the base class does not prove that the object its called from is not derived.

AI Applications in C++: How costly are virtual functions? What are the possible optimizations?

Virtual functions are very efficient. Assuming 32 bit pointers the memory layout is approximately:

classptr -> [vtable:4][classdata:x]
vtable -> [first:4][second:4][third:4][fourth:4][...]
first -> [code:x]
second -> [code:x]
...

The classptr points to memory that is typically on the heap, occasionally on the stack, and starts with a four byte pointer to the vtable for that class. But the important thing to remember is the vtable itself is not allocated memory. It's a static resource and all objects of the same class type will point to the exactly the same memory location for their vtable array. Calling on different instances won't pull different memory locations into L2 cache.

This example from msdn shows the vtable for class A with virtual func1, func2, and func3. Nothing more than 12 bytes. There is a good chance the vtables of different classes will also be physically adjacent in the compiled library (you'll want to verify this is you're especially concerned) which could increase cache efficiency microscopically.

CONST SEGMENT
??_7A@@6B@
   DD  FLAT:?func1@A@@UAEXXZ
   DD  FLAT:?func2@A@@UAEXXZ
   DD  FLAT:?func3@A@@UAEXXZ
CONST ENDS

The other performance concern would be instruction overhead of calling through a vtable function. This is also very efficient. Nearly identical to calling a non-virtual function. Again from the example from msdn:

; A* pa;
; pa->func3();
mov eax, DWORD PTR _pa$[ebp]
mov edx, DWORD PTR [eax]
mov ecx, DWORD PTR _pa$[ebp]
call  DWORD PTR [edx+8]

In this example ebp, the stack frame base pointer, has the variable A* pa at zero offset. The register eax is loaded with the value at location [ebp], so it has the A*, and edx is loaded with the value at location [eax], so it has class A vtable. Then ecx is loaded with [ebp], because ecx represents "this" it now holds the A*, and finally the call is made to the value at location [edx+8] which is the third function address in the vtable.

If this function call was not virtual the mov eax and mov edx would not be needed, but the difference in performance would be immeasurably small.

Are there any costs to using a virtual function if objects are cast to their actual type?

There's always a gap between what the mythical Sufficiently Smart Compiler can do, and what actual compilers end up doing. In your example, since there is nothing inheriting from Derived, the latest compilers will likely devirtualize the call to foo. However, since successful devirtualization and subsequent inlining is a difficult problem in general, help the compiler out whenever possible by using the final keyword.

class Derived : public Base { void foo() final { /* code */} }

Now, the compiler knows that there's only one possible foo that a Derived* can call.

(For an in-depth discussion of why devirtualization is hard and how gcc4.9+ tackles it, read Jan Hubicka's Devirtualization in C++ series posts.)

What Is the Performance Cost of Having a Virtual Method in a C++ Class