What Is the Performance Cost of Having a Virtual Method in a C++ Class

What is the performance cost of having a virtual method in a C++ class?

I ran some timings on a 3ghz in-order PowerPC processor. On that architecture, a virtual function call costs 7 nanoseconds longer than a direct (non-virtual) function call.

So, not really worth worrying about the cost unless the function is something like a trivial Get()/Set() accessor, in which anything other than inline is kind of wasteful. A 7ns overhead on a function that inlines to 0.5ns is severe; a 7ns overhead on a function that takes 500ms to execute is meaningless.

The big cost of virtual functions isn't really the lookup of a function pointer in the vtable (that's usually just a single cycle), but that the indirect jump usually cannot be branch-predicted. This can cause a large pipeline bubble as the processor cannot fetch any instructions until the indirect jump (the call through the function pointer) has retired and a new instruction pointer computed. So, the cost of a virtual function call is much bigger than it might seem from looking at the assembly... but still only 7 nanoseconds.

Edit: Andrew, Not Sure, and others also raise the very good point that a virtual function call may cause an instruction cache miss: if you jump to a code address that is not in cache then the whole program comes to a dead halt while the instructions are fetched from main memory. This is always a significant stall: on Xenon, about 650 cycles (by my tests).

However this isn't a problem specific to virtual functions because even a direct function call will cause a miss if you jump to instructions that aren't in cache. What matters is whether the function has been run before recently (making it more likely to be in cache), and whether your architecture can predict static (not virtual) branches and fetch those instructions into cache ahead of time. My PPC does not, but maybe Intel's most recent hardware does.

My timings control for the influence of icache misses on execution (deliberately, since I was trying to examine the CPU pipeline in isolation), so they discount that cost.

What are the performance implications of marking methods / properties as virtual?

Virtual functions only have a very small performance overhead compared to direct calls. At a low level, you're basically looking at an array lookup to get a function pointer, and then a call via a function pointer. Modern CPUs can even predict indirect function calls reasonably well in their branch predictors, so they generally won't hurt modern CPU pipelines too badly. At the assembly level, a virtual function call translates to something like the following, where I is an arbitrary immediate value.

MOV EAX, [EBP + I] ; Move pointer to class instance into register
MOV EBX, [EAX] ; Move vtbl pointer into register.
CALL [EBX + I] ; Call function

Vs. the following for a direct function call:

CALL I  ;  Call function directly

The real overhead comes in that virtual functions can't be inlined, for the most part. (They can be in JIT languages if the VM realizes they're always going to the same address anyhow.) Besides the speedup you get from inlining itself, inlining enables several other optimizations such as constant folding, because the caller can know how the callee works internally. For functions that are large enough not to be inlined anyhow, the performance hit will likely be negligible. For very small functions that might be inlined, that's when you need to be careful about virtual functions.

Edit: Another thing to keep in mind is that all programs require flow control, and this is never free. What would replace your virtual function? A switch statement? A series of if statements? These are still branches that may be unpredictable. Furthermore, given an N-way branch, a series of if statements will find the proper path in O(N), while a virtual function will find it in O(1). The switch statement may be O(N) or O(1) depending on whether it is optimized to a jump table.

Virtual functions and performance - C++

A good rule of thumb is:

It's not a performance problem until you can prove it.

The use of virtual functions will have a very slight effect on performance, but it's unlikely to affect the overall performance of your application. Better places to look for performance improvements are in algorithms and I/O.

An excellent article that talks about virtual functions (and more) is Member Function Pointers and the Fastest Possible C++ Delegates.

Is there any penalty/cost of virtual inheritance in C++, when calling non-virtual base method?

There may be, yes, if you call the member function via a pointer or reference and the compiler can't determine with absolute certainty what type of object that pointer or reference points or refers to. For example, consider:

void f(B* p) { p->foo(); }

void g()
{
D bar;
f(&bar);
}

Assuming the call to f is not inlined, the compiler needs to generate code to find the location of the A virtual base class subobject in order to call foo. Usually this lookup involves checking the vptr/vtable.

If the compiler knows the type of the object on which you are calling the function, though (as is the case in your example), there should be no overhead because the function call can be dispatched statically (at compile time). In your example, the dynamic type of bar is known to be D (it can't be anything else), so the offset of the virtual base class subobject A can be computed at compile time.

Avoiding the overhead of C# virtual calls

You can cause the JIT to devirtualize your interface calls by using a struct with a constrained generic.

public SomeObject<TMathFunction> where TMathFunction: struct, IMathFunction 
{
private readonly TMathFunction mathFunction_;

public double SomeWork(double input, double step)
{
var f = mathFunction_.Calculate(input);
var dv = mathFunction_.Derivate(input);
return f - (dv * step);
}
}

// ...

var obj = new SomeObject<CoolMathFunction>();
obj.SomeWork(x, y);

Here are the important pieces to note:

  • The implementation of the IMathFunction interface, CoolMathFunction, is known at compile-time through a generic. This limits the applicability of this optimization quite a bit.
  • A generic parameter type TMathFunction is called directly rather than the interface IMathFunction.
  • The generic is constrained to implement IMathFunction so we can call those methods.
  • The generic is constrained to a struct -- not strictly a requirement, but to ensure we correctly exploit how the JIT generates codes for generics: the code will still run, but we won't get the optimization we want without a struct.

When generics are instantiated, codegen is different depending on the generic parameter being a class or a struct. For classes, every instantiation actually shares the same code and is done through vtables. But structs are special: they get their own instantiation that devirtualizes the interface calls into calling the struct's methods directly, avoiding any vtables and enabling inlining.

This feature exists to avoid boxing value types into reference types every time you call a generic. It avoids allocations and is a key factor in List<T> etc. being an improvement over the non-generic List etc.

Some implementation:

I made a simple implementation of IMathFunction for testing:

class SomeImplementationByRef : IMathFunction
{
public double Calculate(double input)
{
return input + input;
}

public double Derivate(double input)
{
return input * input;
}
}

... as well as a struct version and an abstract version.

So, here's what happens with the interface version. You can see it is relatively inefficient because it performs two levels of indirection:

    return obj.SomeWork(input, step);
sub esp,40h
vzeroupper
vmovaps xmmword ptr [rsp+30h],xmm6
vmovaps xmmword ptr [rsp+20h],xmm7
mov rsi,rcx
vmovsd qword ptr [rsp+60h],xmm2
vmovaps xmm6,xmm1
mov rcx,qword ptr [rsi+8] ; load mathFunction_ into rcx.
vmovaps xmm1,xmm6
mov r11,7FFED7980020h ; load vtable address of the IMathFunction.Calculate function.
cmp dword ptr [rcx],ecx
call qword ptr [r11] ; call IMathFunction.Calculate function which will call the actual Calculate via vtable.
vmovaps xmm7,xmm0
mov rcx,qword ptr [rsi+8] ; load mathFunction_ into rcx.
vmovaps xmm1,xmm6
mov r11,7FFED7980028h ; load vtable address of the IMathFunction.Derivate function.
cmp dword ptr [rcx],ecx
call qword ptr [r11] ; call IMathFunction.Derivate function which will call the actual Derivate via vtable.
vmulsd xmm0,xmm0,mmword ptr [rsp+60h] ; dv * step
vsubsd xmm7,xmm7,xmm0 ; f - (dv * step)
vmovaps xmm0,xmm7
vmovaps xmm6,xmmword ptr [rsp+30h]
vmovaps xmm7,xmmword ptr [rsp+20h]
add rsp,40h
pop rsi
ret

Here's an abstract class. It's a little more efficient but only negligibly:

        return obj.SomeWork(input, step);
sub esp,40h
vzeroupper
vmovaps xmmword ptr [rsp+30h],xmm6
vmovaps xmmword ptr [rsp+20h],xmm7
mov rsi,rcx
vmovsd qword ptr [rsp+60h],xmm2
vmovaps xmm6,xmm1
mov rcx,qword ptr [rsi+8] ; load mathFunction_ into rcx.
vmovaps xmm1,xmm6
mov rax,qword ptr [rcx] ; load object type data from mathFunction_.
mov rax,qword ptr [rax+40h] ; load address of vtable into rax.
call qword ptr [rax+20h] ; call Calculate via offset 0x20 of vtable.
vmovaps xmm7,xmm0
mov rcx,qword ptr [rsi+8] ; load mathFunction_ into rcx.
vmovaps xmm1,xmm6
mov rax,qword ptr [rcx] ; load object type data from mathFunction_.
mov rax,qword ptr [rax+40h] ; load address of vtable into rax.
call qword ptr [rax+28h] ; call Derivate via offset 0x28 of vtable.
vmulsd xmm0,xmm0,mmword ptr [rsp+60h] ; dv * step
vsubsd xmm7,xmm7,xmm0 ; f - (dv * step)
vmovaps xmm0,xmm7
vmovaps xmm6,xmmword ptr [rsp+30h]
vmovaps xmm7,xmmword ptr [rsp+20h]
add rsp,40h
pop rsi
ret

So both an interface and an abstract class rely heavily on branch target prediction to have acceptable performance. Even then, you can see there's quite a lot more going into it, so the best-case is still relatively slow while the worst-case is a stalled pipeline due to a mispredict.

And finally here's the generic version with a struct. You can see it's massively more efficient because everything has been fully inlined so there's no branch prediction involved. It also has the nice side effect of removing most of the stack/parameter management that was in there too, so the code becomes very compact:

    return obj.SomeWork(input, step);
push rax
vzeroupper
movsx rax,byte ptr [rcx+8]
vmovaps xmm0,xmm1
vaddsd xmm0,xmm0,xmm1 ; Calculate - got inlined
vmulsd xmm1,xmm1,xmm1 ; Derivate - got inlined
vmulsd xmm1,xmm1,xmm2 ; dv * step
vsubsd xmm0,xmm0,xmm1 ; f -
add rsp,8
ret

Does the virtual keyword affect performance of the base class methods?


If I call a method in the base class with the virtual keyword, does this performance hit still occur?

That the virtual function is being called from the base class will not prevent virtual lookup.

Consider this trivial example:

class Base
{
public:
virtual get_int() { return 1; }
void print_int()
{
// Calling a virtual function from the base class
std::cout << get_int();
}
};

class Derived : public Base
{
public:
virtual get_int() { return 2; }
};

int main()
{
Base().print_int();
Derived().print_int();
}

Is print_int() guaranteed to print 1? It is not.

That the function print_int() exists in the base class does not prove that the object its called from is not derived.

AI Applications in C++: How costly are virtual functions? What are the possible optimizations?

Virtual functions are very efficient. Assuming 32 bit pointers the memory layout is approximately:

classptr -> [vtable:4][classdata:x]
vtable -> [first:4][second:4][third:4][fourth:4][...]
first -> [code:x]
second -> [code:x]
...

The classptr points to memory that is typically on the heap, occasionally on the stack, and starts with a four byte pointer to the vtable for that class. But the important thing to remember is the vtable itself is not allocated memory. It's a static resource and all objects of the same class type will point to the exactly the same memory location for their vtable array. Calling on different instances won't pull different memory locations into L2 cache.

This example from msdn shows the vtable for class A with virtual func1, func2, and func3. Nothing more than 12 bytes. There is a good chance the vtables of different classes will also be physically adjacent in the compiled library (you'll want to verify this is you're especially concerned) which could increase cache efficiency microscopically.

CONST SEGMENT
??_7A@@6B@
DD FLAT:?func1@A@@UAEXXZ
DD FLAT:?func2@A@@UAEXXZ
DD FLAT:?func3@A@@UAEXXZ
CONST ENDS

The other performance concern would be instruction overhead of calling through a vtable function. This is also very efficient. Nearly identical to calling a non-virtual function. Again from the example from msdn:

; A* pa;
; pa->func3();
mov eax, DWORD PTR _pa$[ebp]
mov edx, DWORD PTR [eax]
mov ecx, DWORD PTR _pa$[ebp]
call DWORD PTR [edx+8]

In this example ebp, the stack frame base pointer, has the variable A* pa at zero offset. The register eax is loaded with the value at location [ebp], so it has the A*, and edx is loaded with the value at location [eax], so it has class A vtable. Then ecx is loaded with [ebp], because ecx represents "this" it now holds the A*, and finally the call is made to the value at location [edx+8] which is the third function address in the vtable.

If this function call was not virtual the mov eax and mov edx would not be needed, but the difference in performance would be immeasurably small.

Are there any costs to using a virtual function if objects are cast to their actual type?

There's always a gap between what the mythical Sufficiently Smart Compiler can do, and what actual compilers end up doing. In your example, since there is nothing inheriting from Derived, the latest compilers will likely devirtualize the call to foo. However, since successful devirtualization and subsequent inlining is a difficult problem in general, help the compiler out whenever possible by using the final keyword.

class Derived : public Base { void foo() final { /* code */} }

Now, the compiler knows that there's only one possible foo that a Derived* can call.

(For an in-depth discussion of why devirtualization is hard and how gcc4.9+ tackles it, read Jan Hubicka's Devirtualization in C++ series posts.)



Related Topics



Leave a reply



Submit