C++ Virtual Function Table Memory Cost

What is the performance cost of having a virtual method in a C++ class?

I ran some timings on a 3ghz in-order PowerPC processor. On that architecture, a virtual function call costs 7 nanoseconds longer than a direct (non-virtual) function call.

So, not really worth worrying about the cost unless the function is something like a trivial Get()/Set() accessor, in which anything other than inline is kind of wasteful. A 7ns overhead on a function that inlines to 0.5ns is severe; a 7ns overhead on a function that takes 500ms to execute is meaningless.

The big cost of virtual functions isn't really the lookup of a function pointer in the vtable (that's usually just a single cycle), but that the indirect jump usually cannot be branch-predicted. This can cause a large pipeline bubble as the processor cannot fetch any instructions until the indirect jump (the call through the function pointer) has retired and a new instruction pointer computed. So, the cost of a virtual function call is much bigger than it might seem from looking at the assembly... but still only 7 nanoseconds.

Edit: Andrew, Not Sure, and others also raise the very good point that a virtual function call may cause an instruction cache miss: if you jump to a code address that is not in cache then the whole program comes to a dead halt while the instructions are fetched from main memory. This is always a significant stall: on Xenon, about 650 cycles (by my tests).

However this isn't a problem specific to virtual functions because even a direct function call will cause a miss if you jump to instructions that aren't in cache. What matters is whether the function has been run before recently (making it more likely to be in cache), and whether your architecture can predict static (not virtual) branches and fetch those instructions into cache ahead of time. My PPC does not, but maybe Intel's most recent hardware does.

My timings control for the influence of icache misses on execution (deliberately, since I was trying to examine the CPU pipeline in isolation), so they discount that cost.

C++ : Memory overhead due to virtuality?

You might wanna look at Item 24 of Scott Meyer's More Effective C++. It is titled 'Understand the costs of virtual functions, multiple inheritance, virtual base class and RTTI '. In this item Meyers goes over the overhead involved in making use of these facilities.

virtual table and _vptr storage scheme

The first point to keep in mind is a disclaimer: none of this is actually guaranteed by the standard. The standard says what the code needs to look like and how it should work, but doesn't actually specify exactly how the compiler needs to make that happen.

That said, essentially all C++ compilers work quite similarly in this respect.

So, let's start with non-virtual functions. They come in two classes: static and non-static.

The simpler of the two are static member functions. A static member function is almost like a global function that's a friend of the class, except that it also needs the class`s name as a prefix to the function name.

Non-static member functions are a little more complex. They're still normal functions that are called directly--but they're passed a hidden pointer to the instance of the object on which they were called. Inside the function, you can use the keyword this to refer to that instance data. So, when you call something like a.func(b);, the code that's generated is pretty similar to code you'd get for func(a, b);

Now let's consider virtual functions. Here's where we get into vtables and vtable pointers. We have enough indirection going on that it's probably best to draw some diagrams to see how it's all laid out. Here's pretty much the simplest case: one instance of one class with two virtual functions:

Sample Image

So, the object contains its data and a pointer to the vtable. The vtable contains a pointer to each virtual function defined by that class. It may not be immediately apparent, however, why we need so much indirection. To understand that, let's look at the next (ever so slightly) more complex case: two instances of that class:

Sample Image

Note how each instance of the class has its own data, but they both share the same vtable and the same code--and if we had more instances, they'd still all share the one vtable among all the instances of the same class.

Now, let's consider derivation/inheritance. As an example, let's rename our existing class to "Base", and add a derived class. Since I'm feeling imaginative, I'll name it "Derived". As above, the base class defines two virtual functions. The derived class overrides one (but not the other) of those:

Sample Image

Of course, we can combine the two, having multiple instances of each of the base and/or derived class:

Sample Image

Now let's delve into that in a little more detail. The interesting thing about derivation is that we can pass a pointer/reference to an object of the derived class to a function written to receive a pointer/reference to the base class, and it still works--but if you invoke a virtual function, you get the version for the actual class, not the base class. So, how does that work? How can we treat an instance of the derived class as if it were an instance of the base class, and still have it work? To do it, each derived object has a "base class subobject". For example, lets consider code like this:

struct simple_base { 
int a;
};

struct simple_derived : public simple_base {
int b;
};

In this case, when you create an instance of simple_derived, you get an object containing two ints: a and b. The a (base class part) is at the beginning of the object in memory, and the b (derived class part) follows that. So, if you pass the address of the object to a function expecting an instance of the base class, it uses on the part(s) that exist in the base class, which the compiler places at the same offsets in the object as they'd be in an object of the base class, so the function can manipulate them without even knowing that it's dealing with an object of the derived class. Likewise, if you invoke a virtual function all it needs to know is the location of the vtable pointer. As far as it cares, something like Base::func1 basically just means it follows the vtable pointer, then uses a pointer to a function at some specified offset from there (e.g., the fourth function pointer).

At least for now, I'm going to ignore multiple inheritance. It adds quite a bit of complexity to the picture (especially when virtual inheritance gets involved) and you haven't mentioned it at all, so I doubt you really care.

As to accessing any of this, or using in any way other than simply calling virtual functions: you may be able to come up with something for a specific compiler--but don't expect it to be portable at all. Although things like debuggers often need to look at such stuff, the code involved tends to be quite fragile and compiler-specific.

sizeof class with int , function, virtual function in C++?

First off, a virtual function is not a pointer with 8 bytes. In C++ nothing but sizeof(char) is guaranteed to be any number of bytes.

Second, only the first virtual function in a class increases its size (compiler-dependent, but on most - if not all - it's like this). All subsequent methods do not. Non-virtual functions do not affect the class's size.

This happens because a class instance doesn't hold pointers to methods themselves, but to a virtual function table, which is one per class.

So if you had:

class A
{
virtual void foo();
}

and

class B
{
virtual void goo();
virtual void test();
static void m();
void x();
}

you would have sizeof(A) == sizeof(B).

And now:

Why siszeof(A) is 1 and sizeof(C) is 1 too ?

A and C have size 1 just because it's not allowed for a class to be of size 0. The functions have nothing to do with it. It's just a dummy byte.

Why siszeof(H) is 16 but sizeof(G) is 4 ?

G has only one member that accounts for memory - the int. And on your platform, sizeof(int) == 4. H, besides the int, also has a pointer to the vftable (virtual function table, see above). The size of this, size of int and allignment are compiler specific.

Why siszeof(E) is 16 but sizeof(F) is 4 ?

Explained above - non virtual methods don't take up memory in the class.

Why siszeof(D) is 8 but sizeof(E) is 16 ?

D only contains the vftable pointer which is apparently 8 bytes on your platform. E also has an int, and the vftable is aligned to 8 bytes. So it's something like:

class E

4 bytes for int | 4 padding bytes | 8 bytes for vftable pointer |
| x | x | x | x | | | | | v | v | v | v | v | v | v | v |

AI Applications in C++: How costly are virtual functions? What are the possible optimizations?

Virtual functions are very efficient. Assuming 32 bit pointers the memory layout is approximately:

classptr -> [vtable:4][classdata:x]
vtable -> [first:4][second:4][third:4][fourth:4][...]
first -> [code:x]
second -> [code:x]
...

The classptr points to memory that is typically on the heap, occasionally on the stack, and starts with a four byte pointer to the vtable for that class. But the important thing to remember is the vtable itself is not allocated memory. It's a static resource and all objects of the same class type will point to the exactly the same memory location for their vtable array. Calling on different instances won't pull different memory locations into L2 cache.

This example from msdn shows the vtable for class A with virtual func1, func2, and func3. Nothing more than 12 bytes. There is a good chance the vtables of different classes will also be physically adjacent in the compiled library (you'll want to verify this is you're especially concerned) which could increase cache efficiency microscopically.

CONST SEGMENT
??_7A@@6B@
DD FLAT:?func1@A@@UAEXXZ
DD FLAT:?func2@A@@UAEXXZ
DD FLAT:?func3@A@@UAEXXZ
CONST ENDS

The other performance concern would be instruction overhead of calling through a vtable function. This is also very efficient. Nearly identical to calling a non-virtual function. Again from the example from msdn:

; A* pa;
; pa->func3();
mov eax, DWORD PTR _pa$[ebp]
mov edx, DWORD PTR [eax]
mov ecx, DWORD PTR _pa$[ebp]
call DWORD PTR [edx+8]

In this example ebp, the stack frame base pointer, has the variable A* pa at zero offset. The register eax is loaded with the value at location [ebp], so it has the A*, and edx is loaded with the value at location [eax], so it has class A vtable. Then ecx is loaded with [ebp], because ecx represents "this" it now holds the A*, and finally the call is made to the value at location [edx+8] which is the third function address in the vtable.

If this function call was not virtual the mov eax and mov edx would not be needed, but the difference in performance would be immeasurably small.

Avoiding repeated C++ virtual table lookup

Based on some of the questions and your answers in the comments, here are a couple of considerations.

1) Your problem (if there is one, your solution might already be close to optimal, depending on details you have not mentioned) is most likely somewhere else, not in the overhead of a virtual function call.

If you really run this in a tight loop, and there's not much going on in the implementations of f() that touches a lot of memory, your vtables probably remain in the L1 cache, and the virtual function call overhead will be absolutely minimal, if any, on modern hardware.

2) You say "the functions f() themselves are very simple, for example one of them just multiplies the values at two memory addresses and stores the product in a third address" - this might not be as innocent as you expect. For reference, going to L1 cache will cost you about 3 cycles, going to RAM may cost as much as 60-200, depedning on your hardware.

If you have enough of these objects (so that keeping all of the memory they reference in L1 cache is not possible), and the memory locations they reference are basically random (so that prefetching is ineffective), and/or you touch enough things in the rest of your program (so that all the relevant data gets vacated from cache between the loops over your vector), the cost of fetching and storing the values from and to memory/lower levels of cache will outweigh the cost of the virtual function calls by orders of magnitude in the worst case.

3) You iterate over a vector of pointers to objects - not the objects themselves.

Depending on how you allocate the objects and how big they are, this might not be an issue - prefetching will do wonders for you if you allocate them in a tight loop and your allocator packs them nicely. If, however, you allocate/free a lot of other things and mix in the allocations of these objects in between, they may end up located sparsely and in basically random locations in memory; then iterating over them in the order of creation will involve a lot random reads from memory, which will again be far slower than any virtual function overhead.

4) You say "calls to f() for the vector of children has to be in order" - do they?

If they do, then you are out of luck in some ways. If, however, you can re-architect your system so that they can be called ordered by type, then there is a lot of speed to be gained in various aspects - you could probably allocate an array of each type of object (nice, dense packing in memory), iterate over them in order (prefetcher friendly), and call your f()'s in groups for a single, well known type (inlining friendly, instruction cache friendly).

5) And finally - if none of the above applies and your problem is really in virtual function calls (unlikely), then, yes, you can try storing a pointer to the exact function you need to call for each object in some fashion - either manually or by using one of the type erasure / duck typing methods others have suggested.

My main point is this - there a lot of performance benefits to be had from changing the architecture of your system in some ways.

Remember: accessing things that are already in L1/L2 cache is good, having to go to L3/RAM for data is worse; accessing memory in a sequential order is good, jumping all over memory is bad; calling the same method in a tight loop, potentially inlining it, is good, calling a lot of different methods in a tight loop is worse.

If this is a part of your program the performance of which really matters, you should consider changing the architecture of your system to allow for some of the previously mentioned optimizations. I know this may seem daunting, but that is the game we are playing. Sometimes you need to sacrifice "clean" OOP and abstractions for performance, if the problem you are solving allows for it.

Virtual functions and performance - C++

A good rule of thumb is:

It's not a performance problem until you can prove it.

The use of virtual functions will have a very slight effect on performance, but it's unlikely to affect the overall performance of your application. Better places to look for performance improvements are in algorithms and I/O.

An excellent article that talks about virtual functions (and more) is Member Function Pointers and the Fastest Possible C++ Delegates.

How many bytes of extra memory does it take for virtual functions?

Since virtual functions are typically implemented as a virtual method table (vtable), I am guessing O(number of virtual functions) per class and 1 extra pointer per object. Not as exact as you want, but should give a rough idea.
As @AlokSave mentions, there are many other subtleties which are very compiler specific and may not be consistent enough to estimate.

Typically, the compiler creates a separate vtable for each class. When an object is created, a pointer to this vtable, called the virtual table pointer, vpointer or VPTR, is added as a hidden member of this object

Source : Wikipedia

The number of functions you override (as in your examples) is (im guessing) irrelevent, except when the function to be called can safely be determined at compile time, in which case it might be treated as a static dispatch function (G() in your 2nd example). This is also up to the compiler.

compilers usually avoid using vtables whenever the call can be resolved at compile time.

Finally, The cost of virtual function is likely to be harder on speed than memory..



Related Topics



Leave a reply



Submit