Virtual Functions and Performance - C++

Virtual functions and performance - C++

A good rule of thumb is:

It's not a performance problem until you can prove it.

The use of virtual functions will have a very slight effect on performance, but it's unlikely to affect the overall performance of your application. Better places to look for performance improvements are in algorithms and I/O.

An excellent article that talks about virtual functions (and more) is Member Function Pointers and the Fastest Possible C++ Delegates.

c++ Virtual Function performance for x calls on the same object

There are several good answers to a similar question here:
Cost of a virtual function in a tight loop

In general, to understand the C++ object model and the implementation choices Stroustrup made, I recommend his book "The design and evolution of the C++ language".

Virtual Functions and Performance C++

Extending Charles' answer.

The problem here is that your loop is doing more than just testing the virtual call itself (the memory allocation probably dwarfs the virtual call overhead anyway), so his suggestion is to change the code so that only the virtual call is tested.

Here the benchmark function is template, because template may be inlined while call through function pointers are unlikely to.

template <typename Type>
double benchmark(Type const& t, size_t iterations)
{
  timeval a, b;
  gettimeofday(&a, 0);
  for (;iterations > 0; --iterations) {
    t.getArea();
  }
  gettimeofday(&b, 0);
  return (b.tv_sec * (unsigned int)1e6 + b.tv_usec) -
         (a.tv_sec * (unsigned int)1e6 + a.tv_usec);
}

Classes:

struct Regular
{
  Regular(size_t w, size_t h): _width(w), _height(h) {}

  size_t getArea() const;

  size_t _width;
  size_t _height;
};

// The following line in another translation unit
// to avoid inlining
size_t Regular::getArea() const { return _width * _height; }

struct Base
{
  Base(size_t w, size_t h): _width(w), _height(h) {}

  virtual size_t getArea() const = 0;

  size_t _width;
  size_t _height;
};

struct Derived: Base
{
  Derived(size_t w, size_t h): Base(w, h) {}

  virtual size_t getArea() const;
};

// The following two functions in another translation unit
// to avoid inlining
size_t Derived::getArea() const  { return _width * _height; }

std::auto_ptr<Base> generateDerived()
{
  return std::auto_ptr<Base>(new Derived(3,7));
}

And the measuring:

int main(int argc, char* argv[])
{
  if (argc != 2) {
    std::cerr << "Usage: %prog iterations\n";
    return 1;
  }

  Regular regular(3, 7);
  std::auto_ptr<Base> derived = generateDerived();

  double regTime = benchmark<Regular>(regular, atoi(argv[1]));
  double derTime = benchmark<Base>(*derived, atoi(argv[1]));

  std::cout << "Regular: " << regTime << "\nDerived: " << derTime << "\n";

  return 0;
}

Note: this tests the overhead of a virtual call in comparison to a regular function. The functionality is different (since you do not have runtime dispatch in the second case), but it's therefore a worst-case overhead.

EDIT:

Results of the run (gcc.3.4.2, -O2, SLES10 quadcore server) note: with the functions definitions in another translation unit, to prevent inlining

> ./test 5000000
Regular: 17041
Derived: 17194

Not really convincing.

What is the performance cost of having a virtual method in a C++ class?

I ran some timings on a 3ghz in-order PowerPC processor. On that architecture, a virtual function call costs 7 nanoseconds longer than a direct (non-virtual) function call.

So, not really worth worrying about the cost unless the function is something like a trivial Get()/Set() accessor, in which anything other than inline is kind of wasteful. A 7ns overhead on a function that inlines to 0.5ns is severe; a 7ns overhead on a function that takes 500ms to execute is meaningless.

The big cost of virtual functions isn't really the lookup of a function pointer in the vtable (that's usually just a single cycle), but that the indirect jump usually cannot be branch-predicted. This can cause a large pipeline bubble as the processor cannot fetch any instructions until the indirect jump (the call through the function pointer) has retired and a new instruction pointer computed. So, the cost of a virtual function call is much bigger than it might seem from looking at the assembly... but still only 7 nanoseconds.

Edit: Andrew, Not Sure, and others also raise the very good point that a virtual function call may cause an instruction cache miss: if you jump to a code address that is not in cache then the whole program comes to a dead halt while the instructions are fetched from main memory. This is always a significant stall: on Xenon, about 650 cycles (by my tests).

However this isn't a problem specific to virtual functions because even a direct function call will cause a miss if you jump to instructions that aren't in cache. What matters is whether the function has been run before recently (making it more likely to be in cache), and whether your architecture can predict static (not virtual) branches and fetch those instructions into cache ahead of time. My PPC does not, but maybe Intel's most recent hardware does.

My timings control for the influence of icache misses on execution (deliberately, since I was trying to examine the CPU pipeline in isolation), so they discount that cost.

virtual function vs. function pointer - performance?

I'd say most of the C++ implementations work similar to this (and probably the first
implementations, that compiled into C, produced code like this):

struct ClassVTABLE {
    void (* virtuamethod1)(Class *this);
    void (* virtuamethod2)(Class *this, int arg);
};

struct Class {
    ClassVTABLE *vtable;
};

Then, given an instance Class x, calling the method virtualmethod1 for it is like x.vtable->virtualmethod1(&x), thus one extra dereference, 1 indexed lookup from the vtable, and one extra argument (= this) pushed onto the stack / passed in registers.

However the compiler probably can optimize repeated method calls on an instance within a function: since an instance Class x cannot change its class after it is constructed, the compiler can consider the whole x.vtable->virtualmethod1 as a common sub-expression, and move it out of loops. Thus in this case the repeated virtual method calls within a single function would be equivalent in speed to calling a function via a simple function pointer.

Performance variability of C++ pure virtual function calls

The international C++ standardization committee in 2005 published a Technical Report on C++ Performance , which I believe qualifies as both article and benchmark about this topic.

The short answer is that cache misses can influence run time considerably, and in a call of a virtual function the vtable is (usually) consulted.

But in practice (as opposed to the formal) the overhead per call in terms of executed machine code, is fixed, because all extant compiled C++ implementations use vtables. You can derive classes to your heart's content without affecting the call overhead. Any call still performs (1) look up vtable pointer in known place in object, (2) look up function address in known place in vtable, (3) call that function, unless the compiler knows that the function pointer is available from e.g. an earlier call, which just, if anything, makes the call go a wee bit faster.

Virtual function performance: one large class vs many smaller subclasses

In your case, it probably does not matter. I say probably because, and I mean this constructively, the fact that you did not specify performance requirements and did not specify how often the function in question is called indicates that you may not have enough information to make a judgment now - the "don't speculate: profile" blanket response actually only intends to make sure you have all the necessary information required, because premature micro-optimizations are very common and our real goal is to help you out in the big picture.

Jeremy Friesner really hit the nail on the head with his comment on another answer here:

If you don't understand why it is slow you won't be able to speed it up.

So, with all that in mind, assuming that either A) Your performance requirements are already being met (e.g. you are pulling 4000 FPS - far higher than any display refresh rate) or B) You are struggling to meet performance requirements and this function is only called a few (say < 1000-ish) times per frame) or C) You are struggling to meet performance requirements and this function is called often but does a lot of other significant work (and thus function call overhead is negligible), then:

Using a virtual function may, at most, end up in one extra lookup in a table somewhere (and possibly some cache misses - but not so much if it is accessed repeatedly in, say, an inner loop), which is a few CPU clock cycles worst case (and most likely still less than a switch, although that is really moot here), and this is completely insignificant compared to your target frame rate, the effort required to render a frame, and any other algorithms and logic you are performing. If you'd like to prove it to yourself, profile.

What you should do is use whatever technique leads to the clearest, cleanest, most maintainable and readable code. Micro-optimizations such as this are not going to have an effect, and the cost in code maintainability, even if it is minor, is not worth the benefit, which is essentially zero.

What you should also do is sit down and get a handle on your actual situation. Do you need to improve performance? Is this function called enough to actually have a significant impact or should you be concentrating on other techniques (e.g. higher level algorithms, other design strategies, off-loading computation to the GPU or using machine-specific optimizations e.g. bulk operations with SSE, etc.)?

One thing you could do, in the absence of concrete information, is try both approaches. While performance will differ from machine to machine, you could at least get a rough idea of the impact of this particular bit of code on your overall performance (e.g. if you're shooting for 60 FPS, and these two options give you 23.2 FPS vs. 23.6 FPS, then this isn't where you want to focus, and possible sacrifices made by choosing one of these strategies over the other may not be worth it).

Also consider using call lists, vertex index buffers, etc. OpenGL provides many facilities for optimizing drawing of objects where certain aspects remain constant. For example, if you have a huge surface model with small parts whose vertex coordinates change often, divide the model into sections, using call lists, and only update the call list for the section that changed when it has changed since the last redraw. Leave e.g. coloring and texturing out of the call list (or use coordinate arrays) if they change often. That way you can avoid calling your functions altogether.

If you're curious, here is a test program (which probably does not represent your actual usage, again, this is not possible to answer with the information given - this test is the one requested in comments below). This does not mean these results will be reflected in your program and, again, you need to have concrete information about your actual requirements. But, here this is just for giggles:

This test program compares a switch-based operation vs. a virtual-function based operation vs. pointer-to-member (where member is called from another class member function) vs. pointer-to-member (where member is called directly from test loop). It also performs three types of tests: A run on a dataset with just one operator, a run that alternates back and forth between two operators, and a run that uses a random mix of two operators.

Output when compiled with gcc -O0, for 1,000,000,000 iterations:

$ g++ -O0 tester.cpp
$ ./a.out 
--------------------
Test: time=6.34 sec (switch add) [-358977076]
Test: time=6.44 sec (switch subtract) [358977076]
Test: time=6.96 sec (switch alternating) [-281087476]
Test: time=18.98 sec (switch mixed) [-314721196]
Test: time=6.11 sec (virtual add) [-358977076]
Test: time=6.19 sec (virtual subtract) [358977076]
Test: time=7.88 sec (virtual alternating) [-281087476]
Test: time=19.80 sec (virtual mixed) [-314721196]
Test: time=10.96 sec (ptm add) [-358977076]
Test: time=10.83 sec (ptm subtract) [358977076]
Test: time=12.53 sec (ptm alternating) [-281087476]
Test: time=24.24 sec (ptm mixed) [-314721196]
Test: time=6.94 sec (ptm add (direct)) [-358977076]
Test: time=6.89 sec (ptm subtract (direct)) [358977076]
Test: time=9.12 sec (ptm alternating (direct)) [-281087476]
Test: time=21.19 sec (ptm mixed (direct)) [-314721196]

Output when compiled with gcc -O3, for 1,000,000,000 iterations:

$ g++ -O3 tester.cpp ; ./a.out
--------------------
Test: time=0.87 sec (switch add) [372023620]
Test: time=1.28 sec (switch subtract) [-372023620]
Test: time=1.29 sec (switch alternating) [101645020]
Test: time=7.71 sec (switch mixed) [855607628]
Test: time=2.95 sec (virtual add) [372023620]
Test: time=2.95 sec (virtual subtract) [-372023620]
Test: time=14.74 sec (virtual alternating) [101645020]
Test: time=9.39 sec (virtual mixed) [855607628]
Test: time=4.20 sec (ptm add) [372023620]
Test: time=4.21 sec (ptm subtract) [-372023620]
Test: time=13.11 sec (ptm alternating) [101645020]
Test: time=9.32 sec (ptm mixed) [855607628]
Test: time=3.37 sec (ptm add (direct)) [372023620]
Test: time=3.37 sec (ptm subtract (direct)) [-372023620]
Test: time=13.08 sec (ptm alternating (direct)) [101645020]
Test: time=9.74 sec (ptm mixed (direct)) [855607628]

Note that -O3 does a lot, and without looking at the assembler, we cannot use this as a 100% accurate representation of the problem at hand.

In the unoptimized case, we notice:

Virtual outperforms switch in runs of a single operator.
Switch outperforms virtual in cases where multiple operators used.
Pointer-to-member when the member is called directly (object->*ptm_) is similar to, but slower than, virtual.
Pointer-to-member when the member is called through another method (object->doit() where doit() calls this->*ptm_) takes a little under twice the time.
As expected, "mixed" case performance suffers due to branch prediction failures.

In the optimized case:

Switch outperforms virtual in all cases.
Similar characteristics for pointer-to-member as unoptimized case.
All "alternating" cases that involve function pointers at some point are slower than with -O0 and slower than "mixed" for reasons I do not understand. This does not occur on my PC at home.

What is especially significant here is how much the effects of e.g. branch prediction outweigh any choice of "virtual" vs. "switch". Again, be sure you understand your code and are optimizing the right thing.

The other significant thing here is this represents time differences on the order of 1-14 nanoseconds per operation. This difference could be significant for large numbers of operations, but is likely negligible compared to other things you are doing (note that these functions perform only a single arithmetic operation, any more than that will quickly dwarf the effects of virtual vs. switch).

Note also that while calling the pointer-to-member directly showed an "improvement" over calling it through another class member, this has potentially large impacts on overall design as such an implementation (at least in this case, where something outside the class calls the member directly) could not be dropped in as a direct replacement for another implementation due to different syntax for calling pointer-to-member functions (-> vs. ->*). I had to create a whole separate set of test cases just to handle that, for example.

Conclusion

Performance difference will easily be dwarfed by even a couple extra arithmetic operations. Note also that branch prediction has a far more significant impact in all but the "virtual alternating" case with -O3. However, test is also unlikely to be representative of actual application (which the OP has kept a secret), and -O3 introduces even more variables, and so results must be taken with a grain of salt and are unlikely to be applicable to other scenarios (in other words, test may be interesting, but is not particularly meaningful).

Source:

// === begin timing ===
#ifdef __linux__
#  include <sys/time.h>
typedef struct timeval Time;
static void tick (Time &t) {
  gettimeofday(&t, 0);
}
static double delta (const Time &a, const Time &b) {
  return
    (double)(b.tv_sec - a.tv_sec) +
    (double)(b.tv_usec - a.tv_usec) / 1000000.0;
}
#else // windows; untested, working from memory; sorry for compile errors
#  include <windows.h>
typedef LARGE_INTEGER Time;
static void tick (Time &t) {
  QueryPerformanceCounter(&t);
}
static double delta (const Time &a, const Time &b) {
  LARGE_INTEGER freq;
  QueryPerformanceFrequency(&freq);
  return (double)(b.QuadPart - a.QuadPart) / (double)freq.QuadPart;
}
#endif
// === end timing

#include <cstdio>
#include <cstdlib>
#include <ctime>

using namespace std;

// Size of dataset.
static const size_t DATASET_SIZE = 10000000;

// Repetitions per test.
static const unsigned REPETITIONS = 100;


// Class performs operations with a switch statement.
class OperatorSwitch {
public:
  enum Op { Add, Subtract };
  explicit OperatorSwitch (Op op) : op_(op) { }
  int perform (int a, int b) const {
    switch (op_) {
    case Add: return a + b;
    case Subtract: return a - b;
    }
  }
private:
  Op op_;
};


// Class performs operations with pointer-to-member.
class OperatorPTM {
public:
  enum Op { Add, Subtract };
  explicit OperatorPTM (Op op) {
    perform_ = (op == Add) ? 
      &OperatorPTM::performAdd :
      &OperatorPTM::performSubtract;
  }
  int perform (int a, int b) const { return (this->*perform_)(a, b); }
  int performAdd (int a, int b) const { return a + b; }
  int performSubtract (int a, int b) const { return a - b; }
  //private:
  int (OperatorPTM::*perform_) (int, int) const;
};


// Base class for virtual-function test operator.
class OperatorBase {
public:
  virtual ~OperatorBase () { }
  virtual int perform (int a, int b) const = 0;
};

// Addition
class OperatorAdd : public OperatorBase {
public:
  int perform (int a, int b) const { return a + b; }
};

// Subtraction
class OperatorSubtract : public OperatorBase {
public:
  int perform (int a, int b) const { return a - b; }
};


// No base

// Addition
class OperatorAddNoBase {
public:
  int perform (int a, int b) const { return a + b; }
};

// Subtraction
class OperatorSubtractNoBase {
public:
  int perform (int a, int b) const { return a - b; }
};



// Processes the dataset a number of times, using 'oper'.
template <typename T>
static void test (const int *dataset, const T *oper, const char *name) {

  int result = 0;
  Time start, stop;

  tick(start);

  for (unsigned n = 0; n < REPETITIONS; ++ n)
    for (size_t i = 0; i < DATASET_SIZE; ++ i)
      result = oper->perform(result, dataset[i]);

  tick(stop);

  // result is computed and printed so optimizations do not discard it.
  printf("Test: time=%.2f sec (%s) [%i]\n", delta(start, stop), name, result);
  fflush(stdout);

}


// Processes the dataset a number of times, alternating between 'oper[0]'
// and 'oper[1]' per element.
template <typename T>
static void testalt (const int *dataset, const T * const *oper, const char *name) {

  int result = 0;
  Time start, stop;

  tick(start);

  for (unsigned n = 0; n < REPETITIONS; ++ n)
    for (size_t i = 0; i < DATASET_SIZE; ++ i)
      result = oper[i&1]->perform(result, dataset[i]);

  tick(stop);

  // result is computed and printed so optimizations do not discard it.
  printf("Test: time=%.2f sec (%s) [%i]\n", delta(start, stop), name, result);
  fflush(stdout);

}


// Processes the dataset a number of times, choosing between 'oper[0]'
// and 'oper[1]' randomly (based on value in dataset).
template <typename T>
static void testmix (const int *dataset, const T * const *oper, const char *name) {

  int result = 0;
  Time start, stop;

  tick(start);

  for (unsigned n = 0; n < REPETITIONS; ++ n)
    for (size_t i = 0; i < DATASET_SIZE; ++ i) {
      int d = dataset[i];
      result = oper[d&1]->perform(result, d);
    }

  tick(stop);

  // result is computed and printed so optimizations do not discard it.
  printf("Test: time=%.2f sec (%s) [%i]\n", delta(start, stop), name, result);
  fflush(stdout);

}


// Same as test() but calls perform_() pointer directly.
static void test_ptm (const int *dataset, const OperatorPTM *oper, const char *name) {

  int result = 0;
  Time start, stop;

  tick(start);

  for (unsigned n = 0; n < REPETITIONS; ++ n)
    for (size_t i = 0; i < DATASET_SIZE; ++ i)
      result = (oper->*(oper->perform_))(result, dataset[i]);

  tick(stop);

  // result is computed and printed so optimizations do not discard it.
  printf("Test: time=%.2f sec (%s) [%i]\n", delta(start, stop), name, result);
  fflush(stdout);

}


// Same as testalt() but calls perform_() pointer directly.
static void testalt_ptm (const int *dataset, const OperatorPTM * const *oper, const char *name) {

  int result = 0;
  Time start, stop;

  tick(start);

  for (unsigned n = 0; n < REPETITIONS; ++ n)
    for (size_t i = 0; i < DATASET_SIZE; ++ i) {
      const OperatorPTM *op = oper[i&1];
      result = (op->*(op->perform_))(result, dataset[i]);
    }

  tick(stop);

  // result is computed and printed so optimizations do not discard it.
  printf("Test: time=%.2f sec (%s) [%i]\n", delta(start, stop), name, result);
  fflush(stdout);

}


// Same as testmix() but calls perform_() pointer directly.
static void testmix_ptm (const int *dataset, const OperatorPTM * const *oper, const char *name) {

  int result = 0;
  Time start, stop;

  tick(start);

  for (unsigned n = 0; n < REPETITIONS; ++ n)
    for (size_t i = 0; i < DATASET_SIZE; ++ i) {
      int d = dataset[i];
      const OperatorPTM *op = oper[d&1];
      result = (op->*(op->perform_))(result, d);
    }

  tick(stop);

  // result is computed and printed so optimizations do not discard it.
  printf("Test: time=%.2f sec (%s) [%i]\n", delta(start, stop), name, result);
  fflush(stdout);

}


int main () {

  int *dataset = new int[DATASET_SIZE];
  srand(time(NULL));
  for (int n = 0; n < DATASET_SIZE; ++ n)
    dataset[n] = rand();

  OperatorSwitch *switchAdd = new OperatorSwitch(OperatorSwitch::Add);
  OperatorSwitch *switchSub = new OperatorSwitch(OperatorSwitch::Subtract);
  OperatorSwitch *switchAlt[2] = { switchAdd, switchSub };
  OperatorBase *virtAdd = new OperatorAdd();
  OperatorBase *virtSub = new OperatorSubtract();
  OperatorBase *virtAlt[2] = { virtAdd, virtSub };
  OperatorPTM *ptmAdd = new OperatorPTM(OperatorPTM::Add);
  OperatorPTM *ptmSub = new OperatorPTM(OperatorPTM::Subtract);
  OperatorPTM *ptmAlt[2] = { ptmAdd, ptmSub };

  while (true) {
    printf("--------------------\n");
    test(dataset, switchAdd, "switch add");
    test(dataset, switchSub, "switch subtract");
    testalt(dataset, switchAlt, "switch alternating");
    testmix(dataset, switchAlt, "switch mixed");
    test(dataset, virtAdd, "virtual add");
    test(dataset, virtSub, "virtual subtract");
    testalt(dataset, virtAlt, "virtual alternating");
    testmix(dataset, virtAlt, "virtual mixed");
    test(dataset, ptmAdd, "ptm add");
    test(dataset, ptmSub, "ptm subtract");
    testalt(dataset, ptmAlt, "ptm alternating");
    testmix(dataset, ptmAlt, "ptm mixed");
    test_ptm(dataset, ptmAdd, "ptm add (direct)");
    test_ptm(dataset, ptmSub, "ptm subtract (direct)");
    testalt_ptm(dataset, ptmAlt, "ptm alternating (direct)");
    testmix_ptm(dataset, ptmAlt, "ptm mixed (direct)");
  }

}

Does the virtual keyword affect performance of the base class methods?

If I call a method in the base class with the virtual keyword, does this performance hit still occur?

That the virtual function is being called from the base class will not prevent virtual lookup.

Consider this trivial example:

class Base
{
public:
    virtual get_int() { return 1; }
    void print_int()
    {
        // Calling a virtual function from the base class
        std::cout << get_int();
    }
};

class Derived : public Base
{
public:
    virtual get_int() { return 2; }
};

int main()
{
    Base().print_int();
    Derived().print_int();
}

Is print_int() guaranteed to print 1? It is not.

That the function print_int() exists in the base class does not prove that the object its called from is not derived.