What Is the Performance Overhead of Std::Function

What is the performance overhead of std::function?

You can find information from the boost's reference materials: How much overhead does a call through boost::function incur? and Performance

This doesn't determine "yes or no" to boost function. The performance drop may be well acceptable given program's requirements. More often than not, parts of a program are not performance-critical. And even then it may be acceptable. This is only something you can determine.

As to the standard library version, the standard only defines an interface. It is entirely up to individual implementations to make it work. I suppose a similar implementation to boost's function would be used.

Overhead with std::function

std::function is a type erasure class.

It takes whatever it is constructed from, and erases everything except:

  • Invoke with the signature in question (with possible implicit casting)
  • Destroy
  • Copy
  • Cast back to exact original type

and possibly

  • Move

This involves some overhead. A typical decent-quality std::function will have small object optimization (like small string optimization), avoiding a heap allocation when the amount of memory used is small.

A function pointer will fit in there.

However, there is still overhead. If you initialize a std::function with a compatible function pointer, instead of directly calling the function pointer in question, you do a virtual function table lookup, or invoke some other function, which then invokes the function pointer.

With a vtable implementation, that is a possible cache miss, an instruction cache miss, then another instruction cache miss. With a function pointer, the pointer is probably stored locally, and it is called directly, resulting on one possible instruction cache miss.

On top of this, in practice compilers understand function pointers better than std::functions: a number of compilers can figure out that the pointer is constant value during inlining or whole program optimization. I have never seen one that pulls that off with std::function.

For larger objects (say larger than sizeof(std::string) in one implementation), a heap allocation is also done by the std::function. This is another cost. For function pointers and reference wrappers, SOO is guaranteed by the standard.


Directly storing the lambda without storing it in a std::function is even better than a function pointer: in that case, the code being run is implicit in the type of the lambda. This makes it trivial for code to work out what is going to happen when it is called, and inlining easy for the compiler.

Only do type erasure when you need to.

Why std::function is too slow is CPU can't utilize instruction reordering?

Code wrapped into std::function is always slower than inlining code directly into calling place. Especially if your code is very short, like 3-5 CPU instructions.

If your function's code is quite big, hundreds of instructions then there will be no difference whether to use std::function or some other mechanism of calling/wrapping code.

std::function code is not inlined. Using std::function wrapper has almost same speed overhead like using virtual methods in class. More than that std::function mechanism looks very much like virtual call mechanism, in both cases code is not inlined and pointer to code is used to call it with assembler's call instruction.

If you really need speed then use lambdas and pass them around as templated parameters, like below. Lambdas are always inlined if possible (and if compiler decides that it will improve speed).

Try it online!

#include <functional>

template <typename F>
void __attribute__((noinline)) use_lambda(F const & f) {
auto volatile a = f(13); // call f
// ....
auto volatile b = f(7); // call f again
}

void __attribute__((noinline)) use_func(
std::function<int(int)> const & f) {
auto volatile a = f(11); // call f
// ....
auto volatile b = f(17); // call f again
}

int main() {
int x = 123;
auto f = [&](int y){ return x + y; };
use_lambda(f); // Pass lambda
use_func(f); // Pass function
}

If you look at assembler code of above example (click Try-it-online link above) then you can see that lambda code was inlined while std::function code wasn't.

Template params are always faster than other solutions, you should always use templates everywhere where you need polymorphism while having high performance.

Overhead of std::function vs virtual function call for type erasure

I was hesitating a bit to answer this question, since it could easily boil down to opinions. I have been using std::function in a project, so I might just as well share my two cents (and you can decide what to do with the input).

Firstly, I would like to iterate what's already been said in the comments. If you actually want to see the performance, you have to do some benchmarking. Only after benchmarking, you can derive your conclusions.

Luckily, you can use quick-bench for quick benchmarking(!). I fed the benchmark with your two versions, adding a state that is increased for each call, and a getter for the variable:

// Type erasure:
struct PersistentAPI {
std::function<void(int)> take_snapshot;
std::function<void(int)> save;
std::function<void(int)> load;
std::function<int()> get;
};

// Virtual base class
class AbstractPersistent {
public:
virtual void take_snapshot(int version) = 0;
virtual void save(int to_version) = 0;
virtual void load(int to_version) = 0;
virtual int get() = 0;
};

Each function simply increases an integer in the corresponding class, and returns it with get() (hoping that the compiler does not remove all unnecessary code).

The result is in favor of virtual functions, and for both Clang and GCC, we have around 1.7 speed difference (https://quick-bench.com/q/wUbPp8OdtzLZv8H1VylyuDnd2pU, you can change compiler and recheck).

Now to the analysis: why is the abstract class seemingly quicker? Well, there are more indirections with std::function, but also there's another indirection in the wrapping before, when we call std::bind(!). Listening to Scott Meyers, lambdas are to prefer over std::bind, not only for their ease of syntax for people (std::placeholders is no beauty), but their of syntax for the compiler! A lambda call is easier to inline.

Inlining is very important for performance. If a explicit call can be avoided by added the code where we call, we can save some cycles!

Changing std::bind to lambdas, and performing again, we have very similar performance between std::function and inheritance (for both Clang and GCC): https://quick-bench.com/q/HypCbzz5UMo1aHtRpRbrc9B8v44.

So, why are they similar? For Clang and GCC, std::function is internally using inheritance. Type erasure, as it is implemented here, is simply hiding the polymorphism.

(Note that this benchmark might be misleading, since the call for both cases could be completely inlined, thus no indirection is used at all. The test case might have to be a bit more tricky to trick the compiler.)

So let's say you have either Clang and GCC as compilers, which method should you use?

The PersistentAPI is more flexible, since actually take_snapshot, save and load are basically function pointers, and do not need to be assigned to a single class! With

struct PersistentAPI {
std::function<void(int)> take_snapshot;
std::function<void(int)> save;
std::function<void(int)> load;
};

, it is fully reasonable as a developer to believe that PersistentAPI is meant to dispatch to multiple objects, and not just a * single one*. take_snapshot could for example dispatch to a free function, whereas save and load to two different classes. Is this the flexibility you want? Then that's what you should use. Generally, I would use std::function through the API to let the user register a callback to any callable of choice.

If you want to use type erasure, but want to hide the inheritance for some reason, you could build your own version. std::function accepts all types having operator(), we can build one that accepts all classes having the interface "take_snapshot, save and load". It's good to practice!

// probably there is a better name for this class
class PersistentTypeErased {
public:
template<typename T>
PersistentTypeErased(T t) : t_(std::make_unique<Model<T>>(t)) {}

void take_snapshot(int version) { t_->take_snapshot(version); }
void save(int to_version) { t_->save(to_version); }
void load(int to_version) { t_->load(to_version); }
private:
struct Concept
{
virtual void take_snapshot(int version) = 0;
virtual void save(int to_version) = 0;
virtual void load(int to_version) = 0;
};
template<typename T>
struct Model : Concept
{
Model(T t) : t_(t) {}
void take_snapshot(int version) { t_.take_snapshot(version); }
void save(int to_version) { t_.save(to_version); }
void load(int to_version) { t_.load(to_version); }
T t_;
};
std::unique_ptr<Concept> t_;
};

The technique is similar to std::function, and now you probably also can see how type erasure uses polymorphism under the hood. You can see how it is used here.

Performance of std::function compared to raw function pointer and void* this?

I wondered myself quite frequently already, so I started writing some very minimal benchmark that attempts to simulate the performance by looped atomic counters for each function-pointer callback version.

Keep in mind, these are bare calls to functions that do only one thing, atomically incrementing its counter;

By checking the generated assembler output you may find out, that a bare C-function pointer loop is compiled into 3 CPU instructions;

a C++11's std::function call just adds 2 more CPU instructions, thus 5 in our example. As a conclusion: it absolutely doesn't matter what way of function pointer technique you use, the overhead differences are in any case very small.

((Confusing however is that the assigned lambda expression seems to run faster than the others, even than the C-one.))

Compile the example with: clang++ -o tests/perftest-fncb tests/perftest-fncb.cpp -std=c++11 -pthread -lpthread -lrt -O3 -march=native -mtune=native

#include <functional>
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>

typedef unsigned long long counter_t;

struct Counter {
volatile counter_t bare;
volatile counter_t cxx;
volatile counter_t cxo1;
volatile counter_t virt;
volatile counter_t lambda;

Counter() : bare(0), cxx(0), cxo1(0), virt(0), lambda(0) {}
} counter;

void bare(Counter* counter) { __sync_fetch_and_add(&counter->bare, 1); }
void cxx(Counter* counter) { __sync_fetch_and_add(&counter->cxx, 1); }

struct CXO1 {
void cxo1(Counter* counter) { __sync_fetch_and_add(&counter->cxo1, 1); }
virtual void virt(Counter* counter) { __sync_fetch_and_add(&counter->virt, 1); }
} cxo1;

void (*bare_cb)(Counter*) = nullptr;
std::function<void(Counter*)> cxx_cb;
std::function<void(Counter*)> cxo1_cb;
std::function<void(Counter*)> virt_cb;
std::function<void(Counter*)> lambda_cb;

void* bare_main(void* p) { while (true) { bare_cb(&counter); } }
void* cxx_main(void* p) { while (true) { cxx_cb(&counter); } }
void* cxo1_main(void* p) { while (true) { cxo1_cb(&counter); } }
void* virt_main(void* p) { while (true) { virt_cb(&counter); } }
void* lambda_main(void* p) { while (true) { lambda_cb(&counter); } }

int main()
{
pthread_t bare_thread;
pthread_t cxx_thread;
pthread_t cxo1_thread;
pthread_t virt_thread;
pthread_t lambda_thread;

bare_cb = &bare;
cxx_cb = std::bind(&cxx, std::placeholders::_1);
cxo1_cb = std::bind(&CXO1::cxo1, &cxo1, std::placeholders::_1);
virt_cb = std::bind(&CXO1::virt, &cxo1, std::placeholders::_1);
lambda_cb = [](Counter* counter) { __sync_fetch_and_add(&counter->lambda, 1); };

pthread_create(&bare_thread, nullptr, &bare_main, nullptr);
pthread_create(&cxx_thread, nullptr, &cxx_main, nullptr);
pthread_create(&cxo1_thread, nullptr, &cxo1_main, nullptr);
pthread_create(&virt_thread, nullptr, &virt_main, nullptr);
pthread_create(&lambda_thread, nullptr, &lambda_main, nullptr);

for (unsigned long long n = 1; true; ++n) {
sleep(1);
Counter c = counter;

printf(
"%15llu bare function pointer\n"
"%15llu C++11 function object to bare function\n"
"%15llu C++11 function object to object method\n"
"%15llu C++11 function object to object method (virtual)\n"
"%15llu C++11 function object to lambda expression %30llu-th second.\n\n",
c.bare, c.cxx, c.cxo1, c.virt, c.lambda, n
);
}
}

std::function has performances issues, how to avoid it?

by templating FooKernel you can avoid the need for std::function.

#include <iostream>
#include <complex>
#include <functional>


class Kernel {
public:
/*
Covariance function : return the covariance between two R.V. for the entire kernel's domain definition.
*/
virtual double covarianceFunction(
double X,
double Y
)const = 0 ;
~Kernel() = default;
};


template <typename Func>
class FooKernel : public Kernel {
public:

FooKernel(Func&& fun) : fun_(std::forward<Func>(fun)) {}
double covarianceFunction(
double X,
double Y
) const {
return fun_(X, Y);
}
template<class T>
auto operator+(const T b) const {
return make_foo_kernel([b, this](double X, double Y) -> double {
return this->covarianceFunction(X, Y) + b.covarianceFunction(X, Y);
});
}
FooKernel operator=(const FooKernel other) const {
return other;
}
private:
Func fun_;
};

template <typename Func>
auto make_foo_kernel(Func&& fun)
{
return FooKernel<Func>(std::forward<Func>(fun));
}


class GaussianKernel : public Kernel {
public:
GaussianKernel(double sigma, double scale) : m_sigma(sigma), m_scale(scale) {}
GaussianKernel(double sigma) : m_sigma(sigma), m_scale(1) {}
/*
A well known covariance function that enforces smooth deformations
Ref : Shape modeling using Gaussian process Morphable Models, Luethi et al.
*/
double covarianceFunction(
double X,
double Y
) const
{
//use diagonal matrix
double result;
result = m_scale * exp(-std::norm(X - Y) / (m_sigma*m_sigma));
return result;
}
template<class T>
auto operator+(const T b) const {
return make_foo_kernel([b, this](double X, double Y) -> double {
auto debugBval = b.covarianceFunction(X, Y);
auto debugAval = this->covarianceFunction(X, Y);
auto test = debugBval + debugAval;
return test;
});
}
private:
double m_sigma;
double m_scale;
};

int main()
{
auto C = GaussianKernel(50,60) + GaussianKernel(100,200);
auto result = C.covarianceFunction(30.0,40.0);

return 0;
}

Demo

What is the performance penalty of using std::vector in C++?

Given the abstraction it provides, the C++ std::vector is as efficient as it gets: 3 pointers on the stack, and dynamically allocated data that on average does 1 reallocation per element on a linear growth scenario (because the resizing expands the capacity more than proportionally, a factor of 1.5 to 2).

The C equivalent using malloc() and realloc() would be at least as expensive, and more cumbersome (manual resizing etc.). Moreover, the std::vector allows user-defined performance tuning through special allocators (pool-based, stack-allocated, etc.), which in C++11 are not as hard to use as it was in C++98.

If you don't need dynamic resizing, you can code both in C and C++ a static array (or std::array in C++).

In general, for high-performance computing, C++ has more potential for optimization, in particular through the use of function objects that can be inlined (in contrast to regular C function pointers). The canonical example is sorting

int comp( const void* a, const void* b ) {
return /* your comparison here */;
}

// C style sorting
qsort( arr, LARGE_SIZE, sizeof( int ), comp );
^^^^ <---- no-inlining through function pointer

// C++11 style sorting (use hand-made function object for C++98
std::sort(std::begin(arr), std::end(arr), [](auto a, auto b) {
return comp(&a, &b);
^^^^ <----- C++11 lambdas can be fully inlined
});


Related Topics



Leave a reply



Submit