Performance of Std::Function Compared to Raw Function Pointer and Void* This

Performance of std::function compared to raw function pointer and void* this?

I wondered myself quite frequently already, so I started writing some very minimal benchmark that attempts to simulate the performance by looped atomic counters for each function-pointer callback version.

Keep in mind, these are bare calls to functions that do only one thing, atomically incrementing its counter;

By checking the generated assembler output you may find out, that a bare C-function pointer loop is compiled into 3 CPU instructions;

a C++11's std::function call just adds 2 more CPU instructions, thus 5 in our example. As a conclusion: it absolutely doesn't matter what way of function pointer technique you use, the overhead differences are in any case very small.

((Confusing however is that the assigned lambda expression seems to run faster than the others, even than the C-one.))

Compile the example with: clang++ -o tests/perftest-fncb tests/perftest-fncb.cpp -std=c++11 -pthread -lpthread -lrt -O3 -march=native -mtune=native

#include <functional>
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>

typedef unsigned long long counter_t;

struct Counter {
    volatile counter_t bare;
    volatile counter_t cxx;
    volatile counter_t cxo1;
    volatile counter_t virt;
    volatile counter_t lambda;

    Counter() : bare(0), cxx(0), cxo1(0), virt(0), lambda(0) {}
} counter;

void bare(Counter* counter) { __sync_fetch_and_add(&counter->bare, 1); }
void cxx(Counter* counter) { __sync_fetch_and_add(&counter->cxx, 1); }

struct CXO1 {
    void cxo1(Counter* counter) { __sync_fetch_and_add(&counter->cxo1, 1); }
    virtual void virt(Counter* counter) { __sync_fetch_and_add(&counter->virt, 1); }
} cxo1;

void (*bare_cb)(Counter*) = nullptr;
std::function<void(Counter*)> cxx_cb;
std::function<void(Counter*)> cxo1_cb;
std::function<void(Counter*)> virt_cb;
std::function<void(Counter*)> lambda_cb;

void* bare_main(void* p) { while (true) { bare_cb(&counter); } }
void* cxx_main(void* p) { while (true) { cxx_cb(&counter); } }
void* cxo1_main(void* p) { while (true) { cxo1_cb(&counter); } }
void* virt_main(void* p) { while (true) { virt_cb(&counter); } }
void* lambda_main(void* p) { while (true) { lambda_cb(&counter); } }

int main()
{
    pthread_t bare_thread;
    pthread_t cxx_thread;
    pthread_t cxo1_thread;
    pthread_t virt_thread;
    pthread_t lambda_thread;

    bare_cb = &bare;
    cxx_cb = std::bind(&cxx, std::placeholders::_1);
    cxo1_cb = std::bind(&CXO1::cxo1, &cxo1, std::placeholders::_1);
    virt_cb = std::bind(&CXO1::virt, &cxo1, std::placeholders::_1);
    lambda_cb = [](Counter* counter) { __sync_fetch_and_add(&counter->lambda, 1); };

    pthread_create(&bare_thread, nullptr, &bare_main, nullptr);
    pthread_create(&cxx_thread, nullptr, &cxx_main, nullptr);
    pthread_create(&cxo1_thread, nullptr, &cxo1_main, nullptr);
    pthread_create(&virt_thread, nullptr, &virt_main, nullptr);
    pthread_create(&lambda_thread, nullptr, &lambda_main, nullptr);

    for (unsigned long long n = 1; true; ++n) {
        sleep(1);
        Counter c = counter;

        printf(
            "%15llu bare function pointer\n"
            "%15llu C++11 function object to bare function\n"
            "%15llu C++11 function object to object method\n"
            "%15llu C++11 function object to object method (virtual)\n"
            "%15llu C++11 function object to lambda expression %30llu-th second.\n\n",
            c.bare, c.cxx, c.cxo1, c.virt, c.lambda, n
        );
    }
}

Should I use std::function or a function pointer in C++?

In short, use std::function unless you have a reason not to.

Function pointers have the disadvantage of not being able to capture some context. You won't be able to for example pass a lambda function as a callback which captures some context variables (but it will work if it doesn't capture any). Calling a member variable of an object (i.e. non-static) is thus also not possible, since the object (this-pointer) needs to be captured.⁽¹⁾

std::function (since C++11) is primarily to store a function (passing it around doesn't require it to be stored). Hence if you want to store the callback for example in a member variable, it's probably your best choice. But also if you don't store it, it's a good "first choice" although it has the disadvantage of introducing some (very small) overhead when being called (so in a very performance-critical situation it might be a problem but in most it should not). It is very "universal": if you care a lot about consistent and readable code as well as don't want to think about every choice you make (i.e. want to keep it simple), use std::function for every function you pass around.

Think about a third option: If you're about to implement a small function which then reports something via the provided callback function, consider a template parameter, which can then be any callable object, i.e. a function pointer, a functor, a lambda, a std::function, ... Drawback here is that your (outer) function becomes a template and hence needs to be implemented in the header. On the other hand you get the advantage that the call to the callback can be inlined, as the client code of your (outer) function "sees" the call to the callback will the exact type information being available.

Example for the version with the template parameter (write & instead of && for pre-C++11):

template <typename CallbackFunction>
void myFunction(..., CallbackFunction && callback) {
    ...
    callback(...);
    ...
}

As you can see in the following table, all of them have their advantages and disadvantages:

	function ptr	std::function	template param
can capture context variables	no¹	yes	yes
no call overhead (see comments)	yes	no	yes
can be inlined (see comments)	no	no	yes
can be stored in a class member	yes	yes	no²
can be implemented outside of header	yes	yes	no
supported without C++11 standard	yes	no³	yes
nicely readable (my opinion)	no	yes	(yes)

Difference between std::function and a standard function pointer?

A function pointer is the address of an actual function defined in C++. An std::function is a wrapper that can hold any type of callable object (objects that can be used like functions).

struct FooFunctor
{
    void operator()(int i) {
        std::cout << i;
    }
};

// Since `FooFunctor` defines `operator()`, it can be used as a function
FooFunctor func;
std::function<void (int)> f(func);

Here, std::function allows you to abstract away exactly what kind of callable object it is you are dealing with — you don't know it's FooFunctor, you just know that it returns void and has one int parameter.

A real-world example where this abstraction is useful is when you are using C++ together with another scripting language. You might want to design an interface that can deal with both functions defined in C++, as well as functions defined in the scripting language, in a generic way.

Edit: Binding

Alongside std::function, you will also find std::bind. These two are very powerful tools when used together.

void func(int a, int b) {
    // Do something important
}

// Consider the case when you want one of the parameters of `func` to be fixed
// You can used `std::bind` to set a fixed value for a parameter; `bind` will
// return a function-like object that you can place inside of `std::function`.

std::function<void (int)> f = std::bind(func, _1, 5);

In that example, the function object returned by bind takes the first parameter, _1, and passes it to func as the a parameter, and sets b to be the constant 5.

Why std::function is too slow is CPU can't utilize instruction reordering?

Code wrapped into std::function is always slower than inlining code directly into calling place. Especially if your code is very short, like 3-5 CPU instructions.

If your function's code is quite big, hundreds of instructions then there will be no difference whether to use std::function or some other mechanism of calling/wrapping code.

std::function code is not inlined. Using std::function wrapper has almost same speed overhead like using virtual methods in class. More than that std::function mechanism looks very much like virtual call mechanism, in both cases code is not inlined and pointer to code is used to call it with assembler's call instruction.

If you really need speed then use lambdas and pass them around as templated parameters, like below. Lambdas are always inlined if possible (and if compiler decides that it will improve speed).

Try it online!

#include <functional>

template <typename F>
void __attribute__((noinline)) use_lambda(F const & f) {
    auto volatile a = f(13); // call f
    // ....
    auto volatile b = f(7); // call f again
}

void __attribute__((noinline)) use_func(
        std::function<int(int)> const & f) {
    auto volatile a = f(11); // call f
    // ....
    auto volatile b = f(17); // call f again
}

int main() {
    int x = 123;
    auto f = [&](int y){ return x + y; };
    use_lambda(f); // Pass lambda
    use_func(f); // Pass function
}

If you look at assembler code of above example (click Try-it-online link above) then you can see that lambda code was inlined while std::function code wasn't.

Template params are always faster than other solutions, you should always use templates everywhere where you need polymorphism while having high performance.

Why does the implementation of std::any use a function pointer + function op codes, instead of a pointer to a virtual table + virtual calls?

Consider a typical use case of a std::any: You pass it around in your code, move it dozens of times, store it in a data structure and fetch it again later. In particular, you'll likely return it from functions a lot.

As it is now, the pointer to the single "do everything" function is stored right next to the data in the any. Given that it's a fairly small type (16 bytes on GCC x86-64), any fits into a pair of registers. Now, if you return an any from a function, the pointer to the "do everything" function of the any is already in a register or on the stack! You can just jump directly to it without having to fetch anything from memory. Most likely, you didn't even have to touch memory at all: You know what type is in the any at the point you construct it, so the function pointer value is just a constant that's loaded into the appropriate register. Later, you use the value of that register as your jump target. This means there's no chance for misprediction of the jump because there is nothing to predict, the value is right there for the CPU to consume.

In other words: The reason that you get the jump target for free with this implementation is that the CPU must have already touched the any in some way to obtain it in the first place, meaning that it already knows the jump target and can jump to it with no additional delay.

That means there really is no indirection to speak of with the current implementation if the any is already "hot", which it will be most of the time, especially if it's used as a return value.

On the other hand, if you use a table of function pointers somewhere in a read-only section (and let the any instance point to that instead), you'll have to go to memory (or cache) every single time you want to move or access it. The size of an any is still 16 bytes in this case but fetching values from memory is much, much slower than accessing a value in a register, especially if it's not in a cache. In a lot of cases, moving an any is as simple as copying its 16 bytes from one location to another, followed by zeroing out the original instance. This is pretty much free on any modern CPU. However, if you go the pointer table route, you'll have to fetch from memory every time, wait for the reads to complete, and then do the indirect call. Now consider that you'll often have to do a sequence of calls on the any (i.e. move, then destruct) and this will quickly add up. The problem is that you don't just get the address of the function you want to jump to for free every time you touch the any, the CPU has to fetch it explicitly. Indirect jumps to a value read from memory are quite expensive since the CPU can only retire the jump operation once the entire memory operation has finished. That doesn't just include fetching a value (which is potentially quite fast because of caches) but also address generation, store forwarding buffer lookup, TLB lookup, access validation, and potentially even page table walks. So even if the jump address is computed quickly, the jump won't retire for quite a long while. In general, "indirect-jump-to-address-from-memory" operations are among the worst things that can happen to a CPU's pipeline.

TL;DR: As it is now, returning an any doesn't stall the CPU's pipeline (the jump target is already available in a register so the jump can retire pretty much immediately). With a table-based solution, returning an any will stall the pipeline twice: Once to fetch the address of the move function, then another time to fetch the destructor. This delays retirement of the jump quite a bit since it'll have to wait not only for the memory value but also for the TLB and access permission checks.

Code memory accesses, on the other hand, aren't affected by this since the code is kept in microcode form anyway (in the µOp cache). Fetching and executing a few conditional branches in that switch statement is therefore quite fast (and even more so when the branch predictor gets things right, which it almost always does).

Is a unique_ptr equipped with a function pointer as custom deleter the same size as a shared_ptr?

It is true (I doubt that this is required by the standard though) that

static_assert(sizeof(std::unique_ptr<int, void(*)(int*)>) == sizeof(std::shared_ptr<int>));

but I don't agree with the conclusion that storing a function pointer as the custom deleter renders the advantages of std::unique_ptr over std::shared_ptr useless. Both types model very different ownership semantics, and choosing one over the other has not so much to do with performance, but rather with how you intend to handle the pointee instances.

Performance-wise, std::unique_ptr will always be more efficient than std::shared_ptr. While this is primarily due to thread-safe reference counting of the latter, it is also true for custom deleters:

std::unique_ptr stores the deleter in-place, i.e., on the stack. Invoking this deleter is likely to be faster than one that lives in a heap-allocated block together with pointee and reference count.
std::unique_ptr also doesn't type-erase the deleter. A lambda that is baked into the type is certainly more efficient than the indirection required to hide the deleter type as in std::shared_ptr.

C function pointers with C++11 lambdas

Using a void* is typical of C callback interfaces to pass some "state" to the function. However, std::function does not need this because std::function supports "stateful functions". So, you could do something like this:

double Integrate(
          std::function<double(double)> func,
          double a, double b)
{
    typedef std::function<double(double)> fun_type;
    :::
    F.function = [](double x, void* p){
        return (*static_cast<fun_type*>(p))(x);
    };
    F.params = &func;
    :::
}

and store a reference to the parameter vector as part of the functor that will be encapsulated in the std::function object or do something like this:

void Another_function()
{
    double m = 2;
    double b = 3;
    auto func = [&](double x){return m*x+b};
    auto r1 = Integrate(func,0,3);
    :::
}

However, this solution would use rather many indirections. GSL would invoke your lambda. Your lambda would invoke the std::function<>::operator() which in turn would inwoke some kind of virtual function that is used for type erasure which would in turn invoke the actual computation.

So, if you care about performance, you could get rid of a couple of layers there, specifically std::function. Here's a another approach with a function template:

template<class Func>
double Integrate(
          Func func,
          double a, double b)
{
    :::
    F.function = [](double x, void* p)->double{
        return (*static_cast<Func*>(p))(x);
    };
    F.params = &func;
    :::
}

I guess I would prefer this over the std::function solution.

Whether to pass shared pointer or raw pointer to a function

I would suggest the following approach to looking at code something like this:

Sample Image

There are some other options like weak_ptr, but for this it is probably not worth looking at.

So for your example, we can see that ThirdParty_DoStuff does not take ownership, so we won't either, so you can choose between a reference and a pointer depending on if the argument is mandatory or not respectively.

Performance of Std::Function Compared to Raw Function Pointer and Void* This