5 Years Later, Is There Something Better Than the "Fastest Possible C++ Delegates"

5 years later, is there something better than the Fastest Possible C++ Delegates ?

Update: An article with the complete source code and a more detailed discussion has been posted on The Code Project.

Well, the problem with pointers to methods is that they're not all the same size. So instead of storing pointers to methods directly, we need to "standardize" them so that they are of a constant size. This is what Don Clugston attempts to achieve in his Code Project article. He does so using intimate knowledge of the most popular compilers. I assert that it's possible to do it in "normal" C++ without requiring such knowledge.

Consider the following code:

void DoSomething(int)
{
}

void InvokeCallback(void (*callback)(int))
{
    callback(42);
}

int main()
{
    InvokeCallback(&DoSomething);
    return 0;
}

This is one way to implement a callback using a plain old function pointer. However, this doesn't work for methods in objects. Let's fix this:

class Foo
{
public:
    void DoSomething(int) {}

    static void DoSomethingWrapper(void* obj, int param)
    {
        static_cast<Foo*>(obj)->DoSomething(param);
    }
};

void InvokeCallback(void* instance, void (*callback)(void*, int))
{
    callback(instance, 42);
}

int main()
{
    Foo f;
    InvokeCallback(static_cast<void*>(&f), &Foo::DoSomethingWrapper);
    return 0;
}

Now, we have a system of callbacks that can work for both free and member functions. This, however, is clumsy and error-prone. However, there is a pattern - the use of a wrapper function to "forward" the static function call to a method call on the proper instance. We can use templates to help with this - let's try generalizing the wrapper function:

template<typename R, class T, typename A1, R (T::*Func)(A1)>
R Wrapper(void* o, A1 a1)
{
    return (static_cast<T*>(o)->*Func)(a1);

}

class Foo
{
public:
    void DoSomething(int) {}
};

void InvokeCallback(void* instance, void (*callback)(void*, int))
{
    callback(instance, 42);
}

int main()
{
    Foo f;
    InvokeCallback(static_cast<void*>(&f),
        &Wrapper<void, Foo, int, &Foo::DoSomething> );
    return 0;
}

This is still extremely clumsy, but at least now we don't have to write out a wrapper function every single time (at least for the 1 argument case). Another thing we can generalize is the fact that we're always passing a pointer to void*. Instead of passing it as two different values, why not put them together? And while we're doing that, why not generalize it as well? Hey, let's throw in an operator()() so we can call it like a function!

template<typename R, typename A1>
class Callback
{
public:
    typedef R (*FuncType)(void*, A1);

    Callback(void* o, FuncType f) : obj(o), func(f) {}
    R operator()(A1 a1) const
    {
        return (*func)(obj, a1);
    }

private:
    void* obj;
    FuncType func;
};

template<typename R, class T, typename A1, R (T::*Func)(A1)>
R Wrapper(void* o, A1 a1)
{
    return (static_cast<T*>(o)->*Func)(a1);

}

class Foo
{
public:
    void DoSomething(int) {}
};

void InvokeCallback(Callback<void, int> callback)
{
    callback(42);
}

int main()
{
    Foo f;
    Callback<void, int> cb(static_cast<void*>(&f),
        &Wrapper<void, Foo, int, &Foo::DoSomething>);
    InvokeCallback(cb);
    return 0;
}

We're making progress! But now our problem is the fact that the syntax is absolutely horrible. The syntax appears redundant; can't the compiler figure out the types from the pointer to method itself? Unfortunately no, but we can help it along. Remember that a compiler can deduce types via template argument deduction in a function call. So why don't we start with that?

template<typename R, class T, typename A1>
void DeduceMemCallback(R (T::*)(A1)) {}

Inside the function, we know what R, T and A1 is. So what if we can construct a struct that can "hold" these types and return them from the function?

template<typename R, class T, typename A1>
struct DeduceMemCallbackTag
{
};

template<typename R, class T, typename A1>
DeduceMemCallbackTag2<R, T, A1> DeduceMemCallback(R (T::*)(A1))
{
    return DeduceMemCallbackTag<R, T, A1>();
}

And since DeduceMemCallbackTag knows about the types, why not put our wrapper function as a static function in it? And since the wrapper function is in it, why not put the code to construct our Callback object in it?

template<typename R, typename A1>
class Callback
{
public:
    typedef R (*FuncType)(void*, A1);

    Callback(void* o, FuncType f) : obj(o), func(f) {}
    R operator()(A1 a1) const
    {
        return (*func)(obj, a1);
    }

private:
    void* obj;
    FuncType func;
};

template<typename R, class T, typename A1>
struct DeduceMemCallbackTag
{
    template<R (T::*Func)(A1)>
    static R Wrapper(void* o, A1 a1)
    {
        return (static_cast<T*>(o)->*Func)(a1);
    }

    template<R (T::*Func)(A1)>
    inline static Callback<R, A1> Bind(T* o)
    {
        return Callback<R, A1>(o, &DeduceMemCallbackTag::Wrapper<Func>);
    }
};

template<typename R, class T, typename A1>
DeduceMemCallbackTag<R, T, A1> DeduceMemCallback(R (T::*)(A1))
{
    return DeduceMemCallbackTag<R, T, A1>();
}

The C++ standard allows us to call static functions on instances (!):

class Foo
{
public:
    void DoSomething(int) {}
};

void InvokeCallback(Callback<void, int> callback)
{
    callback(42);
}

int main()
{
    Foo f;
    InvokeCallback(
        DeduceMemCallback(&Foo::DoSomething)
        .Bind<&Foo::DoSomething>(&f)
    );
    return 0;
}

Still, it's a lengthy expression, but we can put that into a simple macro (!):

template<typename R, typename A1>
class Callback
{
public:
    typedef R (*FuncType)(void*, A1);

    Callback(void* o, FuncType f) : obj(o), func(f) {}
    R operator()(A1 a1) const
    {
        return (*func)(obj, a1);
    }

private:
    void* obj;
    FuncType func;
};

template<typename R, class T, typename A1>
struct DeduceMemCallbackTag
{
    template<R (T::*Func)(A1)>
    static R Wrapper(void* o, A1 a1)
    {
        return (static_cast<T*>(o)->*Func)(a1);
    }

    template<R (T::*Func)(A1)>
    inline static Callback<R, A1> Bind(T* o)
    {
        return Callback<R, A1>(o, &DeduceMemCallbackTag::Wrapper<Func>);
    }
};

template<typename R, class T, typename A1>
DeduceMemCallbackTag<R, T, A1> DeduceMemCallback(R (T::*)(A1))
{
    return DeduceMemCallbackTag<R, T, A1>();
}

#define BIND_MEM_CB(memFuncPtr, instancePtr) \
    (DeduceMemCallback(memFuncPtr).Bind<(memFuncPtr)>(instancePtr))

class Foo
{
public:
    void DoSomething(int) {}
};

void InvokeCallback(Callback<void, int> callback)
{
    callback(42);
}

int main()
{
    Foo f;
    InvokeCallback(BIND_MEM_CB(&Foo::DoSomething, &f));
    return 0;
}

We can enhance the Callback object by adding a safe bool. It's also a good idea to disable the equality operators since it's not possible to compare two Callback objects. Even better, is to use partial specialization to allow for a "preferred syntax". This gives us:

template<typename FuncSignature>
class Callback;

template<typename R, typename A1>
class Callback<R (A1)>
{
public:
    typedef R (*FuncType)(void*, A1);

    Callback() : obj(0), func(0) {}
    Callback(void* o, FuncType f) : obj(o), func(f) {}

    R operator()(A1 a1) const
    {
        return (*func)(obj, a1);
    }

    typedef void* Callback::*SafeBoolType;
    operator SafeBoolType() const
    {
        return func != 0? &Callback::obj : 0;
    }

    bool operator!() const
    {
        return func == 0;
    }

private:
    void* obj;
    FuncType func;
};

template<typename R, typename A1> // Undefined on purpose
void operator==(const Callback<R (A1)>&, const Callback<R (A1)>&);
template<typename R, typename A1>
void operator!=(const Callback<R (A1)>&, const Callback<R (A1)>&);

template<typename R, class T, typename A1>
struct DeduceMemCallbackTag
{
    template<R (T::*Func)(A1)>
    static R Wrapper(void* o, A1 a1)
    {
        return (static_cast<T*>(o)->*Func)(a1);
    }

    template<R (T::*Func)(A1)>
    inline static Callback<R (A1)> Bind(T* o)
    {
        return Callback<R (A1)>(o, &DeduceMemCallbackTag::Wrapper<Func>);
    }
};

template<typename R, class T, typename A1>
DeduceMemCallbackTag<R, T, A1> DeduceMemCallback(R (T::*)(A1))
{
    return DeduceMemCallbackTag<R, T, A1>();
}

#define BIND_MEM_CB(memFuncPtr, instancePtr) \
    (DeduceMemCallback(memFuncPtr).Bind<(memFuncPtr)>(instancePtr))

Usage example:

class Foo
{
public:
    float DoSomething(int n) { return n / 100.0f; }
};

float InvokeCallback(int n, Callback<float (int)> callback)
{
    if(callback) { return callback(n); }
    return 0.0f;
}

int main()
{
    Foo f;
    float result = InvokeCallback(97, BIND_MEM_CB(&Foo::DoSomething, &f));
    // result == 0.97
    return 0;
}

I have tested this on the Visual C++ compiler (version 15.00.30729.01, the one that comes with VS 2008), and you do need a rather recent compiler to use the code. By inspection of the disassembly, the compiler was able to optimize away the wrapper function and the DeduceMemCallback call, reducing down to simple pointer assignments.

It's simple to use for both sides of the callback, and uses only (what I believe to be) standard C++. The code I've shown above works for member functions with 1 argument, but can be generalized to more arguments. It can also be further generalized by allowing support for static functions.

Note that the Callback object requires no heap allocation - they are of a constant size thanks to this "standardization" procedure. Because of this, it's possible to have a Callback object be a member of larger class, since it has a default constructor. It is also assignable (the compiler generated copy assignment functions are sufficient). It is also typesafe, thanks to the templates.

FastDelegate and lambdas - can't get them to work (Don Clugston's fastest possible delegates)

You have diagnosed the situation well: you need to store the state.

Since the lambda is a temporary object, you are actually allowed to move from it (normally) which should be preferred to a copy if possible (because move is more general than copy).

Now, all you need to do is to reserve some storage for it, and if this requires a dynamic allocation you might indeed get a performance degradation. On the other hand, an object need have a fixed foot-print, so ?

One possible solution is to offer a configurable (but limited) storage capacity:

static size_t const Size = 32;
static size_t const Alignment = alignof(std::max_align_t);

typedef std::aligned_storage<Size, Alignment>::type Storage;
Storage storage;

Now you can (using reinterpret_cast as necessary) store your lambda within storage provided its size fit (which can be detected using static_assert).

Finally managed to get a working example (had to restart from scratch because god is that fast delegate code verbose !!), you can see it in action here (and the code is below).

I have only scratch the surface, notably because it lacks copy and move operators. To do so properly those operations need be added to the handler following the same pattern than the two other operations.

Code:

#include <cstddef>

#include <iostream>
#include <memory>
#include <type_traits>

template <typename, size_t> class FastFunc;

template <typename R, typename... Args, size_t Size>
class FastFunc<R(Args...), Size> {
public:
    template <typename F>
    FastFunc(F f): handler(&Get<F>()) {
        new (&storage) F(std::move(f));
    }

    ~FastFunc() {
        handler->destroy(&storage);
    }

    R operator()(Args&&... args) {
      return handler->apply(&storage, std::forward<Args>(args)...);
    }

private:
    using Storage = typename std::aligned_storage<Size, alignof(max_align_t)>::type;

    struct Handler {
        R (*apply)(void*, Args&&...);
        void (*destroy)(void*);
    }; // struct Handler

    template <typename F>
    static R Apply(void* f, Args&&... args) {
        (*reinterpret_cast<F*>(f))(std::forward<Args>(args)...);
    }

    template <typename F>
    static void Destroy(void* f) {
        reinterpret_cast<F*>(f)->~F();
    }

    template <typename F>
    Handler const& Get() {
        static Handler const H = { &Apply<F>, &Destroy<F> };
        return H;
    } // Get

    Handler const* handler;
    Storage storage;
}; // class FastFunc

int main() {
    FastFunc<void(), 32> stateless = []() { std::cout << "stateless\n"; };
    stateless();

    bool b = true;
    FastFunc<void(), 32> stateful = [&b]() { std::cout << "stateful: " << b << "\n"; };
    stateful();

    b = false;
    stateful();

    return 0;
}

Have the ideas behind the Fast Delegate (et al) been used to optimize std::function?

In libstdc++'s std::function we use a union type that is suitably sized and aligned to store pointers, function pointers or pointers to member functions. We avoid a heap allocation for any function object that can be stored in that size and alignment, but only if it is "location invariant"

/**
 *  Trait identifying "location-invariant" types, meaning that the
 *  address of the object (or any of its members) will not escape.
 *  Also implies a trivial copy constructor and assignment operator.
 */

The code is based on the std::tr1::function implementation and that part hasn't changed significantly. I think that could be simplified using std::aligned_storage and could be improved by specializing the trait so that more types are identified as location invariant.

Invoking the target object is done without any virtual function calls, the type erasure is done by storing a single function pointer in the std::function which is the address of a function template specialization. All operations are done by calling that function template through the stored pointer and passing in an enum identifying what operation it is being asked to perform. This means no vtable and only a single function pointer needs to be stored in the object.

This design was contributed by the original boost::function author and I believe it is close to the boost implementation. See the Performance docs for Boost.Function for some rationale. That means it's pretty unlikely that GCC's std::function is any faster than boost::function, because it's a similar design by the same person.

N.B. our std::function doesn't support construction with an allocator yet, any allocations it needs to do will be done using new.

In response to Emile's comment expressing a desire to avoid a heap allocation for a std::function which holds a pointer to member function and an object, here's a little hack to do it (but you didn't hear it from me ;-)

struct A {
  int i = 0;
  int foo() const { return 0; }
};

struct InvokeA
{
  int operator()() const { return a->foo(); }
  A* a;
};

namespace std
{
  template<> struct __is_location_invariant<InvokeA>
  { static const bool value = true; };
}

int main()
{
  A a;
  InvokeA inv{ &a };

  std::function<int()> f2(inv);

  return f2();
}

The trick is that InvokeA is small enough to fit in the function's small object buffer, and the trait specialization says it's safe to store in there, so the function holds a copy of that object directly, not on the heap. This requires a to persist as long as the pointer to it persists, but that would be the case anyway if the function's target was bind(&A::foo, &a).

Why one delegate is faster than the other?

Using a decompiler, we can discover the following:

Implementation of GetAction1<T>():

IL_0000: ldnull
IL_0001: ldftn        void ConsoleApp1.UnderTest::ActionInt(int32)
IL_0007: newobj       instance void class [System.Runtime]System.Action`1<int32>::.ctor(object, native int)
IL_000c: castclass    class [System.Runtime]System.Action`1<!!0/*T*/>
IL_0011: ldftn        instance void class [System.Runtime]System.Action`1<!!0/*T*/>::Invoke(!0/*T*/)
IL_0017: newobj       instance void class ConsoleApp1.UnderTest/MyAction`1<!!0/*T*/>::.ctor(object, native int)
IL_001c: ret

Implementation of GetAction2<T>():

IL_0000: ldnull
IL_0001: ldftn        void ConsoleApp1.UnderTest::ActionInt(int32)
IL_0007: newobj       instance void class ConsoleApp1.UnderTest/MyAction`1<int32>::.ctor(object, native int)
IL_000c: castclass    class ConsoleApp1.UnderTest/MyAction`1<!!0/*T*/>
IL_0011: ret

You can see in the first case that it is actually creating two delegates, and chaining one to the other.

In the second case it is only creating one delegate.

I can't explain the exact reason for this, but I would think that it's because of the extra cast to object in GetAction1.

There appears to be an even faster implementation, namely:

 public static MyAction<T> GetAction3<T>()
      => x => ActionInt((int)(object)x);

This generates much longer IL code:

IL_0000: ldsfld       class ConsoleApp1.UnderTest/MyAction`1<!0/*T*/> class ConsoleApp1.UnderTest/'<>c__4`1'<!!0/*T*/>::'<>9__4_0'
IL_0005: dup
IL_0006: brtrue.s     IL_0021
IL_0008: pop
IL_0009: ldsfld       class ConsoleApp1.UnderTest/'<>c__4`1'<!0/*T*/> class ConsoleApp1.UnderTest/'<>c__4`1'<!!0/*T*/>::'<>9'
IL_000e: ldftn        instance void class ConsoleApp1.UnderTest/'<>c__4`1'<!!0/*T*/>::'<GetAction3>b__4_0'(!0/*T*/)
IL_0014: newobj       instance void class ConsoleApp1.UnderTest/MyAction`1<!!0/*T*/>::.ctor(object, native int)
IL_0019: dup
IL_001a: stsfld       class ConsoleApp1.UnderTest/MyAction`1<!0/*T*/> class ConsoleApp1.UnderTest/'<>c__4`1'<!!0/*T*/>::'<>9__4_0'
IL_001f: stloc.0      // V_0
IL_0020: ldloc.0      // V_0
IL_0021: ret

And yet it is faster for both the call to GetAction3() and executing the action that it returns.

Here's the benchmark program I tested with:

using System;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

namespace ConsoleApp1;

public static class Program
{
    public static void Main()
    {
        var summary = BenchmarkRunner.Run<UnderTest>();
    }
}

public class UnderTest
{
    public delegate void MyAction<T>(T val);
    public static int counter;

    public static MyAction<T> GetAction1<T>()
        => new MyAction<T>((Action<T>)(object)ActionInt);

    // Enigmativity's suggestion
    public static MyAction<T> GetAction2<T>()
        => (MyAction<T>)(Delegate)(MyAction<int>)ActionInt;

    public static MyAction<T> GetAction3<T>()
        => x => ActionInt((int)(object)x);

    public static MyAction<int> Act1 = GetAction1<int>();
    public static MyAction<int> Act2 = GetAction2<int>();
    public static MyAction<int> Act3 = GetAction3<int>();

    static void ActionInt(int val) { counter++; }

    [Benchmark]
    public void Action1()
    {
        _ = GetAction1<int>();
    }

    [Benchmark]
    public void Action2()
    {
        _ = GetAction2<int>();
    }

    [Benchmark]
    public void Action3()
    {
        _ = GetAction3<int>();
    }

    [Benchmark]
    public void RunAction1()
    {
        Act1(0);
    }

    [Benchmark]
    public void RunAction2()
    {
        Act2(0);
    }

    [Benchmark]
    public void RunAction3()
    {
        Act3(0);
    }
}

And the results:

|     Method |       Mean |     Error |    StdDev |
|----------- |-----------:|----------:|----------:|
|    Action1 | 13.3355 ns | 0.1670 ns | 0.1480 ns |
|    Action2 |  6.9685 ns | 0.1313 ns | 0.1228 ns |
|    Action3 |  1.3437 ns | 0.0321 ns | 0.0285 ns |
| RunAction1 |  2.4100 ns | 0.0454 ns | 0.0425 ns |
| RunAction2 |  1.6493 ns | 0.0594 ns | 0.0527 ns |
| RunAction3 |  0.8347 ns | 0.0295 ns | 0.0276 ns |

Of course, none of the actions actually use the int passed to them, since they all just call ActionInt() which ignores its argument.

I suppose you could also implement it as:

public static MyAction<T> GetAction3<T>()
    => _ => ActionInt(0);

which might be even faster, but I haven't tried that.

Performance of C++/CLI function pointers versus .NET delegates

You are seeing the cost of "double thunking". The core problem with your DoIt() function is that it is being compiled as managed code. The delegate call is very fast, it is uncomplicated to go from managed to managed code through a delegate. The function pointer is slow however, the compiler automatically generates code to first switch from managed code to unmanaged code and make the call through the function pointer. Which then ends up in a stub that switches from unmanaged code back to managed code and calls DoIt().

Presumably what you really meant to measure was a call to native code. Use a #pragma to force DoIt() to be generated as machine code, like this:

#pragma managed(push, off)
__int64 DoIt(int n, __int64 sum)
{
    if ((n % 3) == 0)
        return sum + n;
    else
        return sum + 1;
}
#pragma managed(pop)

You'll now see that the function pointer is faster than a delegate

What is a C++ delegate?

You have an incredible number of choices to achieve delegates in C++. Here are the ones that came to my mind.

Option 1 : functors:

A function object may be created by implementing operator()

struct Functor
{
     // Normal class/struct members

     int operator()(double d) // Arbitrary return types and parameter list
     {
          return (int) d + 1;
     }
};

// Use:
Functor f;
int i = f(3.14);

Option 2: lambda expressions (C++11 only)

// Syntax is roughly: [capture](parameter list) -> return type {block}
// Some shortcuts exist
auto func = [](int i) -> double { return 2*i/1.15; };
double d = func(1);

Option 3: function pointers

int f(double d) { ... }
typedef int (*MyFuncT) (double d);
MyFuncT fp = &f;
int a = fp(3.14);

Option 4: pointer to member functions (fastest solution)

See Fast C++ Delegate (on The Code Project).

struct DelegateList
{
     int f1(double d) { }
     int f2(double d) { }
};

typedef int (DelegateList::* DelegateType)(double d);

DelegateType d = &DelegateList::f1;
DelegateList list;
int a = (list.*d)(3.14);

Option 5: std::function

(or boost::function if your standard library doesn't support it). It is slower, but it is the most flexible.

#include <functional>
std::function<int(double)> f = [can be set to about anything in this answer]
// Usually more useful as a parameter to another functions

Option 6: binding (using std::bind)

Allows setting some parameters in advance, convenient to call a member function for instance.

struct MyClass
{
    int DoStuff(double d); // actually a DoStuff(MyClass* this, double d)
};

std::function<int(double d)> f = std::bind(&MyClass::DoStuff, this, std::placeholders::_1);
// auto f = std::bind(...); in C++11

Option 7: templates

Accept anything as long as it matches the argument list.

template <class FunctionT>
int DoSomething(FunctionT func)
{
    return func(3.14);
}

Performance of calling delegates vs methods

I haven't seen that effect - I've certainly never encountered it being a bottleneck.

Here's a very rough-and-ready benchmark which shows (on my box anyway) delegates actually being faster than interfaces:

using System;
using System.Diagnostics;

interface IFoo
{
    int Foo(int x);
}

class Program : IFoo
{
    const int Iterations = 1000000000;

    public int Foo(int x)
    {
        return x * 3;
    }

    static void Main(string[] args)
    {
        int x = 3;
        IFoo ifoo = new Program();
        Func<int, int> del = ifoo.Foo;
        // Make sure everything's JITted:
        ifoo.Foo(3);
        del(3);

        Stopwatch sw = Stopwatch.StartNew();        
        for (int i = 0; i < Iterations; i++)
        {
            x = ifoo.Foo(x);
        }
        sw.Stop();
        Console.WriteLine("Interface: {0}", sw.ElapsedMilliseconds);

        x = 3;
        sw = Stopwatch.StartNew();        
        for (int i = 0; i < Iterations; i++)
        {
            x = del(x);
        }
        sw.Stop();
        Console.WriteLine("Delegate: {0}", sw.ElapsedMilliseconds);
    }
}

Results (.NET 3.5; .NET 4.0b2 is about the same):

Interface: 5068
Delegate: 4404

Now I don't have particular faith that that means delegates are really faster than interfaces... but it makes me fairly convinced that they're not an order of magnitude slower. Additionally, this is doing almost nothing within the delegate/interface method. Obviously the invocation cost is going to make less and less difference as you do more and more work per call.

One thing to be careful of is that you're not creating a new delegate several times where you'd only use a single interface instance. This could cause an issue as it would provoke garbage collection etc. If you're using an instance method as a delegate within a loop, you will find it more efficient to declare the delegate variable outside the loop, create a single delegate instance and reuse it. For example:

Func<int, int> del = myInstance.MyMethod;
for (int i = 0; i < 100000; i++)
{
    MethodTakingFunc(del);
}

is more efficient than:

for (int i = 0; i < 100000; i++)
{
    MethodTakingFunc(myInstance.MyMethod);
}

Could this have been the problem you were seeing?

Events in C#

Events firing are delegate invocations, which are a a bit slower than virtual calls

alt text
_{(source: microsoft.com)}

But dealing with interfaces for subscriber/publisher/observer/observable scenario is more painful that using events.

5 Years Later, Is There Something Better Than the "Fastest Possible C++ Delegates"