Why Can Lambdas Be Better Optimized by the Compiler Than Plain Functions

Why can lambdas be better optimized by the compiler than plain functions?

The reason is that lambdas are function objects so passing them to a function template will instantiate a new function specifically for that object. The compiler can thus trivially inline the lambda call.

For functions, on the other hand, the old caveat applies: a function pointer gets passed to the function template, and compilers traditionally have a lot of problems inlining calls via function pointers. They can theoretically be inlined, but only if the surrounding function is inlined as well.

As an example, consider the following function template:

template <typename Iter, typename F>
void map(Iter begin, Iter end, F f) {
    for (; begin != end; ++begin)
        *begin = f(*begin);
}

Calling it with a lambda like this:

int a[] = { 1, 2, 3, 4 };
map(begin(a), end(a), [](int n) { return n * 2; });

Results in this instantiation (created by the compiler):

template <>
void map<int*, _some_lambda_type>(int* begin, int* end, _some_lambda_type f) {
    for (; begin != end; ++begin)
        *begin = f.operator()(*begin);
}

… the compiler knows _some_lambda_type::operator () and can inline calls to it trivially. (And invoking the function map with any other lambda would create a new instantiation of map since each lambda has a distinct type.)

But when called with a function pointer, the instantiation looks as follows:

template <>
void map<int*, int (*)(int)>(int* begin, int* end, int (*f)(int)) {
    for (; begin != end; ++begin)
        *begin = f(*begin);
}

… and here f points to a different address for each call to map and thus the compiler cannot inline calls to f unless the surrounding call to map has also been inlined so that the compiler can resolve f to one specific function.

How does a compiler treat lambdas differently than regular functions?

Removing some extraneous stuff from your snippet, we get

template<typename T>
constexpr void foo(T t)
{
    constexpr int i = t();
}

constexpr int f() { return 42; }
auto l = []{ return 42; }

Remarks:

You are attempting to use t() as a constexpr within foo. foo could in fact be a normal function and still behave the same.
A lambda isn't a function. It is an anonymous struct with an operator().
```
struct L { constexpr int operator()() const { return 42; } };
```
T is deduced as a type equivalent to L in the call foo(l).
T is deduced as int(*)() in the call foo(f).
In both cases, t isn't a constexpr within foo.

From [expr.const]

An expression e is a core constant expression unless the evaluation of e, following the rules of the abstract machine, would evaluate one of the following expressions: [...]

an lvalue-to-rvalue conversion unless [...]

a non-volatile glvalue that refers to a non-volatile object defined with constexpr, or that refers to a non-mutable subobject of such an object, or

a non-volatile glvalue of literal type that refers to a non-volatile object whose lifetime began within the evaluation of e;

Calling through a int(*)() requires a lvalue-to-rvalue conversion. t isn't an object defined with constexpr, neither did it begin its lifetime within the evaluation of t(). Therefore t() isn't a constexpr in foo(f).

Calling operator() doesn't require a lvalue-to-rvalue conversion. Since there is no member access, it is simply a constexpr function call. Therefore t() is a constexpr in foo(l).

_{There is one more step from core constant expressions to constant expressions, but isn't important for the discussion here.}

C++11 Performance: Lambda inlining vs Function template specialization

If the compiler can track "this function pointer points to this function", the compiler can inline the call through the function pointer.

Sometimes compilers can do this. Sometimes they cannot.

Unless you store a lambda in a function pointer, std::function, or similar type-erasing wrapper, the compiler at the point where the lambda is called knows the type of the lambda, so knows the body of the lambda. The compiler can trivially inline the function call.

Nothing about using a function template changes this, except if the argument is constexpr like a function non-type template parameter:

template <int func(int, int)>

this is an example of that. Here, the function template, in the body of the function, is guaranteed to be known at compile time.

Pass that func anywhere else, however, and the compiler can lose track of it.

In any case, any speed difference is going to be highly context dependent. And sometimes the larger binary size caused by inlining of a lambda will cause more slowdown than the inability to inline a function pointer, so performance can go the other way.

Any universal claims like you are trying to make is going to be wrong sometimes.

How to help the compiler to optimize lambda calls?

Do not use std::function, use templated argument for body so the compiler can actually see the lambda when compiling forXY. Otherwise it resorts to type erasure and virtual calls inside std::function. – yeputons

Using this idea, here is the improved code:

template <typename F>
static inline void forXY(int v, F body) noexcept {
  for(int y = 0 ; y<v ; y++) {
    for(int x = 0 ; x<v ; x++) {
      body(x, y);
    }
  }
}

Why would a C++ compiler fail to inline a lambda passed to a function template?

It is only a matter of filling enough stuff into the lambda and using it at least twice (otherwise there is no good reason not to inline).

Here for GCC 9.2 and Clang 9 with -O3:

#include<iostream>

int a, b, c;

template <typename F>
static inline void
hof(const F fun) {
    fun(a, b, c);
    fun(a, b, c);
}
void caller() {
    hof([&](int a, int b, int c) {
        std::cout << "Hello!";
        std::cout << "Hello!";
        std::cout << "Hello!";
        std::cout << "Hello!";
        std::cout << "Hello!";
        std::cout << "Hello!";
        std::cout << "Hello!";
        std::cout << "Hello!";
        std::cout << "Hello!";
        std::cout << "Hello!";
        std::cout << "Hello!";
        std::cout << "Hello!";
        std::cout << "Hello!";
        std::cout << "Hello!";
        std::cout << "Hello!";
        std::cout << "Hello!";
        std::cout << "Hello!";
        std::cout << "Hello!";
        std::cout << "Hello!";
        std::cout << "Hello!";
        std::cout << "Hello!";
        std::cout << "Hello!";
        std::cout << "Hello!";
        std::cout << "Hello!";
        std::cout << "Hello!";
        std::cout << "Hello!";
        std::cout << "Hello!";
        std::cout << "Hello!";
        std::cout << "Hello!";
    });
}

And the assembly for caller looks like this:

GCC:

caller():
        sub     rsp, 8
        call    caller()::{lambda(int, int, int)#1}::operator()(int, int, int) const [clone .isra.0]
        call    caller()::{lambda(int, int, int)#1}::operator()(int, int, int) const [clone .isra.0]
        add     rsp, 8
        ret

Clang:

caller():                             # @caller()
        push    rax
        call    caller()::$_0::operator()(int, int, int) const
        pop     rax
        jmp     caller()::$_0::operator()(int, int, int) const # TAILCALL

See godbolt here.

These are exactly as many repetitions in the lambda as I needed to convince GCC that inlining twice is not worth it.

Clang stopped inlining already with fewer repetitions.

How effectively can function-local lambdas be inlined by C++ compilers?

Yes.

Modern compilers use "static single assignment" (SSA) as an optimization pass.

Each time you assign to a value or modify it, a conceptually different value is created. Sometimes these conceptually different values share identity (for the purpose of pointers-to).

Identity, when you take the address of something, is the thing that gets in the way of this.

Simple references are turned into aliases for the value they reference; they have no identity. This is part of the original design intent for references, and why you cannot have a pointer to a reference.

Concretely:

std::string printSomeNumbers(void)
{
  std::ostringstream ss;
  const auto appendNumber = [&ss](auto number) {
    ss << number << "\n"; // Pretend this is something non-trivial
  };

  printf("hello\n");
  appendNumber(1);
  printf("world\n");
  appendNumber(2.0);
  printf("today\n");

  return ss.str();
}

compiles to:

printSomeNumbers[abi:cxx11]():           # @printSomeNumbers[abi:cxx11]()
        push    r14
        push    rbx
        sub     rsp, 376
        mov     r14, rdi
        mov     rbx, rsp
        mov     rdi, rbx
        mov     esi, 16
        call    std::__cxx11::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >::basic_ostringstream(std::_Ios_Openmode)
        mov     edi, offset .Lstr
        call    puts
        mov     rdi, rbx
        mov     esi, 1
        call    std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
        mov     esi, offset .L.str.3
        mov     edx, 1
        mov     rdi, rax
        call    std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)
        mov     edi, offset .Lstr.8
        call    puts
        mov     rdi, rsp
        movsd   xmm0, qword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero
        call    std::basic_ostream<char, std::char_traits<char> >& std::basic_ostream<char, std::char_traits<char> >::_M_insert<double>(double)
        mov     esi, offset .L.str.3
        mov     edx, 1
        mov     rdi, rax
        call    std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)
        mov     edi, offset .Lstr.9
        call    puts
        lea     rsi, [rsp + 8]
        mov     rdi, r14
        call    std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >::str() const
        mov     rax, qword ptr [rip + VTT for std::__cxx11::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >]
        mov     qword ptr [rsp], rax
        mov     rcx, qword ptr [rip + VTT for std::__cxx11::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >+24]
        mov     rax, qword ptr [rax - 24]
        mov     qword ptr [rsp + rax], rcx
        mov     qword ptr [rsp + 8], offset vtable for std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >+16
        mov     rdi, qword ptr [rsp + 80]
        lea     rax, [rsp + 96]
        cmp     rdi, rax
        je      .LBB0_7
        call    operator delete(void*)
.LBB0_7:
        mov     qword ptr [rsp + 8], offset vtable for std::basic_streambuf<char, std::char_traits<char> >+16
        lea     rdi, [rsp + 64]
        call    std::locale::~locale() [complete object destructor]
        lea     rdi, [rsp + 112]
        call    std::ios_base::~ios_base() [base object destructor]
        mov     rax, r14
        add     rsp, 376
        pop     rbx
        pop     r14
        ret

Godbolt

Notice that between the printf calls (in the assembly they are puts) there is no call other than directly to a operator<< of ostringstream.

Helper functions: lambdas vs normal functions

Your first and foremost concern should be readability (and maintainability)!
Which of regular or lambda functions is more readable strongly depends on the given problem (and a bit on the taste of the reader/maintainer).

Don't be concerned about performance until you find that performance actually is an issue! If performance is an issue, start by benchmarking, not by guessing which implementation you think is faster (in many situations compilers are pretty good at optimizing).

Lambda VS Function

I think lambda is elegant when using stl functions like , or you want quick throw away functions without naming them .

sort(v.begin(),
     v.end(),
     [](int a, int b){ return a > b; }
);

but it's not faster from the function .

Disassembly of both .

    helloWorld1();
008112A0  mov         ecx,dword ptr ds:[813054h]  
008112A6  push        8119A0h  
008112AB  call        std::operator<<<std::char_traits<char> > (0811780h)  
008112B0  mov         ecx,eax  
008112B2  call        dword ptr ds:[813038h]  
    helloWorld2();
008112B8  mov         ecx,dword ptr ds:[813054h]  
008112BE  push        8119A0h  
008112C3  call        std::operator<<<std::char_traits<char> > (0811780h)  
008112C8  mov         ecx,eax  
008112CA  call        dword ptr ds:[813038h]

both have the same disassembly.

Does the compiler really optimize to make these two functions the same assembly?

This function:

float a() {
    A a{55,67,12};
    return a.bar();
}

Has exactly the same observable behavior as this one:

float a() {
    return sqrt(55+67+12);
}

The same is true for b(). Further, sqrt(55+67+12) == sqrt(134) == 11.5758369028.

Binary representation of the IEEE-754 floating point value 11.5758369028 is 01000001001110010011011010100001. And that binary as integer is 1094268577.

The compiler applied the so-called as if rule to replace both functions with assembly that has the exact same observable behavior as the original code: Both functions return a float with value 11.5758369028.

Why Can Lambdas Be Better Optimized by the Compiler Than Plain Functions