Why Can Lambdas Be Better Optimized by the Compiler Than Plain Functions

Why can lambdas be better optimized by the compiler than plain functions?

The reason is that lambdas are function objects so passing them to a function template will instantiate a new function specifically for that object. The compiler can thus trivially inline the lambda call.

For functions, on the other hand, the old caveat applies: a function pointer gets passed to the function template, and compilers traditionally have a lot of problems inlining calls via function pointers. They can theoretically be inlined, but only if the surrounding function is inlined as well.

As an example, consider the following function template:

template <typename Iter, typename F>
void map(Iter begin, Iter end, F f) {
for (; begin != end; ++begin)
*begin = f(*begin);
}

Calling it with a lambda like this:

int a[] = { 1, 2, 3, 4 };
map(begin(a), end(a), [](int n) { return n * 2; });

Results in this instantiation (created by the compiler):

template <>
void map<int*, _some_lambda_type>(int* begin, int* end, _some_lambda_type f) {
for (; begin != end; ++begin)
*begin = f.operator()(*begin);
}

… the compiler knows _some_lambda_type::operator () and can inline calls to it trivially. (And invoking the function map with any other lambda would create a new instantiation of map since each lambda has a distinct type.)

But when called with a function pointer, the instantiation looks as follows:

template <>
void map<int*, int (*)(int)>(int* begin, int* end, int (*f)(int)) {
for (; begin != end; ++begin)
*begin = f(*begin);
}

… and here f points to a different address for each call to map and thus the compiler cannot inline calls to f unless the surrounding call to map has also been inlined so that the compiler can resolve f to one specific function.

How does a compiler treat lambdas differently than regular functions?

Removing some extraneous stuff from your snippet, we get

template<typename T>
constexpr void foo(T t)
{
constexpr int i = t();
}

constexpr int f() { return 42; }
auto l = []{ return 42; }

Remarks:

  1. You are attempting to use t() as a constexpr within foo. foo could in fact be a normal function and still behave the same.

  2. A lambda isn't a function. It is an anonymous struct with an operator().

    struct L { constexpr int operator()() const { return 42; } };

    T is deduced as a type equivalent to L in the call foo(l).

  3. T is deduced as int(*)() in the call foo(f).

  4. In both cases, t isn't a constexpr within foo.


From [expr.const]

An expression e is a core constant expression unless the evaluation of e, following the rules of the abstract machine, would evaluate one of the following expressions: [...]

  • an lvalue-to-rvalue conversion unless [...]

    • a non-volatile glvalue that refers to a non-volatile object defined with constexpr, or that refers to a non-mutable subobject of such an object, or

    • a non-volatile glvalue of literal type that refers to a non-volatile object whose lifetime began within the evaluation of e;

Calling through a int(*)() requires a lvalue-to-rvalue conversion. t isn't an object defined with constexpr, neither did it begin its lifetime within the evaluation of t(). Therefore t() isn't a constexpr in foo(f).

Calling operator() doesn't require a lvalue-to-rvalue conversion. Since there is no member access, it is simply a constexpr function call. Therefore t() is a constexpr in foo(l).

There is one more step from core constant expressions to constant expressions, but isn't important for the discussion here.

C++11 Performance: Lambda inlining vs Function template specialization

If the compiler can track "this function pointer points to this function", the compiler can inline the call through the function pointer.

Sometimes compilers can do this. Sometimes they cannot.

Unless you store a lambda in a function pointer, std::function, or similar type-erasing wrapper, the compiler at the point where the lambda is called knows the type of the lambda, so knows the body of the lambda. The compiler can trivially inline the function call.

Nothing about using a function template changes this, except if the argument is constexpr like a function non-type template parameter:

template <int func(int, int)>

this is an example of that. Here, the function template, in the body of the function, is guaranteed to be known at compile time.

Pass that func anywhere else, however, and the compiler can lose track of it.

In any case, any speed difference is going to be highly context dependent. And sometimes the larger binary size caused by inlining of a lambda will cause more slowdown than the inability to inline a function pointer, so performance can go the other way.

Any universal claims like you are trying to make is going to be wrong sometimes.

How to help the compiler to optimize lambda calls?

Do not use std::function, use templated argument for body so the compiler can actually see the lambda when compiling forXY. Otherwise it resorts to type erasure and virtual calls inside std::function. – yeputons

Using this idea, here is the improved code:

template <typename F>
static inline void forXY(int v, F body) noexcept {
for(int y = 0 ; y<v ; y++) {
for(int x = 0 ; x<v ; x++) {
body(x, y);
}
}
}

Why would a C++ compiler fail to inline a lambda passed to a function template?

It is only a matter of filling enough stuff into the lambda and using it at least twice (otherwise there is no good reason not to inline).

Here for GCC 9.2 and Clang 9 with -O3:

#include<iostream>

int a, b, c;

template <typename F>
static inline void
hof(const F fun) {
fun(a, b, c);
fun(a, b, c);
}
void caller() {
hof([&](int a, int b, int c) {
std::cout << "Hello!";
std::cout << "Hello!";
std::cout << "Hello!";
std::cout << "Hello!";
std::cout << "Hello!";
std::cout << "Hello!";
std::cout << "Hello!";
std::cout << "Hello!";
std::cout << "Hello!";
std::cout << "Hello!";
std::cout << "Hello!";
std::cout << "Hello!";
std::cout << "Hello!";
std::cout << "Hello!";
std::cout << "Hello!";
std::cout << "Hello!";
std::cout << "Hello!";
std::cout << "Hello!";
std::cout << "Hello!";
std::cout << "Hello!";
std::cout << "Hello!";
std::cout << "Hello!";
std::cout << "Hello!";
std::cout << "Hello!";
std::cout << "Hello!";
std::cout << "Hello!";
std::cout << "Hello!";
std::cout << "Hello!";
std::cout << "Hello!";
});
}

And the assembly for caller looks like this:

GCC:

caller():
sub rsp, 8
call caller()::{lambda(int, int, int)#1}::operator()(int, int, int) const [clone .isra.0]
call caller()::{lambda(int, int, int)#1}::operator()(int, int, int) const [clone .isra.0]
add rsp, 8
ret

Clang:

caller():                             # @caller()
push rax
call caller()::$_0::operator()(int, int, int) const
pop rax
jmp caller()::$_0::operator()(int, int, int) const # TAILCALL

See godbolt here.

These are exactly as many repetitions in the lambda as I needed to convince GCC that inlining twice is not worth it.

Clang stopped inlining already with fewer repetitions.

How effectively can function-local lambdas be inlined by C++ compilers?

Yes.

Modern compilers use "static single assignment" (SSA) as an optimization pass.

Each time you assign to a value or modify it, a conceptually different value is created. Sometimes these conceptually different values share identity (for the purpose of pointers-to).

Identity, when you take the address of something, is the thing that gets in the way of this.

Simple references are turned into aliases for the value they reference; they have no identity. This is part of the original design intent for references, and why you cannot have a pointer to a reference.

Concretely:

std::string printSomeNumbers(void)
{
std::ostringstream ss;
const auto appendNumber = [&ss](auto number) {
ss << number << "\n"; // Pretend this is something non-trivial
};

printf("hello\n");
appendNumber(1);
printf("world\n");
appendNumber(2.0);
printf("today\n");

return ss.str();
}

compiles to:

printSomeNumbers[abi:cxx11]():           # @printSomeNumbers[abi:cxx11]()
push r14
push rbx
sub rsp, 376
mov r14, rdi
mov rbx, rsp
mov rdi, rbx
mov esi, 16
call std::__cxx11::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >::basic_ostringstream(std::_Ios_Openmode)
mov edi, offset .Lstr
call puts
mov rdi, rbx
mov esi, 1
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
mov esi, offset .L.str.3
mov edx, 1
mov rdi, rax
call std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)
mov edi, offset .Lstr.8
call puts
mov rdi, rsp
movsd xmm0, qword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero
call std::basic_ostream<char, std::char_traits<char> >& std::basic_ostream<char, std::char_traits<char> >::_M_insert<double>(double)
mov esi, offset .L.str.3
mov edx, 1
mov rdi, rax
call std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)
mov edi, offset .Lstr.9
call puts
lea rsi, [rsp + 8]
mov rdi, r14
call std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >::str() const
mov rax, qword ptr [rip + VTT for std::__cxx11::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >]
mov qword ptr [rsp], rax
mov rcx, qword ptr [rip + VTT for std::__cxx11::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >+24]
mov rax, qword ptr [rax - 24]
mov qword ptr [rsp + rax], rcx
mov qword ptr [rsp + 8], offset vtable for std::__cxx11::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >+16
mov rdi, qword ptr [rsp + 80]
lea rax, [rsp + 96]
cmp rdi, rax
je .LBB0_7
call operator delete(void*)
.LBB0_7:
mov qword ptr [rsp + 8], offset vtable for std::basic_streambuf<char, std::char_traits<char> >+16
lea rdi, [rsp + 64]
call std::locale::~locale() [complete object destructor]
lea rdi, [rsp + 112]
call std::ios_base::~ios_base() [base object destructor]
mov rax, r14
add rsp, 376
pop rbx
pop r14
ret

Godbolt

Notice that between the printf calls (in the assembly they are puts) there is no call other than directly to a operator<< of ostringstream.

Helper functions: lambdas vs normal functions

Your first and foremost concern should be readability (and maintainability)!
Which of regular or lambda functions is more readable strongly depends on the given problem (and a bit on the taste of the reader/maintainer).

Don't be concerned about performance until you find that performance actually is an issue! If performance is an issue, start by benchmarking, not by guessing which implementation you think is faster (in many situations compilers are pretty good at optimizing).

Lambda VS Function

I think lambda is elegant when using stl functions like , or you want quick throw away functions without naming them .

sort(v.begin(),
v.end(),
[](int a, int b){ return a > b; }
);

but it's not faster from the function .

Disassembly of both .

    helloWorld1();
008112A0 mov ecx,dword ptr ds:[813054h]
008112A6 push 8119A0h
008112AB call std::operator<<<std::char_traits<char> > (0811780h)
008112B0 mov ecx,eax
008112B2 call dword ptr ds:[813038h]
helloWorld2();
008112B8 mov ecx,dword ptr ds:[813054h]
008112BE push 8119A0h
008112C3 call std::operator<<<std::char_traits<char> > (0811780h)
008112C8 mov ecx,eax
008112CA call dword ptr ds:[813038h]

both have the same disassembly.

Does the compiler really optimize to make these two functions the same assembly?

This function:

float a() {
A a{55,67,12};
return a.bar();
}

Has exactly the same observable behavior as this one:

float a() {
return sqrt(55+67+12);
}

The same is true for b(). Further, sqrt(55+67+12) == sqrt(134) == 11.5758369028.

Binary representation of the IEEE-754 floating point value 11.5758369028 is 01000001001110010011011010100001. And that binary as integer is 1094268577.

The compiler applied the so-called as if rule to replace both functions with assembly that has the exact same observable behavior as the original code: Both functions return a float with value 11.5758369028.



Related Topics



Leave a reply



Submit