Do C++ Templates Make Programs Slow

Do c++ templates make programs slow?

The short answer is no. For the longer answer please read on.

As others have already noted, templates don't have a direct run-time penalty -- i.e. all their tricks happen at compile time. Indirectly, however, they can slow things down under a few circumstances. In particular, each instantiation of a template (normally) produces code that's separate and unique from other instantiations. Under a few circumstances, this can lead to slow execution, by simply producing enough object code that it no longer fits in the cache well.

With respect to code size: yes, most compilers can and will fold together the code for identical instantiations -- but that's normally the case only when the instantiations are truly identical. The compiler will not insert code to do even the most trivial conversions to get two minutely different instantiations to match each other. For example, a normal function call can and will convert T * to T const * so calls that use either const or non-const arguments will use the same code (unless you've chosen to overload the function on constness, in which case you've probably done so specifically to provide different behavior for the two cases). With a template, that won't happen -- instantiations over T * and T const * will result in two entirely separate pieces of code being generated. It's possible the compiler (or linker) may be able to merge the two after the fact, but not entirely certain (e.g., I've certainly used compilers that didn't).

But in the end, templates have positive effects on speed far more often than negative.

C++ templates for performance?

A common example is sorting.

In C, qsort takes a pointer to a comparison function. Generally speaking, there will be one copy of the qsort code, which is not inlined. It will make a call through the pointer to the comparison routine -- this of course is also not inlined.

In C++, std::sort is a template, and it can take a functor object as comparator. There is a different copy of std::sort for each different type used as a comparator. Assuming you use a functor class with overloaded operator(), then the call to the comparator can easily be inlined into this copy of std::sort.

So, templates give you more inlining because there are more copies of the sort code, each of which can inline a different comparator. Inlining is quite a good optimization, and sort routines do a lot of comparisons, so you can often measure std::sort running faster than an equivalent qsort. The cost of this is the chance of much larger code -- if your program uses a lot of different comparators then you get a lot of different copies of the sort routine, each with a different comparator baked into it.

In principle there's no reason why a C implementation can't inline qsort into the place it is called. Then if it was called with the name of the function, the optimizer could in theory observe that at the point it is used, the function pointer must still point to that same function. Then it can inline the call to the function, and the result would be similar to the result with std::sort. But in practice, compilers tend not to take the first step, inlining qsort. That's because (a) it's large, and (b) it's in a different translation unit, usually compiled into some library that your program is linked against, and (c) to do it this way, you'd have an inlined copy of qsort for every call to it, not just a copy for every different comparator. So it would be even more bloated than the C++, unless the implementation could also find a way to common up the code in cases where qsort is called in different places with the same comparator.

So, general-purpose functions like qsort in C tend to have some overheads on account of calls through function pointers, or other indirection[*]. Templates in C++ are a common way of keeping the source code generic, but ensuring that it compiles to a special-purpose function (or several such functions). The special-purpose code hopefully is faster.

It's worth noting that templates are not by any means just about performance. std::sort is itself more general-purpose than qsort in some ways. For example qsort only sorts arrays, whereas std::sort can sort anything that provides a random-access iterator. It can for example sort a deque, which under the covers is several disjoint arrays allocated separately. So the use of templates doesn't necessarily provide any performance benefit, it might be done for other reasons. It just happens that templates do affect performance.

[*] another example with sorting - qsort takes an integer parameter saying how big each element of the array is, and when it moves elements it therefore must call memcpy or similar with the value of this variable. std::sort knows at compile-time the exact type of the elements, and hence the exact size. It can inline a copy constructor call that in turn might translate to instructions to copy that number of bytes. As with the inlined comparator, it's often possible to copy exactly 4 (or 8, or 16, or whatever) bytes faster than you'd get by calling a routine that copies a variable number of bytes, passing it the value 4 (or 8, or 16, or whatever). As before, if you called qsort with a literal value for the size, and that call to qsort was inlined, then the compiler could perform the exact same optimization in C. But in practice you don't see that.

Why are templates so slow to compile?

Templated code has to be taken as another language to generate C++ code.

In this way of thinking, templated code has to be parsed, executed, then the compiler can generate C++ code that has to be added to the current unit file, and then we can compile the whole C++ code.

I've heard that not all compilers do this exactly but that's the main idea, and that suppose that there is a lot more happening before really compiling the code as some of the code has to be generated first.

Is Template Metaprogramming faster than the equivalent C code?

First, a disclaimer:
What I think you're asking about is not just template metaprogramming, but also generic programming. The two concepts are closely related, and there's no exact definition of what each encompasses. But in short, template metaprogramming is essentially writing a program using templates, which is evaluated at compile-time. (which makes it entirely free at runtime. Nothing happens. The value (or more commonly, type) has already been computed by the compiler, and is available as a constant (either a const variable, or an enum), or as a typedef nested in a class (if you've used it to "compute" a type).

Generic programming is using templates and when necessary, template metaprogramming, to create generic code which works the same (and with no loss in performance), with all and any type. I'm going to use examples of both in the following.

A common use for template metaprogramming is to enable types to be used in generic programming, even if they were not designed for it.

Since template metaprogramming technically takes place entirely at compile-time, your question is a bit more relevant for generic programming, which still takes place at runtime, but is efficient because it can be specialized for the precise types it's used with at compile-time.

Anyway...

Depends on how you define "the equivalent C code".

The trick about template metaprogramming (or generic programming in general) is that it allows a lot of computation to be moved to compile-time, and it enables flexible, parametrized code that is just as efficient as hardcoded values.

The code displayed here for example computes a number in the fibonacci sequence at compile-time.

The C++ code 'unsigned long fib11 = fibonacci<11uL>::value', relies on the template metaprogram defined in that link, and is as efficient as the C code 'unsigned long fib11 = 89uL'. The templates are evaluated at compile-time, yielding a constant that can be assigned to a variable. So at runtime, the code is actually identical to a simple assignment.

So if that is the "equivalent C code", the performance is the same.
If the equivalent C code is "a program that can compute arbitrary fibonacci numbers, applied to find the 11th number in the sequence", then the C version will be much slower, because it has to be implemented as a function, which computes the value at runtime. But this is the "equivalent C code" in the sense that it is a C program that exhibits the same flexibility (it is not just a hardcoded constant, but an actual function that can return any number in the fibonacci sequence).

Of course, this isn't often useful. But it's pretty much the canonical example of template metaprogramming.

A more realistic example of generic programming is sorting.

In C, you have the qsort standard library function taking an array and a comparator function pointer. The call to this function pointer can not be inlined (except in trivial cases), because at compile-time, it is not known which function is going to be called.

Of course the alternative is a hand-written sorting function designed for your specific datatype.

In C++, the equivalent is the function template std::sort. It too takes a comparator, but instead of this being a function pointer, it is a function object, looking like this:

struct MyComp {
  bool operator()(const MyType& lhs, const MyType& rhs) {
     // return true if lhs < rhs, however this operation is defined for MyType objects
  }
};

and this can be inlined. The std::sort function is passed a template argument, so it knows the exact type of the comparator, and so it knows that the comparator function is not just an unknown function pointer, but MyComp::operator().

The end result is that the C++ function std::sort is exactly as efficient as your hand-coded implementation in C of the same sorting algorithm.

So again, if that is "the equivalent C code", then the performance is the same.
But if the "equivalent C code" is "a generalized sorting function which can be applied to any type, and allows user-defined comparators", then the generic programming-version in C++ is vastly more efficient.

That's really the trick. Generic programming and template metaprogramming are not "faster than C". They are methods to achieve general, reusable code which is as fast as handcoded, and hardcoded C

It is a way to get the best of both worlds. The performance of hardcoded algorithms, and the flexibility and reusability of general, parameterized ones.

Does C++20's 'requires' keyword slow your program?

Does C++20's 'requires' keyword slow your program?

No, it specifies a constant expression on template parameters that evaluate a requirement - or in a template declaration, specifies an associated constraint.

Template parameters (and constraints) must be compile time constants by nature and won't affect the speed of your compiled program.

Why does concepts make C++ compile slower?

Note: the following answer (and the question it answers) pertains to the old C++0x version of concepts and has little relation to the version of the feature added to C++20.

First of all, Herb didn't say that concepts themselves made compiling slower. He said that conceptizing the C++ standard library made any code using the C++ standard library compile slower.

The reason for that comes down to several things.

1: Constraining templates takes compile time.

When you declare a class like this:

template<typename T> class Foo {...};

The compiler simply parses Foo and does very little. Even with two-phase lookup, the compiler simply doesn't do a whole lot in the compilation of class Foo. It stores it for later, of course, but the initial pass is relatively fast.

When you do constrain the template with a concept:

template<ConceptName C> class Foo {...};

The compiler must do some things. It must check up front that every use of the type C conforms to the concept ConceptName. That's extra work that the compiler would have deferred until instantiation time.

The more concept checking you have, the more compile time you spend to verify that the types match the concepts.

2: The standard C++ library uses a lot of concepts.

Look at the number of iterator concepts: input, output, forward, bidirectional, sequential, contiguous. And the committee was considering breaking them down into many more than that. Many algorithms would have multiple versions for different iterator concepts.

And this doesn't include range concepts (of which there is one for every kind of iterator concept except output), character concepts for std::string, and various other kinds of things. All of these have to be compiled and checked.

What concepts really needed to make it fast is modules. The ability for the compiler to generate a module file that contains a sequence of pre-checked symbols, and then load that file directly without having to go through the standard compilation process. Straight from parsing to symbol creation.

Remember: for each .cpp file you #include , the compiler must read that file and compile it. Even though the file is the same thing every time it does this, it still must dutifully read the file and process it. If we're talking about a concept-ized std::vector, it has to do all of the concept checking of the template. It still has to do all of the standard symbol lookup you do when compiling. And so forth.

Imagine if the compiler didn't have to do this. Imagine if it could just load a bunch of symbols and definitions directly from the disk. No compiling at all; just bringing in symbols and definitions for other code to use.

It would be like precompiled headers only better. Precompiled headers are restricted to only have one per .cpp file, whereas you can use as many modules as you like.

Sadly, modules was yanked pretty early in the process from C++0x. And without modules, constraining the standard library with concepts will always compile more slowly than the unconstrained version.

Note that Herb misunderstands the purpose of modules (not hard, since most of the initial concepts of the feature were the things he talked about: cross-platform DLLs and such). Their core fundamental purpose is to help compile times, not to make cross-platform DLLs work. Nor is it intended that modules themselves be cross-platform.

Do C++ Templates Make Programs Slow