C++ Templates for Performance

C++ templates for performance?

A common example is sorting.

In C, qsort takes a pointer to a comparison function. Generally speaking, there will be one copy of the qsort code, which is not inlined. It will make a call through the pointer to the comparison routine -- this of course is also not inlined.

In C++, std::sort is a template, and it can take a functor object as comparator. There is a different copy of std::sort for each different type used as a comparator. Assuming you use a functor class with overloaded operator(), then the call to the comparator can easily be inlined into this copy of std::sort.

So, templates give you more inlining because there are more copies of the sort code, each of which can inline a different comparator. Inlining is quite a good optimization, and sort routines do a lot of comparisons, so you can often measure std::sort running faster than an equivalent qsort. The cost of this is the chance of much larger code -- if your program uses a lot of different comparators then you get a lot of different copies of the sort routine, each with a different comparator baked into it.

In principle there's no reason why a C implementation can't inline qsort into the place it is called. Then if it was called with the name of the function, the optimizer could in theory observe that at the point it is used, the function pointer must still point to that same function. Then it can inline the call to the function, and the result would be similar to the result with std::sort. But in practice, compilers tend not to take the first step, inlining qsort. That's because (a) it's large, and (b) it's in a different translation unit, usually compiled into some library that your program is linked against, and (c) to do it this way, you'd have an inlined copy of qsort for every call to it, not just a copy for every different comparator. So it would be even more bloated than the C++, unless the implementation could also find a way to common up the code in cases where qsort is called in different places with the same comparator.

So, general-purpose functions like qsort in C tend to have some overheads on account of calls through function pointers, or other indirection[*]. Templates in C++ are a common way of keeping the source code generic, but ensuring that it compiles to a special-purpose function (or several such functions). The special-purpose code hopefully is faster.

It's worth noting that templates are not by any means just about performance. std::sort is itself more general-purpose than qsort in some ways. For example qsort only sorts arrays, whereas std::sort can sort anything that provides a random-access iterator. It can for example sort a deque, which under the covers is several disjoint arrays allocated separately. So the use of templates doesn't necessarily provide any performance benefit, it might be done for other reasons. It just happens that templates do affect performance.

[*] another example with sorting - qsort takes an integer parameter saying how big each element of the array is, and when it moves elements it therefore must call memcpy or similar with the value of this variable. std::sort knows at compile-time the exact type of the elements, and hence the exact size. It can inline a copy constructor call that in turn might translate to instructions to copy that number of bytes. As with the inlined comparator, it's often possible to copy exactly 4 (or 8, or 16, or whatever) bytes faster than you'd get by calling a routine that copies a variable number of bytes, passing it the value 4 (or 8, or 16, or whatever). As before, if you called qsort with a literal value for the size, and that call to qsort was inlined, then the compiler could perform the exact same optimization in C. But in practice you don't see that.

Do c++ templates make programs slow?

The short answer is no. For the longer answer please read on.

As others have already noted, templates don't have a direct run-time penalty -- i.e. all their tricks happen at compile time. Indirectly, however, they can slow things down under a few circumstances. In particular, each instantiation of a template (normally) produces code that's separate and unique from other instantiations. Under a few circumstances, this can lead to slow execution, by simply producing enough object code that it no longer fits in the cache well.

With respect to code size: yes, most compilers can and will fold together the code for identical instantiations -- but that's normally the case only when the instantiations are truly identical. The compiler will not insert code to do even the most trivial conversions to get two minutely different instantiations to match each other. For example, a normal function call can and will convert T * to T const * so calls that use either const or non-const arguments will use the same code (unless you've chosen to overload the function on constness, in which case you've probably done so specifically to provide different behavior for the two cases). With a template, that won't happen -- instantiations over T * and T const * will result in two entirely separate pieces of code being generated. It's possible the compiler (or linker) may be able to merge the two after the fact, but not entirely certain (e.g., I've certainly used compilers that didn't).

But in the end, templates have positive effects on speed far more often than negative.

Template class and possible performance issues

As it stands, the classes instantiated from your template are distinct classes with distinct code. And yes, that would cause the code to be reloaded again and again at execution.

You can mitigate this the way Walter suggested in the comments: by having the implementation in some common base class that all other classes derive from. I went into some detail about this technique in my answer on the question on “How to define different types for the same class in C++”.

c++ : template class vs two classes : efficiency

The problem is probably the indirection you get by using pointers to the distribution classes. Try to use the classes directly without pointers, or maybe better do something like

typename std::conditional<std::is_integral<Type>::value
                        , std::uniform_int_distribution<Type>
                        , std::uniform_real_distribution<Type> >::type _dist;

in order to pick the distribution type you need. (This is just to give you a hint, the type-check could surely be improved).

Explanation: The code above works as follows: std::conditional<(1),(2),(3)> is just as a static if-statement for types. If the check in the first field (1) evaluates to true, it takes the type in the second field (2), otherwise it picks the type in the third field (3).

In case the template parameter Type is an integer type, std::is_integral<Type>::value will evaluate to true. Thus, the type of your distribution will be std::uniform_int_distribution<Type> which is usually desired.

In case Type it is not an integral type (but rather a floating-point type, which however isn't checked here), instead std::uniform_real_distribution<Type> is used for the type of the distribution.

Example (tested here):

#include<random>
#include<iostream>

template<typename Type>
struct UniformDistribution
{
    std::mt19937_64 mt_;
    typename std::conditional<std::is_integral<Type>::value
                            , std::uniform_int_distribution<Type>
                            , std::uniform_real_distribution<Type> >::type dist_;    
    Type get()
    {
        return dist_(mt_);
    }
};

int main()
{
     //produces uniformly distributed integer number in [0, numeric_limist<int>::max()]
     std::cout<<UniformDistribution<int>().get()<<std::endl;

     //produces uniformly distributed double number in [0,1]
     std::cout<<UniformDistribution<double>().get()<<std::endl;
}

Expression templates: improving performance in evaluating expressions?

Evaluation of expression template typically happens when you save result to some special type like:

Result D = A*B+sin(C)+3.;

Result type of expression:

A*B+sin(C)+3.

is not Result, but it is something that convertable to Result. And evaluation happens during such conversion.

My question is: is there any technique (for example, using placeholders?) to recognize that the values of D are actually unused

Such kind of "transfromation":

Result D = A*B+sin(C)+3.;
Result F = D*E;

Result F = (A*B+sin(C)+3.)*E;

Is possible when you do not evaluate D. To do this, typically you should capture D as it's real , expression type. For instance, with help of auto:

auto &&D = A*B+sin(C)+3.;
Result F = D*E;

However, you should be carefull - sometimes expression template captures references to it's operands, and if you have some rvalue which would expire after it's expression:

auto &&D = A*get_large_rvalue();
// At this point, result of **get_large_rvalue** is destructed
// And D has expiried reference
Result F = D*E;

Where get_large_rvalue is:

LargeMatrix get_large_rvalue();

It's result is rvalue, it expiries at the end of full expression when get_large_rvalue was called. If something within expression would store pointer/reference to it (for later evaluation) and you would "defer" evaluation - pointer/reference will outlive pointed/referenced object.

In order to prevent this, you should do:

auto &&intermediate = get_large_rvalue(); // it would live till the end of scope
auto &&D = A*intermediate ;
Result F = D*E;

I'm not familiar with C++11 but, as I understand, auto asks the compiler to determine the type of a variable from its initialization

Yes, exactly. This is called Type Inference/Deduction.

C++98/03 had type deduction only for template functions, in C++11 there is auto.

Do you know how do CUDA and C++11 interact each other?

I haven't used CUDA (though I used OpenCL), but I guess that there will be no any problems in Host code with C++11. Maybe some C++11 features are not supported within Device code, but for your purpose - you need auto only in Host code

Finally, is there any possibility with only C++?

Do you mean pre-C++11? I.e. C++98/C++03?
Yes, it is possible, but it has more syntax noise, maybe that would be reason to reject it:

// somehwhere
{
    use_D(A*B+sin(C)+3.);
}
// ...
template<typename Expression>
void use_D(Expression D) // depending on your expression template library
                         //   it may be better to use (const Expression &e)
{
    Result F = D*E;
}

I'm now using CUDA/Visual Studio 2010 under Windows. Could you please recommend a compiler/toolset/environment for both OS' to use C++11 in the framework of my interest (GPGPU and CUDA, in you know any)

MSVC 2010 does supports some parts of C++11. In particular it supports auto. So, if you need only auto from C++11 - MSVC2010 is OK.

But if you may use MSVC2012 - I would recommed to stick with it - it has much better C++11 support.

Also, the trick auto &&intermediate = get_large_rvalue(); seems to be not "transparent" to a third party user (which is not supposed to know such an issue). Am I right? Any alternative?

If expression template stores references to some values, and you defer it's evaluation. You should be sure that all it's references are alive at the place of evaluation. Use any method which you want - it can be done without auto, like:

LargeMatrix temp = get_large_rvalue();

Or maybe even global/static variable (less prefered method).

A last comment/question: to use auto &&D = A*B+sin(C)+3.; it seems that I should overload the operator= for assignments between two expressions, right?

No, such form does not requires nor copy/move assignment operator nor copy/move constructor.

Basically it just names temporary value, and prolongs it's lifetime to the end of scope. Check this SO.

But, if you would use another form:

auto D = A*B+sin(C)+3.;

In such case copy/move/conversion constructor maybe required in order to compile (though actual copy can be optimized away by compiler by use of Copy Ellision)

Also, switching between using auto (for the intermediate expressions) and Result to force calculation seems to be non-transparent to a third party user. Any alternative?

I am not sure if there is any alternative. This is by nature of expression templates. While you using them in expressions - they return some internal intermediate types, but when you store to some "special" type - evaluation is triggered.

Is Template Metaprogramming faster than the equivalent C code?

First, a disclaimer:
What I think you're asking about is not just template metaprogramming, but also generic programming. The two concepts are closely related, and there's no exact definition of what each encompasses. But in short, template metaprogramming is essentially writing a program using templates, which is evaluated at compile-time. (which makes it entirely free at runtime. Nothing happens. The value (or more commonly, type) has already been computed by the compiler, and is available as a constant (either a const variable, or an enum), or as a typedef nested in a class (if you've used it to "compute" a type).

Generic programming is using templates and when necessary, template metaprogramming, to create generic code which works the same (and with no loss in performance), with all and any type. I'm going to use examples of both in the following.

A common use for template metaprogramming is to enable types to be used in generic programming, even if they were not designed for it.

Since template metaprogramming technically takes place entirely at compile-time, your question is a bit more relevant for generic programming, which still takes place at runtime, but is efficient because it can be specialized for the precise types it's used with at compile-time.

Anyway...

Depends on how you define "the equivalent C code".

The trick about template metaprogramming (or generic programming in general) is that it allows a lot of computation to be moved to compile-time, and it enables flexible, parametrized code that is just as efficient as hardcoded values.

The code displayed here for example computes a number in the fibonacci sequence at compile-time.

The C++ code 'unsigned long fib11 = fibonacci<11uL>::value', relies on the template metaprogram defined in that link, and is as efficient as the C code 'unsigned long fib11 = 89uL'. The templates are evaluated at compile-time, yielding a constant that can be assigned to a variable. So at runtime, the code is actually identical to a simple assignment.

So if that is the "equivalent C code", the performance is the same.
If the equivalent C code is "a program that can compute arbitrary fibonacci numbers, applied to find the 11th number in the sequence", then the C version will be much slower, because it has to be implemented as a function, which computes the value at runtime. But this is the "equivalent C code" in the sense that it is a C program that exhibits the same flexibility (it is not just a hardcoded constant, but an actual function that can return any number in the fibonacci sequence).

Of course, this isn't often useful. But it's pretty much the canonical example of template metaprogramming.

A more realistic example of generic programming is sorting.

In C, you have the qsort standard library function taking an array and a comparator function pointer. The call to this function pointer can not be inlined (except in trivial cases), because at compile-time, it is not known which function is going to be called.

Of course the alternative is a hand-written sorting function designed for your specific datatype.

In C++, the equivalent is the function template std::sort. It too takes a comparator, but instead of this being a function pointer, it is a function object, looking like this:

struct MyComp {
  bool operator()(const MyType& lhs, const MyType& rhs) {
     // return true if lhs < rhs, however this operation is defined for MyType objects
  }
};

and this can be inlined. The std::sort function is passed a template argument, so it knows the exact type of the comparator, and so it knows that the comparator function is not just an unknown function pointer, but MyComp::operator().

The end result is that the C++ function std::sort is exactly as efficient as your hand-coded implementation in C of the same sorting algorithm.

So again, if that is "the equivalent C code", then the performance is the same.
But if the "equivalent C code" is "a generalized sorting function which can be applied to any type, and allows user-defined comparators", then the generic programming-version in C++ is vastly more efficient.

That's really the trick. Generic programming and template metaprogramming are not "faster than C". They are methods to achieve general, reusable code which is as fast as handcoded, and hardcoded C

It is a way to get the best of both worlds. The performance of hardcoded algorithms, and the flexibility and reusability of general, parameterized ones.

Speed comparison - Template specialization vs. Virtual Function vs. If-Statement

(1) Not that the size of the generated assembly matters anymore these days, but this is what it generates (approximately, assuming MSVC on x86):

mov eax, [ecx+12]   ; 'this' pointer stored in ecx, eax is scratch
cmp eax, 0          ; test for 0 
jz  .somewhereElse  ; jump if the bool isn't set

The optimizing compiler will intersperse other instructions there, making it more pipeline-friendly. The contents of your class will most likely be in your cache anyway, and if it's not, it will be needed a few cycles later anyway. So, in retrospect, that's maybe a few cycles, and for something that will be called at most a few times per frame, that's nothing.

(2) This is approximately the assembly that will be generated every time your play() method is called:

mov  eax, [ebp+4]    ; pointer to your Animation* somewhere on the stack, eax is scratch
mov  eax, [eax+12]   ; dereference the vtable
call eax             ; call it

Then, you'll have some duplicate code or another function call inside your specialized play() function, since there'll definetely be some common stuff, so that incurs some overhead (in code size and/or execution speed). So, this is definetely slower.

Also, this makes it alot harder to load generic animations. Your graphics department won't be happy.

(3) To use this effectively, you'll end up making a base class for your templated version anyway, with virtual functions (in that case, see (2)), OR you'll do it manually by checking types in places where you call your animation thing, in which case also see (2).

This also makes it MUCH harder to load generic animations. Your graphics department will be even less happy.

(4) What you need to worry about is not some microoptimization for tiny things done at most a few times a frame. From reading your post, i actually identified another problem that's commonly overlooked. You're mentioning std::vector<Animation>. Nothing against the STL, but that's bad voodoo. A single memory allocation will cost you more cycles than all the boolean checks in your play() or update() methods for probably the entire time your application is running. Putting Animations in and out of std::vectors (especially if you're putting in instances and not pointers (smart or dumb) to instances) will cost you way more.

You need to look at different places to optimize. This is such a ridiculous microoptimization that will bring you no benefit except make it harder to generalize and make your graphics department happy. What will matter, however, is worrying about memory allocation, and THEN, when you're done programming that part, starting a profiler and looking where the hot spots are.

If keeping your animations is actually becoming a bottleneck, the std::vector (nice as it is) is where you might want to look. Have you looked at, say, an intrusive linked list? That will actually be more benefit than worrying about this.

C++ Templates for Performance