Should We Still Be Optimizing "In the Small"

Should we still be optimizing in the small?

If there is no cost to the optimization, do it. When writing the code, ++i is just as easy to write as i++, so prefer the former. There is no cost to it.

On the other hand, going back and making this change afterwards takes time, and it most likely won't make a noticeable difference, so you probably shouldn't bother with it.

But yes, it can make a difference. On built-in types, probably not, but for complex classes, the compiler is unlikely to be able to optimize it away. The reason for this is that the increment operation no is no longer an intrinsic operation, built into the compiler, but a function defined in the class. The compiler may be able to optimize it like any other function, but it can not, in general, assume that pre-increment can be used instead of post-increment. The two functions may do entirely different things.

So when determining which optimizations can be done by the compiler, consider whether it has enough information to perform it. In this case, the compiler doesn't know that post-increment and pre-increment perform the same modifications to the object, so it can not assume that one can be replaced with the other. But you have this knowledge, so you can safely perform the optimization.

Many of the others you mention can usually be done very efficiently by the compiler:
Inlining can be done by the compiler, and it's usually better at it than you. All it needs to know is how large a proportion of the function consists of function call over head, and how often is it called? A big function that is called often probably shouldn't be inlined, because you end up copying a lot of code, resulting in a larger executable, and more instruction cache misses. Inlining is always a tradeoff, and often, the compiler is better at weighing all the factors than you.

Loop unrolling is a purely mechanic operation, and the compiler can do that easily. Same goes for strength reduction. Swapping inner and outer loops is trickier, because the compiler has to prove that the changed order of traversal won't affect the result, which is difficult to do automatically. So here is an optimization you should do yourself.

But even in the simple ones that the compiler is able to do, you sometimes have information your compiler doesn't. If you know that a function is going to be called extremely often, even if it's only called from one place, it may be worth checking whether the compiler automatically inlines it, and do it manually if not.

Sometimes you may know more about a loop than the compiler as well (for example, that the number of iterations will always be a multiple of 4, so you can safely unroll it 4 times). The compiler may not have this information, so if it were to inline the loop, it would have to insert an epilog to ensure that the last few iterations get performed correctly.

So such "small-scale" optimizations can still be necessary, if 1) you actually need the performance, and 2) you have information that the compiler doesn't.

You can't outperform the compiler on purely mechanical optimizations. But you may be able to make assumptions that the compiler can't, and that is when you're able to optimize better than the compiler.

Is premature optimization really the root of all evil?

It's important to keep in mind the full quote (see below):

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.
Yet we should not pass up our opportunities in that critical 3%.

What this means is that, in the absence of measured performance issues you shouldn't optimize because you think you will get a performance gain. There are obvious optimizations (like not doing string concatenation inside a tight loop) but anything that isn't a trivially clear optimization should be avoided until it can be measured.

The biggest problems with "premature optimization" are that it can introduce unexpected bugs and can be a huge time waster.

Sample Image

There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.
Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgements about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail. After working with such tools for seven years, I've become convinced that all compilers written from now on should be designed to provide all programmers with feedback indicating what parts of their programs are costing the most; indeed, this feedback should be supplied automatically unless it has been specifically turned off.

When is optimisation premature?

Don Knuth started the literate programming movement because he believed that the most important function of computer code is to communicate the programmer's intent to a human reader. Any coding practice that makes your code harder to understand in the name of performance is a premature optimization.

Certain idioms that were introduced in the name of optimization have become so popular that everyone understands them and they have become expected, not premature. Examples include

Using pointer arithmetic instead of array notation in C, including the use of such idioms as
```
for (p = q; p < lim; p++)
```

Rebinding global variables to local variables in Lua, as in

local table, io, string, math
    = table, io, string, math

Beyond such idioms, take shortcuts at your peril.

All optimization is premature unless

A program is too slow (many people forget this part).
You have a measurement (profile or similar) showing that the optimization could improve things.

(It's also permissible to optimize for memory.)

Direct answer to question:

If your "different" technique makes the program harder to understand, then it's a premature optimization.

EDIT: In response to comments, using quicksort instead of a simpler algorithm like insertion sort is another example of an idiom that everyone understands and expects. (Although if you write your own sort routine instead of using the library sort routine, one hopes you have a very good reason.)

Is micro-optimization worth the time?

Micro-optimisation is worth it when you have evidence that you're optimising a bottleneck.

Usually it's not worth it - write the most readable code you can, and use realistic benchmarks to check the performance. If and when you find you've got a bottleneck, micro-optimise just that bit of code (measuring as you go). Sometimes a small amount of micro-optimisation can make a huge difference.

But don't micro-optimise all your code... it will end up being far harder to maintain, and you'll quite possibly find you've either missed the real bottleneck, or that your micro-optimisations are harming performance instead of helping.

Ideas for good performance optimisations in C++

CString MyFunc(double x, double y)

is less efficient than

void MyFunc(double x, double y, CString &Result)

If MyFunc is written cleanly they should be about the same. The compiler should be able to take advantage of NRVO. It sounds like you've profiled and found that its not - I'm just saying it may more fit your criteria, e.g. "minimal obfuscation to the code", to rearrange the function itself a little to allow NRVO to occur.

A few more things to try:

memoization (similar to your caching
of repeated searches, but more
focussed on tree/ graph resolution,
if you have any).
Using floats instead of doubles, if you don't need the extra
precision (or maybe even ints if you
can).
Use types to mark assumptions (you mentioned sorted arrays -
another common one is lowercased
strings). Create a derived or
wrapper type (e.g. Sorted<T>) that
make such assumptions explicit. That
way if you have a method that takes
a Sorted<vector<T> >, for example,
and you give it a sorted vector, it
passes straight through - but if you
give it a vector<T>, it will have to
constructed a Sorted<vector<T> >, at
which point it will sort it. You can
manually assert that it is sorted
using an alternative constructor,
but it makes it much easier to carry
your assumptions around and maybe
catch places you might have
otherwise missed.
Don't give up on inlines etc. Make sure you're fully aware of when
they should help and when they
should hinder. In some cases they
really can make a difference - but
probably not if you just deal with
them arbitrarily.
You might benefit from flyweight, or pooled object
allocators.
When threading,
try to minimise any interactions so
you can reduce the amount of code
that requires locking. Often taking
copies of even fairly expensive
objects can be less of an overhead
than a mutex. Obvious take advantage
of atomic types where you can.

small object optimization useless in using std::function

Older versions of libstdc++, like the one shipped by gcc 4.8.5, seem to only optimise function pointers to not allocate (as seen here).

Since the std::function implementation does not have the small object optimisation that you want, you will have to use an alternative implementation. Either upgrade your compiler or use boost::function, which is essentially the same as std::function.

link-time optimization versus. project inlining; limitations on each approach

The technical name, approaching minor buzzword status, for that approach is unity build.

See for example:

The benefits / disadvantages of unity builds?

The downside is best described here:

http://leewinder.co.uk/blog/?p=394

The short version is it is more or less a choice of languages: you either write in regular-C++ or Unified-build-C++. The 'correct' way of writing virtually any code will differ between the two.

Does optimizing an algorithm from O(2N) down to O(N) make it twice as fast?

Time complexity is not the same as speed. For a given size of data, a program with O(N) might be slower, faster or the same speed as O(2N). Also, for a given size of data O(N) might be slower, faster or the same speed as O(N^2).

So if Big-O doesn't mean anything, why are we talking about it anyway?

Big-O notation describes the behaviour a program as the size of data increases. This behaviour is always relative. In other words, Big-O tells you the shape of asymptotic curve, but not its scale or dimension.

Let's say you have a program A that is O(N). This means that processing time will be linearly proportional to data size (ignoring real-world complications like cache sizes that might make the run-time more like piecewise-linear):

for 1000 rows it will take 3 seconds
for 2000 rows it will take 6 seconds
for 3000 rows it will take 9 seconds

And for another program B which is also O(N):

for 1000 rows it will take 1 second
for 2000 rows it will take 2 seconds
for 3000 rows it will take 3 seconds

Obviously, the second program is 3 times faster per row, even though they both have O(N). Intuitively, this tells you that both programs go through every row and spend some fixed time on processing it. The difference in time from 2000 to 1000 is the same as difference from 3000 to 2000 - this means that the grows linearly, in other words time needed for one record does not depend on number of all records. This is equivalent to program doing some kind of a for-loop, as for example when calculating a sum of numbers.

And, since the programs are different and do different things, it doesn't make any sense to compare 1 second of program A's time to 1 second of program B's time anyway. You would be comparing apples and oranges. That's why we don't care about the constant factor and we say that O(3n) is equivalent to O(n).

Now imagine a third program C, which is O(N^2).

for 1000 rows it will take 1 second
for 2000 rows it will take 4 seconds
for 3000 rows it will take 9 seconds

The difference in time here between 3000 and 2000 is bigger than difference between 2000 and 1000. The more the data, the bigger the increase. This is equivalent to a program doing a for loop inside a for loop - as, for example when searching for pairs in data.

When your data is small, you might not care about 1-2 seconds difference. If you compare programs A and C just from above timings and without understanding the underlying behaviour, you might be tempted to say that A is faster. But look what happens with more records:

for 10000 rows program A will take 30 seconds
for 10000 rows program C will take 1000 seconds
for 20000 rows program A will take 60 seconds
for 20000 rows program C will take 4000 seconds

Initially the same performance for the same data quickly becomes painfully obvious - by a factor of almost 100x. There is not a way in this worlds how running C on a faster CPU could ever keep up with A, and the bigger the data, the more this is true. The thing that makes all the difference is scalability. This means answering questions like how big of a machine are we going to need in 1 years' time when the database will grow to twice its size. With O(N), you are generally OK - you can buy more servers, more memory, use replication etc. With O(N^2) you are generally OK up to a certain size, at which point buying any number of new machines will not be enough to solve your problems any more and you will need to find a different approach in software, or run it on massively parallel hardware such as GPU clusters. With O(2^N) you are pretty much fucked unless you can somehow limit the maximum size of the data to something which is still useable.

Note that the above examples are theoretical and intentionally simplified; as @PeterCordes pointed out, the times on a real CPU might be different because of caching, branch misprediction, data alignment issues, vector operations and million other implementation-specific details. Please see his links in comments below.

Am I understanding premature optimization correctly?

This is the right strategy in general. Get the code to work, thoroughly covered with automated tests.

You can then run the automated tests while the program is under control of a profiler, to find out where the program is spending time and/or memory. That will show you where to optimize.

And it will be showing you how to optimize working code, not code that may or not work when it's all put together.

You don't want code that fails optimally.

The quote I was failing to remember is from Mich Ravera:

If it doesn't work, it doesn't matter how fast it doesn't work.

Should We Still Be Optimizing "In the Small"