Optimizing a Branch for a Known More-Common Path

Optimizing a branch for a known more-common path

The underlying hardware already performs this optimizations. It will "fail" to predict it the first times, but after it will hit the correct option en.wikipedia.org/wiki/Branch_predictor.

You can try applying the GCC extension and check if it is faster with it or not, but I think you will barely see any difference with it and without it. The branch prediction is applied always, it is not something you enable

optimizing branching by re-ordering

It's not portable, but many versions of GCC support a function called __builtin_expect() that can be used to tell the compiler what we expect a value to be:

if(__builtin_expect(condition, 0)) {
// We expect condition to be false (0), so we're less likely to get here
} else {
// We expect to get here more often, so GCC produces better code
}

The Linux kernel uses these as macros to make them more intuitive, cleaner, and more portable (i.e. redefine the macros on non-GCC systems):

#ifdef __GNUC__
# define likely(x) __builtin_expect((x), 1)
# define unlikely(x) __builtin_expect((x), 0)
#else
# define likely(x) (x)
# define unlikely(x) (x)
#endif

With this, we can rewrite the above:

if(unlikely(condition)) {
// we're less likely to get here
} else {
// we expect to get here more often
}

Of course, this is probably unnecessary unless you're aiming for raw speed and/or you've profiled and found that this is a problem.

Optimizing branch predictions: how to generalize code that could run wth different compiler, interperter, and hardware prediction?

Short answer:

To help improve the performance of the branch predictor try to structure your program so that conditional statements don't depend on apparently random data.

Details

One of the other answers to this question claims:

There is no way to do anything at the high level language to optimize for branch prediction, caching sure, sometimes you can, but branch prediction, no not at all.

However, this is simply not true. A good illustration of this fact comes from one of the most famous questions on Stack Overflow.

All branch predictors work by identifying patterns of repeated code execution and using this information to predict the outcome and/or target of branches as necessary.

When writing code in a high-level language it's typically not necessary for an application programmer to worry about trying to optimizing conditional branches. For instance gcc has the __builtin_expect function which allows the programmer to specify the expected outcome of a conditional branch. But even if an application programmer is certain they know the typical outcome of a specific branch it's usually not necessary to use the annotation. In a hot loop using this directive is unlikely to help improve performance. If the branch really is strongly biased the the predictor will be able to correctly predict the outcome most of the time even without the programmer annotation.

On most modern processors branch predictors perform incredibly well (better than 95% accurate even on complex workloads). So as a micro-optimization, trying to improve branch prediction accuracy is probably not something that an application programmer would want to focus on. Typically the compiler is going to do a better job of generating optimal code that works for the specific hardware platform it is targeting.

But branch predictors rely on identifying patterns, and if an application is written in such a way that patterns don't exist, then the branch predictor will perform poorly. If the application can be modified so that there is a pattern then the branch predictor has a chance to do better. And that is something you might be able to consider at the level of a high-level language, if you find a situation where a branch really is being poorly predicted.

Is it possible to tell the branch predictor how likely it is to follow the branch?

Yes. http://kerneltrap.org/node/4705

The __builtin_expect is a method that
gcc (versions >= 2.96) offer for
programmers to indicate branch
prediction information to the
compiler. The return value of
__builtin_expect is the first argument (which could only be an integer)
passed to it.

if (__builtin_expect (x, 0))
foo ();

[This] would indicate that we do not expect to call `foo', since we
expect `x' to be zero.

What is the advantage of GCC's __builtin_expect in if else statements?

Imagine the assembly code that would be generated from:

if (__builtin_expect(x, 0)) {
foo();
...
} else {
bar();
...
}

I guess it should be something like:

  cmp   $x, 0
jne _foo
_bar:
call bar
...
jmp after_if
_foo:
call foo
...
after_if:

You can see that the instructions are arranged in such an order that the bar case precedes the foo case (as opposed to the C code). This can utilise the CPU pipeline better, since a jump thrashes the already fetched instructions.

Before the jump is executed, the instructions below it (the bar case) are pushed to the pipeline. Since the foo case is unlikely, jumping too is unlikely, hence thrashing the pipeline is unlikely.

how to suggest gcc compiler more probable branch

If you want to suggest to GCC that a condition is likely to have a given value as a hint for arranging code flow, you should use __builtin_expect( ):

if (__builtin_expect(almost_always_false_condition,0)) {
// do something
}

However, it sounds like you want to find a way to avoid evaluating the condition, which __builtin_expect( ) won't do. Is there a way that you can quickly approximate the condition, and only do the full check when the approximation is true:

if (__builtin_expect(fastCheckThatIsTrueIfFullConditionIsTrue,0)) {
// most of the time, we don't even get to here, so you don't need
// to evaluate the condition
if (almostAlwaysFalseCondition) {
// do something
}
}

Can you tell us more about what the condition is?

Branch-aware programming

people often ... and state that programmers usually can predict where a branch could go

(*) Experienced programmers often remind that human programmers are very bad at predicting that.

1- Is it possible to avoid branch mispredictions using some high level programming technique (i.e. no assembly)?

Not in standard c++ or c. At least not for a single branch. What you can do is minimize the depth of your dependency chains so that branch mis-prediction would not have any effect. Modern cpus will execute both code paths of a branch and drop the one that wasn't chosen. There's a limit to this however, which is why branch prediction only matters in deep dependency chains.

Some compilers provide extension for suggesting the prediction manually such as __builtin_expect in gcc. Here is a stackoverflow question about it. Even better, some compilers (such as gcc) support profiling the code and automatically detect the optimal predictions. It's smart to use profiling rather than manual work because of (*).

2- What should I keep in mind to produce branch-friendly code in a high level programming language (I'm mostly interested in C and C++)?

Primarily, you should keep in mind that branch mis-prediction is only going to affect you in the most performance critical part of your program and not to worry about it until you've measured and found a problem.

But what can I do when some profiler (valgrind, VTune, ...) tells that on line n of foo.cpp I got a branch prediction penalty?

Lundin gave very sensible advice

  1. Measure fo find out whether it matters.
  2. If it matters, then

    • Minimize the depth of dependency chains of your calculations. How to do that can be quite complicated and beyond my expertise and there's not much you can do without diving into assembly. What you can do in a high level language is to minimize the number of conditional checks (**). Otherwise you're at the mercy of compiler optimization. Avoiding deep dependency chains also allows more efficient use of out-of-order superscalar processors.
    • Make your branches consistently predictable. The effect of that can be seen in this stackoverflow question. In the question, there is a loop over an array. The loop contains a branch. The branch depends on size of the current element. When the data was sorted, the loop could be demonstrated to be much faster when compiled with a particular compiler and run on a particular cpu. Of course, keeping all your data sorted will also cost cpu time, possibly more than the branch mis-predictions do, so, measure.
  3. If it's still a problem, use profile guided optimization (if available).

Order of 2. and 3. may be switched. Optimizing your code by hand is a lot of work. On the other hand, gathering the profiling data can be difficult for some programs as well.

(**) One way to do that is transform your loops by for example unrolling them. You can also let the optimizer do it automatically. You must measure though, because unrolling will affect the way you interact with the cache and may well end up being a pessimization.



Related Topics



Leave a reply



Submit