How to Measure Mispredictions for a Single Branch on Linux

How to count branch mispredictions?

It depends on the processor your using, in general cpuid can be used to obtain a lot of information about the processor and what cpuid does not provide is typically accessible via smbios or other regions of memory.

Doing this in code on a general level without the processor support functions and manual will not tell you as much as you want to a great degree of certainty but may be useful as an estimate depending on what your looking for and how you have your code compiled e.g. the flags you use during compilation etc.

In general, what is referred to as specular or speculative execution and is typically not observed by programs as their logic which transitions through the pipeline is determined to be not used is then discarded.

Depending on how you use specific instructions in your program you may be able to use such stale cache information for better or worse but the logic therein would vary greatly depending on the CPU in use.

See also Spectre and RowHammer for interesting examples of using such techniques for privileged execution.

See the comments below for links which have code related to the use of cpuid as well as rdrand, rdseed and a few others. (rdtsc)

It's not completely clear what your looking for perhaps but will surely get you started and provide some useful examples.

See also Branch mispredictions

Measure how often a branch is mispredicted

If you are using an AMD CPU, AMD's CodeAnalyst is just what you need (works on windows and Linux)*.

if your not, then you may need to fork out for a VTune licence or build something using the on CPU performance registers and counters details in the instruction manuals.

You can also check out gperf & OProfile (linux only), see how well they perform (I've never used these, but I see them referred to quite a bit).

*CodeAnalyst should work on an Intel CPU, you just don't get all then nice CPU level analysis.

Is it possible to tell the branch predictor how likely it is to follow the branch?

Yes. http://kerneltrap.org/node/4705

The __builtin_expect is a method that
gcc (versions >= 2.96) offer for
programmers to indicate branch
prediction information to the
compiler. The return value of
__builtin_expect is the first argument (which could only be an integer)
passed to it.

if (__builtin_expect (x, 0))
foo ();

[This] would indicate that we do not expect to call `foo', since we
expect `x' to be zero.

How can I pinpoint if the slowness in my program is a CPU cache issue (on Linux)?

The Key Metric

Your culprit is branch misses:

440,171,927      branch-misses             #   13.85% of all branches

vs.

71,034,122      branch-misses             #    2.24% of all branches

I'm not sure which processor your'e running, but if you are hypothetically running a 2.5 GHz processor on Haswell, you'll see that each branch prediction miss costs around 15 cycles (usually a bit more because other stuff stalls too), and each cycle is 0.4ns. So, 0.4ns/cycle * 15 cycles/missed branch * (440,171,927 - 71,034,122) branch misses = 2.2 seconds. It will depend on your exact processor and the machine code, but this explains the bulk of the difference right there.

Cause

The branch prediction algorithms of different chips are proprietary, but if you research here ( http://www.agner.org/optimize/microarchitecture.pdf) you can learn more about different processors and there limitations. Essentially, what you'll find is that certain processors do a better job of avoiding collisions in the branch prediction tables that they store to make predictions about branches taken or not.

So, why is that relevant? Well, what has happened (99% chance) is that by rearranging your program very slightly, you change where exactly different branches are in terms of memory locations. There are too many that map to the same bucket in the branch prediction tables for the processor. By changing the executable slightly, this problem goes away. You have to have a very specific distance between the two branch points to trigger this issue. You have unluckily managed to do that.

Simple Solution

As you found, many changes will in fact cause the program to not hit this degraded performance. Essentially, anything that changes the distance between the two critical branch points will fix the problem. You can accomplish this by literally inserting 16 bytes (or enough to move the branch points to different buckets) of machine code nops somewhere between the two places. You can also do as you did and change something that will disrupt this distance to a non-pathological case.

Digging Deeper

If you want to truly understand what causes it in this situation, you'll need to get your hands dirty. Fun! You will need to select one of many tools to find the specific branch that is being mispredicted. Here is one way: How to measure mispredictions for a single branch on Linux?

After you identify the mispredicted branch, you can figure out if there is a way to remove the branch (almost always a good idea for performance anyway). If not, you can place a hint for it or, worst case, just move things around to ensure that the same entry isn't shared as previously suggested.

Broader lessons

Programmers underestimate the cost of branching (when the compiler is unable to remove the branches at compile time). As you've discovered, each branch puts more strain on the branch prediction buffers of a processor and increases the chance of mispredictions. So, even branches that are 100% predictable to the processor can degrade performance by reducing the resources available for predicting other branches. Further, when a branch is mispredicted, it costs a minimum of 12-15 cycles and can be much more if the needed instructions aren't in the L1 cache, L2 cache, L3 cache, or heaven help you, main memory. Additionally, the compiler can't make all optimizations across branches.



Related Topics



Leave a reply



Submit