How to count branch mispredictions?
It depends on the processor your using, in general cpuid can be used to obtain a lot of information about the processor and what cpuid does not provide is typically accessible via smbios or other regions of memory.
Doing this in code on a general level without the processor support functions and manual will not tell you as much as you want to a great degree of certainty but may be useful as an estimate depending on what your looking for and how you have your code compiled e.g. the flags you use during compilation etc.
In general, what is referred to as specular or speculative execution and is typically not observed by programs as their logic which transitions through the pipeline is determined to be not used is then discarded.
Depending on how you use specific instructions in your program you may be able to use such stale cache information for better or worse but the logic therein would vary greatly depending on the CPU in use.
See also Spectre and RowHammer for interesting examples of using such techniques for privileged execution.
See the comments below for links which have code related to the use of cpuid as well as rdrand, rdseed and a few others. (rdtsc)
It's not completely clear what your looking for perhaps but will surely get you started and provide some useful examples.
See also Branch mispredictions
Measure how often a branch is mispredicted
If you are using an AMD CPU, AMD's CodeAnalyst is just what you need (works on windows and Linux)*.
if your not, then you may need to fork out for a VTune licence or build something using the on CPU performance registers and counters details in the instruction manuals.
You can also check out gperf & OProfile (linux only), see how well they perform (I've never used these, but I see them referred to quite a bit).
*CodeAnalyst should work on an Intel CPU, you just don't get all then nice CPU level analysis.
Is it possible to tell the branch predictor how likely it is to follow the branch?
Yes. http://kerneltrap.org/node/4705
The
__builtin_expect
is a method that
gcc (versions >= 2.96) offer for
programmers to indicate branch
prediction information to the
compiler. The return value of
__builtin_expect
is the first argument (which could only be an integer)
passed to it.
if (__builtin_expect (x, 0))
foo ();
[This] would indicate that we do not expect to call `foo', since we
expect `x' to be zero.
How can I pinpoint if the slowness in my program is a CPU cache issue (on Linux)?
The Key Metric
Your culprit is branch misses:
440,171,927 branch-misses # 13.85% of all branches
vs.
71,034,122 branch-misses # 2.24% of all branches
I'm not sure which processor your'e running, but if you are hypothetically running a 2.5 GHz processor on Haswell, you'll see that each branch prediction miss costs around 15 cycles (usually a bit more because other stuff stalls too), and each cycle is 0.4ns. So, 0.4ns/cycle * 15 cycles/missed branch * (440,171,927 - 71,034,122) branch misses = 2.2 seconds. It will depend on your exact processor and the machine code, but this explains the bulk of the difference right there.
Cause
The branch prediction algorithms of different chips are proprietary, but if you research here ( http://www.agner.org/optimize/microarchitecture.pdf) you can learn more about different processors and there limitations. Essentially, what you'll find is that certain processors do a better job of avoiding collisions in the branch prediction tables that they store to make predictions about branches taken or not.
So, why is that relevant? Well, what has happened (99% chance) is that by rearranging your program very slightly, you change where exactly different branches are in terms of memory locations. There are too many that map to the same bucket in the branch prediction tables for the processor. By changing the executable slightly, this problem goes away. You have to have a very specific distance between the two branch points to trigger this issue. You have unluckily managed to do that.
Simple Solution
As you found, many changes will in fact cause the program to not hit this degraded performance. Essentially, anything that changes the distance between the two critical branch points will fix the problem. You can accomplish this by literally inserting 16 bytes (or enough to move the branch points to different buckets) of machine code nops somewhere between the two places. You can also do as you did and change something that will disrupt this distance to a non-pathological case.
Digging Deeper
If you want to truly understand what causes it in this situation, you'll need to get your hands dirty. Fun! You will need to select one of many tools to find the specific branch that is being mispredicted. Here is one way: How to measure mispredictions for a single branch on Linux?
After you identify the mispredicted branch, you can figure out if there is a way to remove the branch (almost always a good idea for performance anyway). If not, you can place a hint for it or, worst case, just move things around to ensure that the same entry isn't shared as previously suggested.
Broader lessons
Programmers underestimate the cost of branching (when the compiler is unable to remove the branches at compile time). As you've discovered, each branch puts more strain on the branch prediction buffers of a processor and increases the chance of mispredictions. So, even branches that are 100% predictable to the processor can degrade performance by reducing the resources available for predicting other branches. Further, when a branch is mispredicted, it costs a minimum of 12-15 cycles and can be much more if the needed instructions aren't in the L1 cache, L2 cache, L3 cache, or heaven help you, main memory. Additionally, the compiler can't make all optimizations across branches.
Related Topics
What's The Purpose of Mmap Memory Protection Prot_None
Set Environment Variables in an Aws Instance
Linux Kernel - How to Get Physical Address (Memory Management)
Can't Remove, Purge, Unistall Mongodb from Debian
Autoconf Complains "C Compiler Cannot Create Executables" on Linux Mint
Stdin into Zip Command, How to Specify a File Name
Can Upstart Expect/Respawn Be Used on Processes That Fork More Than Twice
Any Good Tools to Solve Integer Programs on Linux
Package Tar.Gz into a Shell Script
Difference Between "Machine Hardware" and "Hardware Platform"
Linux - Check If There Is an Empty Line at The End of a File
Qemu on Raspberry Pi Arch Linux Latest Sd Image
Checking If a Screen of The Specified Name Exists
Sed Regex Problem on Mac, Works Fine on Linux