How Does Branch Prediction Affect Performance in R

How does Branch Prediction affect performance in R?

Interpreter overhead, and just being an interpreter, explains most of the average difference. I don't have an explanation for the higher variance.


R is an interpreted language, not JIT compiled to machine code like Java, or ahead-of-time like C. (I don't know much about R internals, just CPUs and performance, so I'm making a lot of assumptions here.)

The code that's running on the actual CPU hardware is the R interpreter, not exactly your R program.

Control dependencies in the R program (like an if()) become data dependencies in the interpreter. The current thing being executed is just data for the interpreter running on a real CPU.

Different operations in the R program become control dependencies in the interpreter. For example, evaluating myvec[i] then the + operator would probably be done by two different functions in the interpreter. And a separate function for > and for if() statements.

The classic interpreter loop is based around an indirect branch that dispatches from a table of function pointers. Instead of a taken/not-taken choice, the CPU needs a prediction for one of many recently-used target addresses. I don't know if R uses a single indirect branch like that or if tries to be fancier like having the end of each interpreter block dispatch to the next one, instead of returning to a main dispatch loop.

Modern Intel CPUs (like Haswell and later) have IT-TAGE (Indirect TAgged GEometric history length) prediction. The taken/not-taken state of previous branches along the path of execution are used as an index into a table of predictions. This mostly solves the interpreter branch-prediction problem, allowing it to do a surprisingly good job, especially when the interpreted code (the R code in your case) does the same thing repeatedly.

  • Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore (2015) - Haswell's ITTAGE is a huge improvement for interpreters, invalidating the previous wisdom that a single indirect branch for interpreter dispatch was a disaster. I don't know what R actually uses; there are tricks that were useful.
  • X86 prefetching optimizations: "computed goto" threaded code has more links.
  • https://comparch.net/2013/06/30/why-tage-is-the-best/
  • https://danluu.com/branch-prediction/ has some links about that at the bottom. Also mentions that AMD has used Perceptron predictors in Bulldozer-family and Zen: like a neural net.

The if() being taken does result in needing to do different operations, so it does actually still make some branching in the R interpreter more or less predictable depending on data. But of course as an interpreter, it's doing much more work at each step than a simple machine-code loop over an array.

So extra branch mispredicts are a much smaller fraction of the total time because of interpreter overhead.


Of course, both your tests are with the same interpreter on the same hardware. I don't know what kind of CPU you have.

If it's Intel older than Haswell or AMD older than Zen, you might be getting a lot of mispredicts even with the sorted array, unless the pattern is simple enough for an indirect branch history predictor to lock onto. That would bury the difference in more noise.

Since you do see a pretty clear difference, I'm guessing that the CPU doesn't mispredict too much in the sorted case, so there is room for it to get worse in the unsorted case.

How does branch prediction speed up anything?

Due to the pipelined nature of modern CPUs, new instructions begin to be processed before previous instructions have finished processing. The exact number varies depending on the CPU architecture and the type of instruction. The reason for pipelining is to make the CPU more efficient in utilisation of its components, allowing to improve throughput of instructions. For example, the circuitry designed to fetch the next instruction would lay idle for at least a few cycles whilst the previous instructions carries out its stages (things like source register read, data cache access, arithmetic execution, etc) without pipelining.

It introduces its own challenges though: one example is how the instruction fetch part should know which instruction to fetch next in the presence of a conditional jump instruction in the pipeline. The conditional jump (such as the one necessitated by your if above) require the evaluation of a condition to determine which instruction to fetch next - however this evaluation happens several stages in the pipeline later. During its transition through the pipeline stages, the pipeline must keep going and new instructions must keep being loaded - otherwise you would lose efficiency in having to wait until the resolution of the condition is known (a pipeline stall: a condition CPUs try to avoid). Without knowing for sure where the next instructions should be coming from, the CPU has to guess: this is known as branch prediction. If it guesses correctly, the pipeline can keep going at full tilt after the condition has been evaluated and the target jump address confirmed. If it guesses wrong, the pipeline must be cleared of all instructions started after the conditional jump, and re-started from the correct target jump address: an expensive condition that efficient branch prediction algorithms try to minimize.

Applying to your example above: if branch prediction correctly guesses the outcome of condition() a large percentage of the time, the following execution (of either doA() or doB()) continues without a pipeline flush: otherwise the conditional statement imposes a performance hit. This can occur if the evaluation of condition() is random from call to call, or otherwise follows a pattern that the branch prediction algorithm finds hard to predict.

How to deal with branch prediction when using a switch case in CPU emulation

On the contrary, switch statements are likely to be converted to jump tables, which means they perform possibly a few ifs (for range checking), and a single jump. The ifs shouldn't cause a problem with branch prediction because it is unlikely you will have a bad op-code. The jump is not so friendly with the pipeline, but in the end, it's only one for the whole switch statement..

I don't believe you can convert a long switch statement of op-codes into any other form that would result in better performance. This is of course, if your compiler is smart enough to convert it to a jump table. If not, you can do so manually.

If in doubt, implement other methods and measure performance.

Edit

First of all, make sure you don't confuse branch prediction and branch target prediction.

Branch prediction solely works on branch statements. It decides whether a branch condition would fail or succeed. They have nothing to do with the jump statement.

Branch target prediction on the other hand tries to guess where the jump will end up in.

So, your statement "there's no way the branch predictor can predict the jump" should be "there's no way the branch target predictor can predict the jump".

In your particular case, I don't think you can actually avoid this. If you had a very small set of operations, perhaps you could come up with a formula that covers all your operations, like those made in logic circuits. However, with an instruction set as big as a CPU's, even if it were RISC, the cost of that computation is much higher than the penalty of a single jump.

Is branch prediction still significantly speeding up array processing?

You're a victim of the as-if rule:

... conforming implementations are required to emulate (only) the observable behavior of the abstract machine ...

Consider the function under test ...

const size_t arraySize = 32768;
int *data;

long long test()
{
long long sum = 0;
for (size_t i = 0; i < 100000; ++i)
{
// Primary loop
for (size_t c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
return sum;
}

And the generated assembly (VS 2017, x86_64 /O2 mode)

The machine does not execute your loops, instead it executes a similar program that does this:

long long test()
{
long long sum = 0;
// Primary loop
for (size_t c = 0; c < arraySize; ++c)
{
for (size_t i = 0; i < 20000; ++i)
{
if (data[c] >= 128)
sum += data[c] * 5;
}
}
return sum;
}

Observe how the optimizer reversed the order of the loops and defeated your benchmark.

Obviously the latter version is much more branch-predictor-friendly.

You can in turn defeat the loop hoisting optimization by introducing a dependency in the outer loop:

long long test()
{
long long sum = 0;
for (size_t i = 0; i < 100000; ++i)
{
sum += data[sum % 15]; // <== dependency!
// Primary loop
for (size_t c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
return sum;
}

Now this version again exhibits a massive difference between sorted/unsorted data. On my system (i7-7700) 1.6s vs 11s (or 700%).

Conclusion: branch predictor is more important than ever these days when we are facing unprecedented pipeline depths and instruction-level parallelism.

vs. = in bubble sort causes significant performance difference

I think it may indeed be due to branch prediction. If you count the number of swaps compared to the number of inner sort iterations you find:

Limit = 10

  • A = 560M swaps / 1250M loops
  • B = 1250M swaps / 1250M loops (0.02% less swaps than loops)

Limit = 50000

  • A = 627M swaps / 1250M loops
  • B = 850M swaps / 1250M loops

So in the Limit == 10 case the swap is performed 99.98% of the time in the B sort which is obviously favourable for the branch predictor. In the Limit == 50000 case the swap is only hit randomly 68% so the branch predictor is less beneficial.



Related Topics



Leave a reply



Submit