Why is processing a sorted array faster than processing an unsorted array?
You are a victim of branch prediction fail.
What is Branch Prediction?
Consider a railroad junction:
Image by Mecanismo, via Wikimedia Commons. Used under the CC-By-SA 3.0 license.
Now for the sake of argument, suppose this is back in the 1800s - before long-distance or radio communication.
You are the operator of a junction and you hear a train coming. You have no idea which way it is supposed to go. You stop the train to ask the driver which direction they want. And then you set the switch appropriately.
Trains are heavy and have a lot of inertia, so they take forever to start up and slow down.
Is there a better way? You guess which direction the train will go!
- If you guessed right, it continues on.
- If you guessed wrong, the captain will stop, back up, and yell at you to flip the switch. Then it can restart down the other path.
If you guess right every time, the train will never have to stop.
If you guess wrong too often, the train will spend a lot of time stopping, backing up, and restarting.
Consider an if-statement: At the processor level, it is a branch instruction:
You are a processor and you see a branch. You have no idea which way it will go. What do you do? You halt execution and wait until the previous instructions are complete. Then you continue down the correct path.
Modern processors are complicated and have long pipelines. This means they take forever to "warm up" and "slow down".
Is there a better way? You guess which direction the branch will go!
- If you guessed right, you continue executing.
- If you guessed wrong, you need to flush the pipeline and roll back to the branch. Then you can restart down the other path.
If you guess right every time, the execution will never have to stop.
If you guess wrong too often, you spend a lot of time stalling, rolling back, and restarting.
This is branch prediction. I admit it's not the best analogy since the train could just signal the direction with a flag. But in computers, the processor doesn't know which direction a branch will go until the last moment.
How would you strategically guess to minimize the number of times that the train must back up and go down the other path? You look at the past history! If the train goes left 99% of the time, then you guess left. If it alternates, then you alternate your guesses. If it goes one way every three times, you guess the same...
In other words, you try to identify a pattern and follow it. This is more or less how branch predictors work.
Most applications have well-behaved branches. Therefore, modern branch predictors will typically achieve >90% hit rates. But when faced with unpredictable branches with no recognizable patterns, branch predictors are virtually useless.
Further reading: "Branch predictor" article on Wikipedia.
As hinted from above, the culprit is this if-statement:
if (data[c] >= 128)
sum += data[c];
if (data[c] >= 128)
sum += data[c];
Notice that the data is evenly distributed between 0 and 255. When the data is sorted, roughly the first half of the iterations will not enter the if-statement. After that, they will all enter the if-statement.
This is very friendly to the branch predictor since the branch consecutively goes the same direction many times. Even a simple saturating counter will correctly predict the branch except for the few iterations after it switches direction.
T = branch taken
N = branch not taken
data = 0, 1, 2, 3, 4, ... 126, 127, 128, 129, 130, ... 250, 251, 252, ...
branch = N N N N N ... N N T T T ... T T T ...
= NNNNNNNNNNNN ... NNNNNNNTTTTTTTTT ... TTTTTTTTTT (easy to predict)
However, when the data is completely random, the branch predictor is rendered useless, because it can't predict random data. Thus there will probably be around 50% misprediction (no better than random guessing).
data = 226, 185, 125, 158, 198, 144, 217, 79, 202, 118, 14, 150, 177, 182, ...
branch = T, T, N, T, T, T, T, N, T, N, N, T, T, T ...
= TTNTTTTNTNNTTT ... (completely random - impossible to predict)
What can be done?
If the compiler isn't able to optimize the branch into a conditional move, you can try some hacks if you are willing to sacrifice readability for performance.
if (data[c] >= 128)
sum += data[c];
int t = (data[c] - 128) >> 31;
sum += ~t & data[c];
This eliminates the branch and replaces it with some bitwise operations.
(Note that this hack is not strictly equivalent to the original if-statement. But in this case, it's valid for all the input values of
Benchmarks: Core i7 920 @ 3.5 GHz
C++ - Visual Studio 2010 - x64 Release
|Branching - Random data||11.777|
|Branching - Sorted data||2.352|
|Branchless - Random data||2.564|
|Branchless - Sorted data||2.587|
Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?
Several of the answers in the question you link talk about rewriting the code to be branchless and thus avoiding any branch prediction issues. That's what your updated compiler is doing.
Specifically, clang++ 10 with
-O3 vectorizes the inner loop. See the code on godbolt, lines 36-67 of the assembly. The code is a little bit complicated, but one thing you definitely don't see is any conditional branch on the
data[c] >= 128 test. Instead it uses vector compare instructions (
pcmpgtd) whose output is a mask with 1s for matching elements and 0s for non-matching. The subsequent
pand with this mask replaces the non-matching elements by 0, so that they do not contribute anything when unconditionally added to the sum.
The rough C++ equivalent would be
sum += data[c] & -(data[c] >= 128);
The code actually keeps two running 64-bit
sums, for the even and odd elements of the array, so that they can be accumulated in parallel and then added together at the end of the loop.
Some of the extra complexity is to take care of sign-extending the 32-bit
data elements to 64 bits; that's what sequences like
pxor xmm5, xmm5 ; pcmpgtd xmm5, xmm4 ; punpckldq xmm4, xmm5 accomplish. Turn on
-mavx2 and you'll see a simpler
vpmovsxdq ymm5, xmm5 in its place.
The code also looks long because the loop has been unrolled, processing 8 elements of
data per iteration.
Why is processing a sorted array not faster than an unsorted array in Python?
I may be wrong, but I see a fundamental difference between the linked question and your example: Python interprets bytecode, C++ compiles to native code.
In the C++ code that
if translates directly to a
jl sequence, that can be considered by the CPU branch predictor as a single "prediction spot", specific to that cycle.
In Python that comparison is actually several function calls, so there's (1) more overhead and (2) I suppose the code that performs that comparison is a function into the interpreter used for every other integer comparison - so it's a "prediction spot" not specific to the current block, which gives the branch predictor a much harder time to guess correctly.
Edit: also, as outlined in this paper, there are way more indirect branches inside an interpreter, so such an optimization in your Python code would probably be buried anyway by the branch mispredictions in the interpreter itself.
Why is processing a sorted array slower than an unsorted array?
When you are using the unsorted list all tuples are accessed in memory-order. They have been allocated consecutively in RAM. CPUs love accessing memory sequentially because they can speculatively request the next cache line so it will always be present when needed.
When you are sorting the list you put it into random order because your sort keys are randomly generated. This means that the memory accesses to tuple members are unpredictable. The CPU cannot prefetch memory and almost every access to a tuple is a cache miss.
This is a nice example for a specific advantage of GC memory management: data structures which have been allocated together and are used together perform very nicely. They have great locality of reference.
The penalty from cache misses outweighs the saved branch prediction penalty in this case.
Try switching to a
struct-tuple. This will restore performance because no pointer-dereference needs to occur at runtime to access tuple members.
Chris Sinclair notes in the comments that "for TotalCount around 10,000 or less, the sorted version does perform faster". This is because a small list fits entirely into the CPU cache. The memory accesses might be unpredictable but the target is always in cache. I believe there is still a small penalty because even a load from cache takes some cycles. But that seems not to be a problem because the CPU can juggle multiple outstanding loads, thereby increasing throughput. Whenever the CPU hits a wait for memory it will still speed ahead in the instruction stream to queue as many memory operations as it can. This technique is used to hide latency.
This kind of behavior shows how hard it is to predict performance on modern CPUs. The fact that we are only 2x slower when going from sequential to random memory access tell me how much is going on under the covers to hide memory latency. A memory access can stall the CPU for 50-200 cycles. Given that number one could expect the program to become >10x slower when introducing random memory accesses.
Why is processing a sorted array *slower* than an unsorted array? (Java's ArrayList.indexOf)
It looks like caching / prefetching effect.
The clue is that you compare Doubles (objects), not doubles (primitives). When you allocate objects in one thread, they are typically allocated sequentially in memory. So when
indexOf scans a list, it goes through sequential memory addresses. This is good for CPU cache prefetching heuristics.
But after you sort the list, you still have to do the same number of memory lookups in average, but this time memory access will be in random order.
Here is the benchmark to prove that the order of allocated objects matters.
Benchmark (generator) (length) (postprocess) Mode Cnt Score Error Units
ListIndexOf.indexOf random 1000000 none avgt 10 1,243 ± 0,031 ms/op
ListIndexOf.indexOf random 1000000 sort avgt 10 6,496 ± 0,456 ms/op
ListIndexOf.indexOf random 1000000 shuffle avgt 10 6,485 ± 0,412 ms/op
ListIndexOf.indexOf sequential 1000000 none avgt 10 1,249 ± 0,053 ms/op
ListIndexOf.indexOf sequential 1000000 sort avgt 10 1,247 ± 0,037 ms/op
ListIndexOf.indexOf sequential 1000000 shuffle avgt 10 6,579 ± 0,448 ms/op
Is == in sorted array not faster than unsorted array?
One thing that immediately comes to mind is CPU's branch prediction algorithm.
In case of
> comparison, in sorted array the branching behavior is very consistent: first, the
if condition is consistently false, then it is consistently true. This aligns very well with even the simplest branch prediction.
In unsorted array the result of
> condition is essentially random, thus thwarting any branch prediction.
This is what makes the sorted version faster.
In case of
== comparison, most of the time the condition is false and only very rarely it is true. This works well with branch prediction regardless of whether the array is sorted or not. The timings are essentially the same.
DIfference between sorted and unsorted array time complexity
Let's take a look at your sample problem: find two numbers whose sum equals a given number.
Let's say you have an unsorted array:
[2, 8, 1, 3, 6, 7, 5, 4], and the target is 11.
So you look at the first item, 2, and you know that you have to find the number 9 in the array, if it exists. With the unsorted array, you have to do a linear search to determine if 9 exists in the array.
But if you have a sorted array,
[1, 2, 3, 4, 5, 6, 7, 8], you have an advantage. When you see the value 2, you know you need to find 9 in the array. But because the list is sorted, you can use binary search. Instead of having to look at every item in the list, you only have to look at 3 of them:
Binary search will look at the 4th item, then the 6th, then the 8th, and finally determine that 9 isn't in the array.
In short, searching in an unsorted array takes O(n) time: you potentially have to look at every item to find out if what you're looking for is there. A sorted array lets you speed up the search. Instead of having to examine every item, you only have to examine at most log2(n) items. That makes a huge difference when numbers get large. For example, if your list contains a million items, binary search only has to examine 20 of them. Sequential search has to look at all million.
How to "Scan" a Website (Or Page) For Info, and Bring It into My Program
Inconsistent Performance Applying Foregroundactions in a Jeditorpane When Reading Html
How to Download Image from Any Web Page in Java
Convert Word Doc to HTML Programmatically in Java
Example Images For Code and Mark-Up Q&As
How to Interact With the Elements Within #Shadow-Root (Open) While Clearing Browsing Data of Chrome Browser Using Cssselector
How to Style the Progressbar Component in Javafx
Find Div Element by Multiple Class Names
Unfortunately Myapp Has Stopped. How to Solve This
How to Fix 'Android.Os.Networkonmainthread
How to Return Datasnapshot Value as a Result of a Method