Why Is Processing an Unsorted Array the Same Speed as Processing a Sorted Array with Modern X86-64 Clang

Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?

Several of the answers in the question you link talk about rewriting the code to be branchless and thus avoiding any branch prediction issues. That's what your updated compiler is doing.

Specifically, clang++ 10 with -O3 vectorizes the inner loop. See the code on godbolt, lines 36-67 of the assembly. The code is a little bit complicated, but one thing you definitely don't see is any conditional branch on the data[c] >= 128 test. Instead it uses vector compare instructions (pcmpgtd) whose output is a mask with 1s for matching elements and 0s for non-matching. The subsequent pand with this mask replaces the non-matching elements by 0, so that they do not contribute anything when unconditionally added to the sum.

The rough C++ equivalent would be

sum += data[c] & -(data[c] >= 128);

The code actually keeps two running 64-bit sums, for the even and odd elements of the array, so that they can be accumulated in parallel and then added together at the end of the loop.

Some of the extra complexity is to take care of sign-extending the 32-bit data elements to 64 bits; that's what sequences like pxor xmm5, xmm5 ; pcmpgtd xmm5, xmm4 ; punpckldq xmm4, xmm5 accomplish. Turn on -mavx2 and you'll see a simpler vpmovsxdq ymm5, xmm5 in its place.

The code also looks long because the loop has been unrolled, processing 8 elements of data per iteration.

Why is processing a sorted array faster than processing an unsorted array?

You are a victim of branch prediction fail.

What is Branch Prediction?

Consider a railroad junction:

Image showing a railroad junction
_{Image by Mecanismo, via Wikimedia Commons. Used under the CC-By-SA 3.0 license.}

Now for the sake of argument, suppose this is back in the 1800s - before long-distance or radio communication.

You are a blind operator of a junction and you hear a train coming. You have no idea which way it is supposed to go. You stop the train to ask the driver which direction they want. And then you set the switch appropriately.

Trains are heavy and have a lot of inertia, so they take forever to start up and slow down.

Is there a better way? You guess which direction the train will go!

If you guessed right, it continues on.
If you guessed wrong, the captain will stop, back up, and yell at you to flip the switch. Then it can restart down the other path.

If you guess right every time, the train will never have to stop.

If you guess wrong too often, the train will spend a lot of time stopping, backing up, and restarting.

Consider an if-statement: At the processor level, it is a branch instruction:

Screenshot of compiled code containing an if statement

You are a processor and you see a branch. You have no idea which way it will go. What do you do? You halt execution and wait until the previous instructions are complete. Then you continue down the correct path.

Modern processors are complicated and have long pipelines. This means they take forever to "warm up" and "slow down".

Is there a better way? You guess which direction the branch will go!

If you guessed right, you continue executing.
If you guessed wrong, you need to flush the pipeline and roll back to the branch. Then you can restart down the other path.

If you guess right every time, the execution will never have to stop.

If you guess wrong too often, you spend a lot of time stalling, rolling back, and restarting.

This is branch prediction. I admit it's not the best analogy since the train could just signal the direction with a flag. But in computers, the processor doesn't know which direction a branch will go until the last moment.

How would you strategically guess to minimize the number of times that the train must back up and go down the other path? You look at the past history! If the train goes left 99% of the time, then you guess left. If it alternates, then you alternate your guesses. If it goes one way every three times, you guess the same...

In other words, you try to identify a pattern and follow it. This is more or less how branch predictors work.

Most applications have well-behaved branches. Therefore, modern branch predictors will typically achieve >90% hit rates. But when faced with unpredictable branches with no recognizable patterns, branch predictors are virtually useless.

Further reading: "Branch predictor" article on Wikipedia.

As hinted from above, the culprit is this if-statement:

if (data[c] >= 128)
    sum += data[c];

Notice that the data is evenly distributed between 0 and 255. When the data is sorted, roughly the first half of the iterations will not enter the if-statement. After that, they will all enter the if-statement.

This is very friendly to the branch predictor since the branch consecutively goes the same direction many times. Even a simple saturating counter will correctly predict the branch except for the few iterations after it switches direction.

Quick visualization:

T = branch taken
N = branch not taken

data[] = 0, 1, 2, 3, 4, ... 126, 127, 128, 129, 130, ... 250, 251, 252, ...
branch = N  N  N  N  N  ...   N    N    T    T    T  ...   T    T    T  ...

       = NNNNNNNNNNNN ... NNNNNNNTTTTTTTTT ... TTTTTTTTTT  (easy to predict)

However, when the data is completely random, the branch predictor is rendered useless, because it can't predict random data. Thus there will probably be around 50% misprediction (no better than random guessing).

data[] = 226, 185, 125, 158, 198, 144, 217, 79, 202, 118,  14, 150, 177, 182, ...
branch =   T,   T,   N,   T,   T,   T,   T,  N,   T,   N,   N,   T,   T,   T  ...

       = TTNTTTTNTNNTTT ...   (completely random - impossible to predict)

What can be done?

If the compiler isn't able to optimize the branch into a conditional move, you can try some hacks if you are willing to sacrifice readability for performance.

Replace:

if (data[c] >= 128)
    sum += data[c];

with:

int t = (data[c] - 128) >> 31;
sum += ~t & data[c];

This eliminates the branch and replaces it with some bitwise operations.

_{(Note that this hack is not strictly equivalent to the original if-statement. But in this case, it's valid for all the input values of data[].)}

Benchmarks: Core i7 920 @ 3.5 GHz

C++ - Visual Studio 2010 - x64 Release

Scenario	Time (seconds)
Branching - Random data	11.777
Branching - Sorted data	2.352
Branchless - Random data	2.564
Branchless - Sorted data	2.587

Why is processing a sorted array slower than an unsorted array?

When you are using the unsorted list all tuples are accessed in memory-order. They have been allocated consecutively in RAM. CPUs love accessing memory sequentially because they can speculatively request the next cache line so it will always be present when needed.

When you are sorting the list you put it into random order because your sort keys are randomly generated. This means that the memory accesses to tuple members are unpredictable. The CPU cannot prefetch memory and almost every access to a tuple is a cache miss.

This is a nice example for a specific advantage of GC memory management: data structures which have been allocated together and are used together perform very nicely. They have great locality of reference.

The penalty from cache misses outweighs the saved branch prediction penalty in this case.

Try switching to a struct-tuple. This will restore performance because no pointer-dereference needs to occur at runtime to access tuple members.

Chris Sinclair notes in the comments that "for TotalCount around 10,000 or less, the sorted version does perform faster". This is because a small list fits entirely into the CPU cache. The memory accesses might be unpredictable but the target is always in cache. I believe there is still a small penalty because even a load from cache takes some cycles. But that seems not to be a problem because the CPU can juggle multiple outstanding loads, thereby increasing throughput. Whenever the CPU hits a wait for memory it will still speed ahead in the instruction stream to queue as many memory operations as it can. This technique is used to hide latency.

This kind of behavior shows how hard it is to predict performance on modern CPUs. The fact that we are only 2x slower when going from sequential to random memory access tell me how much is going on under the covers to hide memory latency. A memory access can stall the CPU for 50-200 cycles. Given that number one could expect the program to become >10x slower when introducing random memory accesses.

Is branch prediction still significantly speeding up array processing?

You're a victim of the as-if rule:

... conforming implementations are required to emulate (only) the observable behavior of the abstract machine ...

Consider the function under test ...

const size_t arraySize = 32768;
int *data;

long long test()
{
    long long sum = 0;
    for (size_t i = 0; i < 100000; ++i)
    {
        // Primary loop
        for (size_t c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }
    return sum;
}

And the generated assembly (VS 2017, x86_64 /O2 mode)

The machine does not execute your loops, instead it executes a similar program that does this:

long long test()
{
    long long sum = 0;
    // Primary loop
    for (size_t c = 0; c < arraySize; ++c)
    {
        for (size_t i = 0; i < 20000; ++i)
        {
            if (data[c] >= 128)
                sum += data[c] * 5;
        }
    }
    return sum;
}

Observe how the optimizer reversed the order of the loops and defeated your benchmark.

Obviously the latter version is much more branch-predictor-friendly.

You can in turn defeat the loop hoisting optimization by introducing a dependency in the outer loop:

long long test()
{
    long long sum = 0;
    for (size_t i = 0; i < 100000; ++i)
    {
        sum += data[sum % 15];  // <== dependency!
        // Primary loop
        for (size_t c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }
    return sum;
}

Now this version again exhibits a massive difference between sorted/unsorted data. On my system (i7-7700) 1.6s vs 11s (or 700%).

Conclusion: branch predictor is more important than ever these days when we are facing unprecedented pipeline depths and instruction-level parallelism.

Processing records speed

As other have mentioned, LINQ is not inherently slow. But it can be slower than equivalent non-LINQ code (this is why Roslyn team has "Avoid LINQ" guide under coding conventions).

If this is your hot path and you need every microsecond than you should probably implement logic in such a way:

public void Process()
{
    try
    {
        int ptr = PayloadOffset + 1;
        var cPair = MessageData.GetString(ref ptr, 7);
        var orderID = MessageData.GetString(ref ptr, 15);

        if (Book.CPairs.TryGetValue(cPair, out CPairGroup cPairGroup) && cPairGroup != null)
        {
            for (int i = cPairGroup.BPrices.Count - 1; i >= 0; i--)
            {
                var x = cPairGroup.BPrices[i];
                for (int j = x.BOrders.Count - 1; j >= 0; j--)
                {
                    var y = x.BOrders[j];
                    if (y.OrderID.Equals(orderID))
                    {
                        x.BOrders.RemoveAt(j);
                    }
                }
                if (x.BOrders.Count == 0)
                {
                    cPairGroup.BPrices.RemoveAt(i);
                }
            }
        }
    }
}

Main points:

Avoid double dictionary lookup by using TryGetValue
Single iteration over cPairGroup.BPrices
In place modification of structures by iterating backwards
This code should not contain any additional heap allocations

Why Is Processing an Unsorted Array the Same Speed as Processing a Sorted Array with Modern X86-64 Clang