Why Is My Program Slow When Looping Over Exactly 8192 Elements

Why is my program slow when looping over exactly 8192 elements?

The difference is caused by the same super-alignment issue from the following related questions:

Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
Matrix multiplication: Small difference in matrix size, large difference in timings

But that's only because there's one other problem with the code.

Starting from the original loop:

for(i=1;i<SIZE-1;i++) 
    for(j=1;j<SIZE-1;j++) {
        res[j][i]=0;
        for(k=-1;k<2;k++) 
            for(l=-1;l<2;l++) 
                res[j][i] += img[j+l][i+k];
        res[j][i] /= 9;
}

First notice that the two inner loops are trivial. They can be unrolled as follows:

for(i=1;i<SIZE-1;i++) {
    for(j=1;j<SIZE-1;j++) {
        res[j][i]=0;
        res[j][i] += img[j-1][i-1];
        res[j][i] += img[j  ][i-1];
        res[j][i] += img[j+1][i-1];
        res[j][i] += img[j-1][i  ];
        res[j][i] += img[j  ][i  ];
        res[j][i] += img[j+1][i  ];
        res[j][i] += img[j-1][i+1];
        res[j][i] += img[j  ][i+1];
        res[j][i] += img[j+1][i+1];
        res[j][i] /= 9;
    }
}

So that leaves the two outer-loops that we're interested in.

Now we can see the problem is the same in this question: Why does the order of the loops affect performance when iterating over a 2D array?

You are iterating the matrix column-wise instead of row-wise.

To solve this problem, you should interchange the two loops.

for(j=1;j<SIZE-1;j++) {
    for(i=1;i<SIZE-1;i++) {
        res[j][i]=0;
        res[j][i] += img[j-1][i-1];
        res[j][i] += img[j  ][i-1];
        res[j][i] += img[j+1][i-1];
        res[j][i] += img[j-1][i  ];
        res[j][i] += img[j  ][i  ];
        res[j][i] += img[j+1][i  ];
        res[j][i] += img[j-1][i+1];
        res[j][i] += img[j  ][i+1];
        res[j][i] += img[j+1][i+1];
        res[j][i] /= 9;
    }
}

This eliminates all the non-sequential access completely so you no longer get random slow-downs on large powers-of-two.

Core i7 920 @ 3.5 GHz

Original code:

8191: 1.499 seconds
8192: 2.122 seconds
8193: 1.582 seconds

Interchanged Outer-Loops:

8191: 0.376 seconds
8192: 0.357 seconds
8193: 0.351 seconds

Pc going unresponsive when looping over a 10^8 elements in an array

What's happening here is very likely paging / swapping¹. Due to virtual address space, each process on your system can address a huge amount of memory - way more than you have physically in your computer. If all processes together use more memory than you have physically available, the operating system is in trouble - one approach is paging: Moving data from some processes from memory to disk.

Since your disk, even if it's an SSD, is several orders of magnitude slower than RAM, the systems gets unresponsive. Say for example the OS decides to move the block of memory which contains your mouse cursor position onto the disk. Every time it updates the cursor, this introduces a huge delay. Even after the process which consumed all the memory finishes, it will take some time to load back all data from disk to RAM.

To illustrate, on my system with a comparable processor (i5-3320M), your example code finishes in a mere 20 seconds without impact on overall system responsiveness - that is because I have 16 GiB RAM. So clearly it is not about the "CPU [being saturated] with billions of operations". Given you have a quad-core processor, and that code uses only one thread, you have lots of spare compute cycles. Even if you were to use up all the CPU cycles, the system is usually quite responsive, because the OS scheduler does a good job balancing CPU cycles between your compute task and the process moving your mouse cursor.

Python is particularly prone to this issue, because it uses way more memory than necessary. Python 3.6.1 on my system uses ~4 GiB for the data in arr - even though 10^8 64 bit integers would only use 800 MB. That's just due to the fact that everything in python is an object. You can be more memory-efficient if you don't permanently store anything in memory in the first place or use numpy. But to discuss that, would require a more problem-oriented code example.

^{1: There are differences between paging and swapping, but nowadays it is used interchangeably.}

Why are far pointers slow?

Why are far pointers slow?

Segment register loads are too complex for a core's front-end to convert into micro-ops. Instead they're "emulated" by micro-ops stored in a little ROM. This begins with some branches (which CPU mode is it?), and typically those branches can't benefit from CPU's branch prediction causing stall/s.

To avoid segment register loads (e.g. when the same far pointer is used multiple times) software tends to use more segment registers (e.g. tends to use ES, FS and GS); and this adds more prefixes (segment override prefixes) to instructions. These extra prefixes can also slow down instruction decoding.

I guess if every pointer you were using were in a different segment this could add up, but data is usually grouped.

Compilers aren't that smart. If a small piece of code uses 4 far pointers that all happen to use the same segment, the compiler won't know that they're all in the same segment and will do expensive segment register loads regardless. To work around that you could describe the data as a structure (e.g. 1 pointer to a structure that has 4 fields instead of 4 different pointers); but that requires the programmer to write software differently.

For an example; if you do something like "int foo(void) { int a; return bar(&a); }" then ss will probably be passed on the stack and then the callee (bar) will load that into another segment register, because bar() has to assume that the pointer could point anywhere.

The other problem is that sometimes data is larger than a segment (e.g. an array of 100000 bytes that doesn't fit in a 64 KiB segment); so someone (the programmer or the compiler) has to calculate and load a different segment to access (parts of) the same data. All pointer arithmetic may need to take this into account (e.g. a trivial looking pointer++; may become something more like offset++; segment += (offset >> 4); offset &= 0x000F; that causes a segment register load).

If you have to hit DRAM, that should dominate the cost, right?

For real mode; you're limited to about 640 KiB of RAM and the caches in CPUs are typically much larger, so you can expect that every memory access is going to be a cache hit. In some cases (e.g. cascade lake CPUs with 1 MiB of L2 cache per core) you won't even use the L3 cache (it'll be all L2 hits).

You can also expect that a segment register load is more expensive than a cache hit.

Hoping an 8086 wizard is hanging around who remembers this stuff.

When people say "segmentation is awful and should be avoided" they're not thinking of a CPU that's been obsolete for 40 years (8086), they're thinking of CPUs that were relevant during this century. Some of them may also be thinking of more than just performance alone (especially for assembly language programmers, segmentation is an annoyance/extra burden).

Why Is My Program Slow When Looping Over Exactly 8192 Elements