Why Is Std::Fill(0) Slower Than Std::Fill(1)

Why is std::fill(0) slower than std::fill(1)?

From your question + the compiler-generated asm from your answer:

  • fill(0) is an ERMSB rep stosb which will use 256b stores in an optimized microcoded loop. (Works best if the buffer is aligned, probably to at least 32B or maybe 64B).
  • fill(1) is a simple 128-bit movaps vector store loop. Only one store can execute per core clock cycle regardless of width, up to 256b AVX. So 128b stores can only fill half of Haswell's L1D cache write bandwidth. This is why fill(0) is about 2x as fast for buffers up to ~32kiB. Compile with -march=haswell or -march=native to fix that.

    Haswell can just barely keep up with the loop overhead, but it can still run 1 store per clock even though it's not unrolled at all. But with 4 fused-domain uops per clock, that's a lot of filler taking up space in the out-of-order window. Some unrolling would maybe let TLB misses start resolving farther ahead of where stores are happening, since there is more throughput for store-address uops than for store-data. Unrolling might help make up the rest of the difference between ERMSB and this vector loop for buffers that fit in L1D. (A comment on the question says that -march=native only helped fill(1) for L1.)

Note that rep movsd (which could be used to implement fill(1) for int elements) will probably perform the same as rep stosb on Haswell.
Although only the official documentation only guarantees that ERMSB gives fast rep stosb (but not rep stosd), actual CPUs that support ERMSB use similarly efficient microcode for rep stosd. There is some doubt about IvyBridge, where maybe only b is fast. See the @BeeOnRope's excellent ERMSB answer for updates on this.

gcc has some x86 tuning options for string ops (like -mstringop-strategy=alg and -mmemset-strategy=strategy), but IDK if any of them will get it to actually emit rep movsd for fill(1). Probably not, since I assume the code starts out as a loop, rather than a memset.


With more than one thread, at 4 GiB data size, fill(1) shows a higher slope, but reaches a much lower peak than fill(0) (51 GiB/s vs 90 GiB/s):

A normal movaps store to a cold cache line triggers a Read For Ownership (RFO). A lot of real DRAM bandwidth is spent on reading cache lines from memory when movaps writes the first 16 bytes. ERMSB stores use a no-RFO protocol for its stores, so the memory controllers are only writing. (Except for miscellaneous reads, like page tables if any page-walks miss even in L3 cache, and maybe some load misses in interrupt handlers or whatever).

@BeeOnRope explains in comments that the difference between regular RFO stores and the RFO-avoiding protocol used by ERMSB has downsides for some ranges of buffer sizes on server CPUs where there's high latency in the uncore/L3 cache. See also the linked ERMSB answer for more about RFO vs non-RFO, and the high latency of the uncore (L3/memory) in many-core Intel CPUs being a problem for single-core bandwidth.


movntps (_mm_stream_ps()) stores are weakly-ordered, so they can bypass the cache and go straight to memory a whole cache-line at a time without ever reading the cache line into L1D. movntps avoids RFOs, like rep stos does. (rep stos stores can reorder with each other, but not outside the boundaries of the instruction.)

Your movntps results in your updated answer are surprising.

For a single thread with large buffers, your results are movnt >> regular RFO > ERMSB. So that's really weird that the two non-RFO methods are on opposite sides of the plain old stores, and that ERMSB is so far from optimal. I don't currently have an explanation for that. (edits welcome with an explanation + good evidence).

As we expected, movnt allows multiple threads to achieve high aggregate store bandwidth, like ERMSB. movnt always goes straight into line-fill buffers and then memory, so it is much slower for buffer sizes that fit in cache. One 128b vector per clock is enough to easily saturate a single core's no-RFO bandwidth to DRAM. Probably vmovntps ymm (256b) is only a measurable advantage over vmovntps xmm (128b) when storing the results of a CPU-bound AVX 256b-vectorized computation (i.e. only when it saves the trouble of unpacking to 128b).

movnti bandwidth is low because storing in 4B chunks bottlenecks on 1 store uop per clock adding data to the line fill buffers, not on sending those line-full buffers to DRAM (until you have enough threads to saturate memory bandwidth).


@osgx posted some interesting links in comments:

  • Agner Fog's asm optimization guide, instruction tables, and microarch guide: http://agner.org/optimize/
  • Intel optimization guide: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf.

  • NUMA snooping: http://frankdenneman.nl/2016/07/11/numa-deep-dive-part-3-cache-coherency/

  • https://software.intel.com/en-us/articles/intelr-memory-latency-checker
  • Cache Coherence Protocol and Memory
    Performance of the Intel Haswell-EP Architecture

See also other stuff in the x86 tag wiki.

Which is faster/preferred: memset or for loop to zero out an array of doubles?

Note that for memset you have to pass the number of bytes, not the number of elements because this is an old C function:

memset(d, 0, sizeof(double)*length);

memset can be faster since it is written in assembler, whereas std::fill is a template function which simply does a loop internally.

But for type safety and more readable code I would recommend std::fill() - it is the c++ way of doing things, and consider memset if a performance optimization is needed at this place in the code.

Why is std::vector slower than an array?

Let us observe how GCC optimizes this test program:

#include <vector>

int main()
{
int len = 800000;
int* Data = new int[len];

int arr[3] = { 255, 0, 0 };
std::vector<int> vec = { 255, 0, 0 };

for (int i = 0; i < len; i++) {
Data[i] = vec[0];
}
for (int i = 0; i < len; i++) {
Data[i] = arr[0];
}
delete[] Data;
}

The compiler rightly notices that the vector is constant, and eliminates it. Exactly same code is generated for both loops. Therefore it should be irrelevant whether the first loop uses array or vector.

.L2:
movups XMMWORD PTR [rcx], xmm0
add rcx, 16
cmp rsi, rcx
jne .L2

What makes difference in your test program is the order of loops. The comments point out that when a third loop is added to the beginning, both loops take the same time.

I would expect that with a modern compiler accessing a vector would be approximately as fast as accessing an array, when optimization is enabled and debug is disabled. If there is an observable difference in your actual program, the problem lies somewhere else.

Parallel fill std::vector with zero

You can split the vector into chunks for each thread to be filled with std::fill:

#pragma omp parallel
{
auto tid = omp_get_thread_num();
auto chunksize = v.size() / omp_get_num_threads();
auto begin = v.begin() + chunksize * tid;
auto end = (tid == omp_get_num_threads() -1) ? v.end() : begin + chunksize);
std::fill(begin, end, 0);
}

You can further improve it by rounding chunksize to the nearest cacheline / memory word size (128 byte = 32 ints). Assuming that v.data() is aligned similarly. That way, you avoid any false sharing issues.

On a dual socket 24 core Haswell system, I get a speedup of somewhere near 9x: 3.6s for 1 thread, to 0.4s for 24 threads, 4.8B ints = ~48 GB/s, the results vary a bit and this is not a scientific analysis. But it is not too far off the memory bandwidth of the system.

For general performance, you should be concerned about dividing your vector not only for this operation, but also for further operations (be it read or write) the same way if possible. That way, you increase the chance that the data is actually in cache if you need it, or at least on the same NUMA node.

Oddly enough, on my system std::fill(..., 1); is faster than std::fill(..., 0) for a single thread, but slower for 24 threads. Both with gcc 6.1.0 and icc 17.0.1. I guess I'll post that into a separate question.



Related Topics



Leave a reply



Submit