Sse-Copy, Avx-Copy and Std::Copy Performance

SSE-copy, AVX-copy and std::copy performance

The problem is that your test does a poor job to migrate some factors in the hardware that make benchmarking hard. To test this, I've made my own test case. Something like this:

for blah blah:
    sleep(500ms)
    std::copy
    sse
    axv

output:

SSE: 1.11753x faster than std::copy
AVX: 1.81342x faster than std::copy

So in this case, AVX is a bunch faster than std::copy. What happens when I change to test case to..

for blah blah:
    sleep(500ms)
    sse
    axv
    std::copy

Notice that absolutely nothing changed, except the order of the tests.

SSE: 0.797673x faster than std::copy
AVX: 0.809399x faster than std::copy

Woah! how is that possible? The CPU takes a while to ramp up to full speed, so tests that are run later have an advantage. This question has 3 answers now, including an 'accepted' answer. But only the one with the lowest amount of upvotes was on the right track.

This is one of the reasons why benchmarking is hard and you should never trust anyone's micro-benchmarks unless they've included detailed information of their setup. It isn't just the code that can go wrong. Power saving features and weird drivers can completely mess up your benchmark. One time i've measured an factor 7 difference in performance by toggling a switch in the bios that less than 1% of notebooks offer.

How to copy X bytes or bits from an __m128i into standard memory

There is a _mm_maskmoveu_si128 intrinsic, which translates to maskmovdqu (in SSE) or vmaskmovdqu (in AVX).

// Store masks. The highest bit in each byte indicates the byte to store.
alignas(16) const unsigned char masks[16][16] =
{
    { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00 }
};

void store_n(__m128i mm, unsigned int n, void* storage)
{
    assert(n < 16u);
    _mm_maskmoveu_si128(mm, reinterpret_cast< const __m128i& >(masks[n]), static_cast< char* >(storage));
}

The problem with this code is that maskmovdqu (and, presumably, vmaskmovdqu) instructions have an associated hint for non-temporal access to the target memory, which makes the instruction expensive and also requires a fence afterwards.

AVX adds new instructions vmaskmovps/vmaskmovpd (and AVX2 also adds vpmaskmovd/vpmaskmovq), which work similarly to vmaskmovdqu but do not have the non-temporal hint and only operate on 32 and 64-bit granularity.

// Store masks. The highest bit in each 32-bit element indicates the element to store.
alignas(16) const unsigned char masks[4][16] =
{
    { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00 }
};

void store_n(__m128i mm, unsigned int n, void* storage)
{
    assert(n < 4u);
    _mm_maskstore_epi32(static_cast< int* >(storage), reinterpret_cast< const __m128i& >(masks[n]), mm);
}

AVX-512 adds masked stores, and you could use vmovdqu8/vmovdqu16 with an appropriate mask to store 8 or 16-bit elements.

void store_n(__m128i mm, unsigned int n, void* storage)
{
    assert(n < 16u);
    _mm_mask_storeu_epi8(storage, static_cast< __mmask16 >((1u << n) - 1u), mm);
}

Note that the above requires AVX-512BW and VL extensions.

If you require 8 or 16-bit granularity and don't have AVX-512 then you're better off with a function that manually stores the vector register piece by piece.

void store_n(__m128i mm, unsigned int n, void* storage)
{
    assert(n < 16u);

    unsigned char* p = static_cast< unsigned char* >(storage);
    if (n >= 8u)
    {
        _mm_storel_epi64(reinterpret_cast< __m128i* >(p), mm);
        mm = _mm_unpackhi_epi64(mm, mm); // move high 8 bytes to the low 8 bytes
        n -= 8u;
        p += 8;
    }

    if (n >= 4u)
    {
        std::uint32_t data = _mm_cvtsi128_si32(mm);
        std::memcpy(p, &data, sizeof(data)); // typically generates movd
        mm = _mm_srli_si128(mm, 4);
        n -= 4u;
        p += 4;
    }

    if (n >= 2u)
    {
        std::uint16_t data = _mm_extract_epi16(mm, 0); // or _mm_cvtsi128_si32
        std::memcpy(p, &data, sizeof(data));
        mm = _mm_srli_si128(mm, 2);
        n -= 2u;
        p += 2;
    }

    if (n > 0u)
    {
        std::uint32_t data = _mm_cvtsi128_si32(mm);
        *p = static_cast< std::uint8_t >(data);
    }
}

AVX vs. SSE: expect to see a larger speedup

The problem is that that your data doesn't fit in the L1 cache.
The L1 bandwidth of Broadwell is much larger than the L2 bandwidth.
The L1 bandwidth is large enough to load two 32 byte vectors every cpu cycle. So, a better AVX vs. SSE speedup
might be expected if your data set is much smaller. However, note that
the combined L1 read/write bandwidth is less than 2*32(r)+32(w)=96 bytes per cycle.
In practice 75 bytes per cycle is possible, see here.

The second graph on this page shows that indeed the L2 bandwidth is much smaller:
At Test_block_size=128KB (=32KB per core) the bandwidth is 900GB/s.
At Test_block_size=1MB (=256KB per core) the bandwidth is only 300GB/s.
(Note that Haswell 4770k has more or less the same L1 and L2 cache architecture as Broadwell.)

Try to reduce AR to 2000 and to increase NREP to 1000000 and see what happens with the SSE vs. AVX speedup.

performance of SSE and AVX when both Memory-band width limited

That's an interesting observation. I was able to reproduce your results. I manged to improve your SSE code speed quite a bit by unrolling the loop (see the code below). Now for SSE dataLen=2864 is clearly faster and for the smaller values it's nearlly as fast as AVX. For larger values it's ever faster still. This is due to the carried loop dependency in your SSE code (i.e. unrolling the loop increases the instruction level parallelism (ILP)). I did not try unrolling any further. Unrolling the AVX code did not help.

I don't have a clear answer to your question though. My hunch is that it's related to the ILP and the fact that AVX processors such as Sandy Bridge can only load two 128-bit words (SSE width) simultaneously and not two 256-bit words. So in the SSE code it can do one SSE addition, one SSE multiplication, two SSE loads, and one SSE store simultaneously. For AVX it can do one AVX load (through two 128-bit loads on ports 2 and 3), one AVX multiplication, one AVX addition, and one 128bit store (half the AVX width) simultaneous. In other words although with AVX the multiplication and additions do twice as much work as SSE the loads and stores are still 128bit wide. Maybe this leads to lower ILP with AVX compared to SSE sometimes with code dominated by loads and stores?

For more info on the ports and ILP see this Haswell, Sandy Bridge, Nehalem ports compared.

__m128 p1, p2, p3, p1_v2, p2_v2, p3_v2;
for(int j=0; j<N; j++)
    for(int i=0; i<dataLen; i+=8)
    {
        p1 = _mm_load_ps(&buf1[i]);
        p1_v2 = _mm_load_ps(&buf1[i+4]);
        p2 = _mm_load_ps(&buf2[i]);
        p2_v2 = _mm_load_ps(&buf2[i+4]);
        p3 = _mm_load_ps(&buf3[i]);
        p3_v2 = _mm_load_ps(&buf3[i+4]);
        p3 = _mm_add_ps(_mm_mul_ps(p1, p2), p3);
        p3_v2 = _mm_add_ps(_mm_mul_ps(p1_v2, p2_v2), p3_v2);
        _mm_store_ps(&buf3[i], p3);
        _mm_store_ps(&buf3[i+4], p3_v2);
    }

SSE copy data to variables

There's no reason why you couldn't put x and y side by side in one __m128, shortening the code somewhat:

for (unsigned int i = 0; i < PARTICLES; i += 2) {
    // Particle position/velocity x & y
    __m128 pos = _mm_set_ps(m_Particle[i]->x, m_Particle[i+1]->x, 
                            m_Particle[i]->y, m_Particle[i+1]->y);
    __m128 vel = _mm_set_ps(m_Particle[i]->vx, m_Particle[i+1]->vx,
                            m_Particle[i]->vy, m_Particle[i+1]->vy);

    union { float pnew[4]; __m128 pnew4; };
    pnew4 = _mm_add_ps(pos, vel);

    m_Particle[i+0]->x = pnew[0]; // Particle i + 0
    m_Particle[i+0]->y = pnew[2];
    m_Particle[i+1]->x = pnew[1]; // Particle i + 1
    m_Particle[i+1]->y = pnew[3];
}

But really, you've encountered the "Array of structs" vs. "Struct of arrays" issue. SSE code works better with a "Struct of arrays" like:

struct Particles
{
    float x[PARTICLES];
    float y[PARTICLES];
    float xv[PARTICLES];
    float yv[PARTICLES];
};

Another option is a hybrid approach:

struct Particles4
{
    __m128 x;
    __m128 y;
    __m128 xv;
    __m128 yv;
};

Particles4 particles[PARTICLES / 4];

Either way will give simpler and faster code than your example.

SSE and AVX intrinsics mixture

You can mix SSE and AVX intrinsics all you want.

The only thing you want to make sure is to specify the correct compiler flag to enable AVX.

GCC: -mavx
Visual Studio: /arch:AVX

Failing to do so will either result in the code not compiling (GCC), or in the case of Visual Studio,

this kind of crap:

Using AVX CPU instructions: Poor performance without "/arch:AVX"

What the flag does is that it forces all SIMD instructions to use VEX encoding to avoid the state-switching penalties described in the question above.

Sse-Copy, Avx-Copy and Std::Copy Performance