Simd Prefix Sum on Intel Cpu

SIMD prefix sum on Intel cpu

The fastest parallel prefix sum algorithm I know of is to run over the sum in two passes in parallel and use SSE as well in the second pass.

In the first pass you calculate partial sums in parallel and store the total sum for each partial sum. In the second pass you add the total sum from the preceding partial sum to the next partial sum. You can run both passes in parallel using multiple threads (e.g. with OpenMP). The second pass you can also use SIMD since a constant value is being added to each partial sum.

Assuming n elements of an array, m cores, and a SIMD width of w the time cost should be

n/m + n/(m*w) = (n/m)*(1+1/w)

Since the fist pass does not use SIMD the time cost will always be greater than n/m

For example for four cores with a SIMD_width of 4 (four 32bit floats with SSE) the cost would be 5n/16. Or about 3.2 times faster than sequential code which has a time cost of n. Using hyper threading the speed up will be greater still.

In special cases it's possible to use SIMD on the first pass as well. Then the time cost is simply

2*n/(m*w)

I posted the code for the general case which uses OpenMP for the threading and intrinsics for the SSE code and discuss details about the special case at the following link
parallel-prefix-cumulative-sum-with-sse

Edit:
I managed to find a SIMD version for the first pass which is about twice as fast as sequential code. Now I get a total boost of about 7 on my four core ivy bridge system.

Edit:
For larger arrays one problem is that after the first pass most values have been evicted from the cache. I came up with a solution which runs in parallel inside a chunk but runs each chunk serially. The chunk_size is a value that should be tuned. For example I set it to 1MB = 256K floats. Now the second pass is done while the values are still inside the level-2 cache. Doing this gives a big improvement for large arrays.

Here is the code for SSE. The AVX code is about the same speed so I did not post it here. The function which does the prefix sum is scan_omp_SSEp2_SSEp1_chunk. Pass it an array a of floats and it fills the array s with the cumulative sum.

__m128 scan_SSE(__m128 x) {
    x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4))); 
    x = _mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40)); 
    return x;
}

float pass1_SSE(float *a, float *s, const int n) {
    __m128 offset = _mm_setzero_ps();
    #pragma omp for schedule(static) nowait
    for (int i = 0; i < n / 4; i++) {
        __m128 x = _mm_load_ps(&a[4 * i]);
        __m128 out = scan_SSE(x);
        out = _mm_add_ps(out, offset);
        _mm_store_ps(&s[4 * i], out);
        offset = _mm_shuffle_ps(out, out, _MM_SHUFFLE(3, 3, 3, 3));
    }
    float tmp[4];
    _mm_store_ps(tmp, offset);
    return tmp[3];
}

void pass2_SSE(float *s, __m128 offset, const int n) {
    #pragma omp for schedule(static)
    for (int i = 0; i<n/4; i++) {
        __m128 tmp1 = _mm_load_ps(&s[4 * i]);
        tmp1 = _mm_add_ps(tmp1, offset);
        _mm_store_ps(&s[4 * i], tmp1);
    }
}

void scan_omp_SSEp2_SSEp1_chunk(float a[], float s[], int n) {
    float *suma;
    const int chunk_size = 1<<18;
    const int nchunks = n%chunk_size == 0 ? n / chunk_size : n / chunk_size + 1;
    //printf("nchunks %d\n", nchunks);
    #pragma omp parallel
    {
        const int ithread = omp_get_thread_num();
        const int nthreads = omp_get_num_threads();

        #pragma omp single
        {
            suma = new float[nthreads + 1];
            suma[0] = 0;
        }

        float offset2 = 0.0f;
        for (int c = 0; c < nchunks; c++) {
            const int start = c*chunk_size;
            const int chunk = (c + 1)*chunk_size < n ? chunk_size : n - c*chunk_size;
            suma[ithread + 1] = pass1_SSE(&a[start], &s[start], chunk);
            #pragma omp barrier
            #pragma omp single
            {
                float tmp = 0;
                for (int i = 0; i < (nthreads + 1); i++) {
                    tmp += suma[i];
                    suma[i] = tmp;
                }
            }
            __m128 offset = _mm_set1_ps(suma[ithread]+offset2);
            pass2_SSE(&s[start], offset, chunk);
            #pragma omp barrier
            offset2 = s[start + chunk-1];
        }
    }
    delete[] suma;
}

parallel prefix (cumulative) sum with SSE

This is the first time I'm answering my own question but it seems appropriate. Based on hirschhornsalz
answer for prefix sum on 16 bytes simd-prefix-sum-on-intel-cpu I have come up with a solution for using SIMD on the first pass for 4, 8, and 16 32-bit words.

The general theory goes as follows. For a sequential scan of n words it takes n additions (n-1 to scan the n words and one more addition carried from the previous set of words scanned). However using SIMD n words can be scanned in log₂(n) additions and an equal number of shifts plus one more addition and broadcast to carry from the previous SIMD scan. So for some value of n the SIMD method will win.

Let's look at 32-bit words with SSE, AVX, and AVX-512:

4 32-bit words (SSE):      2 shifts, 3 adds, 1 broadcast       sequential: 4 adds
8 32-bit words (AVX):      3 shifts, 4 adds, 1 broadcast       sequential: 8 adds
16 32 bit-words (AVX-512): 4 shifts, 5 adds, 1 broadcast       sequential: 16 adds

Based on that it appears SIMD won't be useful for a scan for 32-bit words until AVX-512. This also assumes that the shifts and broadcast can be done in only 1 instruction. This is true for SSE but not for AVX and maybe not even for AVX2.

In any case I put together some working and tested code which does a prefix sum using SSE.

inline __m128 scan_SSE(__m128 x) {
    x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4))); 
    x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 8)));
    return x;
}

void prefix_sum_SSE(float *a, float *s, const int n) {
__m128 offset = _mm_setzero_ps();
for (int i = 0; i < n; i+=4) {
    __m128 x = _mm_load_ps(&a[i]);
    __m128 out = scan_SSE(x);
    out = _mm_add_ps(out, offset);
    _mm_store_ps(&s[i], out);
    offset = _mm_shuffle_ps(out, out, _MM_SHUFFLE(3, 3, 3, 3)); 
}

Notice that the scan_SSE function has two additions (_mm_add_ps) and two shifts (_mm_slli_si128). The casts are only used to make the compiler happy and don't get converted to instructions. Then inside the main loop over the array in prefix_sum_SSE another addition and one shuffle is used. That's 6 operations total compared to only 4 additions with the sequential sum.

Here is a working solution for AVX:

inline __m256 scan_AVX(__m256 x) {
    __m256 t0, t1;
    //shift1_AVX + add
    t0 = _mm256_permute_ps(x, _MM_SHUFFLE(2, 1, 0, 3));
    t1 = _mm256_permute2f128_ps(t0, t0, 41);
    x = _mm256_add_ps(x, _mm256_blend_ps(t0, t1, 0x11));
    //shift2_AVX + add
    t0 = _mm256_permute_ps(x, _MM_SHUFFLE(1, 0, 3, 2));
    t1 = _mm256_permute2f128_ps(t0, t0, 41);
    x = _mm256_add_ps(x, _mm256_blend_ps(t0, t1, 0x33));
    //shift3_AVX + add
    x = _mm256_add_ps(x,_mm256_permute2f128_ps(x, x, 41));
    return x;
}

void prefix_sum_AVX(float *a, float *s, const int n) {
    __m256 offset = _mm256_setzero_ps();
    for (int i = 0; i < n; i += 8) {
        __m256 x = _mm256_loadu_ps(&a[i]);
        __m256 out = scan_AVX(x);
        out = _mm256_add_ps(out, offset);
        _mm256_storeu_ps(&s[i], out);
        //broadcast last element
        __m256 t0 = _mm256_permute2f128_ps(out, out, 0x11);
        offset = _mm256_permute_ps(t0, 0xff);
    }   
}

The three shifts need 7 intrinsics. The broadcast needs 2 intrinsics. So with the 4 additions that's 13 intrinisics. For AVX2 only 5 intrinsics are needed for the shifts so 11 intrinsics total. The sequential sum only needs 8 additions. Therefore likely neither AVX nor AVX2 will be useful for the first pass.

Edit:

So I finally benchmarked this and the results are unexpected. The SSE and AVX code are both about twice as fast as the following sequential code:

void scan(float a[], float s[], int n) {
    float sum = 0;
    for (int i = 0; i<n; i++) {
        sum += a[i];
        s[i] = sum;
    }
}

I guess this is due to instruction level paralellism.

So that answers my own question. I succeeded in using SIMD for pass1 in the general case. When I combine this with OpenMP on my 4 core ivy bridge system the total speed up is about seven for 512k floats.

Horizontal running diff and conditional update using SIMD/SSE?

This can be done with SIMD. The solution is similar to the solution for the prefix sum with SIMD.

Within a SIMD register the number of iterations goes as O(Log2(simd_width)). Each iteration requires: one shift, one subtraction, and one max. For example with SSE it requires Log2(4) = 2 iterations. You can apply your function on four elements like this:

__m128i foo_SSE(__m128i x, int c) {
    __m128i t, c1, c2;
    c1 = _mm_set1_epi32(c);
    c2 = _mm_set1_epi32(2*c);

    t = _mm_slli_si128(x, 4);
    t = _mm_sub_epi32(t, c1);
    x = _mm_max_epi32(x, t);

    t = _mm_slli_si128(x, 8);
    t = _mm_sub_epi32(t, c2);
    x = _mm_max_epi32(x, t);
    return x;
}

Once you have the result of a SIMD register you need to apply the "carry" to the next register. For example let's say you have an array a of eight elements. You load SSE register x1 and x2 like this

__m128i x1 = _mm_loadu_si128((__m128i*)&a[0]);
__m128i x2 = _mm_loadu_si128((__m128i*)&a[4]);

Then to apply your function to all eight elements you would do

__m128i t, s;
s = _mm_setr_epi32(c, 2*c, 3*c, 4*c);

x1 = foo_SSE(x1,c);
x2 = foo_SSE(x2,c);
t = _mm_shuffle_epi32(x1, 0xff);
t = _mm_sub_epi32(t,s);
x2 = _mm_max_epi32(x2,t);

Note that c1, c2, and s are all constants within a loop so they only need to be calculated once.

In general you could apply your function to an unsigned int array a like this with SSE (with n a multiple of 4):

void fill_SSE(int *a, int n, int c) {
    __m128i offset = _mm_setzero_si128();
    __m128i s = _mm_setr_epi32(c, 2*c, 3*c, 4*c);
    for(int i=0; i<n/4; i++) {
        __m128i x = _mm_loadu_si128((__m128i*)&a[4*i]);
        __m128i out = foo_SSE(x, c);
        out = _mm_max_epi32(out,offset);
        _mm_storeu_si128((__m128i*)&a[4*i], out);
        offset = _mm_shuffle_epi32(out, 0xff);
        offset = _mm_sub_epi32(offset,s);
    }
}

I went ahead and profiled this SSE code. It's about 2.5 times faster than the serial version.

Another major advantage to this method besides going as log2(simd_width) is that it break the dependency chain so that multiple SIMD operations can go at the same time (using multiple ports) instead of waiting for the previous result. The serial code is latency bound.

The current code works for unsigned integers but you could generalize it to signed integers as well as floats.

Here is the general code I used to test this. I created a bunch of abstract SIMD functions to emulate SIMD hardware before I implemented the SSE version.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <x86intrin.h>
#include <omp.h>

__m128i foo_SSE(__m128i x, int c) {
    __m128i t, c1, c2;
    c1 = _mm_set1_epi32(c);
    c2 = _mm_set1_epi32(2*c);

    t = _mm_slli_si128(x, 4);
    t = _mm_sub_epi32(t, c1);
    x = _mm_max_epi32(x, t);

    t = _mm_slli_si128(x, 8);
    t = _mm_sub_epi32(t, c2);
    x = _mm_max_epi32(x, t);
    return x;
}

void foo(int *a, int n, int c) {
    for(int i=0; i<n-1; i++) {
        if(a[i]-c > a[i+1]) a[i+1] = a[i]-c;
    }
}

void broad(int *a, int n, int k) {
    for(int i=0; i<n; i++) a[i] = k;
}

void shiftr(int *a, int *b, int n, int m) {
    int i;
    for(i=0; i<m; i++) b[i] = a[i];
    for(; i<n; i++) b[i] = a[i-m];
}

/*
void shiftr(int *a, int *b, int n, int m) {
    int i;
    for(i=0; i<m; i++) b[i] = 0;
    for(; i<n; i++) b[i] = a[i-m];
}
*/

void sub(int *a, int n, int c) {
    for(int i=0; i<n; i++) a[i] -= c;
}


void max(int *a, int *b, int n) {
    for(int i=0; i<n; i++) if(b[i]>a[i]) a[i] = b[i];
}

void step(int *a, int n, int c) {
    for(int i=0; i<n; i++) {
        a[i] -= (i+1)*c;
    }
}

void foo2(int *a, int n, int c) {
    int b[n];
    for(int m=1; m<n; m*=2) {
        shiftr(a,b,n,m);
        sub(b, n, m*c);
        max(a,b,n);
        //printf("n %d, m %d; ", n,m ); for(int i=0; i<n; i++) printf("%2d ", b[i]); puts("");
    }
}

void fill(int *a, int n, int w, int c) {
    int b[w], offset[w];
    broad(offset, w, -1000);
    for(int i=0; i<n/w; i++) {
        for(int m=1; m<w; m*=2) {
            shiftr(&a[w*i],b,w,m);
            sub(b, w, m*c);
            max(&a[w*i],b,w);
        }
        max(&a[w*i],offset,w);
        broad(offset,w,a[w*i+w-1]);
        step(offset, w, c);
    }
}


void fill_SSE(int *a, int n, int c) {
    __m128i offset = _mm_setzero_si128();
    __m128i s = _mm_setr_epi32(c, 2*c, 3*c, 4*c);
    for(int i=0; i<n/4; i++) {
        __m128i x = _mm_loadu_si128((__m128i*)&a[4*i]);
        __m128i out = foo_SSE(x, c);
        out = _mm_max_epi32(out,offset);
        _mm_storeu_si128((__m128i*)&a[4*i], out);
        offset = _mm_shuffle_epi32(out, 0xff);
        offset = _mm_sub_epi32(offset,s);
    }
}

void fill_SSEv2(int *a, int n, int c) {
    __m128i offset = _mm_setzero_si128();
    __m128i s = _mm_setr_epi32(1*c, 2*c, 3*c, 4*c);
    __m128i c1 = _mm_set1_epi32(1*c);
    __m128i c2 = _mm_set1_epi32(2*c);
    for(int i=0; i<n/4; i++) {
        __m128i x1 = _mm_loadu_si128((__m128i*)&a[4*i]);
        __m128i t1;

        t1 = _mm_slli_si128(x1, 4);
        t1 = _mm_sub_epi32 (t1, c1);
        x1 = _mm_max_epi32 (x1, t1);

        t1 = _mm_slli_si128(x1, 8);
        t1 = _mm_sub_epi32 (t1, c2);
        x1 = _mm_max_epi32 (x1, t1);

        x1 = _mm_max_epi32(x1,offset);
        _mm_storeu_si128((__m128i*)&a[4*i], x1);
        offset = _mm_shuffle_epi32(x1, 0xff);
        offset = _mm_sub_epi32(offset,s);
    }
}

void fill_SSEv3(int *a, int n, int c) {
    __m128i offset = _mm_setzero_si128();
    __m128i s = _mm_setr_epi32(1*c, 2*c, 3*c, 4*c);
    __m128i c1 = _mm_set1_epi32(1*c);
    __m128i c2 = _mm_set1_epi32(2*c);
    for(int i=0; i<n/8; i++) {
        __m128i x1 = _mm_loadu_si128((__m128i*)&a[8*i]);
        __m128i x2 = _mm_loadu_si128((__m128i*)&a[8*i+4]);
        __m128i t1, t2;

        t1 = _mm_slli_si128(x1, 4);
        t1 = _mm_sub_epi32 (t1, c1);
        x1 = _mm_max_epi32 (x1, t1);

        t1 = _mm_slli_si128(x1, 8);
        t1 = _mm_sub_epi32 (t1, c2);
        x1 = _mm_max_epi32 (x1, t1);

        t2 = _mm_slli_si128(x2, 4);
        t2 = _mm_sub_epi32 (t2, c1);
        x2 = _mm_max_epi32 (x2, t2);

        t2 = _mm_slli_si128(x2, 8);
        t2 = _mm_sub_epi32 (t2, c2);
        x2 = _mm_max_epi32 (x2, t2);

        x1 = _mm_max_epi32(x1,offset);
        _mm_storeu_si128((__m128i*)&a[8*i], x1);
        offset = _mm_shuffle_epi32(x1, 0xff);
        offset = _mm_sub_epi32(offset,s);

        x2 = _mm_max_epi32(x2,offset);
        _mm_storeu_si128((__m128i*)&a[8*i+4], x2);
        offset = _mm_shuffle_epi32(x2, 0xff);
        offset = _mm_sub_epi32(offset,s);
    }
}

int main(void) {
    int n = 8, a[n], a1[n], a2[n];
    for(int i=0; i<n; i++) a[i] = i;

    /*
    a[0] = 1, a[1] = 0;
    a[2] = 2, a[3] = 0;
    a[4] = 3, a[5] = 13;
    a[6] = 4, a[7] = 0;
    */


    a[0] = 5, a[1] = 6;
    a[2] = 7, a[3] = 8;
    a[4] = 1, a[5] = 2;
    a[6] = 3, a[7] = 4;

    for(int i=0; i<n; i++) printf("%2d ", a[i]); puts("");
    for(int i=0; i<n; i++) a1[i] = a[i], a2[i] = a[i];

    int c = 1;
    foo(a1,n,c);
    foo2(a2,n,c);
    for(int i=0; i<n; i++) printf("%2d ", a1[i]); puts("");
    for(int i=0; i<n; i++) printf("%2d ", a2[i]); puts("");


    __m128i x1 = _mm_loadu_si128((__m128i*)&a[0]);
    __m128i x2 = _mm_loadu_si128((__m128i*)&a[4]);
    __m128i t, s;
    s = _mm_setr_epi32(c, 2*c, 3*c, 4*c);

    x1 = foo_SSE(x1,c);
    x2 = foo_SSE(x2,c);
    t = _mm_shuffle_epi32(x1, 0xff);
    t = _mm_sub_epi32(t,s);
    x2 = _mm_max_epi32(x2,t);

    int a3[8];
    _mm_storeu_si128((__m128i*)&a3[0], x1);
    _mm_storeu_si128((__m128i*)&a3[4], x2);
    for(int i=0; i<8; i++) printf("%2d ", a3[i]); puts("");

    int w = 8;
    n = w*1000;
    int f1[n], f2[n];
    for(int i=0; i<n; i++) f1[i] = rand()%1000;

    for(int i=0; i<n; i++) f2[i] = f1[i];
    //for(int i=0; i<n; i++) printf("%2d ", f1[i]); puts("");
    foo(f1, n, c);
    //fill(f2, n, 8, c);
    fill_SSEv3(f2, n, c);
    printf("%d\n", memcmp(f1,f2,sizeof(int)*n));
    for(int i=0; i<n; i++) {
        //    if(f1[i] != f2[i]) printf("%d\n", i);
    }
    //for(int i=0; i<n; i++) printf("%2d ", f1[i]); puts("");
    //for(int i=0; i<n; i++) printf("%2d ", f2[i]); puts("");

    int r = 200000;
    double dtime;
    dtime = -omp_get_wtime();
    for(int i=0; i<r; i++) fill_SSEv2(f2, n, c);
    //for(int i=0; i<r; i++) foo(f1, n, c);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);

    dtime = -omp_get_wtime();
    for(int i=0; i<r; i++) fill_SSEv3(f2, n, c);
    //for(int i=0; i<r; i++) foo(f1, n, c);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);

    dtime = -omp_get_wtime();
    for(int i=0; i<r; i++) foo(f1, n, c);
    //for(int i=0; i<r; i++) fill_SSEv2(f2, n, c);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);
}

Based on a comment by Paul R I was able to fix my function to work with signed integers. However, it requires c>=0. I am sure it could be fixed to work for c<0.

void fill_SSEv2(int *a, int n, int c) {
    __m128i offset = _mm_set1_epi32(0xf0000000);
    __m128i s = _mm_setr_epi32(1*c, 2*c, 3*c, 4*c);
    __m128i c1 = _mm_set1_epi32(1*c);
    __m128i c2 = _mm_set1_epi32(2*c);
    for(int i=0; i<n/4; i++) {
        __m128i x1 = _mm_loadu_si128((__m128i*)&a[4*i]);
        __m128i t1;

        t1 = _mm_shuffle_epi32(x1, 0x90);
        t1 = _mm_sub_epi32 (t1, c1);
        x1 = _mm_max_epi32 (x1, t1);

        t1 = _mm_shuffle_epi32(x1, 0x44);
        t1 = _mm_sub_epi32 (t1, c2);
        x1 = _mm_max_epi32 (x1, t1);

        x1 = _mm_max_epi32(x1,offset);
        _mm_storeu_si128((__m128i*)&a[4*i], x1);
        offset = _mm_shuffle_epi32(x1, 0xff);
        offset = _mm_sub_epi32(offset,s);
    }
}

This method should be easily extended to floats now.

SIMD optimization of a curve computed from the second derivative

So you're just looking for a way to generate vectors of 4 phase_current values, which you can pass as an arg to an arbitrary function.

TL:DR: set up the initial vectors for increment and step so each vector element strides through the sequence by 4, giving you vectors of phase_current[i+0..i+3] with still only two vector ADD operations (vertical, not horizontal). This serial dependency is one you can factor out with algebra / math.

This is a bit like a prefix-sum (which you can SIMD with log2(vector_width) shuffle+add operations for vectors with vector_width elements.) You can also parallelize prefix sums with multiple threads using a two-step calculation where each thread prefix-sums a region of an array, then combine the results and have each thread offset its region of the destination array by a constant (the sum total for the first element of that region. See the linked question for multi-threading, too.

But you have the huge simplification that phase_increment_step (the 2nd derivative of the value you want) is constant. I'm assuming that USEFUL_FUNC(phase_current); takes its arg by value, not by non-const reference, so the only modification to phase_current is by the += in the loop. And that useful_func can't somehow mutate the increment or increment_step.

One option to implement this is just to run the scalar algorithm independently in 4 separate elements of SIMD vectors, offset by 1 iteration each time. With integer adds, especially on Intel CPUs where vector-integer add latency is only 1 cycle, running 4 iterations of the running-total is cheap, and we could just do that between calls to USEFUL_FUNC. That would be a way to generate vector inputs to USEFUL_FUNC doing exactly as much work as scalar code (assuming SIMD integer add is as cheap as scalar integer add, which is mostly true if we're limited by the data dependency to 2 adds per clock).

The above method is somewhat more general and could be useful for variations on this problem where there's a true serial dependency that we can't eliminate cheaply with simple math.

If we're clever, we can do even better than a prefix sum or brute-force running 4 sequences one step at a time. Ideally we can derive a closed-form way to step by 4 in the sequence of values (or whatever the SIMD vector width is, times whatever unroll factor you want for multiple accumulators for USEFUL_FUNC).

Summing a sequence of step, step*2, step*3, ... will give us a constant times Gauss's closed-form formula for the sum of integers up to n: sum(1..n) = n*(n+1)/2. This sequence goes 0, 1, 3, 6, 10, 15, 21, 28, ... (https://oeis.org/A000217). (I've factored out the initial phase_increment).

The trick is going by 4 in this sequence. (n+4)*(n+5)/2 - n*(n+1)/2 simplifies down to 4*n + 10. Taking the derivative of that again, we get 4. But to go 4 steps in the 2nd integral, we have 4*4 = 16. So we can maintain a vector phase_increment which we increment with a SIMD add with a vector of 16*phase_increment_step.

I'm not totally sure I have the step-counting reasoning right (the extra factor of 4 to give 16 is a bit surprising). Working out the right formulas and taking first and second-differences in the sequence of vectors makes it very clear how this works out:

 // design notes, working through the first couple vectors
 // to prove this works correctly.

S = increment_step (constant)
inc0 = increment initial value
p0 = phase_current initial value

// first 8 step-increases:
[ 0*S,  1*S,   2*S,  3*S ]
[ 4*S,  5*S,   6*S,  7*S ]

// first vector of 4 values:
[ p0,  p0+(inc0+S),  p0+(inc0+S)+(inc0+2*S),  p0+(inc0+S)+(inc0+2*S)+(inc0+3*S) ]
[ p0,  p0+inc0+S,  p0+2*inc0+3*S,  p0+3*inc0+6*S ]  // simplified

// next 4 values:
[ p0+4*inc0+10*S,  p0+5*inc0+15*S,  p0+6*inc0+21*S,  p0+7*inc0+28*S ]

Using this and the earlier 4*n + 10 formula:

// first 4 vectors of of phase_current
[ p0,              p0+1*inc0+ 1*S,  p0+2*inc0+3*S,   p0+ 3*inc0+ 6*S ]
[ p0+4*inc0+10*S,  p0+5*inc0+15*S,  p0+6*inc0+21*S,  p0+ 7*inc0+28*S ]
[ p0+8*inc0+36*S,  p0+9*inc0+45*S,  p0+10*inc0+55*S, p0+11*inc0+66*S ]
[ p0+12*inc0+78*S,  p0+13*inc0+91*S,  p0+14*inc0+105*S, p0+15*inc0+120*S ]

 first 3 vectors of phase_increment (subtract consecutive phase_current vectors):
[ 4*inc0+10*S,     4*inc0 + 14*S,   4*inc0 + 18*S,   4*inc0 + 22*S  ]
[ 4*inc0+26*S,     4*inc0 + 30*S,   4*inc0 + 34*S,   4*inc0 + 38*S  ]
[ 4*inc0+42*S,     4*inc0 + 46*S,   4*inc0 + 50*S,   4*inc0 + 54*S  ]

 first 2 vectors of phase_increment_step:
[        16*S,              16*S,            16*S,            16*S  ]
[        16*S,              16*S,            16*S,            16*S  ]
Yes, as expected, a constant vector works for phase_increment_step

So we can write code like this with Intel's SSE/AVX intrinsics:

#include <stdint.h>
#include <immintrin.h>

void USEFUL_FUNC(__m128i);

// TODO: more efficient generation of initial vector values
void double_integral(uint32_t phase_start, uint32_t phase_increment_start, uint32_t phase_increment_step, unsigned blockSize)
{

    __m128i pstep1 = _mm_set1_epi32(phase_increment_step);

    // each vector element steps by 4
    uint32_t inc0=phase_increment_start, S=phase_increment_step;
    __m128i pincr  = _mm_setr_epi32(4*inc0 + 10*S,  4*inc0 + 14*S,   4*inc0 + 18*S,   4*inc0 + 22*S);

    __m128i phase = _mm_setr_epi32(phase_start,  phase_start+1*inc0+ 1*S,  phase_start+2*inc0+3*S,   phase_start + 3*inc0+ 6*S );
     //_mm_set1_epi32(phase_start); and add.
     // shuffle to do a prefix-sum initializer for the first vector?  Or SSE4.1 pmullo by a vector constant?

    __m128i pstep_stride = _mm_slli_epi32(pstep1, 4);  // stride by pstep * 16
    for (unsigned i = 0; i < blockSize; ++i)  {
        USEFUL_FUNC(phase);
        pincr = _mm_add_epi32(pincr, pstep_stride);
        phase = _mm_add_epi32(phase, pincr);
    }
}

Further reading: for more about SIMD in general, but mostly x86 SSE/AVX, see https://stackoverflow.com/tags/sse/info, especially slides from SIMD at Insomniac Games (GDC 2015) which have some good stuff about how to think about SIMD in general, and how to lay out your data so you can use it.

Is it possible to use SIMD on a serial dependency in a calculation, like an exponential moving average filter?

If there's a closed-form formula for n steps ahead, you can use that to sidestep serial dependencies. If it can be computed with just different coefficients with the same operations as for 1 step, a broadcast is all you need.

Like in this case, z1 = c + z1 * b, so applying that twice we get

# I'm using Z0..n as the elements in the series your loop calculates
Z2 = c + (c+Z0*b)*b
   = c + c*b + Z0*b^2

c + c*b and b^2 are both constants, if I'm understanding your code correctly that all the C variables are really just C variables, not pseudocode for an array reference. (So everything except your z1 are loop invariant).

So if we have a SIMD vector of 2 elements, starting with Z0 and Z1, we can step each of them forward by 2 to get Z2 and Z3.

void SmoothBlock(int blockSize, double b, double c, double z_init) {

    // z1 = inputA0 + z1 * b1;
    __m128d zv = _mm_setr_pd(z_init, z_init*b + c);

    __m128d step2_mul = _mm_set1_pd(b*b);
    __m128d step2_add = _mm_set1_pd(c + c*b);

    for (int i = 0; i < blockSize-1; i+=2) {
        _mm_storeu_pd(mSmoothedValues + i, zv);
        zv = _mm_mul_pd(zv, step2_mul);
        zv = _mm_add_pd(zv, step2_add);
       // compile with FMA + fast-math for the compiler to fold the mul/add into one FMA
    }
    // handle final odd element if necessary
    if(blockSize%2 != 0)
        _mm_store_sd(mSmoothedValues+blockSize-1, zv);
}

With float + AVX (8 elements per vector), you'd have

__m256 zv = _mm256_setr_ps(z_init, c + z_init*b,
                         c + c*b + z_init*b*b,
                         c + c*b + c*b*b + z_init*b*b*b, ...);

// Z2 = c + c*b + Z0*b^2
// Z3 = c + c*b + (c + Z0*b) * b^2
// Z3 = c + c*b + c*b^2 + Z0*b^3

and the add/mul factors would be for 8 steps.

Normally people use float for SIMD because you get twice as many elements per vector, and half the memory bandwidth / cache footprint, so you typically get a factor of 2 speedup over float. (Same number of vectors / bytes per clock.)

The above loop on a Haswell or Sandybridge for example CPU will run at one vector per 8 cycles, bottlenecked on the latency of mulpd (5 cycles) + addps (3 cycles). We generate 2 double results per vector, but that's still a huge bottleneck compared to 1 mul and 1 add per clock throughput. We're missing out on a factor of 8 throughput.

(Or if compiled with one FMA instead of mul->add, then we have 5 cycle latency).

Sidestepping the serial dependcy is useful not just for SIMD but for avoiding bottlenecks on FP add/mul (or FMA) latency will give further speedups, up to the ratio of FP add+mul latency to add+mul throughput.

Simply unroll more, and use multiple vectors, like zv0, zv1, zv2, zv3. This increases the number of steps you make at once, too. So for example, 16-byte vectors of float, with 4 vectors, would be 4x4 = 16 steps.

Fastest way to do horizontal SSE vector sum (or other reduction)

In general for any kind of vector horizontal reduction, extract / shuffle high half to line up with low, then vertical add (or min/max/or/and/xor/multiply/whatever); repeat until a there's just a single element (with high garbage in the rest of the vector).

If you start with vectors wider than 128-bit, narrow in half until you get to 128 (then you can use one of the functions in this answer on that vector). But if you need the result broadcast to all elements at the end, then you can consider doing full-width shuffles all the way.

Related Q&As for wider vectors, and integers, and FP

__m128 and __m128d This answer (see below)
__m256d with perf analysis for Ryzen 1 vs. Intel (showing why vextractf128 is vastly better than vperm2f128) Get sum of values stored in __m256d with SSE/AVX
__m256 How to sum __m256 horizontally?
Intel AVX: 256-bits version of dot product for double precision floating point variables of single vectors.
Dot product of arrays (not just a single vector of 3 or 4 elements): do vertical mul/add or FMA into multiple accumulators, and hsum at the end.

Related Topics

Templates Polymorphism
Semantics of Flags on Basic_Ios
How to Write a Range Pipeline That Uses Temporary Containers
Createprocess from Memory Buffer
Why Doesn't a Derived Template Class Have Access to a Base Template Class' Identifiers
Template Constraints C++
64 Bit Ntohl() in C++
C++ Cout Hex Values
Create Wcf Service For Unmanaged C++ Clients
Opencv Get Pixel Channel Value from Mat Image
Is There a C++ Decompiler
Is Calling Destructor Manually Always a Sign of Bad Design
Getting Started With Opencv 2.4 and Mingw on Windows 7
C++ Static Virtual Members
Is Null Reference Possible
Difference Between "If Constexpr()" VS "If()"
How to Get Current Directory
C++11 "Auto" Semantics