Parallel for VS Omp Simd: When to Use Each

Parallel for vs omp simd: when to use each?

The linked-to standard is relatively clear (p 13, lines 19+20)

When any thread encounters a simd construct, the iterations of the
loop associated with the construct can be executed by the SIMD lanes
that are available to the thread.

SIMD is a sub-thread thing. To make it more concrete, on a CPU you could imagine using simd directives to specifically request vectorization of chunks of loop iterations that individually belong to the same thread. It's exposing the multiple levels of parallelism that exist within a single multicore processor, in a platform-independent way. See for instance the discussion (along with the accelerator stuff) on this intel blog post.

So basically, you'll want to use omp parallel to distribute work onto different threads, which can then migrate to multiple cores; and you'll want to use omp simd to make use of vector pipelines (say) within each core. Normally omp parallel would go on the "outside" to deal with coarser-grained parallel distribution of work and omp simd would go around tight loops inside of that to exploit fine-grained parallelism.

How to use omp parallel for and omp simd together?

From OpenMP 4.5 Specification:

2.11.4 Parallel Loop SIMD Construct

The parallel loop SIMD construct is a shortcut for specifying a parallel
construct containing one loop SIMD construct and no other statement.

The syntax of the parallel loop SIMD construct is as follows:

#pragma omp parallel for simd
...

You can also write:

#pragma omp parallel
{
   #pragma omp for simd
   for ...
}

Why OpenMP 'simd' has better performance than 'parallel for simd'?

If sSize = 129, as you have in your edit, then the overhead of parallelizing the loop doesn't pay off. This would be easier to confirm if you'd show us the numbers of the sequential implementation (no SIMD) and the pure parallel implementation (i.e. with #pragma omp parallel for but no SIMD).

What is likely happening is that even the pure parallel version is slower than the sequential one. Not only the loop size is reduced, as you launch/create a parallel zone for every iteration of the outer-most loop.

As for the SIMD version, this problem is essentially tailored for that: you have a highly vectorizable kernel that would be too small to distribute among threads.

The difference between simd construct and for simd construct in OpenMP 4.0

#pragma omp simd (the SIMD construct) instructs the OpenMP compiler to vectorise the loop that follows without worksharing, that is without distributing the loop iterations among multiple threads (if any).

#pragma omp for (the loop construct) instructs the compiler to perform the following loop while distributing the work among the threads of the current team. Therefore, the loop construct is only useful when placed within the lexical or the dynamical scope of a parallel region, e.g.

#pragma omp parallel
{
   ...
   #pragma omp for
   for (i = 0; i < 100; i++) { ... }
   ...
}

#pragma omp for simd (also called loop SIMD construct) combines the two constructs above, i.e. it both distributes the iteration space among the threads in the team and further vectorises the partial loop that each thread performs. If not used within the scope of a parallel region, the for simd construct is equivalent to the simd construct.

It is possible to combine the loop SIMD construct with the parallel construct:

#pragma omp parallel for simd
for (i = 0; i < 100; i++) { ... }

This combined construct creates a parallel region, distributes the iterations of the loop among the threads, and vectorises the partial loops.

Note that sometimes vectorisation and multithreading are not orthogonal with respect to performance. For example, if the loop is memory-bound, then both vectorisation and multithreading alone could lead to exhaustion of the available memory bandwidth and combining them won't bring any further speedup.

Also, when comparing the speedup with #pragma omp simd and with #pragma omp [parallel] for simd, keep in mind that multithreading alone usually delivers better speedup than vectorisation for the same amount of "multiplicity", i.e. a four-way SIMD-ised loop might (and most likely would) execute slower than when the same loop is computed with scalar instructions but split among four threads.

Differences between `#pragma parallel for collapse` and `#pragma omp parallel for`

TL;DR: both implementations are quite inefficient. The second one will likely be slower than the first in practice, although it could theoretically scale better.

The first implementation is unlikely to be vectorized because the accesses are not contiguous in memory. Both GCC 10 and Clang 11 generate inefficient code.
The point is that OpenMP provide no high-level SIMD construct to deal with data transposition! Thus, if you want to do it efficiently, you probably need to get your hands dirty by doing it yourself (or by using an external library that does it for you).

The second implementation could be significantly slower than the first implementation because the loop iterator is linearized often resulting in more instruction to be executed in the hot path. Some implementation (eg. Clang 11 and ICC 19 but not GCC 10) even use a very slow modulus operation (ie. div instruction) to do so resulting in a much slower loop.

The second implementation should also theoretically scale better than the first because the collapse clause provide more parallelism. Indeed, in the first implementation, there is only rows lines to share between n threads. So, if you work on massively parallel machines or wide rectangular matrices, with n not so small compared to rows, this could cause some work imbalance, or even thread starvation.

Why both implementations are inefficient

The two implementations are inefficient because of the memory access pattern. Indeed, on big matrices, writes in output are not contiguous and will cause many cache misses. A full cache line (64 bytes on most common architectures) will be written while only few bytes will be written into it. If columns is a power of two, cache thrashing will occurs and further decrease performance.

One solution to mitigate these issues is to use tiling. Here is an example:

// Assume rows and columns are nice for sake of clarity ;)
constexpr int tileSize = 8;
assert(rows % tileSize == 0);
assert(columns % tileSize == 0);

// Note the collapse clause is needed here for scalability and 
// the collapse overhead is mitigated by the inner loop.
#pragma omp parallel for collapse(2)
for(int i=0; i<rows; i+=tileSize)
{
    for(int j=0; j<columns; j+=tileSize)
    {
        for(int ti=i; ti<i+tileSize; ++ti)
        {
            for(int tj=j; tj<j+tileSize; ++tj)
            {
                output[tj * rows + ti] = input[ti * columns + tj];
            }
        }
    }
}

The above code should be faster, but not optimal. Successfully writing a fast transposition code is challenging. Here is some advises to improve the code:

use a temporary tile buffer to improve the memory access pattern (so the compiler can use fast SIMD instructions)
use square tiles to improve the use of the cache
use multi-level tiling to improve the use of the L2/L3 cache or use a Z-tiling approach

Alternatively, you can simply use a fast BLAS implementation providing matrix transposition functions quite well optimized (not all do, but AFAIK OpenBLAS and the MKL does).

PS: I assumed matrices are stored in a row-major order.

SIMD vs OMP in vector multiplication

With compilers supporting OpenMP4.x you may want to start from something like this:

void multiply_singular_omp_for_simd(float *a, float *b, float *d)
{
    #pragma omp parallel for simd schedule (static,16)
    for(int i = 0; i < SIZE; i++)
        d[i] = a[i]*b[i];
}

It will give you both SIMD and Thread parallelism. The parallel decomposition will be done automatically, first having parallel tasks/chunks spread across threads/cores, secondly for every task/chunk spreading individual iterations "across" simd "lanes".

Read given couple articles if you feel concerned:
Threading and SIMD in OpenMP4, ICC documentation.

Formally the way you phrased your question is slightly ambiguous, because staring from 4.0, OMP loop could be SIMD, Threading or SIMD+Threading parallel. So it's not about OMP vs. SIMD anymore. Instead it's about OMP SIMD vs. OMP Threading.

Not sure how good your given GCC implementation is, but ICC/IFORT can deal with omp parallel for simd for relatively long time now. GCC should also support it starting from 5.x (#pragma omp simd was supported by GCC for some time, but it's not necessary the case for #pragma omp parallel for simd).

For optimal compiler-driven implementation you may ideally prefer to do cache blocking and manually split iteration space to have outer loop driven by omp parallel for, while innermost loop driven by omp simd. But this is maybe slightly out of scope of original question.

Intel's pragma simd vs OpenMP's pragma omp simd

For all intents and purposes they should be identical. The difference is that the OpenMP 4.0 #pragma omp simd directive is portable and should work with other compilers that support OpenMP 4.0 as well as Intel's.

Furthemore, there are several clauses in the OpenMP version which allow you to vectorize instructions in a more robust manner (safelen(), linear(), aligned(), reduction(), and collapse() come to mind).

Parallel for VS Omp Simd: When to Use Each