Openmp and Reduction on Std::Vector

Openmp and reduction on std::vector?

It is fairly straight forward to do a user declared reduction for C++ vectors of a specific type:

#include <algorithm>
#include <vector>

#pragma omp declare reduction(vec_float_plus : std::vector<float> : \
                              std::transform(omp_out.begin(), omp_out.end(), omp_in.begin(), omp_out.begin(), std::plus<float>())) \
                    initializer(omp_priv = decltype(omp_orig)(omp_orig.size()))

std::vector<float> res(n,0);
#pragma omp parallel for reduction(vec_float_plus : res)
for(size_t i=0; i<m; i++){
    res[...] += ...;
}

1a) Not knowing m at compile time is not a requirement.

1b) You cannot use the array section reduction on std::vectors, because they are not arrays (and std::vector::data is not an identifier). If it were possible, you'd have to use n, as this is the number of elements in the array section.

2) As long as you are only reading indexes and vals, there is no issue.

Edit: The original initializer caluse was simpler: initializer(omp_priv = omp_orig). However, if the original copy is then not full of zeroes, the result will be wrong. Therefore, I suggest the more complicated initializer which always creates zero-element vectors.

OpenMP reduction on container elements

I'll answer your question in three parts:

1. What is the best way to perform to OpenMP reductions in your example above with a std::vec ?

i) Use your approach, i.e. create a pointer int* val { &vec[0] };

ii) Declare a new shared variable like @1201ProgramAlarm answered.

iii) declare a user defined reduction (which is not really applicable in your simple case, but see 3. below for a more efficient pattern).

2. Why doesn't the third loop work and why does it work with Eigen ?

Like the previous answer states you are telling OpenMP to perform a reduction sum on a memory address X, but you are performing additions on memory address Y, which means that the reduction declaration is ignored and your addition is subjected to the usual thread race conditions.

You don't really provide much detail into your Eigen venture, but here are some possible explanations:

i) You're not really using multiple threads (check n = Eigen::nbThreads( ))

ii) You didn't disable Eigen's own parallelism which can disrupt your own usage of OpenMP, e.g. EIGEN_DONT_PARALLELIZE compiler directive.

iii) The race condition is there, but you're not seeing it because Eigen operations take longer, you're using a low number of threads and only writing a low number of values => lower occurrence of threads interfering with each other to produce the wrong result.

3. How should I parallelize this scenario using OpenMP (technically not a question you asked explicitly) ?

Instead of parallelizing only the inner loop, you should parallelize both at the same time. The less serial code you have, the better. In this scenario each thread has its own private copy of the vec vector, which gets reduced after all the elements have been summed by their respective thread. This solution is optimal for your presented example, but might run into RAM problems if you're using a very large vector and very many threads (or have very limited RAM).

#pragma omp parallel for collapse(2) reduction(vsum : vec)
for (unsigned int i {0}; i<vec.size(); ++i){
    for (int j = 0; j < n; ++j) {
        vec[i] += j;
    }
}

where vsum is a user defined reduction, i.e.

#pragma omp declare reduction(vsum : std::vector<int> : std::transform(omp_out.begin(), omp_out.end(), omp_in.begin(), omp_out.begin(), std::plus<int>())) initializer(omp_priv = decltype(omp_orig)(omp_orig.size()))

Declare the reduction before the function where you use it, and you'll be good to go

How to use openmp reduction with two dimensional vector

I can't tell you much about the error, I can tell you that what you're doing is not what you should be doing:

Using the bounds-checking A.at(i).at(j) instead of the raw A[i][j] is a bad idea here: you know the length stay constant at the entry of this loop, so just check the lengths once, and then use the faster []. It's especially bad because with at, openmp potentially has to take precautions otherwise unnecessary (exceptions in multi-threaded code are always a bit ... special), and you've got concurrent read access to the vector lengths all the time instead of independent access to the raw memory. Most importantly, however:

the inner loop changes the l. row. This means the order of rows being processed by the inner loop matters! That means you cannot accept parallelization. If openmp was allowed to fully parallelize the operations on the rows, this will invariably lead to situations where other threads access memory the individual thread is still read-accessing.

I don't see a good solution for this loop. Maybe you want to make two loops out of it: One that just adds the frq values, and one that transposes your matrix. Or, write to a separate matrix instead of doing this in-place in A (whether that works is a question of memory sizes).

My general recommendation is that you never actually transpose matrices (unless you really need to ensure a memory access pattern); that's just a waste of CPU. Whatever uses the transposed matrix could also swap its row and column indices at no extra computational cost and you get the transposition "for free".

Another recommendation: there's good C++ libraries for linear algebra. Use them. You get well-parallelized (-able) functions, and often, also things like views on data, where transposing does nothing but just change the way addressing works (i.e., is free, basically).

std::vector<std::vector<float>> implies you're doing two memory indirections per element access, which is comparably expensive. Your CPU needs much much less time to multiply two floating point numbers than it needs to fetch an address from memory that's not already in cache – hence, you'd really want your complete matrix to be in one continous piece of memory, not distributed around. If you have a use case for openmp, you're investing a lot of effort into making your code run faster. Instead of doing the hard thing (parallelizing) first, you should start by using a better way of representing matrices through std::vector<std::vector<float>>, and accessing them with .at(i).at(j).

Intel compiler (C++) issue with OpenMP reduction on std::vector

This appears to be a bug in the Intel compiler, I can reliably reproduce it with a C example not involving vectors:

#include <stdio.h>

void my_sum_fun(int* outp, int* inp) {
    printf("%d @ %p += %d @ %p\n", *outp, outp, *inp, inp);
    *outp = *outp + *inp;
}

int my_init(int* orig) {
    printf("orig: %d @ %p\n", *orig, orig);
    return *orig;
}

#pragma omp declare reduction(my_sum : int : my_sum_fun(&omp_out, &omp_in) initializer(omp_priv = my_init(&omp_orig))

int main()
{   
    int s = 0;
    #pragma omp parallel for reduction(my_sum : s)
    for (int i = 0; i < 2; i++)
        s+= 1;

    printf("sum: %d\n", s);
}

Output:

orig: 0 @ 0x7ffee43ccc80
0 @ 0x7ffee43ccc80 += 1 @ 0x7ffee43cc780
orig: 1 @ 0x7ffee43ccc80
1 @ 0x7ffee43ccc80 += 2 @ 0x2b56d095ca80
sum: 3

It applies the reduction operation to the original variable before initializing the private copy from the original value. This leads to the wrong result.

You can manually add a barrier as a workaround:

#pragma omp parallel reduction(vec_double_plus : w)
{
  #pragma omp for
  for (int i = 0; i < 4; ++i)
    for (int j = 0; j < w.size(); ++j)
      w[j] += 1;
  #pragma omp barrier
}

OpenMP for loop with std::vector and scalar variable with reduction

I have an answer myself after three days and an explanation for what I have found.

I have created my own reduction function like this:

#pragma omp declare reduction(* : scalar : omp_out *= omp_in) initializer (omp_priv (1))

The trick was the omp_priv and apparently the reduction value initialization is important, somethin I learned in here.

I made the code much simpler by applying openmp for loops like this:

#pragma omp parallel for reduction (* : like)

Very simple and clean. In this new way the code gets parallelized and runs faster than what I had in the question body. Unfortunately, it is still slower than the serial version. Maybe it is because of the std::vector usage, or the overloaded arithmetic operations are very slow!? I do not know. The code is so large I cannot copy paste it in here in a way that it would be understandable, and not be a pain in the neck to read for others.

C++ OpenMP Parallel For Loop - Alternatives to std::vector

The question you link was talking about the fact that "that STL vector container is not thread-safe in the situation where multiple threads write to a single container" . This is only true, as stated correctly there, if you call methods that can cause reallocation of the underlying array that std::vector holds. push_back(), pop_back() and insert() are examples of these dangerous methods.

If you need thread safe reallocation, then the library intel thread building block offers you concurrent vector containers . You should not use tbb::concurrent_vector in single thread programs because the time it takes to access random elements is higher than the time std::vector takes to do the same (which is O(1)). However, concurrent vector calls push_back(), pop_back(), insert() in a thread safe way, even when reallocation happens.

EDIT 1: The slides 46 and 47 of the following Intel presentation give an illustrative example of concurrent reallocation using tbb::concurrent_vector

EDIT 2: By the way, if you start using Intel Tread Building Block (it is open source, it works with most compilers and it is much better integrated with C++/C++11 features than openmp), then you don't need to use openmp to create a parallel_for, Here is a nice example of parallel_for using tbb.

Openmp and Reduction on Std::Vector