C++ Openmp Parallel For Loop - Alternatives to Std::Vector

C++ OpenMP Parallel For Loop - Alternatives to std::vector

The question you link was talking about the fact that "that STL vector container is not thread-safe in the situation where multiple threads write to a single container" . This is only true, as stated correctly there, if you call methods that can cause reallocation of the underlying array that std::vector holds. push_back(), pop_back() and insert() are examples of these dangerous methods.

If you need thread safe reallocation, then the library intel thread building block offers you concurrent vector containers . You should not use tbb::concurrent_vector in single thread programs because the time it takes to access random elements is higher than the time std::vector takes to do the same (which is O(1)). However, concurrent vector calls push_back(), pop_back(), insert() in a thread safe way, even when reallocation happens.

EDIT 1: The slides 46 and 47 of the following Intel presentation give an illustrative example of concurrent reallocation using tbb::concurrent_vector

EDIT 2: By the way, if you start using Intel Tread Building Block (it is open source, it works with most compilers and it is much better integrated with C++/C++11 features than openmp), then you don't need to use openmp to create a parallel_for, Here is a nice example of parallel_for using tbb.

No speed up with openmp for emplace_back a vector in a loop

The current code is ill-formed regarding the documentation (since the parallelized code contains mostly-implicit dependencies). As a result, an OpenMP implementation is free to generate a fast but completely "wrong" program or a slow "correct" one.

To get a correct implementation and a not-too-bad speedup using OpenMP, one solution is to replicate the generator/distribution in each worker (by moving the variable declarations in a #pragma omp parallel section) and to add critical sections (using #pragma omp critical) for the (sequential) emplace_back.

Due to possible false-sharing and lock-contention, the resulting parallel implementation may scale poorly. It is probably better to generate thread-private arrays and then merge ultimately the sub-arrays in a big shared one rather than using naive critical section (note however that this still not ideal since the computation will likely be limited by the speed of the shared memory).

Please note that the result can be different from the sequential implementation when a specific seed need to be used (here it is not a problem since the seed is extracted from random_device).

Omp for loop initialize vector

What you are seeing here is the effect of a thread unsafe function being called by multiple threads. Internally push_back does something like the following in pseudocode

if reallocation needed:
    reallocate
construct new object at &data[size]
++size

Now try to imagine different threads running the above code at the same time. What happens if both threads see the need to reallocate and attempt to do so at the same time. What if they both construct an object at &data[size] because they both reached that point before ++size? Note that even trying to increment on the same line as the construction won't work because they are still separate non-atomic operations.

What you really want to do is create a loop of strictly thread safe operations like the following.

int main() {
    std::vector<int> vec(301);

    #pragma omp parallel for
    for (int i = 0; i <= 300; i++) {
        vec[i]= i;
    }
    std::cout << vec.size() << std::endl;
}

In this case, every thread accesses vec[i] with a unique i. So no operations ever happen concurrently to the same objects. This is absolutely safe.

To answer your follow-up question, there is no way to push_back concurrently into a vector. You would have to synchronize your push_back calls which would make them slower than the non-parallel way. Another solution is to populate thread-local containers and then merge them. But whenever the simple solution I showed above is applicable it will also be faster than the alternatives.

Using openMP to parallelize a loop over a vector c++ objects

The speed of your program mainly depends on the speed of memory read/write (including cache utilization,etc). Depending on the hardware you may or may not observe speed increase. For more details please read e.g. this.

On my laptop (i7-8550U, g++ -fopenmp -O3 -mavx2 saxpy.cpp) I got similar result, but on a Xeon server I got significant speed improvement:

nthreads=1     
TIME: 13.0372
real    0m14.303s
user    0m13.206s
sys     0m1.096s

nthreads=4
TIME: 5.1537
real    0m5.921s
user    0m18.473s
sys     0m0.615s

nthreads=8
TIME: 3.43479
real    0m4.237s
user    0m27.337s
sys     0m0.608s

OpenMP - store results in vector

The "naive" way:
You can init several vectors (call omp_get_max_threads() to know the thread count inside the current parallel region) then call omp_get_thread_num() inside the parallel region to know the current thread ID, and let each thread write into its vector.
Then outside the parallel region merge the vectors together. This can be worth it or not, depending on how "heavy" your processing is compared to the time required to merge the vectors.

If you know the maximum final size of the vector, you can reserve it before processing (so that push_back calls won't resize the vector and you gain processing time) then call the push_back method from inside a critical section (#pragma omp critical), but critical sections are horribly slow so it's worth it only if the processing you do inside the loop is time consuming. In your case the "processing" looks to be only checking the if-clause, so it's probably not worth it.

Finally, it's a quite known problem. You should read this for more detailed information:
C++ OpenMP Parallel For Loop - Alternatives to std::vector

Vector filling across OpenMP threads

I am going to assume that you need to use a vector and cannot use an array (othrewise your question is not very interesting). Using t = omp_get_num_threads(), you fill the vectors in parallel and then merge them in log2(t) operations instead of toperations (as you are doing now) like this

void reduce(std::vector<BasicType> *v1, std::vector<BasicType> *v2, int begin, int end) {
    if(end - begin == 1) return;
    int pivot = (begin+end)/2;
    #pragma omp task
    reduce(v, begin, pivot);
    #pragma omp task
    reduce(v, pivot, end);
    #pragma omp taskwait
    v1[begin].insert(v1[begin].end(), v1[pivot].begin(), v1[pivot].end());
    v2[begin].insert(v2[begin].end(), v2[pivot].begin(), v2[pivot].end());
}

and

std::vector<BasicType> v1, v2;
std::vector<BasicType> *v1p, *v2p;
#pragma omp parallel
{
    #pragma omp single
    {
        v1p = new std::vector<BasicType>[omp_get_num_threads()];
        v2p = new std::vector<BasicType>[omp_get_num_threads()];
    }
    #pragma omp for
    for(...) {
        // Do some intensive stuff to compute val1 and val2
        // ...
       v1p[omp_get_thread_num()].push_back(val1);
       v2p[omp_get_thread_num()].push_back(val2);
    }
    #pragma omp single
    reduce(v1p, v2p, 0, omp_get_num_threads());
}
v1 = v1p[0], v2 = v2p[0];
delete[] v1p;
delete[] v2p;

For example with 8 threads this will join the vectors for threads

(0,1) (2,3) (4,5) (6,7)
(0,2) (4,6)
(0,4)

For more information about filling vectors in parallel see this. For more information about merging threads in log2(t) operations see the answer to this question.

Parallel OpenMP loop with continue as a break alternative

After compilation, they are identical at least for the trivial case.

Without continue

With Continue

If you compare the resulting assembly there is no difference between the two. I have taken the liberty of adding a junk condition of halting before 2000.

Parallel push back to a vector of vector

Yes this is thread-safe, since the inner vectors are solely modified by one thread. You can omit the schedule(dynamic) derivative and still be save.

This becomes a bit clearer, if you get rid of the inner loop using std::iota.

vector<vector<int> > global_vec(10, vector<int>({}));

#pragma omp parallel for schedule(dynamic)
for(int i = 0; i < 10; i++)
{
    global_vec[i].resize(i * 5) ;
    std::iota(global_vec[i].begin(), global_vec[i].end(), 0);
}

Ps. If your outer vector has a fixed size, consider using a std::array<vector<int>, 10> instead.

C++ Openmp Parallel For Loop - Alternatives to Std::Vector