Iteration Through Std Containers in Openmp

Iteration through std containers in openmp

Loop parallelization for stl iterators only works since OpenMP 3.0, and only for random access iterators (e.g. vector and deque). You should be able to do something like this:

#pragma omp parallel {
   for (std::set<A>::const_iterator i = s.begin(); i != s.end(); ++i) {
      #pragma omp single nowait {
         operate(*i);
      }
   }
}

Overhead is quite big though because each thread iterates over the whole sequence (but only executes operate on some of it). Your method using an int i is more efficient.

As an alternative take a look at GCC's parallel implementation of std::for_each. See my comment.

EDIT: The STL Parallism TS, which will most likely be part of C++17, might be a good option in the future for iterating over standard containers.

How do I parallelize a for loop through a C++ std::list using OpenMP?

If you decide to use Openmp 3.0, you can use the task feature:

#pragma omp parallel
#pragma omp single
{
  for(auto it = l.begin(); it != l.end(); ++it)
     #pragma omp task firstprivate(it)
       it->process();
  #pragma omp taskwait
}

This will execute the loop in one thread, but delegate the processing of elements to others.

Without OpenMP 3.0 the easiest way would be writing all pointers to elements in the list (or iterators in a vector and iterating over that one. This way you wouldn't have to copy anything back and avoid the overhead of copying the elements themselves, so it shouldn't have to much overhead:

std::vector<my_element*> elements; //my_element is whatever is in list
for(auto it = list.begin(); it != list.end(); ++it)
  elements.push_back(&(*it));

#pragma omp parallel shared(chunks)
{
  #pragma omp for
  for(size_t i = 0; i < elements.size(); ++i) // or use iterators in newer OpenMP
      elements[i]->process();
}

If you want to avoid copying even the pointers, you can always create a parallelized for loop by hand. You can either have the threads access interleaved elements of the list (as proposed by KennyTM) or split the range in roughly equal contious parts before iterating and iterating over those. The later seems preferable since the threads avoid accessing listnodes currently processed by other threads (even if only the next pointer), which could lead to false sharing. This would look roughly like this:

#pragma omp parallel
{
  int thread_count = omp_get_num_threads();
  int thread_num   = omp_get_thread_num();
  size_t chunk_size= list.size() / thread_count;
  auto begin = list.begin();
  std::advance(begin, thread_num * chunk_size);
  auto end = begin;
  if(thread_num = thread_count - 1) // last thread iterates the remaining sequence
     end = list.end();
  else
     std::advance(end, chunk_size);
  #pragma omp barrier
  for(auto it = begin; it != end; ++it)
    it->process();
}

The barrier is not strictly needed, however if process mutates the processed element (meaning it is not a const method), there might be some sort of false sharing without it, if threads iterate over a sequence which is already being mutated. This way will iterate 3*n times over the sequence (where n is the number of threads), so scaling might be less then optimal for a high number of threads.

To reduce the overhead you could put the generation of the ranges outside of the #pragma omp parallel, however you will need to know how many threads will form the parallel section. So you'd probably have to manually set the num_threads, or use omp_get_max_threads() and handle the case that the number of threads created is less then omp_get_max_threads() (which is only an upper bound). The last way could be handled by possibly assigning each thread severa chunks in that case (using #pragma omp for should do that):

int max_threads = omp_get_max_threads();
std::vector<std::pair<std::list<...>::iterator, std::list<...>::iterator> > chunks;
chunks.reserve(max_threads); 
size_t chunk_size= list.size() / max_threads;
auto cur_iter = list.begin();
for(int i = 0; i < max_threads - 1; ++i)
{
   auto last_iter = cur_iter;
   std::advance(cur_iter, chunk_size);
   chunks.push_back(std::make_pair(last_iter, cur_iter);
}
chunks.push_back(cur_iter, list.end();

#pragma omp parallel shared(chunks)
{
  #pragma omp for
  for(int i = 0; i < max_threads; ++i)
    for(auto it = chunks[i].first; it != chunks[i].second; ++it)
      it->process();
}

This will take only three iterations over list (two, if you can get the size of the list without iterating). I think that is about the best you can do for non random access iterators without using tasks or iterating over some out of place datastructure (like a vector of pointer).

use openmp in iterating over a map

It's likely your implementation of OpenMP is incompatible with STL iterators. While there have been some changes to the standard to make OMP more compatible with the STL, I think you'll find your implementation doesn't support such behaviour. Most OpenMP implementations I've encountered are at most version 2.5, Microsoft C++ is 2.0. The only compiler I'm aware of that supports 3.0 is the Intel C++ compiler.

A few other points, you should use std::begin, and std::end. Also, you either need to declare your loop invariant as private, or have OpenMP figure that out by itself, like so:

#pragma omp parallel for
for(map< int,string >::iterator datIt = std::begin(dat);
    datIt != std::end(dat);
    datIt++)
{
     //construct the distance matrix...
}

But without 3.0 support, this is beside the point.

loop for iterator openMP

Your example is very simple which hardly need any performance optimization. You just copy the memory (which could be optimized using std::copy instead).

The code you have written is correct and there is hardly any other way to write it to gain performance. But however, to maintain cleaner code, I try to maintain loop iterators private to each thread. This makes code clean (personal preference).

  vector<float> v = {1, 2, 3, 4};
  vector<float> d = {0, 0, 0, 0};   
  #pragma omp parallel for
  for(auto it_v = v.begin(),it_d = d.begin(); it_v!=v.end();++it_v,++it_d)
    {
      *it_d =  *it_v; 
    }

EDIT

OpenMP 3.1 doesn't allow multiple loop initialization in this case. It doesn't adhere to their specification. So one way to do is as follows:

  #pragma omp parallel
  {
    auto it_v = v.begin(),it_d = d.begin();
    #pragma openmp for
    for(; it_v!=v.end();++it_v)
    {
      *it_d =  *it_v; 
    }
  }

openMp : parallelize std::map iteration

Your method will work since you access and iterate the shared object element in a critical section. Whether of not this is good for performance you will have to test. Here is an alternative method you may want to consider. Let me call this the "fast-forward" method.

Let's assume you want to do this in parallel

for(auto element = myMap.begin(); element !=myMap.end(); ++element) {
    foo(element->first, element->second);
}

You can do this with OpenMP 2.0

#pragma omp parallel
{
    size_t cnt = 0;
    int ithread = omp_get_thread_num();
    int nthreads = omp_get_num_threads();
    for(auto element = myMap.begin(); element !=myMap.end(); ++element, cnt++) {
        if(cnt%nthreads != ithread) continue;
        foo(element->first, element->second);
    }
}

Every thread runs through myMap.size() iteartors. However, each thread only calls foo myMap.size()/num_threads. Your method only runs through myMap.size()/num_threads iterators. However, it requires using a critical section every iteration.

The fast-forward method is efficient as long as the time to "fast-forward" through nthreads itererators is much less then the time for foo, i.e:

nthreads*time(++elements) << time(foo)

If, however, the time for foo is on order the time to iterate and foo is reading/writing memory then foo is likely memory bandwidth bound and won't scale with the number of threads anyway.

Iteration Through Std Containers in Openmp