How to Parallelize a for Loop Through a C++ Std::List Using Openmp

How do I parallelize a for loop through a C++ std::list using OpenMP?

If you decide to use Openmp 3.0, you can use the task feature:

#pragma omp parallel
#pragma omp single
{
  for(auto it = l.begin(); it != l.end(); ++it)
     #pragma omp task firstprivate(it)
       it->process();
  #pragma omp taskwait
}

This will execute the loop in one thread, but delegate the processing of elements to others.

Without OpenMP 3.0 the easiest way would be writing all pointers to elements in the list (or iterators in a vector and iterating over that one. This way you wouldn't have to copy anything back and avoid the overhead of copying the elements themselves, so it shouldn't have to much overhead:

std::vector<my_element*> elements; //my_element is whatever is in list
for(auto it = list.begin(); it != list.end(); ++it)
  elements.push_back(&(*it));

#pragma omp parallel shared(chunks)
{
  #pragma omp for
  for(size_t i = 0; i < elements.size(); ++i) // or use iterators in newer OpenMP
      elements[i]->process();
}

If you want to avoid copying even the pointers, you can always create a parallelized for loop by hand. You can either have the threads access interleaved elements of the list (as proposed by KennyTM) or split the range in roughly equal contious parts before iterating and iterating over those. The later seems preferable since the threads avoid accessing listnodes currently processed by other threads (even if only the next pointer), which could lead to false sharing. This would look roughly like this:

#pragma omp parallel
{
  int thread_count = omp_get_num_threads();
  int thread_num   = omp_get_thread_num();
  size_t chunk_size= list.size() / thread_count;
  auto begin = list.begin();
  std::advance(begin, thread_num * chunk_size);
  auto end = begin;
  if(thread_num = thread_count - 1) // last thread iterates the remaining sequence
     end = list.end();
  else
     std::advance(end, chunk_size);
  #pragma omp barrier
  for(auto it = begin; it != end; ++it)
    it->process();
}

The barrier is not strictly needed, however if process mutates the processed element (meaning it is not a const method), there might be some sort of false sharing without it, if threads iterate over a sequence which is already being mutated. This way will iterate 3*n times over the sequence (where n is the number of threads), so scaling might be less then optimal for a high number of threads.

To reduce the overhead you could put the generation of the ranges outside of the #pragma omp parallel, however you will need to know how many threads will form the parallel section. So you'd probably have to manually set the num_threads, or use omp_get_max_threads() and handle the case that the number of threads created is less then omp_get_max_threads() (which is only an upper bound). The last way could be handled by possibly assigning each thread severa chunks in that case (using #pragma omp for should do that):

int max_threads = omp_get_max_threads();
std::vector<std::pair<std::list<...>::iterator, std::list<...>::iterator> > chunks;
chunks.reserve(max_threads); 
size_t chunk_size= list.size() / max_threads;
auto cur_iter = list.begin();
for(int i = 0; i < max_threads - 1; ++i)
{
   auto last_iter = cur_iter;
   std::advance(cur_iter, chunk_size);
   chunks.push_back(std::make_pair(last_iter, cur_iter);
}
chunks.push_back(cur_iter, list.end();

#pragma omp parallel shared(chunks)
{
  #pragma omp for
  for(int i = 0; i < max_threads; ++i)
    for(auto it = chunks[i].first; it != chunks[i].second; ++it)
      it->process();
}

This will take only three iterations over list (two, if you can get the size of the list without iterating). I think that is about the best you can do for non random access iterators without using tasks or iterating over some out of place datastructure (like a vector of pointer).

With openmp in C, how can I parallelize a for loop that contains a nested comparison function for qsort?

I realize this has been self answered, but here are some standard C and OpenMP options. The qsort_r function is a good classic choice, but it's worth noting that qsort_s is part of the c11 standard, and thus is portable wherever c11 is offered (which does not include Windows, they don't quite offer c99 yet).

As to doing it in OpenMP without the nested comparison function, still using original qsort, there are two ways that come to mind. First is to use the classic global variable in combination with OpenMP threadprivate:

static int *index = NULL;
static double *tmp_array = NULL;

#pragma omp threadprivate(index, tmp_array)

int simcmp(const void *a, const void *b){
    int ia = *(int *)a;
    int ib = *(int *)b;
    double aa = ((double *)tmp_array)[ia];
    double bb = ((double *)tmp_array)[ib];
    if ((aa - bb) > 1e-12){
        return -1;
    }else{
        return 1;
    }
}

int main(){
    int i;
#pragma omp parallel for
    for(i = 0; i < 100; i++){
        index= (int *) malloc(sizeof(int)*10);
        tmp_array = (double*) malloc(sizeof(double)*10);
        int j;
        for(j=0; j<10; j++){
            tmp_array[j] = rand();
            index[j] = j;
        }
        // QuickSort the index array based on tmp_array:
        qsort_r(index, 10, sizeof(*index), simcmp, tmp_array);
        free(index);
        free(tmp_array);
    }
    return 0;
}

The version above causes every thread in the parallel region to use a private copy of the global variables index and tmp_array, which takes care of the issue. This is probably the most portable version you can write in standard C and OpenMP, with the only likely incompatible platforms being those that do not implement thread local memory (some microcontrollers, etc.).

If you want to avoid the global variable and still have portability and use OpenMP, then I would recommend using C++11 and the std::sort algorithm with a lambda:

std::sort(index, index+10, [=](const int& a,  const int& b){
    if ((tmp_array[a] - tmp_array[b]) > 1e-12){
        return -1;
    }else{
        return 1;
    }
});

How can I execute two for loops in parallel in C++?

As already mentioned in the comments that OpenMP may not be the best solution to do so, but if you wish to do it with OpenMP, I suggest the following:

Use sections to start 2 threads, and communicate between the threads by using shared variables. The important thing is to use atomic operation to read (#pragma omp atomic read seq_cst) and to write (#pragma omp atomic write seq_cst) these variables. Here is an example:

#pragma omp parallel num_threads(2)
#pragma omp sections
{                     
    #pragma omp section
    {
        //This is the sensor controlling part
            
        while(exit_condition)
        {
            sensor_state = read_sensor(); 
            
            // Read the currect state of motor from other thread
            #pragma omp atomic read seq_cst
            motor_state=shared_motor_state;
            
            // Based on the motor_state and sensor state send
            // a command to the other thread to control the motor
            // or wait for the motor to be ready in a loop, etc.
            
            #pragma omp atomic write seq_cst
            shared_motor_command= //whaterver you wish ;
        }
    }

    #pragma omp section
    {
        //This is the motor controlling part           
        while(exit_condition)
        {
            // read motor command form other thread
            #pragma omp atomic read seq_cst
            motor_command = shared_motor_command;

            // Do whatewer you have to to based on motor command and
            // You can set the state of motor by the following line

            #pragma omp atomic write seq_cst
            shared_motor_state= //what you need to pass to the other thread

        }
    }
}

Can I iterate over a C++11 std::tuple with openmp?

The C++11 template syntax is highly alien to me, but recursive problems such as this one are best made parallel using explicit OpenMP tasks:

template<std::size_t I = 0, typename FuncT, typename... Tp>
inline typename std::enable_if<I < sizeof...(Tp), void>::type
for_each(std::tuple<Tp...>& t, FuncT& f)
{
    #pragma omp task firstprivate(I) shared(t,f)
    {
        f(std::get<I>(t));
    }
    for_each<I + 1, FuncT, Tp...>(t, f);
}

...

// Proper usage
#pragma omp parallel
{
    #pragma omp single
    for_each(...);
}

The important part is to have the top level call to for_each in a single construct inside a parallel region. Thus only a single thread will call for_each, which in turn will result in f(std::get<I>(t)); being queued for later execution as an explicit task. The other threads, while waiting at the implicit barrier at the end of the single construct, will start pulling tasks from the task queue and execute them in parallel until the queue is empty. The sharing classes of all variables used by the task are given explicitly for clarity.

The objects that t and f reference should be shared and the references themselves (basically the pointers that implement the references) should be firstprivate. On the other side, the OpenMP standard prohibits reference types from being firstprivate and different compiler vendors tend to implement the standard differently. Intel C++ Compiler accepts the following code and it gives the correct results inside the task but the referenced variable is privatised (which is wrong):

void f(int& p)
{
   #pragma omp task
   {
      cout << "p = " << p << endl;
      p = 3;
      cout << "p' = " << p << endl;
   }
}

void f1()
{
   int i = 5;

   #pragma omp parallel
   {
      #pragma omp single
      f(i);
   }
   cout << "i = " << i << endl;
}

PGI's compiler gives the correct result and does not privatise i. On the other side GCC correctly determines that p should be firstprivate but then runs into the prohibition in the standard and gives a compile-time error.

If one modifies the task to read:

#pragma omp task shared(p)
{
    ...
}

it works correctly with GCC but the task prints wrong initial value of p and then causes a segmentation fault with both Intel C++ Compiler and PGI's C++ compiler.

Go figure!

Parallelize inner loops

omp will not parallelize the c and d loops meaning each thread will execute a c loop in its entirety.

This is correct.

some of the threads might get stuck with a lot more work than other threads

You are right: the work imbalance between thread is a performance issue in the first code. A schedule(dynamic) help a bit to fix this, but there is not much more you can do on this version.

I don't know enough to know if this helpful at all. What I saw indicates this might be parallelizing the c and d loops but leaving the a and b loops serial?

Technically, the a and b loops are executed in parallel too (since they are in a parallel section, but all the threads will completely execute all the iterations in lockstep (because the omp parallel for contains an implicit synchronization). You should not use a second omp parallel: regarding the runtime, this can created new threads 100 times, and even when no new threads are created, this result in an inefficient code (for example because of a bad default thread pinning). Moreover, schedule(guided) is not needed here and should be less efficient than a schedule(static). Thus, use omp for collapse(2) schedule(static).

how can I parallelize the inner loops to share the work better.

The last code is not soo bad in term of work balancing although it introduces some unwanted overheads:

The implicit synchronization of the omp for can be skipped using nowait since all threads are working on thread-private data.
The access to result_local[a][b] can be replaced by a fast thread-private variable access.
The conditional increment can be replaced by a branch-less boolean increment.
f(a) and f(b) can be per-computed although optimizing compilers should already do this.
When f(a) * f(b) is very small, this could be better not to execute the loop in parallel (because of the expensive cost to communicate between cores). However this is highly dependent of whether cond is expensive or not.
When f(a) is big, there is no need to use a costly collapse(2) as there will be enough work for all threads (collapse(2) usually slow down the execution since compilers often generate a slow modulus instruction to find the value of the loop iterators at runtime).

Here is the resulting code tacking into account most fixes:

#pragma omp parallel
{
    // Create a result_local array to avoid critical sections in the loop

    // Arbritrary threshold (this may not be optimal)
    const int threshold = 4 * omp_get_num_threads();

    for (int a = 0; a < 10; a++)
    {
        const int c_lim = f(a);

        for (int b = 0; b < 10; b++)
        {
            const int d_lim = f(b);
            int64_t local_sum = 0;

            if(c_lim < threshold)
            {
                #pragma omp for collapse(2) schedule(static) nowait
                for (int c = 0; c < c_lim; c++)
                    for (int d = 0; d < d_lim; d++)
                        local_sum += comp(c, d);
            }
            else
            {
                #pragma omp for schedule(static) nowait
                for (int c = 0; c < c_lim; c++)
                    for (int d = 0; d < d_lim; d++)
                        local_sum += comp(c, d);
            }

            result_local[a][b] += local_sum;
        }
    }

    // Add the result_local to result
}

Another more efficient strategy is to redesign the sequential algorithm to significantly reduce the amount of work.

Redesigning of the algorithm

One can note that comp(c, d) is recomputed with the same value several times (up to 100 times) and the same for result_local[a][b]++ or even f(b) (up to 1,000,000 times). In such cases, the generic solution is to memoize the results (see here for more information) to avoid expensive parts of the algorithm to be recomputed over and over.

Note that you cannot pre-compute all the needed comp(a, b) values: this solution would be too expensive in terms of memory usage (up to 10 Gio needed). Thus, the trick is to split the 2D space in tiles. Here is how the algorithm works:

compute all the f(a) and f(b) sequentially (100 values);
split the iteration space in tiles of reasonable size (eg. 100x100) and pre-compute all the required tiles that should be completely computed (possibly in parallel, although this is tedious);
compute the sum of all comp(a, b) for each tile (i.e. for a in [a_tile_begin;a_tile_end[ and b in [b_tile_begin;b_tile_end[) in parallel (each thread should work on several tiles) and write the sums in a shared array.
compute the final result using the tile sums (partial tiles are computed on the fly in this last step) in parallel.

This algorithm is definitively much more complex, but it should be up to 100 time faster than the above one since most operations are computed only once.

How to parallelize an array shift with OpenMP?

This is an example of a loop with loop-carried dependencies, and so can't be easily parallelized as written because the tasks (each iteration of the loop) aren't independent. Breaking the dependency can vary from a trivial modification to the completely impossible
(eg, an iteration loop).

Here, the case is somewhat in between. The issue with doing this in parallel is that you need to find out what your rightmost value is going to be before your neighbour changes the value. The OMP for construct doesn't expose to you which loop iterations values will be "yours", so I don't think you can use the OpenMP for worksharing construct to break up the loop. However, you can do it yourself; but it requires a lot more code, and it won't nicely reduce to the serial case any more.

But still, an example of how to do this is shown below. You have to break the loop up yourself, and then get your rightmost value. An OpenMP barrier ensures that no one starts modifying values until all the threads have cached their new rightmost value.

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char **argv) {
    int i;
    char *array;
    const int n=27;

    array = malloc(n * sizeof(char) );
    for (i=0; i<n-1; i++)
        array[i] = 'A'+i;

    array[n-1] = '\0';

    printf("Array pre-shift  = <%s>\n",array);

    #pragma omp parallel default(none) shared(array) private(i)
    {
        int nthreads = omp_get_num_threads();
        int tid = omp_get_thread_num();

        int blocksize = (n-2)/nthreads;
        int start = tid*blocksize;
        int end = start + blocksize - 1;
        if (tid == nthreads-1) end = n-2;

        /* we are responsible for values start...end */

        char rightval = array[end+1];
        #pragma omp barrier 

        for (i=start; i<end; i++)
            array[i] = array[i+1];

        array[end] = rightval;
    }
    printf("Array post-shift = <%s>\n",array);

    return 0;
}

OpenMP on nested for() loops in creating 3D Grid

Your first code variant with parallelism on the inner loops works slow because overhead on managing threads and merging vectors takes too much time comparing to the actual code running. Another reason for slow down is extensive use of shared variables. while OpenMP synchronizes access to them, it takes a lot of time. Generally you should try to minimize shared variables usage. I believe good style would be specifying default(none) for your parallel section and explicitly specifying variables you want to make shared.

To improve proportion between actual code and management code you should make parallel regions as large as possible and use as less as possible synchronization between threads. So you should parallel your outermost loop for best effect.

In your second approach however you forgot about synchronization. You cannot insert into the same boxes vector from different threads. You could apply the synchronization mechanism which you used in your first approach in just exactly same manner.

Nested openMP parallelisation in combination with std::thread

This produces undefined behavior in terms of the OpenMP standard. Most implementations I have tested will create 24 threads for each of those two parallel regions in your first example, for a total of 48. The second example should not create too many threads, but since it relies on undefined behavior it may do anything from crashing to turning your computer into a jelly-like substance without warning.

Since you're already using OpenMP, I would recommend making it standards compliant OpenMP by simply removing the std::thread, and using nested OpenMP parallel regions instead. You can do so like this:

int main()
{
    // The max is to avoid having less than 1 thread
    int numThreads = max(omp_get_max_threads() - 2, 1); 
    #pragma omp parallel num_threads(2)
    {
        if(omp_get_thread_num() > 0){
            IterativeSolution();
        }else{
            #pragma omp parallel for num_threads(numThreads)
            for(int a = 0; a < 100; a++)
            {
                int b = GetImageFromApproximateSolution();
                RegisterImages(a,b);
                // Inform IterativeSolution about result of registration
            }
        }
    }
}

void IterativeSolution()
{
    #pragma omp parallel for num_threads(2)
    for(int i = 0; i < 2; i++)
    {
        //SolveColumn(i);
    }
}
void RegisterImage(int a, int b)
{
    // Do Registration
}

Chances are that you will need to add the environment variable definitions OMP_NESTED=true and OMP_MAX_ACTIVE_LEVELS=2, or more, to your environment to enable nested regions. This version has the advantage of being completely defined in OpenMP, and should work portably on any environment that supports nested parallel regions. If you have a version that does not support nested OpenMP parallel regions, then your suggested solution may be the best remaining option.