Simplest Tbb Example

Simple way to execute two parallel tasks in TBB

For a small number of tasks, you can use tbb::task_group:

tbb::task_group g;

g.run([&] { task0(); });
g.run([&] { task1(); });

// do other work while tasks run ...

g.wait(); // wait for all tasks to finish

As a small optimization, can run one of the tasks in the current thread:

tbb::task_group g;

g.run([&] { task1(); }); // spawn a parallel task

task0(); // run one task in current thread

g.wait(); // wait for parallel task(s) to finish

tbb parallel_for example c++ without lambda

parallel_for will take any functor, which can be a lambda, a functor class or a plain old function; the following should work just fine too:

#include "tbb/tbb.h"
using namespace tbb;
...
void print( size_t n) {
printf("hellow world %d\n", n);
}
void print_range( const blocked_range<size_t> & r ){
for( size_t i = r.begin(); i != r.end(); ++i )
printf("hello from range: %d\n", i);
}
void doit() {
parallel_for<size_t>( 1, 10, 1, print );
parallel_for( blocked_range<size_t>(1,10), print_range );
}

tbb:concurrent_hash_mapK,V: sample code for Intel Threading Building Blocks (TBB)

Intel TBB is open source, and on GitHub:

https://github.com/intel/tbb

To install TBB, I used vcpkg which is compatible with Linux, Windows and Mac. Yes, vcpkg is from Microsoft, but it is 100% cross-platform, open source, and very popular.

Linux:

./vcpkg search tbb              # Find the package.
./vcpkg install tbb:x64-linux # Install the package.

Windows:

vcpkg search tbb                # Find the package.
vcpkg install tbb:x64-windows # Install the package.

Compile:

  • Compatible with any modern compiler including MSVC, GCC, LLVM, Intel Compiler (ICC), etc. I used CMake for gcc.

Can also download the source and extract the headers and libraries into the source tree, this works just as well.

Code.

#include "tbb/concurrent_hash_map.h" // For concurrent hash map.

tbb::concurrent_hash_map<int, string> dict;
typedef tbb::concurrent_hash_map<int, string>::accessor dictAccessor; // See notes on accessor below.

print(" - Insert key, method 1:\n");
dict.insert({1,"k1"});
print(" - 1: k1\n");

print(" - Insert key, method 2:\n");
dict.emplace(2,"k2");
print(" - 2: k2\n");

string result;

{
print(" - Read an existing key:\n");
dictAccessor accessor;
const auto isFound = dict.find(accessor, 2);
// The accessor functions as:
// (a) a fine-grained per-key lock (released when it goes out of scope).
// (b) a method to read the value.
// (c) a method to insert or update the value.
if (isFound == true) {
print(" - {}: {}\n", accessor->first, accessor->second);
}
}

{
print(" - Atomically insert or update a key:\n");
dictAccessor accessor;
const auto itemIsNew = dict.insert(accessor, 4);
// The accessor functions as:
// (a) a fine-grained per-key lock (released when it goes out of scope).
// (b) a method to read the value.
// (c) a method to insert or update the value.
if (itemIsNew == true) {
print(" - Insert.\n");
accessor->second = "k4";
}
else {
print(" - Update.\n");
accessor->second = accessor->second + "+update";
}
print(" - {}: {}\n", accessor->first, accessor->second);
}

{
print(" - Atomically insert or update a key:\n");
dictAccessor accessor;
const auto itemIsNew = dict.insert(accessor, 4);
// The accessor functions as:
// (a) a fine-grained per-key lock which is released when it goes out of scope.
// (b) a method to read the value.
// (c) a method to insert or update the value.
if (itemIsNew == true) {
print(" - Insert.\n");
accessor->second = "k4";
}
else {
print(" - Update.\n");
accessor->second = accessor->second + "+update";
}
print(" - {}: {}\n", accessor->first, accessor->second);
}

{
print(" - Read the final state of the key:\n");
dictAccessor accessor;
const auto isFound = dict.find(accessor, 4);
print(" - {}: {}\n", accessor->first, accessor->second);
}

Printing uses {fmtlib} for printing; can replace with cout <<.

Output:

- Insert key, method 1:
- 1: k1
- Insert key, method 2:
- 2: k2
- Read an existing key:
- 2: k2
- Atomically insert or update a key:
- Insert.
- 4: k4
- Atomically insert or update a key:
- Update.
- 4: k4+update
- Read the final state of the key:
- 4: k4+update

Other hash maps

  • See: https://tessil.github.io/2016/08/29/benchmark-hopscotch-map.html
  • See: std::unordered_map. This has a more standard API, and is thread safe in many situations, see: unordered_map thread safety. Suggest using this, if possible, as it has a simpler API.
  • There is also the concurrent_unordered_map from Intel TBB. It is essentially the same thing, a key/value map. However, it is much older, much much lower level, and more difficult to use. One has to supply a hasher, a equality operator, and an allocator. There is no sample code anywhere, even in the official Intel docs. I never got it working, despite months of occasional attempts. It may be obsolete, as it is not mentioned in said free book (it only covers concurrent_hash_map). Not recommended.

Update: Reader/Writer Locks

There are actually two accessors, one is a read lock, one is a write lock:

  • const_accessor
  • accessor

If using find, use const_accessor which is a read lock. If using insert or erase, use accessor which is a write lock (i.e. it will wait until any reads are done, and block further reads until it is done).

This is effectively equivalent to a reader/writer lock, but on a single dictionary key in the dictonary, rather than the entire dictionary.

Update

Final part of the learning curve: for key writes, nothing happens until the accessor goes out of scope. So any locks are held for no more than a few machine instructions, probably using CAS (Compare And Swap).

Comparing this to a database, the scope of the accessor is like a transaction. When the accessor goes out of scope, the entire transaction is committed to the hashmap.

Update

The free book mentioned above has fantastic performance tips in the chapter on concurrent_hash_map.

Conclusion

The API for this hash map is powerful but somewhat awkward. However, it supports fine-grained, per-key locks on insert/update. Any locks are only held for a handful of machine instructions, using CAS. This is something that few other hashmaps can offer, in any language. Recommend starting with std::unordered_map for simplicity; it is thread safe as long as the two threads do not write to the same key. If blazingly fast performance is required, there is an option to either refactor, or write a compatible wrapper on top with [] accessors and insert_or_update().

TBB parallel_reduce - need better understanding of example from the Developer Reference

It will work when applied directly because each instance of the class is used by one thread at the same time.

But using this variable directly might prevent compiler from vectorization of the loop. I'd go one step further with this logic by caching the r.end() because if it is not inlined properly, it can break vectorization as well. Though, it is not directly related to TBB itself, just general C++ optimization tricks.

Very Basic for loop using TBB

First of all for the following code I assume your arrays are of type mytype*, otherwise the code need some modifications. Furthermore I assume that your ranges don't overlap, otherwise parallelization attemps won't work correctly (at least not without more work)

Since you asked for it in tbb:

First you need to initialize the library somewhere (typically in your main). For the code assume I put a using namespace tbb somewhere.

int main(int argc, char *argv[]){
task_scheduler_init init;
...
}

Then you will need a functor which captures your arrays and executes the body of the forloop:

struct apply_func {
const mytype* songin; //whatever type you are operating on
mytype* sli;
mytype* sri;
apply_func(const mytype* sin, mytype* sl, mytype* sr):songin(sin), sli(sl), sri(sr)
{}
void operator()(const blocked_range<size_t>& range) {
for(size_t n = range.begin(); n !=range.end(); ++n){
sli[n]=songin[n*2];
sri[n]=songin[n*2+1];
}
}
}

Now you can use parallel_for to parallelize this loop:

size_t grainsize = 1000; //or whatever you decide on (testing required for best performance);
apply_func func(songin, sli, sri);
parallel_for(blocked_range<size_t>(0, songinfo.frames, grainsize), func);

That should do it (if I remember correctly haven't looked at tbb in a while, so there might be small mistakes).
If you use c++11, you can simplify the code by using lambda:

size_t grainsize = 1000; //or whatever you decide on (testing required for best performance);
parallel_for(blocked_range<size_t>(0, songinfo.frames, grainsize),
[&](const blocked_range<size_t>&){
for(size_t n = range.begin(); n !=range.end(); ++n){
sli[n]=songin[n*2];
sri[n]=songin[n*2+1];
}
});

That being said tbb is not exactly what I would recommend for a new programmer. I would really suggest parallelizing only code which is trivial to parallelize until you have a very firm grip on threading. For this I would suggest using openmp which is quiet a bit simpler to start with then tbb, while still being powerfull enough to parallelize a lot of stuff (Depends on the compiler supporting it,though). For your loop it would look like the following:

#pragma omp prallel for
for(size_t n = 0; n < songinfo.frames; ++n) {
sli[n]=songin[n*2];
sri[n]=songin[n*2+1];
}

Then you have to tell your compiler to compile and link with openmp (-fopenmp for gcc, /openmp for visual c++). As you can see it is quite a bit simpler to use (for such easy usecases, more complex scenarious are a different matter) then tbb and has the added benefit of workingon plattforms which don't support openmp or tbb too (since unknown #pragmas are ignored by the compiler). Personally I'm using openmp in favor of tbb for some projects since I couldn't use it's open source license and buying tbb was a bit to steep for the projects.

Now that we have the how to parallize the loop out of the way, lets get to the question if it's worth it. This is a question which really can't be answered easily, since it completely depends on how many elements you process and what kind of platform your program is expected to run on. Your problem is very bandwidth heavy so I wouldn't count on to much of an increase in performance.

  • If you are only processing 1000 elements the parallel version of the loop is very likely to be slower then the single threaded version due to overhead.
  • If your data is not in the cache (because it doesn't fit) and your system is very bandwidth starved you might not see much of a benefit (although it's likely that you will see some benefit, just don't be supprised if its in the order of 1.X even if you use a lot of processors)
  • If your system is ccNUMA (likely for multisocket systems) your performance might decrease regardless of the amount of elements, due to additional transfercosts
  • The compiler might miss optimizations regarding pointer aliasing (since the loop body is moved to a different dunction). Using __restrict (for gcc, no clue for vs) might help with that problem.
  • ...

Personally I think the situation where you are most likely to see a significant performance increase is if your system has a single multi-core cpu, for which the dataset fit's into the L3-Cache (but not the individual L2 Caches). For bigger datasets your performance will probably increase, but not by much (and correctly using prefetching might get similar gains). Of course this is pure speculization.

Tasks on Thread Building Blocks

Yes, this is the expected behavior.

TBB is a library designed to parallelize code for PERFORMANCE. It is not designed for asynchronous tasks - the official documentation states that you should use another library, eg pthreads, for such tasks (or boost::thread, in your case).

For maximum performance, it does not make any sense to have more threads than you do cores, as there are some significant overheads involved (not just context switching, but also things like flushing the cache).

EDIT:
You can read about it in the Tutorial. Specifically, in section 1.2 "Benefits" it states

Intel® Threading Building Blocks targets threading for performance.
Most general-purpose threading packages support many different kinds
of threading, such as threading for asynchronous events in graphical user
interfaces. As a result, general-purpose packages tend to be
low-level tools that provide a foundation, not a solution. Instead,
Intel® Threading Building Blocks focuses on the particular goal of
parallelizing computationally intensive work, delivering
higher-level, simpler solutions.

and

Intel® Threading Building Blocks is compatible with other threading
packages. Because the library is not designed to address all threading problems,
it can coexist seamlessly with other threading packages.



Related Topics



Leave a reply



Submit