How Does Intel Tbb's Scalable_Allocator Work

How does Intel TBB's scalable_allocator work?

There is a good paper on the allocator: The Foundations for Scalable Multi-core Software in Intel Threading Building Blocks

My limited experience: I overloaded the global new/delete with the tbb::scalable_allocator for my AI application. But there was little change in the time profile. I didn't compare the memory usage though.

Thread-safe TBB scalable allocator

As per the manual you linked:

Unless otherwise stated, the thread safety rules for the library are
as follows:

Two threads can invoke a method or function concurrently on different
objects, but not the same object. It is unsafe for two threads to
invoke concurrently methods or functions on the same object.
Descriptions of the classes note departures from this convention. For
example, the concurrent containers are more liberal. By their nature,
they do permit some concurrent operations on the same container
object.

With the scalable allocator this means two threads cannot free the same memory at the same time, which should not be surprising.

performance of Intel TBB memory allocator?

You can probably get the best answers at TBBs forums, they have excellent support.

I have been using TBB for a little over a year and I have been quite satisfied with TBB in general and its allocator.

You will need to provide more information, e.g. use case, numbers etc... otherwise it is impossible to tell what is causing your probelms.

intel tbb memory overhead

Heap-allocated memory is not normally returned to the OS after a call to delete or free. You need to call malloc_trim or your allocator-specific function to do that.

Scalable allocation of large (8MB) memory regions on NUMA architectures

Second Update (closing the question):

Just profiled the example application again with a 3.10 kernel.

Results for parallel allocation and memsetting of 16GB of data:

small pages:

1 socket: 3112.29 ms
2 socket: 2965.32 ms
3 socket: 3000.72 ms
4 socket: 3211.54 ms

huge pages:

1 socket: 3086.77 ms
2 socket: 1568.43 ms
3 socket: 1084.45 ms
4 socket: 852.697 ms

The scalable allocation problem seems to be fixed now - at least for huge pages.

Is there any performance comparison between TBB::scalable_allocator, tcmalloc and jemalloc?

'I want to use a memory allocator in multithreading enviroment and each thread eats a lot of memory.' - why? What for?

'Which should I choose?' - two possibilities:

1) Profile/analyze your application and match the memory requirement charateristics again the specs for each of the allocators.

2) Test you app with each allocator to find out empirically which matches your application best.

'Is there any performance between these allocators?' - I guess you mean 'performance difference'. I'm almost 100% sure that there is a difference, yes.

How Does Intel Tbb's Scalable_Allocator Work