Numa Aware Cache Aligned Memory Allocation

NUMA aware cache aligned memory allocation

If you're just looking to get the alignment functionality around a NUMA allocator, you can easily build your own.

The idea is to call the unaligned malloc() with a little bit more space. Then return the first aligned address. To be able to free it, you need to store the base address at a known location.

Here's an example. Just substitute the names with whatever is appropriate:

pint         //  An unsigned integer that is large enough to store a pointer.
NUMA_malloc // The NUMA malloc function
NUMA_free // The NUMA free function

void* my_NUMA_malloc(size_t bytes,size_t align, /* NUMA parameters */ ){

// The NUMA malloc function
void *ptr = numa_malloc(
(size_t)(bytes + align + sizeof(pint)),
/* NUMA parameters */
);

if (ptr == NULL)
return NULL;

// Get aligned return address
pint *ret = (pint*)((((pint)ptr + sizeof(pint)) & ~(pint)(align - 1)) + align);

// Save the free pointer
ret[-1] = (pint)ptr;

return ret;
}

void my_NUMA_free(void *ptr){
if (ptr == NULL)
return;

// Get the free pointer
ptr = (void*)(((pint*)ptr)[-1]);

// The NUMA free function
numa_free(ptr);
}

To when you use this, you need to call my_NUMA_free for anything allocated with my_NUMA_malloc.

Thread Affinity also restrict memory allocation?

I wouldn't count on it. In windows, you have to use special functions to allcate memory on a particular numa node. Linux is pretty similar. Look for a memory mapping function in linux that takes the numa node.

EDIT:
numa_malloc

See this link

NUMA aware cache aligned memory allocation

C++ NUMA Optimization

When you allocate dynamic memory (such as std::vector does) you effectively get some range of pages from virtual memory space. When a program first accesses a particular page, page fault is triggered and some page from physical memory is requested. Usually, this page is in a local physical memory to the core that generated the page fault, which is called a first touch policy.

In your code, if pages of your std::vector's buffers are first touched by a single (e.g, main) thread, then it may happen that all elements of these vectors ends up in a local memory of a single NUMA node. Then, if you split your program to threads that runs on all NUMA nodes, some of the threads accesses remote memory when working with these vectors.

The solution is thus to allocate "raw memory" and then "touch" it first with all threads the same way it will be then accessed by these threads during processing phase. Unfortunately, this is not easy to achieve with std::vector, at least with standard allocators. Can you switch to ordinary dynamic arrays? I would try this first to find out, whether their initialization with respect to first touch policy helps:

int* data = new int[N];
int* res = new int[N];

// initialization with respect to first touch policy
#pragma omp parallel for schedule(static)
for (int i = 0; i < N; i++) {
data[i] = ...;
res[i] = ...;
}

#pragma omp parallel for schedule(static)
for (int i = 0; i < N; i++)
res[i] = doExtremeComplexStuff(data[i]);

With static scheduling, mapping of elements to threads should the very same in both loops.


However, I am not convinced that your problem is caused by NUMA effects when accessing these two vectors. As you called the function doExtremeComplexStuff, it seems that this function is very expensive as for runtime. If this is true, even accessing remote NUMA memory will likely be negligibly fast in comparison with function invocation. The whole problem can be hidden inside this function, but we don't know what it does.

How to instantiate C++ objects on specific NUMA memory nodes?

Placement new is what you are looking for. Example:

void *blob = numa_alloc_onnode(sizeof(Object), ...);
Object *object = new(blob) Object;

Scalable allocation of large (8MB) memory regions on NUMA architectures

Second Update (closing the question):

Just profiled the example application again with a 3.10 kernel.

Results for parallel allocation and memsetting of 16GB of data:

small pages:

  • 1 socket: 3112.29 ms
  • 2 socket: 2965.32 ms
  • 3 socket: 3000.72 ms
  • 4 socket: 3211.54 ms

huge pages:

  • 1 socket: 3086.77 ms
  • 2 socket: 1568.43 ms
  • 3 socket: 1084.45 ms
  • 4 socket: 852.697 ms

The scalable allocation problem seems to be fixed now - at least for huge pages.



Related Topics



Leave a reply



Submit