NUMA aware cache aligned memory allocation
If you're just looking to get the alignment functionality around a NUMA allocator, you can easily build your own.
The idea is to call the unaligned malloc()
with a little bit more space. Then return the first aligned address. To be able to free it, you need to store the base address at a known location.
Here's an example. Just substitute the names with whatever is appropriate:
pint // An unsigned integer that is large enough to store a pointer.
NUMA_malloc // The NUMA malloc function
NUMA_free // The NUMA free function
void* my_NUMA_malloc(size_t bytes,size_t align, /* NUMA parameters */ ){
// The NUMA malloc function
void *ptr = numa_malloc(
(size_t)(bytes + align + sizeof(pint)),
/* NUMA parameters */
);
if (ptr == NULL)
return NULL;
// Get aligned return address
pint *ret = (pint*)((((pint)ptr + sizeof(pint)) & ~(pint)(align - 1)) + align);
// Save the free pointer
ret[-1] = (pint)ptr;
return ret;
}
void my_NUMA_free(void *ptr){
if (ptr == NULL)
return;
// Get the free pointer
ptr = (void*)(((pint*)ptr)[-1]);
// The NUMA free function
numa_free(ptr);
}
To when you use this, you need to call my_NUMA_free
for anything allocated with my_NUMA_malloc
.
Thread Affinity also restrict memory allocation?
I wouldn't count on it. In windows, you have to use special functions to allcate memory on a particular numa node. Linux is pretty similar. Look for a memory mapping function in linux that takes the numa node.
EDIT:
numa_malloc
See this link
NUMA aware cache aligned memory allocation
C++ NUMA Optimization
When you allocate dynamic memory (such as std::vector
does) you effectively get some range of pages from virtual memory space. When a program first accesses a particular page, page fault is triggered and some page from physical memory is requested. Usually, this page is in a local physical memory to the core that generated the page fault, which is called a first touch policy.
In your code, if pages of your std::vector
's buffers are first touched by a single (e.g, main) thread, then it may happen that all elements of these vectors ends up in a local memory of a single NUMA node. Then, if you split your program to threads that runs on all NUMA nodes, some of the threads accesses remote memory when working with these vectors.
The solution is thus to allocate "raw memory" and then "touch" it first with all threads the same way it will be then accessed by these threads during processing phase. Unfortunately, this is not easy to achieve with std::vector
, at least with standard allocators. Can you switch to ordinary dynamic arrays? I would try this first to find out, whether their initialization with respect to first touch policy helps:
int* data = new int[N];
int* res = new int[N];
// initialization with respect to first touch policy
#pragma omp parallel for schedule(static)
for (int i = 0; i < N; i++) {
data[i] = ...;
res[i] = ...;
}
#pragma omp parallel for schedule(static)
for (int i = 0; i < N; i++)
res[i] = doExtremeComplexStuff(data[i]);
With static
scheduling, mapping of elements to threads should the very same in both loops.
However, I am not convinced that your problem is caused by NUMA effects when accessing these two vectors. As you called the function doExtremeComplexStuff
, it seems that this function is very expensive as for runtime. If this is true, even accessing remote NUMA memory will likely be negligibly fast in comparison with function invocation. The whole problem can be hidden inside this function, but we don't know what it does.
How to instantiate C++ objects on specific NUMA memory nodes?
Placement new is what you are looking for. Example:
void *blob = numa_alloc_onnode(sizeof(Object), ...);
Object *object = new(blob) Object;
Scalable allocation of large (8MB) memory regions on NUMA architectures
Second Update (closing the question):
Just profiled the example application again with a 3.10 kernel.
Results for parallel allocation and memsetting of 16GB of data:
small pages:
- 1 socket: 3112.29 ms
- 2 socket: 2965.32 ms
- 3 socket: 3000.72 ms
- 4 socket: 3211.54 ms
huge pages:
- 1 socket: 3086.77 ms
- 2 socket: 1568.43 ms
- 3 socket: 1084.45 ms
- 4 socket: 852.697 ms
The scalable allocation problem seems to be fixed now - at least for huge pages.
Related Topics
Why Linux Kernel Use Trap Gate to Handle Divide_Error Exception
Arm Inline Asm: Exit System Call with Value Read from Memory
Docker Copy with File Globbing
How to Return Spawned Process Exit Code in Expect Script
Overlay Two Postscript Files (Command Line Approach)
Release of Flock in Case of Errors
How to Config Socks5 Proxy on Git
Jenkins Path to Git Windows Master/Linux Slave
How to Check Fips 140-2 Support in Openssl
Prohibit Unaligned Memory Accesses on X86/X86_64
Linux Kernel: How to Capture a Key Press and Replace It with Another Key
How to Remove Only the First Occurrence of a Line in a File Using Sed