Are mlock()-ed pages static, or can they be moved in physical RAM?
No. Pages that have been mlocked
are managed using the kernel's unevictable LRU list. As the name suggests (and mlock()
guarantees) these pages cannot be evicted from RAM. However, the pages can be migrated from one physical page frame to another. Here is an excerpt from Unevictable LRU Infrastructure (formatting added for clarity):
MIGRATING MLOCKED PAGES
A page that is being migrated has been isolated from the LRU lists and is held locked across unmapping of the page, updating the page's address space entry and copying the contents and state, until the page table entry has been replaced with an entry that refers to the new page. Linux supports migration of
mlocked
pages and other unevictable pages. This involves simply moving thePG_mlocked
andPG_unevictable
states from the old page to the new page.
Repeated Minor Pagefaults at Same Address After Calling mlockall()
@fche's mention of Transparent Huge Pages put me onto the right track.
A less careless read of the kernel documentation I linked to in the question shows that mlock
does not prevent the kernel from migrating the page to a new page frame; indeed, there's an entire section devoted to migrating mlocked pages. Thus, simply calling mlock()
does not guarantee that you will not experience any minor pagefaults
Somewhat belatedly, I see that this answer quotes the same passage and partially answers my question.
One of the reasons the kernel might move pages around is memory compaction, whereby the kernel frees up a large contiguous block of pages so a "huge page" can be allocated. Transparent huge pages can be easily disabled; see e.g. this answer.
My particular test case was the result of some NUMA balancing changes introduced in the 3.13 kernel.
Quoting the LWN article linked therein:
The scheduler will periodically scan through each process's address
space, revoking all access permissions to the pages that are currently
resident in RAM. The next time the affected process tries to access
that memory, a page fault will result. The scheduler will trap that
fault and restore access to the page in question...
This behavior of the scheduler can be disabled by setting the NUMA policy of the process to explicitly use a certain node. This can be done using numactl
at the command line (e.g. numactl --membind=0
) or a call to the libnuma
library.
EDIT: The sysctl documentation explicitly states regarding NUMA balancing:
If the target workload is already bound to NUMA nodes then this feature should be disabled.
This can be done with sysctl -w kernel.numa_balancing=0
There may still be other causes for page migration, but this sufficed for my purposes.
VmLck (locked memory) vs VmPin (pinned memory) in /proc/pid/status
After some research I am now able to answer my own question, hope it helps future visitors.
A locked memory is never swapped out of main memory. This means that a page locked in physical memory is guaranteed to be present in RAM all the time. However, there is no guarantee that the page fault will never happen, since the kernel is still free to move the page within the physical memory.
A pinned memory is a locked memory that is pinned at a particular page frame location. This means that the pinned page can neither be swapped out of main memory nor be moved within the physical RAM and hence it is guaranteed that the page fault will never happen. This is an ideal requirement for hard realtime applications.
Read more : https://lwn.net/Articles/600502/
CUDA and pinned (page locked) memory not page locked at all?
It would seem that the pinned allocator on CUDA 6.5 under the hood is using mmap()
with MAP_FIXED. Although I am not an OS expert, this apparently has the effect of "pinning" memory, i.e. ensuring that its address never changes. However this is not a complete explanation. Refer to the answer by @Jeff which points out what is almost certainly the "missing piece".
Let's consider a short test program:
#include <stdio.h>
#define DSIZE (1048576*1024)
int main(){
int *data;
cudaFree(0);
system("cat /proc/meminfo > out1.txt");
printf("*$*before alloc\n");
cudaHostAlloc(&data, DSIZE, cudaHostAllocDefault);
printf("*$*after alloc\n");
system("cat /proc/meminfo > out2.txt");
cudaFreeHost(data);
system("cat /proc/meminfo > out3.txt");
return 0;
}
If we run this program with strace
, and excerpt the output part between the printf
statements, we have:
write(1, "*$*before alloc\n", 16*$*before alloc) = 16
mmap(0x204500000, 1073741824, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x204500000
ioctl(11, 0xc0304627, 0x7fffcf72cce0) = 0
ioctl(3, 0xc0384657, 0x7fffcf72cd70) = 0
write(1, "*$*after alloc\n", 15*$*after alloc) = 15
(note that 1073741824 is exactly one gigabyte, i.e. the same as the requested 1048576*1024)
Reviewing the description of mmap
, we have:
address gives a preferred starting address for the mapping. NULL expresses no preference. Any previous mapping at that address is automatically removed. The address you give may still be changed, unless you use the MAP_FIXED flag.
Therefore, assuming the mmap
command is successful, the virtual address requested will be fixed, which is probably useful, but not the whole story.
As I mentioned, I am not a OS expert, and it's not obvious to me what exactly about this system call would create a "pinned" mapping/allocation. It may be that the combination of MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS
somehow creates a pinned underlying allocation, but I've not found any evidence to support that.
Based on this article it seems that even mlock()
-ed pages would not meet the needs of DMA activity, which is one of the key goals of pinned host pages in CUDA. Therefore, it seems that something else is providing the actual "pinning" (i.e. guaranteeing that the underlying physical pages are always memory-resident, and that their virtual-to-physical mapping doesn't change -- the latter part of this is possibly accomplished by MAP_FIXED
along with whatever mechanism guarantees that the underlying physical pages don't move in any way).
This mechanism apparently does not use mlock()
, and so the mlock'ed pages don't change, before and after. However we would expect a change in the mapping statistic, and if we diff the out1.txt and out2.txt produced by the above program, we see (excerpted):
< Mapped: 87488 kB
---
> Mapped: 1135904 kB
The difference is approximately a gigabyte, the amount of "pinned" memory requested.
Problem of sorting OpenMP threads into NUMA nodes by experiment
After more investigation, I note the following:
- work-load managers on clusters can and will disregard/reset OMP_PLACES/OMP_PROC_BIND,
- memory page migration is a thing on modern NUMA systems.
Following this, I started using the work-load manager's own thread binding/pinning system, and adapted my benchmark to lock the memory page(s) on which my data lay. Furthermore, giving in to my programmer's paranoia, I ditched the std::unique_ptr
for fear that it may lay its own first touch after allocating the memory.
// create data which will be shared by multiple threads
const auto size_per_thread = std::size_t{50 * 1024 * 1024 / sizeof(double)}; // 50 MB
const auto total_size = thread_count * size_per_thread;
double* data = nullptr;
posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), total_size * sizeof(double));
if (data == nullptr)
{
throw std::runtime_error("could_not_allocate_memory_error");
}
// perform first touch using thread 0
#pragma omp parallel num_threads(thread_count)
{
if (omp_get_thread_num() == 0)
{
#pragma omp simd safelen(8)
for (auto d_index = std::size_t{}; d_index < total_size; ++d_index)
{
data[d_index] = -1.0;
}
}
} // #pragma omp parallel
mlock(data, total_size); // page migration is a real thing...
// open a parallel section
auto thread_id_avg_latency = std::multimap<double, int>{};
auto generator = std::mt19937(); // heavy object can be created outside parallel
#pragma omp parallel num_threads(thread_count) private(generator)
{
// access the data using all threads individually
#pragma omp for schedule(static, 1)
for (auto thread_counter = std::size_t{}; thread_counter < thread_count; ++thread_counter)
{
// seed each thread's generator
generator.seed(thread_counter + 1);
// calculate the minimum access latency of this thread
auto this_thread_avg_latency = 0.0;
const auto experiment_count = 250;
for (auto experiment_counter = std::size_t{}; experiment_counter < experiment_count; ++experiment_counter)
{
const auto start_timestamp = omp_get_wtime() * 1E+6;
for (auto counter = std::size_t{}; counter < size_per_thread / 100; ++counter)
{
const auto index = std::uniform_int_distribution<std::size_t>(0, size_per_thread-1)(generator);
auto& datapoint = data[thread_counter * size_per_thread + index];
datapoint += index;
}
const auto end_timestamp = omp_get_wtime() * 1E+6;
this_thread_avg_latency += end_timestamp - start_timestamp;
}
this_thread_avg_latency /= experiment_count;
#pragma omp critical
{
thread_id_avg_latency.insert(std::make_pair(this_thread_avg_latency, omp_get_thread_num()));
}
}
} // #pragma omp parallel
std::free(data);
With these changes, I am noticing the difference I expected.
Further notes:
- this experiment shows that the latency of non-local access is 1.09 - 1.15 times that of local access on the cluster that I'm using,
- there is no reliable cross-platform way of doing this (requires kernel-APIs),
- OpenMP seems to number the threads exactly as
hwloc/lstopo
,numactl
andlscpu
seems to number them (logical ID?)
The most astonishing things are that the difference in latencies is very low, and that memory page migration may happen, which begs the question, why should we care about first-touch and all the rest of the NUMA concerns at all?
How do I calculate someone's age based on a DateTime type birthday?
An easy to understand and simple solution.
// Save today's date.
var today = DateTime.Today;
// Calculate the age.
var age = today.Year - birthdate.Year;
// Go back to the year in which the person was born in case of a leap year
if (birthdate.Date > today.AddYears(-age)) age--;
However, this assumes you are looking for the western idea of the age and not using East Asian reckoning.
Related Topics
Create a Hard Link from a File Handle on Unix
Checking If a Binary Compiled with "-Static"
How to Find Files Except Given Name
Difference Between Arm-None-Eabi and Arm-Linux-Gnueabi
How to Use Systemd to Restart a Service When Down
Where to Put Svn Repository Directory in Linux
Installing Gcc from Source on Alpine
Can't Remove, Purge, Unistall Mongodb from Debian
Detecting The Output Stream Type of a Shell Script
Difference Between Dts and Acpi
How to Display The Current Disk Io Queue Length on Linux
Extract a Specific Folder to Specific Directory from a Tar.Gz