Are Mlock()-Ed Pages Static, or Can They Be Moved in Physical Ram

Are mlock()-ed pages static, or can they be moved in physical RAM?

No. Pages that have been mlocked are managed using the kernel's unevictable LRU list. As the name suggests (and mlock() guarantees) these pages cannot be evicted from RAM. However, the pages can be migrated from one physical page frame to another. Here is an excerpt from Unevictable LRU Infrastructure (formatting added for clarity):

MIGRATING MLOCKED PAGES

A page that is being migrated has been isolated from the LRU lists and is held locked across unmapping of the page, updating the page's address space entry and copying the contents and state, until the page table entry has been replaced with an entry that refers to the new page. Linux supports migration of mlocked pages and other unevictable pages. This involves simply moving the PG_mlocked and PG_unevictable states from the old page to the new page.

Repeated Minor Pagefaults at Same Address After Calling mlockall()

@fche's mention of Transparent Huge Pages put me onto the right track.

A less careless read of the kernel documentation I linked to in the question shows that mlock does not prevent the kernel from migrating the page to a new page frame; indeed, there's an entire section devoted to migrating mlocked pages. Thus, simply calling mlock() does not guarantee that you will not experience any minor pagefaults

Somewhat belatedly, I see that this answer quotes the same passage and partially answers my question.

One of the reasons the kernel might move pages around is memory compaction, whereby the kernel frees up a large contiguous block of pages so a "huge page" can be allocated. Transparent huge pages can be easily disabled; see e.g. this answer.

My particular test case was the result of some NUMA balancing changes introduced in the 3.13 kernel.

Quoting the LWN article linked therein:

The scheduler will periodically scan through each process's address
space, revoking all access permissions to the pages that are currently
resident in RAM. The next time the affected process tries to access
that memory, a page fault will result. The scheduler will trap that
fault and restore access to the page in question...

This behavior of the scheduler can be disabled by setting the NUMA policy of the process to explicitly use a certain node. This can be done using numactl at the command line (e.g. numactl --membind=0) or a call to the libnuma library.

EDIT: The sysctl documentation explicitly states regarding NUMA balancing:

If the target workload is already bound to NUMA nodes then this feature should be disabled.

This can be done with sysctl -w kernel.numa_balancing=0

There may still be other causes for page migration, but this sufficed for my purposes.

VmLck (locked memory) vs VmPin (pinned memory) in /proc/pid/status

After some research I am now able to answer my own question, hope it helps future visitors.

A locked memory is never swapped out of main memory. This means that a page locked in physical memory is guaranteed to be present in RAM all the time. However, there is no guarantee that the page fault will never happen, since the kernel is still free to move the page within the physical memory.

A pinned memory is a locked memory that is pinned at a particular page frame location. This means that the pinned page can neither be swapped out of main memory nor be moved within the physical RAM and hence it is guaranteed that the page fault will never happen. This is an ideal requirement for hard realtime applications.

Read more : https://lwn.net/Articles/600502/

CUDA and pinned (page locked) memory not page locked at all?

It would seem that the pinned allocator on CUDA 6.5 under the hood is using mmap() with MAP_FIXED. Although I am not an OS expert, this apparently has the effect of "pinning" memory, i.e. ensuring that its address never changes. However this is not a complete explanation. Refer to the answer by @Jeff which points out what is almost certainly the "missing piece".

Let's consider a short test program:

#include <stdio.h>
#define DSIZE (1048576*1024)

int main(){

  int *data;
  cudaFree(0);
  system("cat /proc/meminfo > out1.txt");
  printf("*$*before alloc\n");
  cudaHostAlloc(&data, DSIZE, cudaHostAllocDefault);
  printf("*$*after alloc\n");
  system("cat /proc/meminfo > out2.txt");
  cudaFreeHost(data);
  system("cat /proc/meminfo > out3.txt");
  return 0;
}

If we run this program with strace, and excerpt the output part between the printf statements, we have:

write(1, "*$*before alloc\n", 16*$*before alloc)       = 16
mmap(0x204500000, 1073741824, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x204500000
ioctl(11, 0xc0304627, 0x7fffcf72cce0)   = 0
ioctl(3, 0xc0384657, 0x7fffcf72cd70)    = 0
write(1, "*$*after alloc\n", 15*$*after alloc)        = 15

(note that 1073741824 is exactly one gigabyte, i.e. the same as the requested 1048576*1024)

Reviewing the description of mmap, we have:

address gives a preferred starting address for the mapping. NULL expresses no preference. Any previous mapping at that address is automatically removed. The address you give may still be changed, unless you use the MAP_FIXED flag.

Therefore, assuming the mmap command is successful, the virtual address requested will be fixed, which is probably useful, but not the whole story.

As I mentioned, I am not a OS expert, and it's not obvious to me what exactly about this system call would create a "pinned" mapping/allocation. It may be that the combination of MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS somehow creates a pinned underlying allocation, but I've not found any evidence to support that.

Based on this article it seems that even mlock()-ed pages would not meet the needs of DMA activity, which is one of the key goals of pinned host pages in CUDA. Therefore, it seems that something else is providing the actual "pinning" (i.e. guaranteeing that the underlying physical pages are always memory-resident, and that their virtual-to-physical mapping doesn't change -- the latter part of this is possibly accomplished by MAP_FIXED along with whatever mechanism guarantees that the underlying physical pages don't move in any way).

This mechanism apparently does not use mlock(), and so the mlock'ed pages don't change, before and after. However we would expect a change in the mapping statistic, and if we diff the out1.txt and out2.txt produced by the above program, we see (excerpted):

< Mapped:            87488 kB
---
> Mapped:          1135904 kB

The difference is approximately a gigabyte, the amount of "pinned" memory requested.

Problem of sorting OpenMP threads into NUMA nodes by experiment

After more investigation, I note the following:

work-load managers on clusters can and will disregard/reset OMP_PLACES/OMP_PROC_BIND,
memory page migration is a thing on modern NUMA systems.

Following this, I started using the work-load manager's own thread binding/pinning system, and adapted my benchmark to lock the memory page(s) on which my data lay. Furthermore, giving in to my programmer's paranoia, I ditched the std::unique_ptr for fear that it may lay its own first touch after allocating the memory.

    // create data which will be shared by multiple threads
    const auto size_per_thread = std::size_t{50 * 1024 * 1024 / sizeof(double)}; // 50 MB
    const auto total_size = thread_count * size_per_thread;
    double* data = nullptr;
    posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), total_size * sizeof(double));
    if (data == nullptr)
    {
        throw std::runtime_error("could_not_allocate_memory_error");
    }
    
    // perform first touch using thread 0
    #pragma omp parallel num_threads(thread_count)
    {
        if (omp_get_thread_num() == 0)
        {
            #pragma omp simd safelen(8)
            for (auto d_index = std::size_t{}; d_index < total_size; ++d_index)
            {
                data[d_index] = -1.0;
            }
        }
    } // #pragma omp parallel
    mlock(data, total_size); // page migration is a real thing...
    
    // open a parallel section
    auto thread_id_avg_latency = std::multimap<double, int>{};
    auto generator = std::mt19937(); // heavy object can be created outside parallel
    #pragma omp parallel num_threads(thread_count) private(generator)
    {
        // access the data using all threads individually
        #pragma omp for schedule(static, 1)
        for (auto thread_counter = std::size_t{}; thread_counter < thread_count; ++thread_counter)
        {
            // seed each thread's generator
            generator.seed(thread_counter + 1);
            
            // calculate the minimum access latency of this thread
            auto this_thread_avg_latency = 0.0;
            const auto experiment_count = 250;
            for (auto experiment_counter = std::size_t{}; experiment_counter < experiment_count; ++experiment_counter)
            {
                const auto start_timestamp = omp_get_wtime() * 1E+6;
                for (auto counter = std::size_t{}; counter < size_per_thread / 100; ++counter)
                {
                    const auto index = std::uniform_int_distribution<std::size_t>(0, size_per_thread-1)(generator);
                    auto& datapoint = data[thread_counter * size_per_thread + index];
                    datapoint += index;
                }
                const auto end_timestamp = omp_get_wtime() * 1E+6;
                this_thread_avg_latency += end_timestamp - start_timestamp;
            }
            this_thread_avg_latency /= experiment_count;
            #pragma omp critical
            {
                thread_id_avg_latency.insert(std::make_pair(this_thread_avg_latency, omp_get_thread_num()));
            }
        }
    } // #pragma omp parallel
    std::free(data);

With these changes, I am noticing the difference I expected.

Sample Image

Further notes:

this experiment shows that the latency of non-local access is 1.09 - 1.15 times that of local access on the cluster that I'm using,
there is no reliable cross-platform way of doing this (requires kernel-APIs),
OpenMP seems to number the threads exactly as hwloc/lstopo, numactl and lscpu seems to number them (logical ID?)

The most astonishing things are that the difference in latencies is very low, and that memory page migration may happen, which begs the question, why should we care about first-touch and all the rest of the NUMA concerns at all?

How do I calculate someone's age based on a DateTime type birthday?

An easy to understand and simple solution.

// Save today's date.
var today = DateTime.Today;

// Calculate the age.
var age = today.Year - birthdate.Year;

// Go back to the year in which the person was born in case of a leap year
if (birthdate.Date > today.AddYears(-age)) age--;

However, this assumes you are looking for the western idea of the age and not using East Asian reckoning.

Are Mlock()-Ed Pages Static, or Can They Be Moved in Physical Ram