Mmap and Memory Usage

mmap and memory usage

Mmap can help you in some ways, I'll explain with some hypothetical examples:

First thing: Let's say you're running out of memory, and your application that have a 100MB chunk of malloc'ed memory get 50% of it swapped out, that means that the OS had to write 50MB to the swapfile, and if you need to read it back, you have written, occupied and then read it back again 50MB of your swapfile.

In case the memory was just mmap'ed, the operating system will not write that piece of information to the swapfile (as it knows that that data is identical to the file itself), instead, it will just scratch 50MB of information (again: supposing you have not written anything for now) and that's that. If you ever need that memory to be read again, the OS will fetch the contents not from the swapfile, but from the original file you've mmaped, so if any other program needs 50MB of swap, they're available. Also there is not overhead with swapfile manipulation at all.

Let's say you read a 100MB chunk of data, and according to the initial 1MB of header data, the information that you want is located at offset 75MB, so you don't need anything between 1~74.9MB! You have read it for nothing but to make your code simpler. With mmap, you will only read the data you have actually accessed (rounded 4kb, or the OS page size, which is mostly 4kb), so it would only read the first and the 75th MB. I think it's very hard to make a simpler and more effective way to avoid disk reading than mmaping files.
And if by some reason you need the data at offset 37MB, you can just use it! You don't have to mmap it again, as the whole file is accessible in memory (of course limited by your process' memory space).

All files mmap'ed are backed up by themselves, not by the swapfile, the swapfile is made to grant data that doesn't have a file to back up, which usually is data malloc'ed or data that is backed up by a file, but it was altered and [can not/shall not] be written back to it before the program actually tells the OS to do so via a msync call.

Beware that you don't need to map the whole file in the memory, you can map any amount (2nd arg is "size_t length") starting from any place (6th arg - "off_t offset"), but unless your file is likely to be enormous, you can safely map 1GB of data with no fear, even if the system only packs 64mb of physical memory, but that's for reading, if you plan on writing then you should be more conservative and map only the stuff that you need.

Mapping files will help you making your code simpler (you already have the file contents on the memory, ready to use, with much less memory overhead since it's not anonymous memory) and faster (you will only read the data that your program accessed).

mmap problem, allocates huge amounts of memory

No, what you're doing is mapping the file into memory. This is different to actually reading the file into memory.

Were you to read it in, you would have to transfer the entire contents into memory. By mapping it, you let the operating system handle it. If you attempt to read or write to a location in that memory area, the OS will load the relevant section for you first. It will not load the entire file unless the entire file is needed.

That is where you get your performance gain. If you map the entire file but only change one byte then unmap it, you'll find that there's not much disk I/O at all.

Of course, if you touch every byte in the file, then yes, it will all be loaded at some point but not necessarily in physical RAM all at once. But that's the case even if you load the entire file up front. The OS will swap out parts of your data if there's not enough physical memory to contain it all, along with that of the other processes in the system.

The main advantages of memory mapping are:

you defer reading the file sections until they're needed (and, if they're never needed, they don't get loaded). So there's no big upfront cost as you load the entire file. It amortises the cost of loading.
The writes are automated, you don't have to write out every byte. Just close it and the OS will write out the changed sections. I think this also happens when the memory is swapped out as well (in low physical memory situations), since your buffer is simply a window onto the file.

Keep in mind that there is most likely a disconnect between your address space usage and your physical memory usage. You can allocate an address space of 4G (ideally, though there may be OS, BIOS or hardware limitations) in a 32-bit machine with only 1G of RAM. The OS handles the paging to and from disk.

And to answer your further request for clarification:

Just to clarify. So If I need the entire file, mmap will actually load the entire file?

Yes, but it may not be in physical memory all at once. The OS will swap out bits back to the filesystem in order to bring in new bits.

But it will also do that if you've read the entire file in manually. The difference between those two situations is as follows.

With the file read into memory manually, the OS will swap parts of your address space (may include the data or may not) out to the swap file. And you will need to manually rewrite the file when your finished with it.

With memory mapping, you have effectively told it to use the original file as an extra swap area for that file/memory only. And, when data is written to that swap area, it affects the actual file immediately. So no having to manually rewrite anything when you're done and no affecting the normal swap (usually).

It really is just a window to the file:

memory mapped file image

mmap's worst case memory usage when using MAP_PRIVATE vs MAP_SHARED

MAP_SHARED creates a mapping that is backed by the original file. Any changes to the data are written back to that file (assuming a read/write mapping).

MAP_PRIVATE creates a mapping that is backed by the original file for reads only. If you change bytes in the mapping, then the OS creates a new page that is occupies physical memory and is backed by swap (if any).

The impact on resident set size is not dependent on the mapping type: pages will be in your resident set if they're actively accessed (read or write). If the OS needs physical memory, then pages that are not actively accessed are dropped (if clean), or written to either the original file or swap (if dirty, and depending on mapping type).

Where the two types differ is in total commitment against physical memory and swap. A shared mapping doesn't increase this commitment, a private mapping does. If you don't have enough combined memory and swap to hold every page of the private mapping, and you write to every page, then you (or possibly some other process) will be killed by the out-of-memory daemon.

Update: what I wrote above applies to memory-mapped files. You can map an anonymous block (MAP_ANONYMOUS) with MAP_SHARED, in which case the memory is backed by swap, not a file.

mmap memory backed by other memory?

General case - no control over first mapping

`/proc/[PID]/pagemap` + `/dev/mem`

The only way I can think of making this work without any copying is by manually opening and checking /proc/[PID]/pagemap to get the Page Frame Number of the physical page corresponding to the page you want to "alias", and then opening and mapping /dev/mem at the corresponding offset. While this would work in theory, it would require root privileges, and is most likely not possible on any reasonable Linux distribution since the kernel is usually configured with CONFIG_STRICT_DEVMEM=y which puts strict restrictions over the usage of /dev/mem. For example on x86 it disallows reading RAM from /dev/mem (only allows reading memory-mapped PCI regions). Note that in order for this to work the page you want to "alias" needs to be locked to keep it in RAM.

In any case, here's an example of how this would work if you were able/willing to do this (I am assuming x86 64bit here):

#include <stdio.h>
#include <errno.h>
#include <limits.h>
#include <sys/mman.h>
#include <unistd.h>
#include <fcntl.h>

/* Get the physical address of an existing virtual memory page and map it. */

int main(void) {
    FILE *fp;
    char *endp;
    unsigned long addr, info, physaddr, val;
    long off;
    int fd;
    void *mem;
    void *orig_mem;

    // Suppose that this is the existing page you want to "alias"
    orig_mem = mmap(NULL, 0x1000, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
    if (orig_mem == MAP_FAILED) {
        perror("mmap orig_mem failed");
        return 1;
    }

    // Write a dummy value just for testing
    *(unsigned long *)orig_mem = 0x1122334455667788UL;

    // Lock the page to prevent it from being swapped out
    if (mlock(orig_mem, 0x1000)) {
        perror("mlock orig_mem failed");
        return 1;
    }

    fp = fopen("/proc/self/pagemap", "rb");
    if (!fp) {
        perror("Failed to open \"/proc/self/pagemap\"");
        return 1;
    }

    addr = (unsigned long)orig_mem;
    off  = addr / 0x1000 * 8;

    if (fseek(fp, off, SEEK_SET)) {
        perror("fseek failed");
        return 1;
    }

    // Get its information from /proc/self/pagemap
    if (fread(&info, sizeof(info), 1, fp) != 1) {
        perror("fread failed");
        return 1;
    }

    physaddr = (info & ((1UL << 55) - 1)) << 12;

    printf("Value: %016lx\n", info);
    printf("Physical address: 0x%016lx\n", physaddr);

    // Ensure page is in RAM, should be true since it was mlock'd
    if (!(info & (1UL << 63))) {
        fputs("Page is not in RAM? Strange! Aborting.\n", stderr);
        return 1;
    }

    fd = open("/dev/mem", O_RDONLY);
    if (fd == -1) {
        perror("open(\"/dev/mem\") failed");
        return 1;
    }

    mem = mmap(NULL, 0x1000, PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS, fd, physaddr);
    if (mem == MAP_FAILED) {
        perror("Failed to mmap \"/dev/mem\"");
        return 1;
    }

    // Now `mem` is effecively referring to the same physical page that
    // `orig_mem` refers to.

    // Try reading 8 bytes (note: this will just return 0 if
    // CONFIG_STRICT_DEVMEM=y).
    val = *(unsigned long *)mem;

    printf("Read 8 bytes at physaddr 0x%016lx: %016lx\n", physaddr, val);

    return 0;
}

`userfaultfd(2)`

Other than what I described above, AFAIK there isn't a way to do what you want from userspace without copying. I.E. there is not a way to simply tell the kernel "map this second virtual addresses to the same memory of an existing one". You can however register an userspace handler for page faults through the userfaultfd(2) syscall and ioctl_userfaultfd(2), and I think this is overall your best shot.

The whole mechanism is similar to what the kernel would do with a real memory page, only that the faults are handled by a user-defined userspace handler thread. This is still pretty much an actual copy, but is atomic to the faulting thread and gives you more control. It could potentially also perform better in general since the copying is controlled by you and can therefore be done only if/when needed (i.e. at the first read fault), while in the case of a normal mmap + copy you always do the copying regardless if the page will ever be accessed later or not.

There is a pretty good example program in the manual page for userfaultfd(2) which I linked above, so I'm not going to copy-paste it here. It deals with one or more pages and should give you an idea about the whole API.

Simpler case - control over the first mapping

In the case you do have control over the first mapping which you want to "alias", then you can simply create a shared mapping. What you are looking for is memfd_create(2). You can use it to create an anonymous file which can then be mmaped multiple times with different permissions.

Here's a simple example:

#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>

int main(void) {
        int memfd;
        void *mem_ro, *mem_rw;

        // Create a memfd
        memfd = memfd_create("something", 0);
        if (memfd == -1) {
                perror("memfd_create failed");
                return 1;
        }

        // Give the file a size, otherwise reading/writing will fail
        if (ftruncate(memfd, 0x1000) == -1) {
                perror("ftruncate failed");
                return 1;
        }

        // Map the fd as read only and private
        mem_ro = mmap(NULL, 0x1000, PROT_READ, MAP_PRIVATE, memfd, 0);
        if (mem_ro == MAP_FAILED) {
                perror("mmap failed");
                return 1;
        }

        // Map the fd as read/write and shared (shared is needed if we want
        // write operations to be propagated to the other mappings)
        mem_rw = mmap(NULL, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, memfd, 0);
        if (mem_rw == MAP_FAILED) {
                perror("mmap failed");
                return 1;
        }

        printf("ro mapping @ %p\n", mem_ro);
        printf("rw mapping @ %p\n", mem_rw);

        // This write can now be read from both mem_ro and mem_rw
        *(char *)mem_rw = 123;

        // Test reading
        printf("read from ro mapping: %d\n", *(char *)mem_ro);
        printf("read from rw mapping: %d\n", *(char *)mem_rw);

        return 0;
}

Does mmap or malloc allocate RAM?

This is very OS/machine dependent.

In most OSes neither allocates RAM. They both allocate VM space. They make a certain range of your processes virtual memory valid for use. RAM is normally allocated later by the OS on first write. Until then those allocations do not use RAM (aside from the page table that lists them as valid VM space).

If you want to allocate physical RAM then you have to make each page (sysconf(_SC_PAGESIZE) gives you the system pagesize) dirty.

In Linux you can see your VM mappings with all details in /proc/self/smaps. Rss is your resident set of that mapping (how much is resident in RAM), everything else that is dirty will have been swapped out. All non-dirty memory will be available for use, but won't exist until then.

You can make all pages dirty with something like

size_t mem_length;
char (*my_memory)[sysconf(_SC_PAGESIZE)] = mmap(
      NULL
    , mem_length
    , PROT_READ | PROT_WRITE
    , MAP_PRIVATE | MAP_ANONYMOUS
    , -1
    , 0
    );

int i;
for (i = 0; i * sizeof(*my_memory) < mem_length; i++) {
    my_memory[i][0] = 1;
}

On some Implementations this can also be achieved by passing the MAP_POPULATE flag to mmap, but (depending on your system) it may just fail mmap with ENOMEM if you try to map more then you have RAM available.

Why does dereferencing pointer from mmap cause memory usage reported by top to increase?

When you first map the file, all it does is reserve address space, it doesn't necessarily read anything from the file if you don't pass MAP_POPULATE (the OS might do a little prefetch, it's not required to, and often doesn't until you begin reading/writing).

When you read from a given page of memory for the first time, this triggers a page fault. This "invalid page fault" most people think of when they hear the name, it's either:

A minor fault - The data is already loaded in the kernel, but the userspace mapping for that address to the loaded data needs to be established (fast)
A major fault - The data is not loaded at all, and the kernel needs to allocate a page for the data, populate it from the disk (slow), then perform the same mapping to userspace as in the minor fault case

The behavior you're seeing is likely due to the mapped file being too large to fit in memory alongside everything else that wants to stay resident, so:

When first mapped, the initial pages aren't already mapped to the process (some of them might be in the kernel cache, but they're not charged to the process unless they're linked to the process's address space by minor page faults)
You read from the file, causing minor and major faults until you fill main RAM
Once you fill main RAM, faulting in a new page typically leads to one of the older pages being dropped (you're not using all the pages as much as the OS and other processes are using theirs, so the low activity pages, especially ones that can be dropped for free rather than written to the page/swap file, are ideal pages to discard), so your memory usage steadies (for every page read in, you drop another)
When you munmap, the accounting against your process is dropped. Many of the pages are likely still in the kernel cache, but unless they're remapped and accessed again soon, they're likely first on the chopping block to discard if something else requests memory

And as commenters noted, shared memory mapped file accounting gets weird; every process is "charged" for the memory, but they'll all report it as shared even if no other processes map it, so it's not practical to distinguish "shared because it's MAP_SHARED and backed by kernel cache, but no one else has it mapped so it's effectively uniquely owned by this process" from "shared because N processes are mapping the same data, reporting shared_amount * N usage cumulatively, but actually only consuming shared_amount memory total (plus a trivial amount to maintain the per-process page tables for each mapping). There's no reason to be worried if the tallies don't line up.

Mmap and Memory Usage