In Linux, How to Create a File Descriptor for a Memory Region

In linux , how to create a file descriptor for a memory region

You cannot easily create a file descriptor (other than a C standard library one, which is not helpful) from "some memory region". However, you can create a shared memory region, getting a file descriptor in return.

From shm_overview (7):

shm_open(3)

Create and open a new object, or open an existing object. This is analogous to open(2). The call returns a file descriptor for use by the other interfaces listed below.

Among the listed interfaces is mmap, which means that you can "memory map" the shared memory the same as you would memory map a regular file.

Thus, using mmap for both situations (file or memory buffer) should work seamlessly, if only you control creation of that "memory buffer".

How to get file descriptor of buffer in memory?

I wrote a simple example how to make filedescriptor to a memory area:

#include <unistd.h>
#include <stdio.h> 
#include <string.h> 

char buff[]="qwer\nasdf\n";

int main(){
  int p[2]; pipe(p);

  if( !fork() ){
    for( int buffsize=strlen(buff), len=0; buffsize>len; )
      len+=write( p[1], buff+len, buffsize-len );
    return 0;
  }

  close(p[1]);
  FILE *f = fdopen( p[0], "r" );
  char buff[100];
  while( fgets(buff,100,f) ){
    printf("from child: '%s'\n", buff );
  }
  puts("");
}

munmap() when processes share file descriptor table, but not virtual memory

Using the program below I'm able to empirically get some conclusions (even though I have no guarantees they are correct):

mmap() takes approximately the same time independently of the allocation area (this is due to efficient memory management by the linux kernel. mapped memory doesn't take space unless it is written to).
mmap() takes longer depending on the number of already-existing mappings. First 1000 mmaps take around 0.05 seconds; 1000 mmaps after having 64000 mappings take around 34 seconds. I haven't checked the linux kernel, but probably inserting a mapped region in the index takes O(n) instead of the feasible O(1) in some structures. Kernel patch possible; but probably it's not a problem to anyone but me :-)
munmap() needs to be issued on ALL processes mapping the same MAP_ANONYMOUS region for it to be reclaimed by the kernel. This correctly frees the shared memory region.

#include <cassert>
#include <cinttypes>
#include <thread>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <stddef.h>
#include <signal.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sched.h>

#define NUM_ITERATIONS 100000
#define ALLOC_SIZE 1ul<<30
#define CLOCK_TYPE CLOCK_PROCESS_CPUTIME_ID
#define NUM_ELEMS 1024*1024/4

struct timespec start_time;

int main() {
    clock_gettime(CLOCK_TYPE, &start_time);
    printf("iterations = %d\n", NUM_ITERATIONS);
    printf("alloc size = %lu\n", ALLOC_SIZE);
    assert(ALLOC_SIZE >= NUM_ELEMS * sizeof(int));
    bool *written = (bool*) mmap(NULL, sizeof(bool), PROT_READ | PROT_WRITE,
                               MAP_SHARED | MAP_ANONYMOUS, -1, 0);
    for(int i=0; i < NUM_ITERATIONS; i++) {
        if(i % (NUM_ITERATIONS / 100) == 0) {
            struct timespec now;
            struct timespec elapsed;
            printf("[%3d%%]", i / (NUM_ITERATIONS / 100));
            clock_gettime(CLOCK_TYPE, &now);
            if (now.tv_nsec < start_time.tv_nsec) {
                elapsed.tv_sec = now.tv_sec - start_time.tv_sec - 1;
                elapsed.tv_nsec = now.tv_nsec - start_time.tv_nsec + 1000000000;
            } else {
                elapsed.tv_sec = now.tv_sec - start_time.tv_sec;
                elapsed.tv_nsec = now.tv_nsec - start_time.tv_nsec;
            }
            printf("%05" PRIdMAX ".%09ld\n", elapsed.tv_sec, elapsed.tv_nsec);
        }
    int *value = (int*) mmap(NULL, ALLOC_SIZE, PROT_READ | PROT_WRITE,
                             MAP_SHARED | MAP_ANONYMOUS, -1, 0);
        *value = 0;
        *written = 0;
      if (int rv = syscall(SYS_clone, CLONE_FS | CLONE_FILES | SIGCHLD, nullptr)) {
            while(*written == 0) std::this_thread::yield();
            assert(*value == i);
            munmap(value, ALLOC_SIZE);
            waitpid(-1, NULL, 0);
        } else {
            for(int j=0; j<NUM_ELEMS; j++)
                value[j] = i;
            *written = 1;
            //munmap(value, ALLOC_SIZE);
            return 0;
        }
    }
    return 0;
}

Setting a fmemopen ed file descriptor to be the standard input for a child process

This is not possible. Inheriting stdin/out/err is based purely on file descriptors, not stdio FILE streams. Since fmemopen does not create a file descriptor, it cannot become a new process's stdin/out/err or be used for inter-process communication in any way. What you're looking for is a pipe, unless you need seeking, in which case you need a temporary file. The tmpfile function could be used to create one without having to worry about making a visible name in the filesystem.

mmap memory backed by other memory?

General case - no control over first mapping

`/proc/[PID]/pagemap` + `/dev/mem`

The only way I can think of making this work without any copying is by manually opening and checking /proc/[PID]/pagemap to get the Page Frame Number of the physical page corresponding to the page you want to "alias", and then opening and mapping /dev/mem at the corresponding offset. While this would work in theory, it would require root privileges, and is most likely not possible on any reasonable Linux distribution since the kernel is usually configured with CONFIG_STRICT_DEVMEM=y which puts strict restrictions over the usage of /dev/mem. For example on x86 it disallows reading RAM from /dev/mem (only allows reading memory-mapped PCI regions). Note that in order for this to work the page you want to "alias" needs to be locked to keep it in RAM.

In any case, here's an example of how this would work if you were able/willing to do this (I am assuming x86 64bit here):

#include <stdio.h>
#include <errno.h>
#include <limits.h>
#include <sys/mman.h>
#include <unistd.h>
#include <fcntl.h>

/* Get the physical address of an existing virtual memory page and map it. */

int main(void) {
    FILE *fp;
    char *endp;
    unsigned long addr, info, physaddr, val;
    long off;
    int fd;
    void *mem;
    void *orig_mem;

    // Suppose that this is the existing page you want to "alias"
    orig_mem = mmap(NULL, 0x1000, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
    if (orig_mem == MAP_FAILED) {
        perror("mmap orig_mem failed");
        return 1;
    }

    // Write a dummy value just for testing
    *(unsigned long *)orig_mem = 0x1122334455667788UL;

    // Lock the page to prevent it from being swapped out
    if (mlock(orig_mem, 0x1000)) {
        perror("mlock orig_mem failed");
        return 1;
    }

    fp = fopen("/proc/self/pagemap", "rb");
    if (!fp) {
        perror("Failed to open \"/proc/self/pagemap\"");
        return 1;
    }

    addr = (unsigned long)orig_mem;
    off  = addr / 0x1000 * 8;

    if (fseek(fp, off, SEEK_SET)) {
        perror("fseek failed");
        return 1;
    }

    // Get its information from /proc/self/pagemap
    if (fread(&info, sizeof(info), 1, fp) != 1) {
        perror("fread failed");
        return 1;
    }

    physaddr = (info & ((1UL << 55) - 1)) << 12;

    printf("Value: %016lx\n", info);
    printf("Physical address: 0x%016lx\n", physaddr);

    // Ensure page is in RAM, should be true since it was mlock'd
    if (!(info & (1UL << 63))) {
        fputs("Page is not in RAM? Strange! Aborting.\n", stderr);
        return 1;
    }

    fd = open("/dev/mem", O_RDONLY);
    if (fd == -1) {
        perror("open(\"/dev/mem\") failed");
        return 1;
    }

    mem = mmap(NULL, 0x1000, PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS, fd, physaddr);
    if (mem == MAP_FAILED) {
        perror("Failed to mmap \"/dev/mem\"");
        return 1;
    }

    // Now `mem` is effecively referring to the same physical page that
    // `orig_mem` refers to.

    // Try reading 8 bytes (note: this will just return 0 if
    // CONFIG_STRICT_DEVMEM=y).
    val = *(unsigned long *)mem;

    printf("Read 8 bytes at physaddr 0x%016lx: %016lx\n", physaddr, val);

    return 0;
}

`userfaultfd(2)`

Other than what I described above, AFAIK there isn't a way to do what you want from userspace without copying. I.E. there is not a way to simply tell the kernel "map this second virtual addresses to the same memory of an existing one". You can however register an userspace handler for page faults through the userfaultfd(2) syscall and ioctl_userfaultfd(2), and I think this is overall your best shot.

The whole mechanism is similar to what the kernel would do with a real memory page, only that the faults are handled by a user-defined userspace handler thread. This is still pretty much an actual copy, but is atomic to the faulting thread and gives you more control. It could potentially also perform better in general since the copying is controlled by you and can therefore be done only if/when needed (i.e. at the first read fault), while in the case of a normal mmap + copy you always do the copying regardless if the page will ever be accessed later or not.

There is a pretty good example program in the manual page for userfaultfd(2) which I linked above, so I'm not going to copy-paste it here. It deals with one or more pages and should give you an idea about the whole API.

Simpler case - control over the first mapping

In the case you do have control over the first mapping which you want to "alias", then you can simply create a shared mapping. What you are looking for is memfd_create(2). You can use it to create an anonymous file which can then be mmaped multiple times with different permissions.

Here's a simple example:

#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>

int main(void) {
        int memfd;
        void *mem_ro, *mem_rw;

        // Create a memfd
        memfd = memfd_create("something", 0);
        if (memfd == -1) {
                perror("memfd_create failed");
                return 1;
        }

        // Give the file a size, otherwise reading/writing will fail
        if (ftruncate(memfd, 0x1000) == -1) {
                perror("ftruncate failed");
                return 1;
        }

        // Map the fd as read only and private
        mem_ro = mmap(NULL, 0x1000, PROT_READ, MAP_PRIVATE, memfd, 0);
        if (mem_ro == MAP_FAILED) {
                perror("mmap failed");
                return 1;
        }

        // Map the fd as read/write and shared (shared is needed if we want
        // write operations to be propagated to the other mappings)
        mem_rw = mmap(NULL, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, memfd, 0);
        if (mem_rw == MAP_FAILED) {
                perror("mmap failed");
                return 1;
        }

        printf("ro mapping @ %p\n", mem_ro);
        printf("rw mapping @ %p\n", mem_rw);

        // This write can now be read from both mem_ro and mem_rw
        *(char *)mem_rw = 123;

        // Test reading
        printf("read from ro mapping: %d\n", *(char *)mem_ro);
        printf("read from rw mapping: %d\n", *(char *)mem_rw);

        return 0;
}

Linux Implement open file descriptors C

How you count this depends on what information you are interested in.

Looking through /proc/PID/fd/* will give you the number of open file descriptors. However, one caveat is that two processes may actually share a file descriptor, if you fork then the child process inherits the file descriptor from its parent, and this method will then count it twice, once for each process.

/proc/PID/maps will show you the memory map of the process, which can include the loaded executable itself and dynamically linked libraries, but also includes things that don't correspond to files like the heap, the stack, the vdso section which is a virtual shared object exported by the kernel, and so on.

lsof will list a variety of ways that files can be in use, which includes more than just file descriptors; it also includes the executable and shared libraries, but does not include the memory regions that don't correspond to files that show up in /proc/PID/maps like the stack, heap, vdso section, etc.

/proc/sys/fs/file-nr will report the number of open kernel file handles. A kernel file handle is different than a file descriptor; there can be more than one file descriptor open that point to the same file handle, for instance, by calling dup or dup2.

These differences explain why you're getting different numbers from these different ways of counting. The question is, what purpose are you using this count for? That will help answer which way of counting you should actually use.

In Linux, How to Create a File Descriptor for a Memory Region