Sharing Executable Memory Pages in Linux

Sharing executable memory pages in Linux?

As geekosaur said, Linux already does this.

At application startup the dynamic linker (ld.so) mmap()s the shared libraries.
It performs several calls to mmap() for each library:

mmap(PROT_READ|PROT_EXEC) for the executable section (i.e. .text)
mmap(PROT_READ|PROT_WRITE) for the data (i.e. .data and .bss)

(You can check this for yourself using strace.)

The kernel, being a clever little bit of code, realises that the executable section, identified by offset and the inode (known through the fd), is already mapped. As it's read-only there's no point in allocating more memory for it.

This also means that if you have any other file which you mmap() read-only from several application the memory will also be consumed only once.

Linux shared library loading and sharing the code with other process

To be precise, it's not ld.so's job to reserve physical memory or to manage or choose the mapping between virtual and physical memory, it's the kernel's job. When ld.so loads a shared library, it does so through the mmap syscall, and the kernel allocates the needed physical memory⁽¹⁾ and creates a virtual mapping between the library file and the physical memory. What is then returned by mmap is the virtual base address of the mapped library, which will then be used by the dynamic loader as a base to service calls to functions of that library.

Is ld.so going to identify that this shared library is already loaded to the physical memory? How does it work to understand that?

It's not ld.so, but the kernel that is going to identify this. It's a complicated process, but to make it simple, the kernel keeps track of which file is mapped where, and can detect when a request is made to map an already mapped file again, avoiding physical memory allocation if possible.

If the same file (i.e. a file with the same path) is mapped multiple times, the kernel will look at the existing mappings, and if possible it will reuse the same physical pages to avoid wasting memory. So ideally, if a shared library is loaded multiple times, it could be physically allocated only once.

In practice it's not that simple though. Since memory can also be written to, this "sharing" of physical pages can obviously only occur if the page that needs to be shared is unchanged from the original content of the file (otherwise different processes mapping the same file or library would interfere with each other). This is basically always true for code sections (.text) since they are usually read-only, and for other similar sections (like read-only data). It can also happen for RW sections if they are not modified⁽²⁾. So in short, the .text segments of already loaded libraries are usually only allocated into physical memory once.

(1) Actually, the kernel creates the mapping first, and then only allocates physical memory if the process tries to read or write to it through the mapping. This prevents wasting memory when it's not needed.

(2) This technique of sharing physical memory is managed through a copy-on-write mechanism where the kernel initially maps "clean" pages and marks them as "dirty" when they are written to, duplicating them as needed.

Can i force linux kernel to use particular memory pages for new executable

No, not without making heavy changes to the kernel. New anonymous pages are always zero-filled, and even if you could fill them with something else, there would be no reasonable way you could make them carry over data from old processes. Doing so would be a huge security hole in itself.

How to allocate an executable page in a Linux kernel module?

#include <linux/vmalloc.h>
#include <asm/pgtype_types.h>
...
char *p = __vmalloc(byte_size, GFP_KERNEL, PAGE_KERNEL_EXEC);
...
if (p != NULL) vfree(p);

How to keep executable code in memory even under memory pressure ? in Linux

To answer the question, here's a simple/preliminary patch to not evict Active(file)(as seen in /proc/meminfo) if it's less than 256 MiB, that seems to work ok (no disk thrashing) with linux-stable 5.2.4:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index dbdc46a84f63..7a0b7e32ff45 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2445,6 +2445,13 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
            BUG();
        }

+    if (NR_ACTIVE_FILE == lru) {
+      long long kib_active_file_now=global_node_page_state(NR_ACTIVE_FILE) * MAX_NR_ZONES;
+      if (kib_active_file_now <= 256*1024) {
+        nr[lru] = 0; //don't reclaim any Active(file) (see /proc/meminfo) if they are under 256MiB
+        continue;
+      }
+    }
        *lru_pages += size;
        nr[lru] = scan;
    }

Note that some ~~yet-to-be-found~~ regression on kernel 5.3.0-rc4-gd45331b00ddb will cause a system freeze(without disk thrashing, and sysrq will still work) even without this patch.

(any new developments related to this should be happening here.)

Executable Object Files and Virtual Memory

In general (not specifically for Linux)...

When an executable file is started, the OS (kernel) creates a virtual address space and an (initially empty) process, and examines the executable file's header. The executable file's header describes "sections" (e.g. .text, .rodata, .data, .bss, etc) where each section has different attributes - if the contents of the section should be put in the virtual address space or not (e.g. is a symbol table or something that isn't used at run-time), if the contents are part of the file or not (e.g. .bss), and if the area should be executable, read-only or read/write.

Typically, (used parts of) the executable file are cached by the virtual file system; and pieces of the file that are already in the VFS cache can be mapped (as "copy on write") into the new process' virtual address space. For parts that aren't already in the VFS cache, those pieces of the file can be mapped as "need fetching" into the new process' virtual address space.

Then the process is started (given CPU time).

If the process reads data from a page that hasn't been loaded yet; the OS (kernel) pauses the process, fetches the page from the file on disk into the VFS cache, then also maps the page as "copy on write" into the process; then allows the process to continue (allows the process to retry the read from the page that wasn't loaded, which will work now that the page is loaded).

If the process writes to a page that is still "copy on write"; the OS (kernel) pauses the process, allocates a new page and copies the original page's data into it, then replaces the original page with the process' own copy; then allows the process to continue (allows the process to retry the write which will work now that the process has it's own copy).

If the process writes to data from a page that hasn't been loaded yet; the OS (kernel) combines both of the previous things (fetches original page from disk into VFS cache, creates a copy, maps the process' copy into the process' virtual address space).

If the OS starts to run out of free RAM; then:

pages of file data that are in the VFS cache but aren't shared as "copy on write" with any process can be freed in the VFS without doing anything else. Next time the file is used those pages will be fetched from the file on disk into the VFS cache.
pages of file data that are in the VFS cache and are also shared as "copy on write" with any process can be freed in the VFS and the copies in any/all processes marked as "not fetched yet". Next time the file is used (including when a process accesses the "not fetched yet" page/s) those pages will be fetched from the file on disk into the VFS cache and then mapped as "copy on write" in the process/es).
pages of data that have been modified (either because they were originally "copy on write" but got copied, or because they weren't part of the executable file at all - e.g. .bss section, the executable's heap space, etc) can be saved to swap space and then freed. When the process accesses the page/s again they will be fetched from swap space.

Note: If the executable file is stored on unreliable media (e.g. potentially scratched CD) a "smarter than average" OS may load the entire executable file into VFS cache and/or swap space initially; because there's no sane way to handle "read error from memory mapped file" while the process is using the file other than making the process crash (e.g. SIGSEGV) and making it look like the executable was buggy when it was not, and because this improves reliability (because you're depending on more reliable swap and not depending on a less reliable scratched CD). Also; if the OS guards against file corruption or malware (e.g. has a CRC or digital signature built into executable files) then the OS may (should) load everything into memory (VFS cache) to check the CRC or digital signature before allowing the executable to be executed, and (for secure systems, in case the file on disk is modified while the executable is running) when freeing RAM may stored unmodified pages in "more trusted" swap space (the same as it would if the page was modified) to avoid fetching the data from the original "less trusted" file (partly because you don't want to do the whole digital signature check every time a page is loaded from the file).

My question is: suppose the program change the value of global variable from 2018 to 2019 on the run time and it seems that the virtual page that contains the global variable will eventually page out to the disk, which means that .data section has the global variable to be 2019 now, so we change the executable object file which are not supposed to be changed?

The page containing 2018 will begin as "not fetched", then (when its accessed) loaded into VFS cache and mapped into the process as "copy on write". At either of these points the OS may free the memory and fetch the data (that hasn't been changed) from the executable file on disk if it's needed again.

When the process modifies the global variable (changes it to contain 2019) the OS creates a copy of it for the process. After this point, if the OS wants to free the memory the OS needs to save the page's data in swap space, and load the page's data back from swap space if it's accessed again. The executable file is not modified and (for that page, for that process) the executable file isn't used again.

Sharing Executable Memory Pages in Linux