How to Find How Much Memory Is Shared Between Forked Process with Copy-On-Write in Linux

How to find how much memory is shared between forked process with copy-on-write in Linux?

I don't know of a tool that would give you this information, but you can probably compute this based on /proc/[pid]/smaps:

   /proc/[pid]/smaps (since Linux 2.6.14)
          This file shows memory consumption for each of  the  process’s  mappings.
          For each of mappings there is a series of lines such as the following:

              08048000-080bc000 r-xp 00000000 03:02 13130      /bin/bash
              Size:               464 kB
              Rss:                424 kB
              Shared_Clean:       424 kB
              Shared_Dirty:         0 kB
              Private_Clean:        0 kB
              Private_Dirty:        0 kB

          The  first  of these lines shows the same information as is displayed for
          the mapping in /proc/[pid]/maps.  The remaining lines show  the  size  of
          the mapping, the amount of the mapping that is currently resident in RAM,
          the number of clean and dirty shared pages in the mapping, and the number
          of clean and dirty private pages in the mapping.

For details, see Getting information about a process' memory usage from /proc/pid/smaps.

How does copy-on-write in fork() handle multiple fork?

If fork is called multiple times from the original parent process, then each of the children and parent will have their pages marked as read-only. When a child process attempts to write data then the page from the parent process is copied to its address space and the copied page is marked as writeable in the child but not in the parent.

If fork is called from the child process and the grand-child attempts to write, the page from the original parent is copied to the first child, and then to the grand child, and all is marked as writeable.

Determine if memory after fork is copy-on-write

In general, in the sense of being portable to all POSIX conforming or POSIX-like systems, no, there is no way to observe COW, especially not at the individual page level (you might be able to observe it on a broader level just by "available" memory if the system provides such a figure). But on Linux you can observe it via /proc/[pid]/pagemap for the potentially-sharing processes. /proc/kpagecount and /proc/kpageflags may also contain relevant information but you need root to access them. See:

https://www.kernel.org/doc/Documentation/vm/pagemap.txt

Is it possible to monitor copy on write for forked linux processes? (specifically python)

To answer your specific question "is there a way to verify this?", if I understand it correctly, you can do the following if you want to see whether there are any changes associated with the pages that contain the large object.

1) Determine the address of your "large shared object" and the address of where that object ends.

2) If the start address is not on a 4K page boundary, round the start address down to the page boundary before where the object starts.

3) If the end address is not on a 4K boundary, round the end address up to the page boundary after where the object ends.

4) Dump that memory range for the process and all its children to separate files and compare them.

However, I think that Will os.fork() use copy on write or do a full copy of the parent-process in Python? may already provide an explanation for at least some of the writing that is requiring copying. Specifically, python objects are reference counted and your child processes will be altering reference counts.

Have you considered using python's threading rather than creating child processes?

Is a call to free() in the forked process causing a copy-on-write?

Yes, it certainly does.

Memory copy-on-write (CoW) happens on a different layer than malloc()/free().

When a process is forked, the child process has all its mapped pages marked as shared from the parent (and thus read-only). When the child modifies a shared page, it triggers a page fault and only then does the operating system copy the data to another area in the physical RAM (and change the mapping for the process).

malloc() and free() do not allocate physical RAM. They are memory management functions, with memory defined as "the (virtual) address space of a process". Thus, these C library functions keep track of an internal state of allocated memory chunks, and malloc() and free() only modifies these libc-internal data structures (with an exception of requesting more address space from the OS when malloc()-ing). Physical RAM allocation only happens at page fault, most commonly when a process accesses newly assigned memory for the first time.

In this respect, yes. As free() must modify memory to mark a region as freed, it will write to the relevant region, and at the lower level cause a remapping (i.e. CoW).

Which segments are affected by a copy-on-write?

The OS can set whatever "copy on write" policy it wishes, but generally, they all do the same thing (i.e. what makes the most sense).

Loosely, for a POSIX-like system (linux, BSD, OSX), there are four areas (what you were calling segments) of interest: data (where int x = 1; goes), bss (where int y goes), sbrk (this is heap/malloc), and stack

When a fork is done, the OS sets up a new page map for the child that shares all the pages of the parent. Then, in the page maps of the parent and the child, all the pages are marked readonly.

Each page map also has a reference count that indicates how many processes are sharing the page. Before the fork, the refcount will be 1 and, after, it will be 2.

Now, when either process tries to write to a R/O page, it will get a page fault. The OS will see that this is for "copy on write", will create a private page for the process, copy in the data from the shared, mark the page as writable for that process and resume it.

It will also bump down the refcount. If the refcount is now [again] 1, the OS will mark the page in the other process as writable and non-shared [this eliminates a second page fault in the other process--a speedup only because at this point the OS knows that the other process should be free to write unmolested again]. This speedup could be OS dependent.

Actually, the bss section get even more special treatment. In the initial page mapping for it, all pages are mapped to a single page that contains all zeroes (aka the "zero page"). The mapping is marked R/O. So, the bss area could be gigabytes in size and it will only occupy a single physical page. This single, special, zero page is shared amongst all bss sections of all processes, regardless whether they have any relationship to one another at all.

Thus, a process can read from any page in the area and gets what it expects: zero. It's only when the process tries to write to such a page, the same copy on write mechanism kicks in, the process gets a private page, the mapping is adjusted, and the process is resumed. It is now free to write to the page as it sees fit.

Once again, an OS can choose its policy. For example, after the fork, it might be more efficient to share most of the stack pages, but start off with private copies of the "current" page, as determined by the value of the stack pointer register.

When an exec syscall is done [on the child], the kernel has to undo much of the mapping done during the fork [bumping down refcounts], releasing the child's mapping, etc and restoring the parent's original page protections (i.e. it will no longer be sharing its pages unless it does another fork)

Although not part of your original question, there are related activities that may be of interest, such as on demand loading [of pages] and on demand linking [of symbols] after an exec syscall.

When a process does an exec, the kernel does the cleanup above, and reads a small portion of the executable file to determine its object format. The dominate format is ELF, but any format that a kernel understands can be used (e.g. OSX can use ELF [IIRC], but it also has others].

For ELF, the executable has a special section that gives a full FS path to what's known as the "ELF interpreter", which is a shared library, and is usually /lib64/ld.linux.so.

The kernel, using an internal form of mmap, will map this into the application space, and set up a mapping for the executable file itself. Most things are marked as R/O pages and "not present".

Before we go further, we need to talk about the "backing store" for a page. That is, if a page fault occurs and we need to load the page from disk, where it comes from. For heap/malloc, this is generally the swap disk [aka paging disk].

Under linux, it's generally the partition that is of the type "linux swap" that was added when the system was installed. When a page is written to that has to flushed to disk to free up some physical memory, it gets written there. Note that the page sharing algorithm in the first section still applies.

Anyway, when an executable is first mapped into memory, its backing store is the executable file in the filesystem.

So, the kernel sets the app's program counter to point to the starting location of the ELF interpreter, and transfers control to it.

The ELF interpreter goes about its business. Every time it tries to execute a portion of itself [a "code" page] that is mapped but not loaded, a page fault occurs and the loads that page from the backing store (e.g. the ELF interpreter's file) and changes the mapping to R/O but present.

This occurs for the ELF interpreter, shared libraries, and the executable itself.

The ELF interpreter will now use mmap to map libc into the app space [again, subject to the demand loading]. If the ELF interpreter has to modify a code page to relocate a symbol [or tries to write to any that has the file as the backing store, like a data page], a protection fault occurs, the kernel changes the backing store for the page from the on disk file to a page on the swap disk, adjusts the protections, and resumes the app.

The kernel must also handle the case where the ELF interpreter (e.g.) is trying to write to [say] a data page that had never yet been loaded (i.e. it has to load it first and then change the backing store to the swap disk)

The ELF interpreter then uses portions of libc to help it complete initial linking activities. It relocates the minimum necessary to allow it to do its job.

However, the ELF interpreter does not relocate anywhere near all the symbols for most other shared libraries. It will look through the executable and, again using mmap, create a mapping for the shared libraries the executable needs (i.e. what you see when you do ldd executable).

These mappings to shared libraries and executables, can be thought of as "segments".

There is a symbol jump table that points back to the interpreter in each shared library. But, the ELF interpreter makes minimal changes.

[Note: this is a loose explanation] Only when the application tries to call a given function's jump entry [this is that GOT et. al. stuff you may have seen] does a relocation occur. The jump entry transfers control to the interpreter, which locates the real address of the symbol and adjusts the GOT so that it now points directly to the final address for the symbol and redoes the call, which will now call the real function. On a subsequent call to the same given function, it now goes direct.

This is called "on demand linking".

A by-product of all this mmap activity is the the classical sbrk syscall is of little to no use. It would soon collide with one of the shared library memory mappings.

So, modern libc doesn't use it. When malloc needs more memory from the OS, it requests more memory from an anonymous mmap and keeps track of which allocations belong to which mmap mapping. (i.e. if enough memory got freed to comprise an entire mapping, free could do an munmap).

So, to sum up, we have "copy on write", "on demand loading", and "on demand linking" all going on at the same time. It seems complex, but makes fork and exec go quickly, smoothly. This adds some complexity, but extra overhead is done only when needed ("on demand").

Thus, instead of a large lurch/delay at the beginning launch of a program, the overhead activity gets spread out over the lifetime of the program, as needed.

How to Find How Much Memory Is Shared Between Forked Process with Copy-On-Write in Linux