The Only Overhead Incurred by Fork Is Page Table Duplication and Process Id Creation

Why does fork() flag each page in both processes as read-only?

If each page in the parent process is read-only then the parent process will never be able to modify some uninitialised global variables

That would only be true if the pages stay read only. But they don't as it says in the next part of the sentence:

and flags each area struct in both processes as private copy-on-write

Each page starts off as read-only so that a single copy can be shared by both parent and child. If either process tries to modify such a page only at that point will a writeable copy be made (if the page is indeed meant to be writeable). After the copy the writing process can make any changes it likes without affecting the other process's original (still read-only) page.

This can save memory for pages that neither parent nor child actually ends up changing.

sharing address space versus duplicating the page table entries

lets say your process is got var name X that have a virtual address 100 and physical address 200.
the PTE is holding a mapping of addresses from virtual 100 to physical 200.

after the fork, each process (parent and child) will have his private PTE. at this point both PTEs will map virtual 100 to physical 200.

as long as both process just read from there they both will read from physical address 200.

when the first one will try to write there, the data from physical address will be copy to a new physical space, lets say 300, and his (and only his) PTE will be update so virtual 100 will be mapped to physical 300. that way it's transparent to the process because he is still using the same (virtual) address.

*Note: this is just an abstract, and the real thing is happening in page resolution.

How does fork() process mark parent's PTE's as read only?

Linux OS implements syscall fork with iterating over all memory ranges (mmaps, stack and heap) of parent process. Copying of that ranges (VMA - Virtual memory areas is in function copy_page_range (mn/memory.c) which has loop over page table entries:

  • copy_page_range will iterate over pgd and call
  • copy_pud_range to iterate over pud and call
  • copy_pmd_range to iterate over pmd and call
  • copy_pte_range to iterate over pte and call
  • copy_one_pte which does memory usage accounting (RSS) and has several code segments to handle COW case:
    /*
* If it's a COW mapping, write protect it both
* in the parent and the child
*/
if (is_cow_mapping(vm_flags)) {
ptep_set_wrprotect(src_mm, addr, src_pte);
pte = pte_wrprotect(pte);
}

where is_cow_mapping will be true for private and potentially writable pages (bitfield flags is checked for shared and maywrite bits and should have only maywrite bit set)

#define VM_SHARED   0x00000008
#define VM_MAYWRITE 0x00000020

static inline bool is_cow_mapping(vm_flags_t flags)
{
return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
}

PUD, PMD, and PTE are described in books like https://www.kernel.org/doc/gorman/html/understand/understand006.html and in articles like LWN 2005: "Four-level page tables merged".

How fork implementation calls copy_page_range:

  • fork syscall implementation (sys_fork? or syscall_define0(fork)) is do_fork (kernel/fork.c) which will call
  • copy_process which will call many copy_* functions, including
  • copy_mm which calls
  • dup_mm to allocate and fill new mm struct, where most work is done by
  • dup_mmap (still kernel/fork.c) which will check what was mmaped and how. (Here I was unable to get exact path to COW implementation so I used the Internet Search Machine with something like "fork+COW+dup_mm" to get hints like [1] or [2] or [3]). After checking mmap types there is retval = copy_page_range(mm, oldmm, mpnt); line to do real work.

Child-runs-first semantics in old linux kernels

fork() creates a copy of the parent's memory address space where all memory pages are initially shared between the parent and the child. All pages are markes as read-only, and on the first write to such a page, the page is copied so that parent and child have their own. (This is what COW is about.)

exec() throws away the entire current address space and creates a new one for the new program.

  • If the child executes first and calls exec(), the none of the shared pages needs to be unshared.
  • If the parent executes first and modifies some data, then these pages are unshared. If the child then starts executing and calls exec(), the copied pages will be thrown away, i.e., the unsharing was not actually necessary.

If you fork() and the forked (child) process exits are all the VM pages still marked COW in the parent?

Apart from the copy-on-write bit there is also a reference count in the page table. So when a child forks, all non-private pages in the parentent are marked COW, and the reference count is incremented.

Then while the child process is running, and the parent writes a page, it will get a page fault, and the page is copied like you would expect, and reference count is decreased. When the child exits, it decreases all its page references with one, and the pages with reference count zero are thrown away.

Now when the parent writes a page that has the COW-bit set, and a reference count of one, the COW-bit is simply ignored.



Related Topics



Leave a reply



Submit