How does copy-on-write in fork() handle multiple fork?
If fork
is called multiple times from the original parent process, then each of the children and parent will have their pages marked as read-only. When a child process attempts to write data then the page from the parent process is copied to its address space and the copied page is marked as writeable in the child but not in the parent.
If fork
is called from the child process and the grand-child attempts to write, the page from the original parent is copied to the first child, and then to the grand child, and all is marked as writeable.
Does parent process lose write ability during copy on write?
Right, if either process writes a COW page, it triggers a page fault.
In the page fault handler, if the page is supposed to be writeable, it allocates a new physical page and does a memcpy(newpage, shared_page, pagesize)
, then updates the page table of whichever process faulted to map the newpage to that virtual address. Then returns to user-space for the store instruction to re-run.
This is a win for something like fork
, because one process typically makes an execve
system call right away, after touching typically one page (of stack memory). execve
destroys all memory mappings for that process, effectively replacing it with a new process. The parent once again has the only copy of every page. (Except pages that were already copy-on-write, e.g. memory allocated with mmap
is typically COW-mapped to a single physical page of zeros, so reads can hit in L1d cache).
A smart optimization would be for fork
to actually copy the page containing the top of the stack, but still do lazy COW for all the other pages, on the assumption that the child process will normally execve
right away and thus drop its references to all the other pages. It still costs a TLB invalidation in the parent to temporarily flip all the pages to read-only and back, though.
Does fork() in linux copy all the parent's memory pages to the child?
As long as no write operation is performed on a child's memory page, they are identical to a parent's memory page to a user process. Therefore, as long as the page is not written to, it can be used for both the parent and child.
If, however, a write operation is performed the parent's and child's versions differ. At this point, the parent's page is copied and assigned to the child in place of the parent page. This copy is called "copy on write" because the copy is performed when the page is written to.
Note that "copy on write" is just an optimization of the fork
ing operation. A naive implementation simply duplicates the pages of the parent for the child instantly. By noticing that pages not yet written to do not require that duplication, that copy is postponed until the child actually writes something (the term "laziness" is often used for that delay), which might not happen at all.
How does copy-on-write work in fork()?
Depends on the Operating System, hardware architecture and libc. But yes in case of recent Linux with MMU the fork(2) will work with copy-on-write. It will only (allocate and) copy a few system structures and the page table, but the heap pages actually point to the ones of the parent until written.
More control over this can be exercised with the clone(2) call. And vfork(2) beeing a special variant which does not expect the pages to be used. This is typically used before exec().
As for the allocation: the malloc() has meta information over requested memory blocks (address and size) and the C variable is a pointer (both in process memory heap and stacks). Those two look the same for the child (same values because same underlying memory page seen in the address space of both processes). So from a C program point of view the array is already allocated and the variable initialized when the process comes into existence. The underlying memory pages are however pointing to the original physical ones of the parent process, so no extra memory pages are needed until they are modified.
If the child allocates a new array it depends if it fits into the already existing heap pages or if the brk of the process needs to be increased. In both cases only the modified pages get copied and the new pages get allocated only for the child.
This also means that the physical memory might run out after malloc(). (Which is bad as the program cannot check the error return code of "a operation in a random code line"). Some operating systems will not allow this form of overcommit: So if you fork a process it will not allocate the pages, but it requires them to be available at that moment (kind of reserves them) just in case. In Linux this is configurable and called overcommit-accounting.
How does fork() process mark parent's PTE's as read only?
Linux OS implements syscall fork with iterating over all memory ranges (mmap
s, stack and heap) of parent process. Copying of that ranges (VMA - Virtual memory areas is in function copy_page_range
(mn/memory.c) which has loop over page table entries:
copy_page_range
will iterate over pgd and callcopy_pud_range
to iterate over pud and callcopy_pmd_range
to iterate over pmd and callcopy_pte_range
to iterate over pte and callcopy_one_pte
which does memory usage accounting (RSS) and has several code segments to handle COW case:
/*
* If it's a COW mapping, write protect it both
* in the parent and the child
*/
if (is_cow_mapping(vm_flags)) {
ptep_set_wrprotect(src_mm, addr, src_pte);
pte = pte_wrprotect(pte);
}
where is_cow_mapping
will be true for private and potentially writable pages (bitfield flags is checked for shared and maywrite bits and should have only maywrite bit set)
#define VM_SHARED 0x00000008
#define VM_MAYWRITE 0x00000020
static inline bool is_cow_mapping(vm_flags_t flags)
{
return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
}
PUD, PMD, and PTE are described in books like https://www.kernel.org/doc/gorman/html/understand/understand006.html and in articles like LWN 2005: "Four-level page tables merged".
How fork implementation calls copy_page_range
:
- fork syscall implementation (sys_fork? or syscall_define0(fork)) is
do_fork
(kernel/fork.c) which will call copy_process
which will call many copy_* functions, includingcopy_mm
which callsdup_mm
to allocate and fill new mm struct, where most work is done bydup_mmap
(still kernel/fork.c) which will check what was mmaped and how. (Here I was unable to get exact path to COW implementation so I used the Internet Search Machine with something like "fork+COW+dup_mm" to get hints like [1] or [2] or [3]). After checking mmap types there isretval = copy_page_range(mm, oldmm, mpnt);
line to do real work.
Related Topics
Reading Kernel Memory Using a Module
Trying to Delete All But Most Recent 2 Files in Sub Directories
How Does Linux Kernel Prevents The Bios System Calls
What Is The Downside of Updating Arm Ttbr(Translate Table Base Register)
Xfs - How to Not Modify Mtime When Writing to File
Why Strace Shows Eagain (Resource Temporarily Unavailable)
Docker Warning on Cgroup Swap Limit, Memory.Use_Hierarchy
Undelete The Deleted Command in Bash
Apache 2.4.23 Undefined Reference to Crypto_Malloc_Init
Sort a File Based on a Column in Another File
How Does Os Chooses The Next Process to Be Run in Cpu
Show Image Notification from Bash Script
Shared Libraries in Same Folder with App in Tcsh
How to Get The Output of at Command in Current or Another Terminal Window
Apt-Update in Azure Nvidia Gives Publickey Error