Which Segments Are Affected by a Copy-On-Write

Which segments are affected by a copy-on-write?

The OS can set whatever "copy on write" policy it wishes, but generally, they all do the same thing (i.e. what makes the most sense).

Loosely, for a POSIX-like system (linux, BSD, OSX), there are four areas (what you were calling segments) of interest: data (where int x = 1; goes), bss (where int y goes), sbrk (this is heap/malloc), and stack

When a fork is done, the OS sets up a new page map for the child that shares all the pages of the parent. Then, in the page maps of the parent and the child, all the pages are marked readonly.

Each page map also has a reference count that indicates how many processes are sharing the page. Before the fork, the refcount will be 1 and, after, it will be 2.

Now, when either process tries to write to a R/O page, it will get a page fault. The OS will see that this is for "copy on write", will create a private page for the process, copy in the data from the shared, mark the page as writable for that process and resume it.

It will also bump down the refcount. If the refcount is now [again] 1, the OS will mark the page in the other process as writable and non-shared [this eliminates a second page fault in the other process--a speedup only because at this point the OS knows that the other process should be free to write unmolested again]. This speedup could be OS dependent.

Actually, the bss section get even more special treatment. In the initial page mapping for it, all pages are mapped to a single page that contains all zeroes (aka the "zero page"). The mapping is marked R/O. So, the bss area could be gigabytes in size and it will only occupy a single physical page. This single, special, zero page is shared amongst all bss sections of all processes, regardless whether they have any relationship to one another at all.

Thus, a process can read from any page in the area and gets what it expects: zero. It's only when the process tries to write to such a page, the same copy on write mechanism kicks in, the process gets a private page, the mapping is adjusted, and the process is resumed. It is now free to write to the page as it sees fit.

Once again, an OS can choose its policy. For example, after the fork, it might be more efficient to share most of the stack pages, but start off with private copies of the "current" page, as determined by the value of the stack pointer register.

When an exec syscall is done [on the child], the kernel has to undo much of the mapping done during the fork [bumping down refcounts], releasing the child's mapping, etc and restoring the parent's original page protections (i.e. it will no longer be sharing its pages unless it does another fork)

Although not part of your original question, there are related activities that may be of interest, such as on demand loading [of pages] and on demand linking [of symbols] after an exec syscall.

When a process does an exec, the kernel does the cleanup above, and reads a small portion of the executable file to determine its object format. The dominate format is ELF, but any format that a kernel understands can be used (e.g. OSX can use ELF [IIRC], but it also has others].

For ELF, the executable has a special section that gives a full FS path to what's known as the "ELF interpreter", which is a shared library, and is usually /lib64/ld.linux.so.

The kernel, using an internal form of mmap, will map this into the application space, and set up a mapping for the executable file itself. Most things are marked as R/O pages and "not present".

Before we go further, we need to talk about the "backing store" for a page. That is, if a page fault occurs and we need to load the page from disk, where it comes from. For heap/malloc, this is generally the swap disk [aka paging disk].

Under linux, it's generally the partition that is of the type "linux swap" that was added when the system was installed. When a page is written to that has to flushed to disk to free up some physical memory, it gets written there. Note that the page sharing algorithm in the first section still applies.

Anyway, when an executable is first mapped into memory, its backing store is the executable file in the filesystem.

So, the kernel sets the app's program counter to point to the starting location of the ELF interpreter, and transfers control to it.

The ELF interpreter goes about its business. Every time it tries to execute a portion of itself [a "code" page] that is mapped but not loaded, a page fault occurs and the loads that page from the backing store (e.g. the ELF interpreter's file) and changes the mapping to R/O but present.

This occurs for the ELF interpreter, shared libraries, and the executable itself.

The ELF interpreter will now use mmap to map libc into the app space [again, subject to the demand loading]. If the ELF interpreter has to modify a code page to relocate a symbol [or tries to write to any that has the file as the backing store, like a data page], a protection fault occurs, the kernel changes the backing store for the page from the on disk file to a page on the swap disk, adjusts the protections, and resumes the app.

The kernel must also handle the case where the ELF interpreter (e.g.) is trying to write to [say] a data page that had never yet been loaded (i.e. it has to load it first and then change the backing store to the swap disk)

The ELF interpreter then uses portions of libc to help it complete initial linking activities. It relocates the minimum necessary to allow it to do its job.

However, the ELF interpreter does not relocate anywhere near all the symbols for most other shared libraries. It will look through the executable and, again using mmap, create a mapping for the shared libraries the executable needs (i.e. what you see when you do ldd executable).

These mappings to shared libraries and executables, can be thought of as "segments".

There is a symbol jump table that points back to the interpreter in each shared library. But, the ELF interpreter makes minimal changes.

[Note: this is a loose explanation] Only when the application tries to call a given function's jump entry [this is that GOT et. al. stuff you may have seen] does a relocation occur. The jump entry transfers control to the interpreter, which locates the real address of the symbol and adjusts the GOT so that it now points directly to the final address for the symbol and redoes the call, which will now call the real function. On a subsequent call to the same given function, it now goes direct.

This is called "on demand linking".

A by-product of all this mmap activity is the the classical sbrk syscall is of little to no use. It would soon collide with one of the shared library memory mappings.

So, modern libc doesn't use it. When malloc needs more memory from the OS, it requests more memory from an anonymous mmap and keeps track of which allocations belong to which mmap mapping. (i.e. if enough memory got freed to comprise an entire mapping, free could do an munmap).

So, to sum up, we have "copy on write", "on demand loading", and "on demand linking" all going on at the same time. It seems complex, but makes fork and exec go quickly, smoothly. This adds some complexity, but extra overhead is done only when needed ("on demand").

Thus, instead of a large lurch/delay at the beginning launch of a program, the overhead activity gets spread out over the lifetime of the program, as needed.

Copy-on-write during fork

Not so.

The addresses seen by both the child and the parent are relative to their own address spaces, not relative to the system as a whole.

The operating system maps the memory used by each process to a different location in physical (or virtual) memory. But that mapping is not visible to the processes.

Does parent process lose write ability during copy on write?

Right, if either process writes a COW page, it triggers a page fault.

In the page fault handler, if the page is supposed to be writeable, it allocates a new physical page and does a memcpy(newpage, shared_page, pagesize), then updates the page table of whichever process faulted to map the newpage to that virtual address. Then returns to user-space for the store instruction to re-run.

This is a win for something like fork, because one process typically makes an execve system call right away, after touching typically one page (of stack memory). execve destroys all memory mappings for that process, effectively replacing it with a new process. The parent once again has the only copy of every page. (Except pages that were already copy-on-write, e.g. memory allocated with mmap is typically COW-mapped to a single physical page of zeros, so reads can hit in L1d cache).

A smart optimization would be for fork to actually copy the page containing the top of the stack, but still do lazy COW for all the other pages, on the assumption that the child process will normally execve right away and thus drop its references to all the other pages. It still costs a TLB invalidation in the parent to temporarily flip all the pages to read-only and back, though.

Use of Readonly Memory in Data Segment

You are right, almost... there are three kinds of global static data in a program:

Initialized data, read only. Not only reserved for constant string, but for all types of const global data. It is not necessarily in the data section, it can be in the text section of the program (normally the .rodata segment), as it is normally not modifiable by a program.
Initialized data, read write. Normally in the data section of the program (.data segment).
Uninitialized data, read write. Normally in the data section of the program. The difference with the previous is that the executable doesn't include its contents, only the size, as they are initialized to a fixed known value (zero) and there is not a table of initializers in the executable file. Normally, the compiler/linker constructs a segment for this purpose in which it accumulates only the size required by the component modules (the .bss segment).

What resources are shared between threads?

You're pretty much correct, but threads share all segments except the stack. Threads have independent call stacks, however the memory in other thread stacks is still accessible and in theory you could hold a pointer to memory in some other thread's local stack frame (though you probably should find a better place to put that memory!).

Duration and scope of a forked process in large scale Unix C applications

You are completely right with your understanding of fork() - it is supposed to do an exact copy of the calling process, effectively doubling memory requirements of the program.

In practice (when wisely used), most of this overhead is optimized away with the help of most modern OS's virtual memory implementation: first, the OS will only consider writable segments for copy (text segments - your code - isn't considered writeable on most operating segments), so both processes will share the same code segments. With the remaining data segments the OS will do copy on write, so data segments will only be duplicated when the child first attempts to write them.

Copies of the data segments will be done on a page-by-page basis, so with a "not so smart mix" of parent and child data scattered across pages you will still be able to force duplicating most of the parents data unused to the child's memory space. With the classic fork()- exec() pair most of the data will be effectively be separated.

Which Segments Are Affected by a Copy-On-Write