How to Know Whether a Copy-On-Write Page Is an Actual Copy

How to know whether a copy-on-write page is an actual copy?

Good, following the advice of MarkR, I gave it a shot to go through the pagemap and kpageflags interface. Below a quick test to check whether a page is in memory 'SWAPBACKED' as it is called. One problem remains of course, which is the problem that kpageflags is only accessible to the root.

int main(int argc, char* argv[])
{
  unsigned long long pagesize=getpagesize();
  assert(pagesize>0);
  int pagecount=4;
  int filesize=pagesize*pagecount;
  int fd=open("test.dat", O_RDWR);
  if (fd<=0)
    {
      fd=open("test.dat", O_CREAT|O_RDWR,S_IRUSR|S_IWUSR);
      printf("Created test.dat testfile\n");
    }
  assert(fd);
  int err=ftruncate(fd,filesize);
  assert(!err);

  char* M=(char*)mmap(NULL, filesize, PROT_READ|PROT_WRITE, MAP_PRIVATE,fd,0);
  assert(M!=(char*)-1);
  assert(M);
  printf("Successfully create private mapping\n");

The test setup contains 4 pages. page 0 and 2 are dirty

  strcpy(M,"I feel so dirty\n");
  strcpy(M+pagesize*2,"Christ on crutches\n");

page 3 has been read from.

  char t=M[pagesize*3];

page 1 will not be accessed

The pagemap file maps the process its virtual memory to actual pages, which can then be retrieved from the global kpageflags file later on. Read the file /usr/src/linux/Documentation/vm/pagemap.txt

  int mapfd=open("/proc/self/pagemap",O_RDONLY);
  assert(mapfd>0);
  unsigned long long target=((unsigned long)(void*)M)/pagesize;
  err=lseek64(mapfd, target*8, SEEK_SET);
  assert(err==target*8);
  assert(sizeof(long long)==8);

Here we read the page frame numbers for each of our virtual pages

  unsigned long long page2pfn[pagecount];
  err=read(mapfd,page2pfn,sizeof(long long)*pagecount);
  if (err<0)
    perror("Reading pagemap");
  if(err!=pagecount*8)
    printf("Could only read %d bytes\n",err);

Now we are about to read for each virtual frame, the actual pageflags

  int pageflags=open("/proc/kpageflags",O_RDONLY);
  assert(pageflags>0);
  for(int i = 0 ; i < pagecount; i++)
    {
      unsigned long long v2a=page2pfn[i];
      printf("Page: %d, flag %llx\n",i,page2pfn[i]);

      if(v2a&0x8000000000000000LL) // Is the virtual page present ?
        {
        unsigned long long pfn=v2a&0x3fffffffffffffLL;
        err=lseek64(pageflags,pfn*8,SEEK_SET);
        assert(err==pfn*8);
        unsigned long long pf;
        err=read(pageflags,&pf,8);
        assert(err==8);
        printf("pageflags are %llx with SWAPBACKED: %d\n",pf,(pf>>14)&1);
        }
    }
}

All in all, I'm not particularly happy with this approach since it requires access to a file that we in general can't access and it is bloody complicated (how about a simple kernel call to retrieve the pageflags ?).

Determine if memory after fork is copy-on-write

In general, in the sense of being portable to all POSIX conforming or POSIX-like systems, no, there is no way to observe COW, especially not at the individual page level (you might be able to observe it on a broader level just by "available" memory if the system provides such a figure). But on Linux you can observe it via /proc/[pid]/pagemap for the potentially-sharing processes. /proc/kpagecount and /proc/kpageflags may also contain relevant information but you need root to access them. See:

https://www.kernel.org/doc/Documentation/vm/pagemap.txt

Which segments are affected by a copy-on-write?

The OS can set whatever "copy on write" policy it wishes, but generally, they all do the same thing (i.e. what makes the most sense).

Loosely, for a POSIX-like system (linux, BSD, OSX), there are four areas (what you were calling segments) of interest: data (where int x = 1; goes), bss (where int y goes), sbrk (this is heap/malloc), and stack

When a fork is done, the OS sets up a new page map for the child that shares all the pages of the parent. Then, in the page maps of the parent and the child, all the pages are marked readonly.

Each page map also has a reference count that indicates how many processes are sharing the page. Before the fork, the refcount will be 1 and, after, it will be 2.

Now, when either process tries to write to a R/O page, it will get a page fault. The OS will see that this is for "copy on write", will create a private page for the process, copy in the data from the shared, mark the page as writable for that process and resume it.

It will also bump down the refcount. If the refcount is now [again] 1, the OS will mark the page in the other process as writable and non-shared [this eliminates a second page fault in the other process--a speedup only because at this point the OS knows that the other process should be free to write unmolested again]. This speedup could be OS dependent.

Actually, the bss section get even more special treatment. In the initial page mapping for it, all pages are mapped to a single page that contains all zeroes (aka the "zero page"). The mapping is marked R/O. So, the bss area could be gigabytes in size and it will only occupy a single physical page. This single, special, zero page is shared amongst all bss sections of all processes, regardless whether they have any relationship to one another at all.

Thus, a process can read from any page in the area and gets what it expects: zero. It's only when the process tries to write to such a page, the same copy on write mechanism kicks in, the process gets a private page, the mapping is adjusted, and the process is resumed. It is now free to write to the page as it sees fit.

Once again, an OS can choose its policy. For example, after the fork, it might be more efficient to share most of the stack pages, but start off with private copies of the "current" page, as determined by the value of the stack pointer register.

When an exec syscall is done [on the child], the kernel has to undo much of the mapping done during the fork [bumping down refcounts], releasing the child's mapping, etc and restoring the parent's original page protections (i.e. it will no longer be sharing its pages unless it does another fork)

Although not part of your original question, there are related activities that may be of interest, such as on demand loading [of pages] and on demand linking [of symbols] after an exec syscall.

When a process does an exec, the kernel does the cleanup above, and reads a small portion of the executable file to determine its object format. The dominate format is ELF, but any format that a kernel understands can be used (e.g. OSX can use ELF [IIRC], but it also has others].

For ELF, the executable has a special section that gives a full FS path to what's known as the "ELF interpreter", which is a shared library, and is usually /lib64/ld.linux.so.

The kernel, using an internal form of mmap, will map this into the application space, and set up a mapping for the executable file itself. Most things are marked as R/O pages and "not present".

Before we go further, we need to talk about the "backing store" for a page. That is, if a page fault occurs and we need to load the page from disk, where it comes from. For heap/malloc, this is generally the swap disk [aka paging disk].

Under linux, it's generally the partition that is of the type "linux swap" that was added when the system was installed. When a page is written to that has to flushed to disk to free up some physical memory, it gets written there. Note that the page sharing algorithm in the first section still applies.

Anyway, when an executable is first mapped into memory, its backing store is the executable file in the filesystem.

So, the kernel sets the app's program counter to point to the starting location of the ELF interpreter, and transfers control to it.

The ELF interpreter goes about its business. Every time it tries to execute a portion of itself [a "code" page] that is mapped but not loaded, a page fault occurs and the loads that page from the backing store (e.g. the ELF interpreter's file) and changes the mapping to R/O but present.

This occurs for the ELF interpreter, shared libraries, and the executable itself.

The ELF interpreter will now use mmap to map libc into the app space [again, subject to the demand loading]. If the ELF interpreter has to modify a code page to relocate a symbol [or tries to write to any that has the file as the backing store, like a data page], a protection fault occurs, the kernel changes the backing store for the page from the on disk file to a page on the swap disk, adjusts the protections, and resumes the app.

The kernel must also handle the case where the ELF interpreter (e.g.) is trying to write to [say] a data page that had never yet been loaded (i.e. it has to load it first and then change the backing store to the swap disk)

The ELF interpreter then uses portions of libc to help it complete initial linking activities. It relocates the minimum necessary to allow it to do its job.

However, the ELF interpreter does not relocate anywhere near all the symbols for most other shared libraries. It will look through the executable and, again using mmap, create a mapping for the shared libraries the executable needs (i.e. what you see when you do ldd executable).

These mappings to shared libraries and executables, can be thought of as "segments".

There is a symbol jump table that points back to the interpreter in each shared library. But, the ELF interpreter makes minimal changes.

[Note: this is a loose explanation] Only when the application tries to call a given function's jump entry [this is that GOT et. al. stuff you may have seen] does a relocation occur. The jump entry transfers control to the interpreter, which locates the real address of the symbol and adjusts the GOT so that it now points directly to the final address for the symbol and redoes the call, which will now call the real function. On a subsequent call to the same given function, it now goes direct.

This is called "on demand linking".

A by-product of all this mmap activity is the the classical sbrk syscall is of little to no use. It would soon collide with one of the shared library memory mappings.

So, modern libc doesn't use it. When malloc needs more memory from the OS, it requests more memory from an anonymous mmap and keeps track of which allocations belong to which mmap mapping. (i.e. if enough memory got freed to comprise an entire mapping, free could do an munmap).

So, to sum up, we have "copy on write", "on demand loading", and "on demand linking" all going on at the same time. It seems complex, but makes fork and exec go quickly, smoothly. This adds some complexity, but extra overhead is done only when needed ("on demand").

Thus, instead of a large lurch/delay at the beginning launch of a program, the overhead activity gets spread out over the lifetime of the program, as needed.

How does copy-on-write in fork() handle multiple fork?

If fork is called multiple times from the original parent process, then each of the children and parent will have their pages marked as read-only. When a child process attempts to write data then the page from the parent process is copied to its address space and the copied page is marked as writeable in the child but not in the parent.

If fork is called from the child process and the grand-child attempts to write, the page from the original parent is copied to the first child, and then to the grand child, and all is marked as writeable.

What is copy-on-write?

I was going to write up my own explanation but this Wikipedia article pretty much sums it up.

Here is the basic concept:

Copy-on-write (sometimes referred to as "COW") is an optimization strategy used in computer programming. The fundamental idea is that if multiple callers ask for resources which are initially indistinguishable, you can give them pointers to the same resource. This function can be maintained until a caller tries to modify its "copy" of the resource, at which point a true private copy is created to prevent the changes becoming visible to everyone else. All of this happens transparently to the callers. The primary advantage is that if a caller never makes any modifications, no private copy need ever be created.

Also here is an application of a common use of COW:

The COW concept is also used in maintenance of instant snapshot on database servers like Microsoft SQL Server 2005. Instant snapshots preserve a static view of a database by storing a pre-modification copy of data when underlaying data are updated. Instant snapshots are used for testing uses or moment-dependent reports and should not be used to replace backups.

Does parent process lose write ability during copy on write?

Right, if either process writes a COW page, it triggers a page fault.

In the page fault handler, if the page is supposed to be writeable, it allocates a new physical page and does a memcpy(newpage, shared_page, pagesize), then updates the page table of whichever process faulted to map the newpage to that virtual address. Then returns to user-space for the store instruction to re-run.

This is a win for something like fork, because one process typically makes an execve system call right away, after touching typically one page (of stack memory). execve destroys all memory mappings for that process, effectively replacing it with a new process. The parent once again has the only copy of every page. (Except pages that were already copy-on-write, e.g. memory allocated with mmap is typically COW-mapped to a single physical page of zeros, so reads can hit in L1d cache).

A smart optimization would be for fork to actually copy the page containing the top of the stack, but still do lazy COW for all the other pages, on the assumption that the child process will normally execve right away and thus drop its references to all the other pages. It still costs a TLB invalidation in the parent to temporarily flip all the pages to read-only and back, though.

How to Know Whether a Copy-On-Write Page Is an Actual Copy