Faster Forking of Large Processes on Linux

Faster forking of large processes on Linux?

Outcome: I was going to go down the early-spawned helper subprocess route as suggested by other answers here, but then I came across this re using huge page support to improve fork performance.

Having tried it myself using libhugetlbfs to simply make all my app's mallocs allocate huge pages, I'm now getting around 2400 forks/s regardless of the process size (over the range I'm interested in anyway). Amazing.

Time waste of execv() and fork()

What is the advantage that is achieved by using this combo (instead of some other solution) that makes people still use this even though we have waste?

You have to create a new process somehow. There are very few ways for a userspace program to accomplish that. POSIX used to have vfork() alognside fork(), and some systems may have their own mechanisms, such as Linux-specific clone(), but since 2008, POSIX specifies only fork() and the posix_spawn() family. The fork + exec route is more traditional, is well understood, and has few drawbacks (see below). The posix_spawn family is designed as a special purpose substitute for use in contexts that present difficulties for fork(); you can find details in the "Rationale" section of its specification.

This excerpt from the Linux man page for vfork() may be illuminating:

Under Linux, fork(2) is implemented using copy-on-write pages, so the only penalty incurred by fork(2) is the time and memory required to duplicate the parent’s page tables, and to create a unique task structure for the child. However, in the bad old days a fork(2) would require making a complete copy of the caller’s data space, often needlessly, since usually immediately afterwards an exec(3) is done. Thus, for greater efficiency, BSD introduced the vfork() system call, which did not fully copy the address space of the parent process, but borrowed the parent’s memory and thread of control until a call to execve(2) or an exit occurred. The parent process was suspended while the child was using its resources. The use of vfork() was tricky: for example, not modifying data in the parent process depended on knowing which variables are held in a register.

(Emphasis added)

Thus, your concern about waste is not well-founded for modern systems (not limited to Linux), but it was indeed an issue historically, and there were indeed mechanisms designed to avoid it. These days, most of those mechanisms are obsolete.

fork() leaking? Taking longer and longer to fork a simple process

The slowdown is caused by an accumulation of anonymous vmas, and is a known problem. The problem is evident when there are a large number of fork() calls and the parent exits before the children. The following code recreates the problem (source Daniel Forrest):

#include <unistd.h>

int main(int argc, char *argv[])
{
  pid_t pid;
  while (1) {
    pid = fork();
    if (pid == -1) {
      /* error */
      return 1;
    }
    if (pid) {
      /* parent */
      sleep(2);
      break;
    }
    else {
      /* child */
      sleep(1);
    }
  }
  return 0;
}

The behavior can be confirmed by checking anon_vma in /proc/slabinfo.

There is a patch (source) which limits the length of copied anon_vma_chain to five. I can confirm that the patch fixes the problem.

As for how I eventually found the problem, I finally just started putting printk calls throughout the fork code, checking the times shown in dmesg. Eventually I saw that it was the call to anon_vma_fork which was taking longer and longer. Then it was a quick matter of google searching.

It took a rather long time, so I would still appreciate any suggestions for a better way to have gone about tracking down the problem. And to all of those that already spent time trying to assist me, Thank You.

Is the unix fork exec sequence really as expensive as it sounds?

Usually the fork does not actually copy all the memory, but uses a "copy on write" which means that as long as the memory is not modified the same pages are used. However, to avoid not having enough memory later on (should the process write to the memory) enough memory must be allocated.

This means that forking from large process on systems that do not allow overcommitting memory the memory must be available. So, if you have a 8 GB process forking, then for at least a short period of time 16 GB must be available.

See also vfork and posix_spawn for other solutions.

forking() and CreateProcess()

They do different things, and on different systems. CreateProcess is a Windows-only function, while fork is only on POSIX (e.g. Linux and Mac OSX) systems.

The fork system call creates a new process and continue execution in both the parent and the child from the point where the fork function was called. CreateProcess creates a new process and load a program from disk. The only similarity is that the end result is a new process is created.

For more information, read the respective manual page on CreateProcess and fork.

To summarize: CreateProcess is similar to fork() followed by one of the exec() functions.

Forking vs Threading

The main difference between forking and threading approaches is one of operating system architecture. Back in the days when Unix was designed, forking was an easy, simple system that answered the mainframe and server type requirements best, as such it was popularized on the Unix systems. When Microsoft re-architected the NT kernel from scratch, it focused more on the threading model. As such there is today still a notable difference with Unix systems being efficient with forking, and Windows more efficient with threads. You can most notably see this in Apache which uses the prefork strategy on Unix, and thread pooling on Windows.

Specifically to your questions:

When should you prefer fork() over threading and vice-verse?

On a Unix system where you're doing a far more complex task than just instantiating a worker, or you want the implicit security sandboxing of separate processes.

If I want to call an external application as a child, then should I use fork() or threads to do it?

If the child will do an identical task to the parent, with identical code, use fork. For smaller subtasks use threads. For separate external processes use neither, just call them with the proper API calls.

While doing google search I found people saying it is bad thing to call a fork() inside a thread. why do people want to call a fork() inside a thread when they do similar things?

Not entirely sure but I think it's computationally rather expensive to duplicate a process and a lot of subthreads.

Is it True that fork() cannot take advantage of multiprocessor system because parent and child process don't run simultaneously?

This is false, fork creates a new process which then takes advantage of all features available to processes in the OS task scheduler.

Why is forking slowing down my application

You are probably observing the cost of the copy-on-write mechanism, as you hypothesize. That's actually quite expensive -- it is the reason vfork still exists. (The main cost is not the extra page faults themselves, but the memcpy of each page as it is touched, and the associated cache and TLB flushes.) It's not showing up as a cost of fork because the page faults don't happen inside the system call.

You can confirm the hypothesis by looking at the times reported by getrusage -- if this is correct, the extra time elapsed should be nearly all "system" time (CPU burnt inside the kernel). oprofile or perf will let you pin down the problem more specifically... if you can get them to work at all, which is nontrivial, alas.

Unfortunately, copy-on-write is also the reason why your checkpoint mechanism works in the first place. Can you get away with taking checkpoints at longer intervals? That's the only quick fix I can think of.

Faster Forking of Large Processes on Linux