Forking VS Threading

Forking vs Threading

The main difference between forking and threading approaches is one of operating system architecture. Back in the days when Unix was designed, forking was an easy, simple system that answered the mainframe and server type requirements best, as such it was popularized on the Unix systems. When Microsoft re-architected the NT kernel from scratch, it focused more on the threading model. As such there is today still a notable difference with Unix systems being efficient with forking, and Windows more efficient with threads. You can most notably see this in Apache which uses the prefork strategy on Unix, and thread pooling on Windows.

Specifically to your questions:

When should you prefer fork() over threading and vice-verse?

On a Unix system where you're doing a far more complex task than just instantiating a worker, or you want the implicit security sandboxing of separate processes.

If I want to call an external application as a child, then should I use fork() or threads to do it?

If the child will do an identical task to the parent, with identical code, use fork. For smaller subtasks use threads. For separate external processes use neither, just call them with the proper API calls.

While doing google search I found people saying it is bad thing to call a fork() inside a thread. why do people want to call a fork() inside a thread when they do similar things?

Not entirely sure but I think it's computationally rather expensive to duplicate a process and a lot of subthreads.

Is it True that fork() cannot take advantage of multiprocessor system because parent and child process don't run simultaneously?

This is false, fork creates a new process which then takes advantage of all features available to processes in the OS task scheduler.

What is the difference between fork and thread?

A fork gives you a brand new process, which is a copy of the current process, with the same code segments. As the memory image changes (typically this is due to different behavior of the two processes) you get a separation of the memory images (Copy On Write), however the executable code remains the same. Tasks do not share memory unless they use some Inter Process Communication (IPC) primitive.

One process can have multiple threads, each executing in parallel within the same context of the process. Memory and other resources are shared among threads, therefore shared data must be accessed through some primitive and synchronization objects (like mutexes, condition variables and semaphores) that allow you to avoid data corruption.

What is the difference between threads and forked processes in Unix?

fork() copies the current process. Without any special preparations, almost no data is exchanged between child and parent. It is just so that the new process is identical to the old one, but as soon as you write a variable, a copy of the written region is created and the child gets a new physical memory location for this data. This means settings a variable in the child will not be visible for the parent and vice versa.

You can use shared memory, pipes, files, sockets, signals, and probably other IPC methods to communicate between child and parent. For your special case you can use the wait() or waitpid() function to wait till your child exits. But I assume you want to know how to exchange data.

Shared memory

You can use the mmap() call to reserve memory that is shared between parent and child.

void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);

You can pass the flag MAP_SHARED | MAP_ANONYMOUS to flags to create a memory region that is shared. There you can place the shared variable and both can access it. Here is an example.

//creates a region of shared memory to store a bool
static bool *reserveSharedMemory(void)
  {
    void *data = mmap(NULL, sizeof(bool), PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
    if(MAP_FAILED==data) 
      {
        //do some error handling here
        return NULL;
      }
    bool *p=data;
    *p=false;
    return p;  
  }

Sockets

Sockets allow you send and receive data with something else. With socketpair() you can create 2 socket file descriptors and you can communicate by writing to one of them and reading at the other file descriptor or verse visa. This way communication with the child process becomes almost the same as communicating with a network socket.

What is the difference between forking and threading in a background process?

Threading means you run the code in another thread in the same process whereas forking means you fork a separate process.

Threading in general means that you'll use less memory since you won't have a separate application instance (this advantage is lessened if you have a copy on write friendly ruby such as ree). Communication between threads is also a little easier.

Depending on your ruby interpreter, ruby may not use extra cores efficiently (jruby is good at this, MRI much worse) so spawning a bunch of extra threads will impact the performance of your web app and won't make full use of your resources - MRI only runs one thread at a time

Forking creates separate ruby instances so you'll make better use of multiple cores. You're also less likely to adversely affect your main application. You need to be a tiny bit careful when forking as you share open file descriptors when you fork, so you usually want to reopen database connections, memcache connections etc.

With MRI I'd use forking, with jruby there's more of a case to be made for threading

Threading vs Forking (with explanation of what I want to do)

Without more detail on your problem, there's not much help that can be given. You want to parallelize a process. Threads and forks in Perl have advantages and disadvantages.

One of the key things that makes Perl threads different from other threads is that data is not shared by default. This makes threads much easier and safer to work with, you don't have to worry about thread safety of libraries or most of your code, just the threaded bit. However it can be a performance drag and memory hungry as Perl must put a copy of the interpreter and all loaded modules into each thread.

When it comes to forking I will only be talking about Unix. Perl emulates fork on Windows using threads, it works but it can be slow and buggy.

Forking Advantages

Very fast to create a fork
Very robust

Forking Disadvantages

Communicating between the processes can be slow and awkward

Thread Advantages

Thread coordination and data interchange is fairly easy
Threads are fairly easy to use

Thread Disadvantages

Each thread takes a lot of memory
Threads can be slow to start
Threads can be buggy (better the more recent your perl)
Database connections are not shared across threads

That last one is a bit of a doozy if the documentation is up to date. If you're going to be doing a lot of SQL, don't use threads.

In general, to get good performance out of Perl threads it's best to start a pool of threads and reuse them. Forks can more easily be created, used and discarded.

Really what it comes down to is what fits your way of thinking and your particular problem.

For either case, you're likely going to want something to manage your pool of workers. For forking you're going to want to use Parallel::ForkManager or Child. Child is particularly nice as it has built in inter-process communication.

For threads you're going to want to use threads::shared, Thread::Queue and read perlthrtut.

When reading articles about Perl threads, keep in mind they were a bit crap when they were introduced in 5.8.0 in 2002, and only serviceable by 5.10.1. After that they've firmed up considerably. Information and opinions about their efficiency and robustness tends to fall rapidly out of date.

What happens when a thread forks?

The new process will be the child of the main thread that created the thread. I think.

fork creates a new process. The parent of a process is another process, not a thread. So the parent of the new process is the old process.

Note that the child process will only have one thread because fork only duplicates the (stack for the) thread that calls fork. (This is not entirely true: the entire memory is duplicated, but the child process will only have one active thread.)

If its parent finishes first, the new process will be attached to init process.

If the parent finishes first a SIGHUP signal is sent to the child. If the child does not exit as a result of the SIGHUP it will get init as its new parent. See also the man pages for nohup and signal(7) for a bit more information on SIGHUP.

And its parent is main thread, not the thread that created it.

The parent of a process is a process, not a specific thread, so it is not meaningful to say that the main or child thread is the parent. The entire process is the parent.

One final note: Mixing threads and fork must be done with care. Some of the pitfalls are discussed here.

Forking VS Threading