Using Process.Spawn as a Replacement For Process.Fork

Using Process.spawn as a replacement for Process.fork

EDIT: There is one common use case of fork() that can be replaced with spawn() -- the fork()--exec() combo. A lot of older (and modern) UNIX applications, when they want to spawn another process, will first fork, and then make an exec call (exec replaces the current process with another). This doesn't actually need fork(), which is why it can be replaced with spawn(). So, this:

if(!fork())
exec("dir")
end

can be replaced with:

Process.spawn("dir")

If any of the gems are using fork() like this, the fix is easy. Otherwise, it is almost impossible.


EDIT: The reason why win32-process' implementation of fork() doesn't work is that (as far as I can tell from the docs), it basically is spawn(), which isn't fork() at all.


No, I don't think it can be done. You see, Process.spawn creates a new process with the default blank state and native code. So, while I can do something like Process.spawn('dir') will start a new, blank process running dir, it won't clone any of the current process' state. It's only connection to your program is the parent - child connection.

You see, fork() is a very low level call. For example, on Linux, what fork() basically does is this: first, a new process is created with exactly cloned register state. Then, Linux does a copy-on-write reference to all of the parent process' pages. Linux then clones some other process flags. Obviously, all of these operations can only be done by the kernel, and the Windows kernel doesn't have the facilities to do that (and can't be patched to either).

Technically, only native programs need the OS for some sort of fork()-like support. Any layer of code needs the cooperation of the layer above it to do something like fork(). So while native C code needs the cooperation of the kernel to fork, Ruby theoretically only needs the cooperation of the interpreter to do a fork. However, the Ruby interpreter does not have a snapshot/restore feature, which would be necessarily to implement a fork. Because of this, normal Ruby fork is achieved by forking the interpreter itself, not the Ruby program.

So, while if you could patch the Ruby interpreter to add a stop/start and snapshot/restore feature, you could do it, but otherwise? I don't think so.

So what are your options? This is what I can think of:

  • Patch the Ruby interpreter
  • Patch the code that uses fork() to maybe use threads or spawn
  • Get a UNIX (I suggest this one)
  • Use Cygwin

Edit 1:
I wouldn't suggest using Cygwin's fork, as it involves special Cygwin process tables, there is no copy-on-write, which makes it very inefficient. Also, it involves a lot of jumping back and forth and a lot of copying. Avoid it if possible. Also, because Windows provides no facilities to copy address spaces, forks are very likely to fail, and will quite a lot of the time (see here).

Alternative for spawning a process with 'fork' in jRuby?

I found out the solution for this. We can use the built-in library FFI in JRuby to 'simulate' the Process.fork in MRI.

# To mimic the Process.fork in MRI Ruby
module JRubyProcess
require 'ffi'
extend FFI::Library
ffi_lib FFI::Library::LIBC
attach_function :fork, [], :int
end

pid = JRubyProcess.fork do
#internal_server.run
end

More details:

https://github.com/ffi/ffi

http://blog.headius.com/2008/10/ffi-for-ruby-now-available.html

multiprocessing fork() vs spawn()

  1. is it that the fork is much quicker 'cuz it does not try to identify which resources to copy?

Yes, it's much quicker. The kernel can clone the whole process and only copies modified memory-pages as a whole. Piping resources to a new process and booting the interpreter from scratch is not necessary.


  1. is it that, since fork duplicates everything, it would "waste" much more resources comparing to spawn()?

Fork on modern kernels does only "copy-on-write" and it only affects memory-pages which actually change. The caveat is that "write" already encompasses merely iterating over an object in CPython. That's because the reference-count for the object gets incremented.

If you have long running processes with lots of small objects in use, this can mean you waste more memory than with spawn. Anecdotally I recall Facebook claiming to have memory-usage reduced considerably with switching from "fork" to "spawn" for their Python-processes.

What's the difference between Process.fork and Process.spawn in Ruby 1.9.2

What's the difference between Process.fork and the new Process.spawn methods in Ruby 1.9.2

Process.fork allows you to run ruby code in another process. Process.spawn allows you to run another program in another process. Basically Process.spawn is like using Process.fork and then calling exec in the forked process, except that it gives you more options.

and which one is better to run another program in a subprocess?

If you need backwards compatibility, use fork + exec as spawn is not available in 1.8. Otherwise use spawn since running another program in a subprocess is exactly what spawn is made for.

As far as I understand Process.fork accepts block of code and Process.spawn takes a system command plus some other parameters.

Exactly.

When I should use one instead of the other?

Use fork if you need to run arbitrary ruby code in a separate process (you can't do that with spawn). Use spawn if you need to invoke an application in a subprocess.

node.js child process - difference between spawn & fork

Spawn is a command designed to run system commands. When you run spawn, you send it a system command that will be run on its own process, but does not execute any further code within your node process. You can add listeners for the process you have spawned, to allow your code interact with the spawned process, but no new V8 instance is created(unless of course your command is another Node command, but in this case you should use fork!) and only one copy of your node module is active on the processor.

Fork is a special instance of spawn, that runs a fresh instance of the V8 engine. Meaning, you can essentially create multiple workers, running on the exact same Node code base, or perhaps a different module for a specific task. This is most useful for creating a worker pool. While node's async event model allows a single core of a machine to be used fairly efficiently, it doesn't allow a node process to make use of multi core machines. Easiest way to accomplish this is to run multiple copies of the same program, on a single processor.

A good rule of thumb is one to two node processes per core, perhaps more for machines with a good ram clock/cpu clock ratio, or for node processes heavy on I/O and light on CPU work, to minimize the down time the event loop is waiting for new events. However, the latter suggestion is a micro-optimization, and would need careful benchmarking to ensure your situation suits the need for many processes/core. You can actually decrease performance by spawning too many workers for your machine/scenario.

Ultimately you could use spawn in a way that did the above, by sending spawn a Node command. But this would be silly, because fork does some things to optimize the process of creating V8 instances. Just making it clear, that ultimately spawn encompasses fork. Fork is just optimal for this particular, and very useful, use case.

http://nodejs.org/api/child_process.html#child_process_child_process_exec_command_options_callback

what's multiprocessing spawn? Process memory is not replicated as what fork does

Python relies on the exec primitive to implement the spawn start method on UNIX platforms.

When a new process is forked, the exec loads a new Python interpreter and points it out to the module and function you are giving as a target to your Process object. When the module is loaded, the if __name__ == "__main__": evaluates to False. This avoids your logic from entering an endless loop which would end up spawning infinite processes.

Assuming you are executing this code on a UNIX machine, this is the correct behaviour based on POSIX specifications.

This volume of POSIX.1-2017 specifies that signals set to SIG_IGN remain set to SIG_IGN, and that the new process image inherits the signal mask of the thread that called exec in the old process image.

This works only for SIG_IGN. In fact, on test3 you can observe how your handler is reset.

Spawn a process in Python without forking

I think you misunderstand; since PyMongo's documentation warns you that a single MongoClient is not fork-safe, you interpret that to mean that PyMongo prohibits your whole program from ever creating subprocesses.

Any single MongoClient is not fork-safe, meaning you must not create it before forking and use the same MongoClient object after forking. Using PyMongo in your program overall, or using one MongoClient before a fork and a different one after, are all safe.

That's why subprocess.Popen is ok: you fork, then exec (to replace your program with a different one in the child process), and therefore you cannot possibly use the same MongoClient in the child afterward.

To quote the PyMongo FAQ:

On Unix systems the multiprocessing module spawns processes using fork(). Care must be taken when using instances of MongoClient with fork(). Specifically, instances of MongoClient must not be copied from a parent process to a child process. Instead, the parent process and each child process must create their own instances of MongoClient. For example:

# Each process creates its own instance of MongoClient.
def func():
db = pymongo.MongoClient().mydb
# Do something with db.

proc = multiprocessing.Process(target=func)
proc.start()

Never do this:

client = pymongo.MongoClient()

# Each child process attempts to copy a global MongoClient
# created in the parent process. Never do this.
def func():
db = client.mydb
# Do something with db.

proc = multiprocessing.Process(target=func)
proc.start()

Instances of MongoClient copied from the parent process have a high probability of deadlock in the child process due to inherent incompatibilities between fork(), threads, and locks. PyMongo will attempt to issue a warning if there is a chance of this deadlock occurring.

Spawn a process in Python without forking

I think you misunderstand; since PyMongo's documentation warns you that a single MongoClient is not fork-safe, you interpret that to mean that PyMongo prohibits your whole program from ever creating subprocesses.

Any single MongoClient is not fork-safe, meaning you must not create it before forking and use the same MongoClient object after forking. Using PyMongo in your program overall, or using one MongoClient before a fork and a different one after, are all safe.

That's why subprocess.Popen is ok: you fork, then exec (to replace your program with a different one in the child process), and therefore you cannot possibly use the same MongoClient in the child afterward.

To quote the PyMongo FAQ:

On Unix systems the multiprocessing module spawns processes using fork(). Care must be taken when using instances of MongoClient with fork(). Specifically, instances of MongoClient must not be copied from a parent process to a child process. Instead, the parent process and each child process must create their own instances of MongoClient. For example:

# Each process creates its own instance of MongoClient.
def func():
db = pymongo.MongoClient().mydb
# Do something with db.

proc = multiprocessing.Process(target=func)
proc.start()

Never do this:

client = pymongo.MongoClient()

# Each child process attempts to copy a global MongoClient
# created in the parent process. Never do this.
def func():
db = client.mydb
# Do something with db.

proc = multiprocessing.Process(target=func)
proc.start()

Instances of MongoClient copied from the parent process have a high probability of deadlock in the child process due to inherent incompatibilities between fork(), threads, and locks. PyMongo will attempt to issue a warning if there is a chance of this deadlock occurring.



Related Topics



Leave a reply



Submit