Why Does Python's Multiprocessing Module Import _Main_ When Starting a New Process on Windows

Compulsory usage of if __name__==__main__ in windows while using multiprocessing

Expanding a bit on the good answer you already got, it helps if you understand what Linux-y systems do. They spawn new processes using fork(), which has two good consequences:

  1. All data structures existing in the main program are visible to the child processes. They actually work on copies of the data.
  2. The child processes start executing at the instruction immediately following the fork() in the main program - so any module-level code already executed in the module will not be executed again.

fork() isn't possible in Windows, so on Windows each module is imported anew by each child process. So:

  1. On Windows, no data structures existing in the main program are visible to the child processes; and,
  2. All module-level code is executed in each child process.

So you need to think a bit about which code you want executed only in the main program. The most obvious example is that you want code that creates child processes to run only in the main program - so that should be protected by __name__ == '__main__'. For a subtler example, consider code that builds a gigantic list, which you intend to pass out to worker processes to crawl over. You probably want to protect that too, because there's no point in this case to make each worker process waste RAM and time building their own useless copies of the gigantic list.

Note that it's a Good Idea to use __name__ == "__main__" appropriately even on Linux-y systems, because it makes the intended division of work clearer. Parallel programs can be confusing - every little bit helps ;-)

Workaround for using __name__=='__main__' in Python multiprocessing

The main module is imported (but with __name__ != '__main__' because Windows is trying to simulate a forking-like behavior on a system that doesn't have forking). multiprocessing has no way to know that you didn't do anything important in you main module, so the import is done "just in case" to create an environment similar to the one in your main process. If it didn't do this, all sorts of stuff that happens by side-effect in main (e.g. imports, configuration calls with persistent side-effects, etc.) might not be properly performed in the child processes.

As such, if they're not protecting their __main__, the code is not multiprocessing safe (nor is it unittest safe, import safe, etc.). The if __name__ == '__main__': protective wrapper should be part of all correct main modules. Go ahead and distribute it, with a note about requiring multiprocessing-safe main module protection.

Why does importing module in '__main__' not allow multiprocessig to use module?

The situation is different in unix-like systems and Windows. On the unixy systems, multiprocessing uses fork to create child processes that share a copy-on-write view of the parent memory space. The child sees the imports from the parent, including anything the parent imported under if __name__ == "__main__":.

On windows, there is no fork, a new process has to be executed. But simply rerunning the parent process doesn't work - it would run the whole program again. Instead, multiprocessing runs its own python program that imports the parent main script and then pickles/unpickles a view of the parent object space that is, hopefully, sufficient for the child process.

That program is the __main__ for the child process and the __main__ of the parent script doesn't run. The main script was just imported like any other module. The reason is simple: running the parent __main__ would just run the full parent program again, which mp must avoid.

Here is a test to show what is going on. A main module called testmp.py and a second module test2.py that is imported by the first.

testmp.py

import os
import multiprocessing as mp

print("importing test2")
import test2

def worker():
print('worker pid: {}, module name: {}, file name: {}'.format(os.getpid(),
__name__, __file__))

if __name__ == "__main__":
print('main pid: {}, module name: {}, file name: {}'.format(os.getpid(),
__name__, __file__))
print("running process")
proc = mp.Process(target=worker)
proc.start()
proc.join()

test2.py

import os

print('test2 pid: {}, module name: {}, file name: {}'.format(os.getpid(),
__name__, __file__))

When run on Linux, test2 is imported once and the worker runs in the main module.

importing test2
test2 pid: 17840, module name: test2, file name: /media/td/USB20FD/tmp/test2.py
main pid: 17840, module name: __main__, file name: testmp.py
running process
worker pid: 17841, module name: __main__, file name: testmp.py

Under windows, notice that "importing test2" is printed twice - testmp.py was run two times. But "main pid" was only printed once - its __main__ wasn't run. That's because multiprocessing changed the module name to __mp_main__ during import.

E:\tmp>py testmp.py
importing test2
test2 pid: 7536, module name: test2, file name: E:\tmp\test2.py
main pid: 7536, module name: __main__, file name: testmp.py
running process
importing test2
test2 pid: 7544, module name: test2, file name: E:\tmp\test2.py
worker pid: 7544, module name: __mp_main__, file name: E:\tmp\testmp.py

RuntimeError on windows trying python multiprocessing

On Windows the subprocesses will import (i.e. execute) the main module at start. You need to insert an if __name__ == '__main__': guard in the main module to avoid creating subprocesses recursively.

Modified testMain.py:

import parallelTestModule

if __name__ == '__main__':
extractor = parallelTestModule.ParallelExtractor()
extractor.runInParallel(numProcesses=2, numThreads=4)

Python Multiprocessing Looping Python File Instead of Starting Process

That if __name__ == '__main__': guard is important. On systems that don't use fork, it simulates a fork by importing the main script in each worker process without naming it __main__ (it's named __mp_main__ IIRC). Any code that should only run in the "main" script needs to be protected by that guard (it can be indirectly, by defining a function and calling it within the guarded segment; the function will be defined in the workers, but not run).

So to fix this, all you need to do is indent the test_input = input("test input") so it's protected by the if __name__ == '__main__': guard. In real code, I try to keep the guarded section clean (so I can't accidentally write functions that rely on global state that doesn't exist when it's not run as the main script, and for the mild performance benefits of using function locals over globals), so I'd write it like:

from multiprocessing import Process

def f(name):
print('hello', name)

def main():
p = Process(target=f, args=('bob',))
p.start()
p.join()

test_input = input("test input")

if __name__ == '__main__':
main()

but that's not strictly necessary.

In Python multiprocessing.Process , do we have to use `__name__ == __main__`?

As described in the multiprocessing guidelines under the heading "Safe importing of main module", some forms of multiprocessing need to import your main module and thus your program may run amok in a fork bomb if the __name__ == '__main__' check is missing. In particular, this is the case on Windows where CPython cannot fork. So it is not safe to skip it. The test belongs at the top (global) level of your module, not inside some class. Its purpose is to stop the module from automatically running tasks (as opposed to defining classes, functions etc) when it is imported, as opposed to run directly.

multiprocessing issue with Windows

First, in Script1.py, where you have placed the if __name__ == "__main__": check is not the correct place. It should be placed as follows:

if __name__ == "__main__":
Test([2, 3]) .run()

This is for two reasons. First, when the new processes are created, any statements at global scope will be executed by these processes. If you do not put the check as I have above, you will be needlessly creating instances of Test objects. It's true that when run is invoked against these objects run will immediately return because of where you did place the check, but why create the objects to begin with?

But the real reason for moving the check as I have done is that you only want to execute the statement Test([2, 3]).run() when you are executing Script1.py as the "main" script and not when it is being imported by some other script. By placing the check as I have done, when it is imported its name will not be "__main__" any more and therefore that statement will not be executed, which gives you more flexibility.

This now allows you in Script2.py to add your own if __name__ == '__main__': check as follows:

from Script1 import Test

class Test2(object):
def __init__(self, y):
self.y = y

def run(self):
z = Test(self.y).run()

if __name__ == '__main__':
Test2([3, 6]).run()

Prints:

<Process name='Process-2' pid=9200 parent=4492 started>
<Process name='Process-3' pid=16428 parent=4492 started>
result [9, 36]

So that when Script2.py is the "main" script being executed, you have control over what object gets created and run.

Explanation

The important thing to remember with Windows is that when a script launches a new process that process starts execution of the source from the top so all statements at global scope (import statements, function declarations, variable assignments, etc.) are executed. Thus you want to avoid having at global scope things that don't need to be there since they will be re-executed by the new process and you might be doing for instance a calculation or creation of a large data structure that the newly created process does not use and you have wasted CPU cycles or memory for nothing. But you absolutely must not have any statements at global scope that when executed end up re-creating recursively the process you just created. That is why we have the need for the if __name__ == "__main__": around such statements (__name__ will not be "__main__" in the newly created process). So there is no need to have such a check in the run method, which is not at global scope. But eventually, in whatever script you run to starts things off, you will need that check for any code at global scope code that creates a process or invokes a function or method that creates a process.

Note that when Script2.py imports Script1.py, Script1.py is now a module and it's __name__ value will be "Script1", and again the code Test([2, 3]).run() will not execute. So that also explains why when we create a module we can place testing code within an if __name__ == "__main__": block -- it will not be executed when the module is imported.

python multiprocessing on windows, if __name__ == __main__

You do not have to call Process() from the "top level" of the module.
It is perfectly fine to call Process from a class method.

The only caveat is that you can not allow Process() to be called if or when the module is imported.

Since Windows has no fork, the multiprocessing module starts a new Python process and imports the calling module. If Process() gets called upon import, then this sets off an infinite succession of new processes (or until your machine runs out of resources). This is the reason for hiding calls to Process() inside

if __name__ == "__main__"

since statements inside this if-statement will not get called upon import.



Related Topics



Leave a reply



Submit