Multiprocessing: Sharing a Large Read-Only Object Between Processes

Share Large, Read-Only Numpy Array Between Multiprocessing Processes

@Velimir Mlaker gave a great answer. I thought I could add some bits of comments and a tiny example.

(I couldn't find much documentation on sharedmem - these are the results of my own experiments.)

  1. Do you need to pass the handles when the subprocess is starting, or after it has started? If it's just the former, you can just use the target and args arguments for Process. This is potentially better than using a global variable.
  2. From the discussion page you linked, it appears that support for 64-bit Linux was added to sharedmem a while back, so it could be a non-issue.
  3. I don't know about this one.
  4. No. Refer to example below.

Example

#!/usr/bin/env python
from multiprocessing import Process
import sharedmem
import numpy

def do_work(data, start):
data[start] = 0;

def split_work(num):
n = 20
width = n/num
shared = sharedmem.empty(n)
shared[:] = numpy.random.rand(1, n)[0]
print "values are %s" % shared

processes = [Process(target=do_work, args=(shared, i*width)) for i in xrange(num)]
for p in processes:
p.start()
for p in processes:
p.join()

print "values are %s" % shared
print "type is %s" % type(shared[0])

if __name__ == '__main__':
split_work(4)

Output

values are [ 0.81397784  0.59667692  0.10761908  0.6736734   0.46349645  0.98340718
0.44056863 0.10701816 0.67167752 0.29158274 0.22242552 0.14273156
0.34912309 0.43812636 0.58484507 0.81697513 0.57758441 0.4284959
0.7292129 0.06063283]
values are [ 0. 0.59667692 0.10761908 0.6736734 0.46349645 0.
0.44056863 0.10701816 0.67167752 0.29158274 0. 0.14273156
0.34912309 0.43812636 0.58484507 0. 0.57758441 0.4284959
0.7292129 0.06063283]
type is <type 'numpy.float64'>

This related question might be useful.

multiprocessing.Pool sharing large lists of lists read-only in memory across child process

It seems to work fine for me using mp.Manager, with an mp.Manager.list of mp.Manager.lists. I believe this will not copy the lists to every process.

The important line is:

big_list_of_lists_proxy = manager.list([manager.list(sublist) for sublist in big_list_of_lists])

You may want to use instead, depending on your use case:

big_list_of_lists_proxy = manager.list(big_list_of_lists)

Whether every sublist should be a proxy or not depends on whether each sublist is large, and also whether it is read in its entirety. If a sublist is large, then it is expensive to transfer the list object to each process that needs it (O(n) complexity) and so a proxy list from a manager should be used, however if every element is going to be needed anyway, there is no advantage to using a proxy.

import multiprocessing as mp
from operator import itemgetter
import numpy as np
from functools import partial

def foo(indexes, big_list_of_lists):
# here I must guarantee read acess for big_list_of_lists on every child process somehow
# as this code would work with only with one child process using global variables but would fail
# with larger data.
store_tuples = itemgetter(*indexes)(big_list_of_lists)
return np.mean([item for sublista in store_tuples for item in sublista])

def main():
# big_list_of_lists is the varible that I want to share across my child process
big_list_of_lists = [[1, 3], [3, 1, 3], [1, 2], [2, 0]]
ctx = mp.get_context('spawn')
with ctx.Manager() as manager:
big_list_of_lists_proxy = manager.list([manager.list(sublist) for sublist in big_list_of_lists])
# big_list_of_lists elements are also passed as args
pool = ctx.Pool(ctx.Semaphore(mp.cpu_count()).get_value())
res = list(pool.map(partial(foo, big_list_of_lists=big_list_of_lists_proxy), big_list_of_lists))
pool.close()
pool.join()

return res

if __name__ == '__main__':
print(main())
# desired output is equivalente to:
# a = []
# for i in big_list_of_lists:
# store_tuples = itemgetter(*i)(big_list_of_lists)
# a.append(np.mean([item for sublista in store_tuples for item in sublista]))
# 'a' would be equal to [1.8, 1.5714285714285714, 2.0, 1.75]

Sharing a complex object between processes?

You can do this using Python's multiprocessing "Manager" classes and a proxy class that you define. See Proxy Objects in the Python docs.

What you want to do is define a proxy class for your custom object, and then share the object using a "Remote Manager" -- look at the examples in the same linked doc page in the "Using a remote manager" section where the docs show how to share a remote queue. You're going to be doing the same thing, but your call to your_manager_instance.register() will include your custom proxy class in its argument list.

In this manner, you're setting up a server to share the custom object with a custom proxy. Your clients need access to the server (again, see the excellent documentation examples of how to setup client/server access to a remote queue, but instead of sharing a Queue, you are sharing access to your specific class).

Providing shared read-only ressources to parallel processes

You can define and compute output_1 as a global variable before creating your process pool; that way each process will have access to the data; this won't result in any memory duplication because you're not changing that data (copy-on-write).

_output_1 = serial_computation()

def parallel_computation(input_2):
# here you can access _output_1
# you must not modify it as this will result in creating new copy in the child process
...

def main():
input_2 = ...
with Pool() as pool:
output_2 = pool.map(parallel_computation, input_2)

What is the right way to share a read only configuration to multiple processes?

The correct way to share static information within multiprocessing.Pool consist in using the initializer function to set it via its initargs.

The two above variables are in fact passed to the Pool workers as Process constructor parameters thus following the recommendations of the multiprocessing programming guidelines.

Explicitly pass resources to child processes

On Unix using the fork start method, a child process can make use of a shared resource created in a parent process using a global resource. However, it is better to pass the object as an argument to the constructor for the child process.

variable = None

def initializer(*initargs):
"""The initializer function is executed on each worker process
once they start.

"""
global variable

variable = initargs

def function(*args):
"""The function is executed on each parameter of `map`."""
print(variable)

with multiprocessing.Pool(initializer=initializer, initargs=[1, 2, 3]) as pool:
pool.map(function, (1, 2, 3))


Related Topics



Leave a reply



Submit