What Can Multiprocessing and Dill Do Together

What can multiprocessing and dill do together?

multiprocessing makes some bad choices about pickling. Don't get me wrong, it makes some good choices that enable it to pickle certain types so they can be used in a pool's map function. However, since we have dill that can do the pickling, multiprocessing's own pickling becomes a bit limiting. Actually, if multiprocessing were to use pickle instead of cPickle... and also drop some of it's own pickling overrides, then dill could take over and give a much more full serialization for multiprocessing.

Until that happens, there's a fork of multiprocessing called pathos (the release version is a bit stale, unfortunately) that removes the above limitations. Pathos also adds some nice features that multiprocessing doesn't have, like multi-args in the map function. Pathos is due for a release, after some mild updating -- mostly conversion to python 3.x.

Python 2.7.5 (default, Sep 30 2013, 20:15:49) 
[GCC 4.2.1 (Apple Inc. build 5566)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill
>>> from pathos.multiprocessing import ProcessingPool    
>>> pool = ProcessingPool(nodes=4)
>>> result = pool.map(lambda x: x**2, range(10))
>>> result
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

and just to show off a little of what pathos.multiprocessing can do...

>>> def busy_add(x,y, delay=0.01):
...     for n in range(x):
...        x += n
...     for n in range(y):
...        y -= n
...     import time
...     time.sleep(delay)
...     return x + y
... 
>>> def busy_squared(x):
...     import time, random
...     time.sleep(2*random.random())
...     return x*x
... 
>>> def squared(x):
...     return x*x
... 
>>> def quad_factory(a=1, b=1, c=0):
...     def quad(x):
...         return a*x**2 + b*x + c
...     return quad
... 
>>> square_plus_one = quad_factory(2,0,1)
>>> 
>>> def test1(pool):
...     print pool
...     print "x: %s\n" % str(x)
...     print pool.map.__name__
...     start = time.time()
...     res = pool.map(squared, x)
...     print "time to results:", time.time() - start
...     print "y: %s\n" % str(res)
...     print pool.imap.__name__
...     start = time.time()
...     res = pool.imap(squared, x)
...     print "time to queue:", time.time() - start
...     start = time.time()
...     res = list(res)
...     print "time to results:", time.time() - start
...     print "y: %s\n" % str(res)
...     print pool.amap.__name__
...     start = time.time()
...     res = pool.amap(squared, x)
...     print "time to queue:", time.time() - start
...     start = time.time()
...     res = res.get()
...     print "time to results:", time.time() - start
...     print "y: %s\n" % str(res)
... 
>>> def test2(pool, items=4, delay=0):
...     _x = range(-items/2,items/2,2)
...     _y = range(len(_x))
...     _d = [delay]*len(_x)
...     print map
...     res1 = map(busy_squared, _x)
...     res2 = map(busy_add, _x, _y, _d)
...     print pool.map
...     _res1 = pool.map(busy_squared, _x)
...     _res2 = pool.map(busy_add, _x, _y, _d)
...     assert _res1 == res1
...     assert _res2 == res2
...     print pool.imap
...     _res1 = pool.imap(busy_squared, _x)
...     _res2 = pool.imap(busy_add, _x, _y, _d)
...     assert list(_res1) == res1
...     assert list(_res2) == res2
...     print pool.amap
...     _res1 = pool.amap(busy_squared, _x)
...     _res2 = pool.amap(busy_add, _x, _y, _d)
...     assert _res1.get() == res1
...     assert _res2.get() == res2
...     print ""
... 
>>> def test3(pool): # test against a function that should fail in pickle
...     print pool
...     print "x: %s\n" % str(x)
...     print pool.map.__name__
...     start = time.time()
...     res = pool.map(square_plus_one, x)
...     print "time to results:", time.time() - start
...     print "y: %s\n" % str(res)
... 
>>> def test4(pool, maxtries, delay):
...     print pool
...     m = pool.amap(busy_add, x, x)
...     tries = 0
...     while not m.ready():
...         time.sleep(delay)
...         tries += 1
...         print "TRY: %s" % tries
...         if tries >= maxtries:
...             print "TIMEOUT"
...             break
...     print m.get()
... 
>>> import time
>>> x = range(18)
>>> delay = 0.01
>>> items = 20
>>> maxtries = 20
>>> from pathos.multiprocessing import ProcessingPool as Pool
>>> pool = Pool(nodes=4)
>>> test1(pool)
<pool ProcessingPool(ncpus=4)>
x: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]

map
time to results: 0.0553691387177
y: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289]

imap
time to queue: 7.91549682617e-05
time to results: 0.102381229401
y: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289]

amap
time to queue: 7.08103179932e-05
time to results: 0.0489699840546
y: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289]

>>> test2(pool, items, delay)
<built-in function map>
<bound method ProcessingPool.map of <pool ProcessingPool(ncpus=4)>>
<bound method ProcessingPool.imap of <pool ProcessingPool(ncpus=4)>>
<bound method ProcessingPool.amap of <pool ProcessingPool(ncpus=4)>>

>>> test3(pool)
<pool ProcessingPool(ncpus=4)>
x: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]

map
time to results: 0.0523059368134
y: [1, 3, 9, 19, 33, 51, 73, 99, 129, 163, 201, 243, 289, 339, 393, 451, 513, 579]

>>> test4(pool, maxtries, delay)
<pool ProcessingPool(ncpus=4)>
TRY: 1
TRY: 2
TRY: 3
TRY: 4
TRY: 5
TRY: 6
TRY: 7
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34]

Python: Efficient workaround for multiprocessing a function that is a data member of a class, from within that class

Steven Bethard has posted a way to allow methods to be pickled/unpickled. You could use it like this:

import multiprocessing as mp
import copy_reg
import types

def _pickle_method(method):
    # Author: Steven Bethard
    # http://bytes.com/topic/python/answers/552476-why-cant-you-pickle-instancemethods
    func_name = method.im_func.__name__
    obj = method.im_self
    cls = method.im_class
    cls_name = ''
    if func_name.startswith('__') and not func_name.endswith('__'):
        cls_name = cls.__name__.lstrip('_')
    if cls_name:
        func_name = '_' + cls_name + func_name
    return _unpickle_method, (func_name, obj, cls)

def _unpickle_method(func_name, obj, cls):
    # Author: Steven Bethard
    # http://bytes.com/topic/python/answers/552476-why-cant-you-pickle-instancemethods
    for cls in cls.mro():
        try:
            func = cls.__dict__[func_name]
        except KeyError:
            pass
        else:
            break
    return func.__get__(obj, cls)

# This call to copy_reg.pickle allows you to pass methods as the first arg to
# mp.Pool methods. If you comment out this line, `pool.map(self.foo, ...)` results in
# PicklingError: Can't pickle <type 'instancemethod'>: attribute lookup
# __builtin__.instancemethod failed

copy_reg.pickle(types.MethodType, _pickle_method, _unpickle_method)

class MyClass(object):

    def __init__(self):
        self.my_args = [1,2,3,4]
        self.output  = {}

    def my_single_function(self, arg):
        return arg**2

    def my_parallelized_function(self):
        # Use map or map_async to map my_single_function onto the
        # list of self.my_args, and append the return values into
        # self.output, using each arg in my_args as the key.

        # The result should make self.output become
        # {1:1, 2:4, 3:9, 4:16}
        self.output = dict(zip(self.my_args,
                               pool.map(self.my_single_function, self.my_args)))

Then

pool = mp.Pool()   
foo = MyClass()
foo.my_parallelized_function()

yields

print foo.output
# {1: 1, 2: 4, 3: 9, 4: 16}

Python multiprocessing with dill.load

SOLUTION: when dumping use this setting

    dill.settings['recurse']=True

Python multiprocessing PicklingError: Can't pickle type 'function'

Here is a list of what can be pickled. In particular, functions are only picklable if they are defined at the top-level of a module.

This piece of code:

import multiprocessing as mp

class Foo():
    @staticmethod
    def work(self):
        pass

if __name__ == '__main__':   
    pool = mp.Pool()
    foo = Foo()
    pool.apply_async(foo.work)
    pool.close()
    pool.join()

yields an error almost identical to the one you posted:

Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 505, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 315, in _handle_tasks
    put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed

The problem is that the pool methods all use a mp.SimpleQueue to pass tasks to the worker processes. Everything that goes through the mp.SimpleQueue must be pickable, and foo.work is not picklable since it is not defined at the top level of the module.

It can be fixed by defining a function at the top level, which calls foo.work():

def work(foo):
    foo.work()

pool.apply_async(work,args=(foo,))

Notice that foo is pickable, since Foo is defined at the top level and foo.__dict__ is picklable.

Multiprocessing: How to use Pool.map on a function defined in a class?

I also was annoyed by restrictions on what sort of functions pool.map could accept. I wrote the following to circumvent this. It appears to work, even for recursive use of parmap.

from multiprocessing import Process, Pipe
from itertools import izip

def spawn(f):
    def fun(pipe, x):
        pipe.send(f(x))
        pipe.close()
    return fun

def parmap(f, X):
    pipe = [Pipe() for x in X]
    proc = [Process(target=spawn(f), args=(c, x)) for x, (p, c) in izip(X, pipe)]
    [p.start() for p in proc]
    [p.join() for p in proc]
    return [p.recv() for (p, c) in pipe]

if __name__ == '__main__':
    print parmap(lambda x: x**x, range(1, 5))

Interaction between pathos.ProcessingPool and pickle

Straight from the Python docs.

12.1.4. What can be pickled and unpickled? The following types can be pickled:

None, True, and False

integers, floating point numbers, complex

strings, bytes, bytearrays

tuples, lists, sets, and

dictionaries containing only picklable objects functions defined at the top level of a module (using def, not lambda)

built-in functions defined at the top level of a module

classes that are defined at the top level of a module

instances of such classes whose __dict__ or the result of calling __getstate__() is picklable (see section Pickling Class Instances for details).

Everything else can't be pickled. In your case, though it's very hard to say given the excerpt of your code, I believe the problem is that the class Parameters is not defined at the top level of the module, hence its instances can't be pickled.

The whole point of using pathos.multiprocessing (or its actively developing fork multiprocess) instead of the built-in multiprocessing is to avoid pickle, because there are far too many things the later can't dump. pathos.multiprocessing and multiprocess use dill instead of pickle. And if you want to debug a worker, you can use trace.

NOTE As Mike McKerns (the main contributor of multiprocess) rightfully noticed, there are cases that even dill can't handle, though it will be hard to formulate some universal rules on that matter.

Replace pickle in Python multiprocessing lib

Try multiprocess. It's a fork of multiprocessing that uses the dill serializer instead of pickle -- there are no other changes in the fork.

I'm the author. I encountered the same problem as you several years ago, and ultimately I decided that that hacking the standard library was my only choice, as some of the pickle code in multiprocessing is in C++.

>>> import multiprocess as mp
>>> p = mp.Pool()
>>> p.map(lambda x:x**2, range(4))
[0, 1, 4, 9]
>>>

Passing a function that accepts class member functions as variables into python multiprocess pool.map()

The multiprocess module uses the pickle module to serialize the arguments passed to the function (f), which is executed in another process.

Many of the built-in types can be pickled, but instance methods cannot be pickled. So .05 is fine, but x.some_func1 isn't. See What can be pickled and unpickled? for more details.

There's no simple solution. You'll need to restructure your program so instance methods don't need to be passed as arguments (or avoid using multiprocess).

What Can Multiprocessing and Dill Do Together