Writing to Shared Memory in Python Is Very Slow

Writing to shared memory in Python is very slow

After more research I've found that python actually creates folders in /tmp which are starting with pymp-, and though no files are visible within them using file viewers, it looks exatly like /tmp/ is used by python for shared memory. Performance seems to be decreasing when file cashes are flushed.

The working solution in the end was to mount /tmp as tmpfs:

sudo mount -t tmpfs tmpfs /tmp

And, if using the latest docker, by providing --tmpfs /tmp argument to the docker run command.

After doing this, read/write operations are done in RAM, and performance is fast and stable.

I still wonder why /tmp is used for shared memory, not /dev/shm which is already monted as tmpfs and is supposed to be used for shared memory.

Why is communication via shared memory so much slower than via queues?

This is because multiprocessing.Array uses a lock by default to prevent multiple processes from accessing it at once:

multiprocessing.Array(typecode_or_type, size_or_initializer, *,
lock=True)

...

If lock is True (the default) then a new lock object is created to synchronize access to the value. If lock is a Lock or RLock object
then that will be used synchronize access to the value. If lock is
False then access to the returned object will not be automatically
protected by a lock, so it will not necessarily be “process-safe”.

This means you're not really concurrently writing to the array - only one process can access it at a time. Since your example workers are doing almost nothing but array writes, constantly waiting on this lock badly hurts performance. If you use lock=False when you create the array, the performance is much better:

With lock=True:

Now starting process 0
Now starting process 1
Now starting process 2
Now starting process 3
4000000 random numbers generated in 4.811205 seconds

With lock=False:

Now starting process 0
Now starting process 3
Now starting process 1
Now starting process 2
4000000 random numbers generated in 0.192473 seconds

Note that using lock=False means you need to manually protect access to the Array whenever you're doing something that isn't process-safe. Your example is having processes write to unique parts, so it's ok. But if you were trying to read from it while doing that, or had different processes write to overlapping parts, you would need to manually acquire a lock.

Using numpy array in shared memory slow when synchronizing access

You are essentially locking out any parallelism you might be getting because there is a lock on your data the entire time you are processing.

When this method

def calc_sync(i):
    with histo_shared.get_lock():
        calc_histo(i)

is executing, you placed a lock on the entire shared dataset while you're processing the histogram. Notice also that

def calc_histo(i):
    # create new array 'd_new' by doing some math on DATA using argument i
    histo = to_np_array(histo_shared)
    histo += np.histogram(d_new, bins=BINS,
        range=(DMIN, DMAX))[0].astype(np.int32)

isn't doing anything with i, so it just looks like you're processing the same data over again. What is d_new? I don't see it in your listing.

Ideally, what you should be doing is taking your large dataset, slicing it in to some number of chunks and processing it individually and then combining the results. Only lock the shared data, not the processing steps. This might look something like this:

def calc_histo(slice):
    # process the slice asyncronously
    return np.histogram(slice, bins=BINS,
        range=(DMIN, DMAX))[0].astype(np.int32)

def calc_sync(start,stop):

    histo = None

    # grab a chunk of data, you likely don't need to lock this
    histo = raw_data[start:stop]

    # acutal calculation is async
    result = calc_histo(histo)

    with histo_shared.get_lock():
         histo_shared += result

For pairwise data:

def calc_sync(part1,part2):

    histo = None
    output = [] # or numpy array
    # acutal calculation is async
    for i in range(part1):
        for j in range(part2):
              # do whatever computation you need and add it to output

    result = calc_histo(output)

    with histo_shared.get_lock():
         histo_shared += result

And now

 p = Pool(initializer=init, initargs=(histo_shared,))
 for i in range(1, it + 1,slice_size):
     for j in range(1, it + 1,slice_size):
         p.apply_async(calc_sync, [histo_shared[j:j+slice_size], histo_shared[i:i+slice_size])

In words, we take pairwise cuts of the data, generate the relevant data and then put them in a histogram. The only real synch you need is when you're combining data in the histogram

It takes longer to access shared memory then to load from file?

You're not using shared memory. That would be multiprocessing.Value, not multiprocessing.Manager().Value. You're storing the string in the manager's server process and sending pickles over TLS connections to access the value. Also, the server process is limited by its own GIL when serving requests.

I don't know how much each of those aspects contributes to the overhead, but it's overall more expensive than reading shared memory.

Writing to Shared Memory Making Copies?

Your actual data has 4 dimensions, but I'm going to work through a simpler 2D example.

Imagine you have this array (renderbuffer):

  1   2   3   4   5
  6   7   8   9  10
 11  12  13  14  15
 16  17  18  19  20

Now imagine your startx/endx/starty/endy parameters select this slice in one process:

  8   9
 13  14
 18  19

The entire array is 4x5 times 8 bytes, so 160 bytes. The "window" is 3x2 times 8 bytes, so 48 bytes.

Your expectation seems to be that accessing this 48 byte slice would require 48 bytes of memory in this one process. But actually it requires closer to the full 160 bytes. Why?

The reason is that memory is mapped in pages, which are commonly 4096 bytes each. So when you access the first element (here, the number 8), you will map the entire page of 4096 bytes containing that element.

Memory mapping on many systems is guaranteed to start on a page boundary, so the first element in your array will be at the beginning of a page. So if your array is 4096 bytes or smaller, accessing any element of it will map the entire array into memory in each process.

In your actual use case, each element you access in the slice will result in mapping the entire page which contains it. Elements adjacent in memory (meaning either the first or the last index in the slice increments by one, depending on your array order) will usually reside in the same page, but elements which are adjacent in other dimensions will likely be in separate pages.

But take heart: the memory mappings are shared between processes, so if your entire array is 200 MB, even though each process will end up mapping most or all of it, the total memory usage is still 200 MB combined for all processes. Many memory measurement tools will report that each process uses 200 MB, which is sort of true but useless: they are sharing a 200 MB view of the same memory.

Why are multiprocessing.sharedctypes assignments so slow?

This is slow for the reasons given in your second link, and the solution is actually pretty simple: Bypass the (slow) RawArray slice assignment code, which in this case is inefficiently reading one raw C value at a time from the source array to create a Python object, then converts it straight back to raw C for storage in the shared array, then discards the temporary Python object, and repeats 1e8 times.

But you don't need to do it that way; like most C level things, RawArray implements the buffer protocol, which means you can convert it to a memoryview, a view of the underlying raw memory that implements most operations in C-like ways, using raw memory operations if possible. So instead of doing:

# assign memory, very slow
%time temp[:] = np.arange(1e8, dtype = np.uint16)
Wall time: 9.75 s  # Updated to what my machine took, for valid comparison

use memoryview to manipulate it as a raw bytes-like object and assign that way (np.arange already implements the buffer protocol, and memoryview's slice assignment operator seamlessly uses it):

# C-like memcpy effectively, very fast
%time memoryview(temp)[:] = np.arange(1e8, dtype = np.uint16)
Wall time: 74.4 ms  # Takes 0.76% of original time!!!

Note, the time for the latter is milliseconds, not seconds; copying using memoryview wrapping to perform raw memory transfers takes less than 1% of the time to do it the plodding way RawArray does it by default!

Writing to Shared Memory in Python Is Very Slow