Writing to shared memory in Python is very slow
After more research I've found that python actually creates folders in /tmp
which are starting with pymp-
, and though no files are visible within them using file viewers, it looks exatly like /tmp/
is used by python for shared memory. Performance seems to be decreasing when file cashes are flushed.
The working solution in the end was to mount /tmp
as tmpfs
:
sudo mount -t tmpfs tmpfs /tmp
And, if using the latest docker, by providing --tmpfs /tmp
argument to the docker run
command.
After doing this, read/write operations are done in RAM, and performance is fast and stable.
I still wonder why /tmp
is used for shared memory, not /dev/shm
which is already monted as tmpfs
and is supposed to be used for shared memory.
Why is communication via shared memory so much slower than via queues?
This is because multiprocessing.Array
uses a lock by default to prevent multiple processes from accessing it at once:
multiprocessing.Array(typecode_or_type, size_or_initializer, *,
lock=True)...
If lock is True (the default) then a new lock object is created to synchronize access to the value. If lock is a Lock or RLock object
then that will be used synchronize access to the value. If lock is
False then access to the returned object will not be automatically
protected by a lock, so it will not necessarily be “process-safe”.
This means you're not really concurrently writing to the array - only one process can access it at a time. Since your example workers are doing almost nothing but array writes, constantly waiting on this lock badly hurts performance. If you use lock=False
when you create the array, the performance is much better:
With lock=True
:
Now starting process 0
Now starting process 1
Now starting process 2
Now starting process 3
4000000 random numbers generated in 4.811205 seconds
With lock=False
:
Now starting process 0
Now starting process 3
Now starting process 1
Now starting process 2
4000000 random numbers generated in 0.192473 seconds
Note that using lock=False
means you need to manually protect access to the Array
whenever you're doing something that isn't process-safe. Your example is having processes write to unique parts, so it's ok. But if you were trying to read from it while doing that, or had different processes write to overlapping parts, you would need to manually acquire a lock.
Using numpy array in shared memory slow when synchronizing access
You are essentially locking out any parallelism you might be getting because there is a lock on your data the entire time you are processing.
When this method
def calc_sync(i):
with histo_shared.get_lock():
calc_histo(i)
is executing, you placed a lock on the entire shared dataset while you're processing the histogram. Notice also that
def calc_histo(i):
# create new array 'd_new' by doing some math on DATA using argument i
histo = to_np_array(histo_shared)
histo += np.histogram(d_new, bins=BINS,
range=(DMIN, DMAX))[0].astype(np.int32)
isn't doing anything with i, so it just looks like you're processing the same data over again. What is d_new? I don't see it in your listing.
Ideally, what you should be doing is taking your large dataset, slicing it in to some number of chunks and processing it individually and then combining the results. Only lock the shared data, not the processing steps. This might look something like this:
def calc_histo(slice):
# process the slice asyncronously
return np.histogram(slice, bins=BINS,
range=(DMIN, DMAX))[0].astype(np.int32)
def calc_sync(start,stop):
histo = None
# grab a chunk of data, you likely don't need to lock this
histo = raw_data[start:stop]
# acutal calculation is async
result = calc_histo(histo)
with histo_shared.get_lock():
histo_shared += result
For pairwise data:
def calc_sync(part1,part2):
histo = None
output = [] # or numpy array
# acutal calculation is async
for i in range(part1):
for j in range(part2):
# do whatever computation you need and add it to output
result = calc_histo(output)
with histo_shared.get_lock():
histo_shared += result
And now
p = Pool(initializer=init, initargs=(histo_shared,))
for i in range(1, it + 1,slice_size):
for j in range(1, it + 1,slice_size):
p.apply_async(calc_sync, [histo_shared[j:j+slice_size], histo_shared[i:i+slice_size])
In words, we take pairwise cuts of the data, generate the relevant data and then put them in a histogram. The only real synch you need is when you're combining data in the histogram
It takes longer to access shared memory then to load from file?
You're not using shared memory. That would be multiprocessing.Value
, not multiprocessing.Manager().Value
. You're storing the string in the manager's server process and sending pickles over TLS connections to access the value. Also, the server process is limited by its own GIL when serving requests.
I don't know how much each of those aspects contributes to the overhead, but it's overall more expensive than reading shared memory.
Writing to Shared Memory Making Copies?
Your actual data has 4 dimensions, but I'm going to work through a simpler 2D example.
Imagine you have this array (renderbuffer
):
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
Now imagine your startx/endx/starty/endy parameters select this slice in one process:
8 9
13 14
18 19
The entire array is 4x5 times 8 bytes, so 160 bytes. The "window" is 3x2 times 8 bytes, so 48 bytes.
Your expectation seems to be that accessing this 48 byte slice would require 48 bytes of memory in this one process. But actually it requires closer to the full 160 bytes. Why?
The reason is that memory is mapped in pages, which are commonly 4096 bytes each. So when you access the first element (here, the number 8), you will map the entire page of 4096 bytes containing that element.
Memory mapping on many systems is guaranteed to start on a page boundary, so the first element in your array will be at the beginning of a page. So if your array is 4096 bytes or smaller, accessing any element of it will map the entire array into memory in each process.
In your actual use case, each element you access in the slice will result in mapping the entire page which contains it. Elements adjacent in memory (meaning either the first or the last index in the slice increments by one, depending on your array order
) will usually reside in the same page, but elements which are adjacent in other dimensions will likely be in separate pages.
But take heart: the memory mappings are shared between processes, so if your entire array is 200 MB, even though each process will end up mapping most or all of it, the total memory usage is still 200 MB combined for all processes. Many memory measurement tools will report that each process uses 200 MB, which is sort of true but useless: they are sharing a 200 MB view of the same memory.
Why are multiprocessing.sharedctypes assignments so slow?
This is slow for the reasons given in your second link, and the solution is actually pretty simple: Bypass the (slow) RawArray
slice assignment code, which in this case is inefficiently reading one raw C value at a time from the source array to create a Python object, then converts it straight back to raw C for storage in the shared array, then discards the temporary Python object, and repeats 1e8
times.
But you don't need to do it that way; like most C level things, RawArray
implements the buffer protocol, which means you can convert it to a memoryview
, a view of the underlying raw memory that implements most operations in C-like ways, using raw memory operations if possible. So instead of doing:
# assign memory, very slow
%time temp[:] = np.arange(1e8, dtype = np.uint16)
Wall time: 9.75 s # Updated to what my machine took, for valid comparison
use memoryview
to manipulate it as a raw bytes-like object and assign that way (np.arange
already implements the buffer protocol, and memoryview
's slice assignment operator seamlessly uses it):
# C-like memcpy effectively, very fast
%time memoryview(temp)[:] = np.arange(1e8, dtype = np.uint16)
Wall time: 74.4 ms # Takes 0.76% of original time!!!
Note, the time for the latter is milliseconds, not seconds; copying using memoryview
wrapping to perform raw memory transfers takes less than 1% of the time to do it the plodding way RawArray
does it by default!
Related Topics
Launch Default Image Viewer from Pygtk Program
Writing Python Lists to Columns in Csv
How to Plot Multiple Dataframes in Subplots
Generating Variable Names on Fly in Python
How to Delete a File or Folder in Python
How to Send a Signal from a Python Program
Running Process of Remote Ssh Server in the Background Using Python Paramiko
Pip Install Unable to Find Ffi.H Even Though It Recognizes Libffi
Redirecting Python's Stdout to the File Fails with Unicodeencodeerror
Using && in Subprocess.Popen for Command Chaining
How to Disable Stdout Buffer When Running Shell
Install Scipy with Mkl Through Pip
Need a Way to Determine If a File Is Done Being Written To
Loading a Config File from Operation System Independent Place in Python