Working with Big Data in Python and Numpy, Not Enough Ram, How to Save Partial Results on Disc

Working with big data in python and numpy, not enough ram, how to save partial results on disc?

Using numpy.memmap you create arrays directly mapped into a file:

import numpy
a = numpy.memmap('test.mymemmap', dtype='float32', mode='w+', shape=(200000,1000))
# here you will see a 762MB file created in your working directory

You can treat it as a conventional array:
a += 1000.

It is possible even to assign more arrays to the same file, controlling it from mutually sources if needed. But I've experiences some tricky things here. To open the full array you have to "close" the previous one first, using del:

del a    
b = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(200000,1000))

But openning only some part of the array makes it possible to achieve the simultaneous control:

b = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(2,1000))
b[1,5] = 123456.
print a[1,5]
#123456.0

Great! a was changed together with b. And the changes are already written on disk.

The other important thing worth commenting is the offset. Suppose you want to take not the first 2 lines in b, but lines 150000 and 150001.

b = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(2,1000),
                 offset=150000*1000*32/8)
b[1,2] = 999999.
print a[150001,2]
#999999.0

Now you can access and update any part of the array in simultaneous operations. Note the byte-size going in the offset calculation. So for a 'float64' this example would be 150000*1000*64/8.

Other references:

Is it possible to map a discontiuous data on disk to an array with python?
numpy.memmap documentation here.

Building a Numpy array by appending data (without knowing the full size in advance)

Use a native list of numpy arrays, then np.concatenate.

The native list will multiply (by ~1.125) in size when needed, so not too many reallocations will occur, moreover, it will only hold pointers to scattered (non contiguous in memory) np.arrays holding the actual data.

Calling concatenate only once will solve your problem.

Pseudocode

dataset = []
for f in glob.glob('*.png'):
    x = read_as_numpyarray(f)    # custom function; x is a matrix of shape (n, 1000)
    dataset.append(x)

dataset_np = np.concatenate(dataset)

Notice vstack internally uses concatenate.

Edit to address the edited question:

Let's say the total size of data is 20 GB. When concatenating, the
system will have to still keep 20 GB (for each individual array) and
also allocate 20 GB for the new concatenated array, thus requiring 40
GB of RAM (double of the dataset). How to do this without requiring
the double of RAM? (Example: is there a solution if we only have 32 GB
of RAM?)

I would attack this problem first by doing the same as proposed in this current answer, in phases. dataset_np1 = np.concatenate(half1_of_data), dataset_np2 = np.concatenate(half2_of_data), will only need 150% RAM (not 200%). This can be extended recursively at the expense of speed until the limit at which this becomes the proposition in the question. I can only assume the likes of dask can handle this better, but haven't tested myself.

Just to clarify, after you have dataset_np1 you no longer need the list of all the sharded small arrays, and can free that. Only then you start loading the other half. Thus, you only ever need to hold an extra 50% of the data in memory.

Pseudocode:


def load_half(buffer: np.array, shard_path: str, shard_ind: int):
    half_dataset = []
    for f in glob.glob(f'{shard_path}/*.png'):
        x = read_as_numpyarray(f)    # custom function; x is a matrix of shape (n, 1000)
        half_dataset.append(x)

    half_dataset_np = np.concatenate(half_dataset) # see comment *
    buffer[:buffer.shape[0] // 2 * (shard_ind + 1), ...] = half_dataset_np

half1_path = r"half1"  # preprocess the shards to be found by glob or otherwise
half2_path = r"half2"
assert os.path.isdir(half1_path)
assert os.path.isdir(half2_path)

buffer = np.zeros(size_shape)
half1_np = load_half(half1_path, buffer, 0) # only 50% of data temporarily loaded, then freed [can be done manually if needed]
half2_np = load_half(half2_path, buffer, 1) # only 50% of data temporarily loaded, then freed

One could (easily, or not so easily) generalize this to quarters, eighths, or recursively any required fraction to reduce memory costs at the expense of speed, with the limit at infinity being the original proposition in the question.

Important comment (see "see comment * in the code):

One might notice half_dataset_np = np.concatenate(half_dataset)
actually allocates 50% of the dataset, with the other 50% allocated
in shards, apparently saving us nothing. That is correct, and I could
not find a way to concat into a buffer. However, implementing this
recursively as suggested (and not shown in pseudocode) will save
memory, as a quarter will only use 2* 25% every time. This is just an
implementation detail, but I hope the gist is clear.

On a different note, another approach would state "what if the dataset is 1000GB"? then no numpy array will do. This is why databases exist, and they can be queried quite efficiently using tools. But again, this is somewhat a research question, and depends heavily on your specific needs. As a very uninformed hunch, I would check out dask.

Such libraries will obviously tackle problems like this one as a subset of what they do, and I recommend not implementing these things yourself, as the total time you will spend will much outweigh the time choosing and learning a library.

On another different note, I wonder if this really has to be such a huge array, and maybe a slightly different design or formulation of the problem could alleviate us from this technical issue altogether.

Memory error regarding operations on numpy arrays

No mystery, these are really large arrays. At 64-bit precision, an array of shape (15239,329960) needs...

>>> np.product((15239,329960)) * 8 / 2**30
37.46345967054367

...about 37GiB! Things to try:

Reduce the bit-depth, e.g. use np.float16, requiring 25% of the memory.
Is the data actually dense, or can you use scipy.sparse?
Maybe it's time for dask?
Get more RAM!

Python: numpy memmap in parallel

Q : "Does it make sense to use numpy's memmap across multiple cores (MPI)?"

Yes _{( ... even without MPI, using just Python native { thread- | process-}-based forms of concurrent-processing )}

Q : "Can I create a separate memmap-object on each core, and use it to read different slices from the file?"

Yes.

Q : "What about writing to it?"

The same _{( sure, if having been opened in write-able mode ... )}

Numpy memory error creating huge matrix

Assuming each floating point number is 4 bytes each, you'd have

(10000000000 * 4) /(2**30.0) = 37.25290298461914

Or 37.5 gigabytes you need to store in memory. So I don't think 24gb of RAM is enough.

Working with Big Data in Python and Numpy, Not Enough Ram, How to Save Partial Results on Disc