Fastest Way to Grow a Numpy Numeric Array

Fastest way to grow a numpy numeric array

I tried a few different things, with timing.

import numpy as np

The method you mention as slow: (32.094 seconds)

class A:

    def __init__(self):
        self.data = np.array([])

    def update(self, row):
        self.data = np.append(self.data, row)

    def finalize(self):
        return np.reshape(self.data, newshape=(self.data.shape[0]/5, 5))

Regular ol Python list: (0.308 seconds)

class B:

    def __init__(self):
        self.data = []

    def update(self, row):
        for r in row:
            self.data.append(r)

    def finalize(self):
        return np.reshape(self.data, newshape=(len(self.data)/5, 5))

Trying to implement an arraylist in numpy: (0.362 seconds)

class C:

    def __init__(self):
        self.data = np.zeros((100,))
        self.capacity = 100
        self.size = 0

    def update(self, row):
        for r in row:
            self.add(r)

    def add(self, x):
        if self.size == self.capacity:
            self.capacity *= 4
            newdata = np.zeros((self.capacity,))
            newdata[:self.size] = self.data
            self.data = newdata

        self.data[self.size] = x
        self.size += 1

    def finalize(self):
        data = self.data[:self.size]
        return np.reshape(data, newshape=(len(data)/5, 5))

And this is how I timed it:

x = C()
for i in xrange(100000):
    x.update([i])

So it looks like regular old Python lists are pretty good ;)

Optimal way to append to numpy array

Appending to numpy arrays is very inefficient. This is because the interpreter needs to find and assign memory for the entire array at every single step. Depending on the application, there are much better strategies.

If you know the length in advance, it is best to pre-allocate the array using a function like np.ones, np.zeros, or np.empty.

desired_length = 500
results = np.empty(desired_length)
for i in range(desired_length):
    results[i] = i**2

If you don't know the length, it's probably more efficient to keep your results in a regular list and convert it to an array afterwards.

results = []
while condition:
    a = do_stuff()
    results.append(a)
results = np.array(results)

Here are some timings on my computer.

def pre_allocate():
    results = np.empty(5000)
    for i in range(5000):
        results[i] = i**2
    return results

def list_append():
    results = []
    for i in range(5000):
        results.append(i**2)
    return np.array(results)

def numpy_append():
    results = np.array([])
    for i in range(5000):
        np.append(results, i**2)
    return results

%timeit pre_allocate()
# 100 loops, best of 3: 2.42 ms per loop

%timeit list_append()
# 100 loops, best of 3: 2.5 ms per loop

%timeit numpy_append()
# 10 loops, best of 3: 48.4 ms per loop

So you can see that both pre-allocating and using a list then converting are much faster.

fastest way to concatenate large numpy arrays

Every time you append a new array, new memory is being allocated to create a bigger one and record data into it. This is very expensive. A better solution is to allocate a specific size of memory once and then record your date using np.concatenate only once:

np.concatenate([np.zeros(arraySize) for i in range(100)])

Fastest way to create 2D numpy array which starts at 0, and increases by 1 across the rows, and continues into the columns?

You can use

np.arange(nrows*ncols).reshape(nrows,ncols)

Incidentally, this is how 90% of example 2D arrays are created in SO numpy posts.

Building a Numpy array by appending data (without knowing the full size in advance)

Use a native list of numpy arrays, then np.concatenate.

The native list will multiply (by ~1.125) in size when needed, so not too many reallocations will occur, moreover, it will only hold pointers to scattered (non contiguous in memory) np.arrays holding the actual data.

Calling concatenate only once will solve your problem.

Pseudocode

dataset = []
for f in glob.glob('*.png'):
    x = read_as_numpyarray(f)    # custom function; x is a matrix of shape (n, 1000)
    dataset.append(x)

dataset_np = np.concatenate(dataset)

Notice vstack internally uses concatenate.

Edit to address the edited question:

Let's say the total size of data is 20 GB. When concatenating, the
system will have to still keep 20 GB (for each individual array) and
also allocate 20 GB for the new concatenated array, thus requiring 40
GB of RAM (double of the dataset). How to do this without requiring
the double of RAM? (Example: is there a solution if we only have 32 GB
of RAM?)

I would attack this problem first by doing the same as proposed in this current answer, in phases. dataset_np1 = np.concatenate(half1_of_data), dataset_np2 = np.concatenate(half2_of_data), will only need 150% RAM (not 200%). This can be extended recursively at the expense of speed until the limit at which this becomes the proposition in the question. I can only assume the likes of dask can handle this better, but haven't tested myself.

Just to clarify, after you have dataset_np1 you no longer need the list of all the sharded small arrays, and can free that. Only then you start loading the other half. Thus, you only ever need to hold an extra 50% of the data in memory.

Pseudocode:


def load_half(buffer: np.array, shard_path: str, shard_ind: int):
    half_dataset = []
    for f in glob.glob(f'{shard_path}/*.png'):
        x = read_as_numpyarray(f)    # custom function; x is a matrix of shape (n, 1000)
        half_dataset.append(x)

    half_dataset_np = np.concatenate(half_dataset) # see comment *
    buffer[:buffer.shape[0] // 2 * (shard_ind + 1), ...] = half_dataset_np

half1_path = r"half1"  # preprocess the shards to be found by glob or otherwise
half2_path = r"half2"
assert os.path.isdir(half1_path)
assert os.path.isdir(half2_path)

buffer = np.zeros(size_shape)
half1_np = load_half(half1_path, buffer, 0) # only 50% of data temporarily loaded, then freed [can be done manually if needed]
half2_np = load_half(half2_path, buffer, 1) # only 50% of data temporarily loaded, then freed

One could (easily, or not so easily) generalize this to quarters, eighths, or recursively any required fraction to reduce memory costs at the expense of speed, with the limit at infinity being the original proposition in the question.

Important comment (see "see comment * in the code):

One might notice half_dataset_np = np.concatenate(half_dataset)
actually allocates 50% of the dataset, with the other 50% allocated
in shards, apparently saving us nothing. That is correct, and I could
not find a way to concat into a buffer. However, implementing this
recursively as suggested (and not shown in pseudocode) will save
memory, as a quarter will only use 2* 25% every time. This is just an
implementation detail, but I hope the gist is clear.

On a different note, another approach would state "what if the dataset is 1000GB"? then no numpy array will do. This is why databases exist, and they can be queried quite efficiently using tools. But again, this is somewhat a research question, and depends heavily on your specific needs. As a very uninformed hunch, I would check out dask.

Such libraries will obviously tackle problems like this one as a subset of what they do, and I recommend not implementing these things yourself, as the total time you will spend will much outweigh the time choosing and learning a library.

On another different note, I wonder if this really has to be such a huge array, and maybe a slightly different design or formulation of the problem could alleviate us from this technical issue altogether.

Good ways to expand a numpy ndarray?

There are the index tricks r_ and c_.

>>> import numpy as np
>>> a = np.array([[1, 2], [3, 4]])
>>> z = np.zeros((2, 3), dtype=a.dtype)
>>> np.c_[a, z]
array([[1, 2, 0, 0, 0],
       [3, 4, 0, 0, 0]])

If this is performance critical code, you might prefer to use the equivalent np.concatenate rather than the index tricks.

>>> np.concatenate((a,z), axis=1)
array([[1, 2, 0, 0, 0],
       [3, 4, 0, 0, 0]])

There are also np.resize and np.ndarray.resize, but they have some limitations (due to the way numpy lays out data in memory) so read the docstring on those ones. You will probably find that simply concatenating is better.

By the way, when I've needed to do this I usually just do it the basic way you've already mentioned (create an array of zeros and assign the smaller array inside it), I don't see anything wrong with that!

Fastest way to left-cycle a numpy array (like pop, push for a queue)

After some experiments, it is clear that:

copying is required,
and the fastest and simplest way to do that, for nparray (numpy arrays) is a slicing and copying.

So the solution is: x[:-1] = x[1:]; x[-1] = newvalue.

Here is a small benchmark:

>>> x = np.random.randint(0, 1e6, 10**8); newvalue = -100
>>> %timeit x[:-1] = x[1:]; x[-1] = newvalue
1000 loops, best of 3: 73.6 ms per loop
>>> %timeit np.concatenate((x[1:], np.array(newvalue).reshape(1,)), axis=0) 
1 loop, best of 3: 339 ms per loop

But if you don't need to have a fast access to all values in the array, but only the first or last ones, using a deque is smarter.

Fastest Way to Grow a Numpy Numeric Array