Concatenate Numpy Arrays Without Copying

Concatenate Numpy arrays without copying

The memory belonging to a Numpy array must be contiguous. If you allocated the arrays separately, they are randomly scattered in memory, and there is no way to represent them as a view Numpy array.

If you know beforehand how many arrays you need, you can instead start with one big array that you allocate beforehand, and have each of the small arrays be a view to the big array (e.g. obtained by slicing).

How to combine two huge numpy arrays without concat, stack, or append?

Numpy numpy.memmap() allows for the creation of memory mapped data stored as a binary on disk that can be accessed and interfaced with as if it were a single array. This solution saves the individual arrays you are working with as separate .npy files and then combines them into a single binary file.

import numpy as np
import os

size = (7,960000,200)

# We are assuming arrays a and b share the same shape, if they do not 
# see https://stackoverflow.com/questions/50746704/how-to-merge-very-large-numpy-arrays
# for an explanation on how to create the new shape

a = np.ones(size) # uses ~16 GB RAM
a = np.transpose(a, (1,0,2))
shape = a.shape
shape[0] *= 2
dtype = a.dtype

np.save('a.npy', a)
a = None # allows for data to be deallocated by garbage collector

b = np.ones(size) # uses ~16 GB RAM
b = np.transpose(b, (1,0,2))
np.save('b.npy', a)
b = None

# Once the size is know create memmap and write chunks
data_files = ['a.npy', 'b.npy']
merged = np.memmap('merged.dat', dtype=dtype, mode='w+', shape=shape)
i = 0
for file in data_files:
    chunk = np.load(file, allow_pickle=True)
    merged[i:i+len(chunk)] = chunk
    i += len(chunk)

merged = np.transpose(merged, (1,0,2))

# Delete temporary numpy .npy files
os.remove('a.npy')
os.remove('b.npy')

Based on: this stackoverflow answer
also check out hdf5 and combining two hdf5 files here. It's another good way of storing large datasets

Combining two views of same numpy array into single view without copying the array?

Since memory-views can only be created using a fixed set of strides, you will have to create a copy in your case, where mat.shape[0] > j > i.

That means views will only work, if you want to have a view to every x-th element in the array:

mat = np.arange(20)
view = mat[slice(0, 20, 4)]
view
# Out[41]: array([ 0,  4,  8, 12, 16])

So this only works for views to equally spaced cells. But if you want to have a view to one contiguous slice(0, i) and another contiguous slice(j, mat.shape[0]), it won't work. You'll have to make a copy.

Concatenate Numpy arrays with least memory

Your problem is that you need to have 2 copies of the same data in memory.
If you build the array as in test1 you'll need far less memory at once, but at the cost of losing the dictionary.

import numpy as np
import time    

def test1(n):
    a = {x:(x, x, x) for x in range(n)} # Build sample data
    b = np.array([a.pop(i) for i in range(n)]).reshape(-1)
    return b

def test2(n):
    a = {x:(x, x, x) for x in range(n)} # Build sample data
    b = np.concatenate(list(a.values()))
    return b

x1 = test1(1000000)
del x1

time.sleep(1)

x2 = test2(1000000)

Results:

Sample Image

test1 : 0.71 s
test2 : 1.39 s

The first peek is for test1, it's not exactly in place but it reduces the memory usage quite a bit.

numpy - Append to array without making a copy

Based on your updated question, it looks like you can handily solve the problem by keeping a dictionary of numpy arrays:

x = np.array([])
y = np.array([])
Arrays = {"x": x, "y": y}

with open("./data.txt", "r") as f:
    for line in f:
        if re.match('x values', line):
            print "reading x values"
            key = "x"
        elif re.match('y', line):
            print "reading y values"
            key = "y"
        else:
            values = re.match("^\s+((?:[0-9.E+-]+\s*)*)", line)
            if values:
                Arrays[key] = np.append(Arrays[key], values.groups()[0].split())

As Sven Marnach points out in comments both here and your question, this is an inefficient use of numpy arrays.

A better approach (again, as Sven points out) would be:

Arrays = {"x": [], "y": []}

with open("./data.txt", "r") as f:
    for line in f:
        if re.match('x values', line):
            print "reading x values"
            key = "x"
        elif re.match('y', line):
            print "reading y values"
            key = "y"
        else:
            values = re.match("^\s+((?:[0-9.E+-]+\s*)*)", line)
            if values:
                Arrays[key].append(values.groups()[0].split())

Arrays = {key: np.array(Arrays[key]) for key in Arrays}

Concatenating views in NumPy

You can only concatenate views if they are contiguous in terms of dtypes, strides and offsets. Here is one way to check. This way is likely incomplete, but it illustrates the gist of it. Basically, if the views share a base, and the strides and offsets are aligned so that they are on the same grid, you can concatenate.

In the spirit of TDD, I will work with the following example:

x = np.arange(24).reshape(4, 6)

We (or at least I) want the following to be concatenatable:

a, b = x[:, :4], x[:, 4:]        # Basic case
a, b = x[:, :4:2], x[:, 4::2]    # Strided
a, b = x[:, :4:2], x[:, 2::2]    # Strided overlapping
a, b = x[1:2, 1:4], x[2:4, 1:4]  # Stacked

# Completely reshaped:
a, b = x.ravel()[:12].reshape(3, 4), x.ravel()[12:].reshape(3, 4)
# Equivalent to
a, b = x[:2, :].reshape(3, 4), x[2:, :].reshape(3, 4)

We do not want the following to be concatenatable:

a, b = x, np.arange(12).reshape(2, 6)   # Buffer mismatch
a, b = x[0, :].view(np.uint), x[1:, :]  # Dtype mismatch
a, b = x[:, ::2], x[:, ::3]             # Stride mismatch
a, b = x[:, :4], x[:, 4::2]             # Stride mismatch
a, b = x[:, :3], x[:, 4:]               # Overlap mismatch
a, b = x[:, :4:2], x[:, 3::2]           # Overlap mismatch
a, b = x[:-1, :-1], x[1:, 1:]           # Overlap mismatch
a, b = x[:-1, :4], x[:, 4:]             # Shape mismatch

The following could be interpreted as concatenatable, but won't be in this case:

a, b = x, x[1:-1, 1:-1]

The idea is that everything (dtype, strides, offsets) has to match exactly. Only one axis offset is allowed to be different between the views, as long as it is no more than one stride away from the edge of the other view. The only possible exception is when one view is fully contained in another, but we will ignore this scenario here. Generalizing to multiple dimensions should be pretty simple if we use array operations on the offsets and strides.

def cat_slices(a, b):
    if a.base is not b.base:
        raise ValueError('Buffer mismatch')
    if a.dtype != b.dtype:  # I don't thing you can use `is` here in general
        raise ValueError('Dtype mismatch')

    sa = np.array(a.strides)
    sb = np.array(b.strides)

    if (sa != sb).any():
        raise ValueError('Stride mismatch')

    oa = np.byte_bounds(a)[0]
    ob = np.byte_bounds(b)[0]

    if oa > ob:
        a, b = b, a
        oa, ob = ob, oa

    offset = ob - oa

    # Check if you can get to `b` from a by moving along exactly one axis
    # This part works consistently for arrays with internal overlap
    div = np.zeros_like(sa)
    mod = np.ones_like(sa)  # Use ones to auto-flag divide-by zero
    np.divmod(offset, sa, where=sa.astype(bool), out=(div, mod))

    zeros = np.flatnonzero((mod == 0) & (div >= 0) & (div <= a.shape))

    if not zeros.size:
        raise ValueError('Overlap mismatch')

    axis = zeros[0]

    check_shape = np.equal(a.shape, b.shape)
    check_shape[axis] = True

    if not check_shape.all():
        raise ValueError('Shape mismatch')

    shape = list(a.shape)
    shape[axis] = b.shape[axis] + div[axis]

    start = np.byte_bounds(a)[0] - np.byte_bounds(a.base)[0]

    return np.ndarray(shape, dtype=a.dtype, buffer=a.base, offset=start, strides=a.strides)

Some things that this function does not handle:

Merging flags
Broadcasting
Handling arrays that are fully contained within each other but with multi-axis offsets
Negative strides

You can, however, check that it returns the expected views (and errors) for all the cases shown above. In a more production-y version, I could envision this enhancing np.concatenate, so for failed cases, it would just copy data instead of raising an error.

Python: Concatenate (or clone) a numpy array N times

You are close, you want to use np.tile, but like this:

a = np.array([0,1,2])
np.tile(a,(3,1))

Result:

array([[0, 1, 2],
   [0, 1, 2],
   [0, 1, 2]])

If you call np.tile(a,3) you will get concatenate behavior like you were seeing

array([0, 1, 2, 0, 1, 2, 0, 1, 2])

http://docs.scipy.org/doc/numpy/reference/generated/numpy.tile.html

Concatenate numpy arrays to a single array

Your question has several problems.

What is embs; we can guess, such as a list of dict, or maybe pandas dataframe?

What is the block labeled input? The result of this loop:

for emb in embs: 
   print(emb['name'])

What is output? That actual output of your code, or the desired output? If the latter, what did your code do? What was wrong with it?

Often we see naive attempts to replicate this list iteration:

alist = []
for x in another_list:
    alist.append(x)

But I'm not sure that's what you have in mind. You don't define any array to 'collect' the loop values in. And your loop just has a print(), which doesn't return anything (or rather returns None). And concatenate doesn't work in in-place.

You've read enough of the concatenate docs to use the axis parameter, but your single first argument looks nothing like the documented tuple of arrays, (a1, a2, ...)

Since embs is apparently a list of 2 somethings with a name value, and the values are matching 2d arrays, this concatenate should produce your desired output:

np.concatenate( (embs[0]['name'], embs[1]['name']), axis=0 )

We could make that argument tuple with a list comprehension, or my earlier list append, but I think you need to see this concatenation spelled out in detail. Your understanding of python loops is still weak.

How to concatenate numpy arrays to create a 2d numpy array

You can use vstack to concatenate your master_list.

master_list = []
for array in formatted_list:
    master_list.append(array)

master_array = np.vstack(master_list)

Alternatively, if you know the length of your formatted_list containing the arrays and array length you can just preallocate the master_array.

import numpy as np

formatted_list = [np.random.rand(20)]*292
master_array = np.zeros((len(formatted_list), len(formatted_list[0])))
for i, array in enumerate(formatted_list):
    master_array[i,:] = array

** Edit **

As mentioned by hpaulj in the comments, np.array(), np.stack() and np.vstack() worked with this input and produced a numpy array with shape (7,20).

Concatenate Numpy Arrays Without Copying