Concatenate Numpy arrays without copying
The memory belonging to a Numpy array must be contiguous. If you allocated the arrays separately, they are randomly scattered in memory, and there is no way to represent them as a view Numpy array.
If you know beforehand how many arrays you need, you can instead start with one big array that you allocate beforehand, and have each of the small arrays be a view to the big array (e.g. obtained by slicing).
How to combine two huge numpy arrays without concat, stack, or append?
Numpy numpy.memmap() allows for the creation of memory mapped data stored as a binary on disk that can be accessed and interfaced with as if it were a single array. This solution saves the individual arrays you are working with as separate .npy files and then combines them into a single binary file.
import numpy as np
import os
size = (7,960000,200)
# We are assuming arrays a and b share the same shape, if they do not
# see https://stackoverflow.com/questions/50746704/how-to-merge-very-large-numpy-arrays
# for an explanation on how to create the new shape
a = np.ones(size) # uses ~16 GB RAM
a = np.transpose(a, (1,0,2))
shape = a.shape
shape[0] *= 2
dtype = a.dtype
np.save('a.npy', a)
a = None # allows for data to be deallocated by garbage collector
b = np.ones(size) # uses ~16 GB RAM
b = np.transpose(b, (1,0,2))
np.save('b.npy', a)
b = None
# Once the size is know create memmap and write chunks
data_files = ['a.npy', 'b.npy']
merged = np.memmap('merged.dat', dtype=dtype, mode='w+', shape=shape)
i = 0
for file in data_files:
chunk = np.load(file, allow_pickle=True)
merged[i:i+len(chunk)] = chunk
i += len(chunk)
merged = np.transpose(merged, (1,0,2))
# Delete temporary numpy .npy files
os.remove('a.npy')
os.remove('b.npy')
- Based on: this stackoverflow answer
- also check out hdf5 and combining two hdf5 files here. It's another good way of storing large datasets
Combining two views of same numpy array into single view without copying the array?
Since memory-views can only be created using a fixed set of strides
, you will have to create a copy in your case, where mat.shape[0] > j > i
.
That means views will only work, if you want to have a view to every x-th element in the array:
mat = np.arange(20)
view = mat[slice(0, 20, 4)]
view
# Out[41]: array([ 0, 4, 8, 12, 16])
So this only works for views to equally spaced cells. But if you want to have a view to one contiguous slice(0, i)
and another contiguous slice(j, mat.shape[0])
, it won't work. You'll have to make a copy.
Concatenate Numpy arrays with least memory
Your problem is that you need to have 2 copies of the same data in memory.
If you build the array as in test1
you'll need far less memory at once, but at the cost of losing the dictionary.
import numpy as np
import time
def test1(n):
a = {x:(x, x, x) for x in range(n)} # Build sample data
b = np.array([a.pop(i) for i in range(n)]).reshape(-1)
return b
def test2(n):
a = {x:(x, x, x) for x in range(n)} # Build sample data
b = np.concatenate(list(a.values()))
return b
x1 = test1(1000000)
del x1
time.sleep(1)
x2 = test2(1000000)
Results:
test1 : 0.71 s
test2 : 1.39 s
The first peek is for test1, it's not exactly in place but it reduces the memory usage quite a bit.
numpy - Append to array without making a copy
Based on your updated question, it looks like you can handily solve the problem by keeping a dictionary of numpy arrays:
x = np.array([])
y = np.array([])
Arrays = {"x": x, "y": y}
with open("./data.txt", "r") as f:
for line in f:
if re.match('x values', line):
print "reading x values"
key = "x"
elif re.match('y', line):
print "reading y values"
key = "y"
else:
values = re.match("^\s+((?:[0-9.E+-]+\s*)*)", line)
if values:
Arrays[key] = np.append(Arrays[key], values.groups()[0].split())
As Sven Marnach points out in comments both here and your question, this is an inefficient use of numpy arrays.
A better approach (again, as Sven points out) would be:
Arrays = {"x": [], "y": []}
with open("./data.txt", "r") as f:
for line in f:
if re.match('x values', line):
print "reading x values"
key = "x"
elif re.match('y', line):
print "reading y values"
key = "y"
else:
values = re.match("^\s+((?:[0-9.E+-]+\s*)*)", line)
if values:
Arrays[key].append(values.groups()[0].split())
Arrays = {key: np.array(Arrays[key]) for key in Arrays}
Concatenating views in NumPy
You can only concatenate views if they are contiguous in terms of dtypes, strides and offsets. Here is one way to check. This way is likely incomplete, but it illustrates the gist of it. Basically, if the views share a base, and the strides and offsets are aligned so that they are on the same grid, you can concatenate.
In the spirit of TDD, I will work with the following example:
x = np.arange(24).reshape(4, 6)
We (or at least I) want the following to be concatenatable:
a, b = x[:, :4], x[:, 4:] # Basic case
a, b = x[:, :4:2], x[:, 4::2] # Strided
a, b = x[:, :4:2], x[:, 2::2] # Strided overlapping
a, b = x[1:2, 1:4], x[2:4, 1:4] # Stacked
# Completely reshaped:
a, b = x.ravel()[:12].reshape(3, 4), x.ravel()[12:].reshape(3, 4)
# Equivalent to
a, b = x[:2, :].reshape(3, 4), x[2:, :].reshape(3, 4)
We do not want the following to be concatenatable:
a, b = x, np.arange(12).reshape(2, 6) # Buffer mismatch
a, b = x[0, :].view(np.uint), x[1:, :] # Dtype mismatch
a, b = x[:, ::2], x[:, ::3] # Stride mismatch
a, b = x[:, :4], x[:, 4::2] # Stride mismatch
a, b = x[:, :3], x[:, 4:] # Overlap mismatch
a, b = x[:, :4:2], x[:, 3::2] # Overlap mismatch
a, b = x[:-1, :-1], x[1:, 1:] # Overlap mismatch
a, b = x[:-1, :4], x[:, 4:] # Shape mismatch
The following could be interpreted as concatenatable, but won't be in this case:
a, b = x, x[1:-1, 1:-1]
The idea is that everything (dtype, strides, offsets) has to match exactly. Only one axis offset is allowed to be different between the views, as long as it is no more than one stride away from the edge of the other view. The only possible exception is when one view is fully contained in another, but we will ignore this scenario here. Generalizing to multiple dimensions should be pretty simple if we use array operations on the offsets and strides.
def cat_slices(a, b):
if a.base is not b.base:
raise ValueError('Buffer mismatch')
if a.dtype != b.dtype: # I don't thing you can use `is` here in general
raise ValueError('Dtype mismatch')
sa = np.array(a.strides)
sb = np.array(b.strides)
if (sa != sb).any():
raise ValueError('Stride mismatch')
oa = np.byte_bounds(a)[0]
ob = np.byte_bounds(b)[0]
if oa > ob:
a, b = b, a
oa, ob = ob, oa
offset = ob - oa
# Check if you can get to `b` from a by moving along exactly one axis
# This part works consistently for arrays with internal overlap
div = np.zeros_like(sa)
mod = np.ones_like(sa) # Use ones to auto-flag divide-by zero
np.divmod(offset, sa, where=sa.astype(bool), out=(div, mod))
zeros = np.flatnonzero((mod == 0) & (div >= 0) & (div <= a.shape))
if not zeros.size:
raise ValueError('Overlap mismatch')
axis = zeros[0]
check_shape = np.equal(a.shape, b.shape)
check_shape[axis] = True
if not check_shape.all():
raise ValueError('Shape mismatch')
shape = list(a.shape)
shape[axis] = b.shape[axis] + div[axis]
start = np.byte_bounds(a)[0] - np.byte_bounds(a.base)[0]
return np.ndarray(shape, dtype=a.dtype, buffer=a.base, offset=start, strides=a.strides)
Some things that this function does not handle:
- Merging flags
- Broadcasting
- Handling arrays that are fully contained within each other but with multi-axis offsets
- Negative strides
You can, however, check that it returns the expected views (and errors) for all the cases shown above. In a more production-y version, I could envision this enhancing np.concatenate
, so for failed cases, it would just copy data instead of raising an error.
Python: Concatenate (or clone) a numpy array N times
You are close, you want to use np.tile
, but like this:
a = np.array([0,1,2])
np.tile(a,(3,1))
Result:
array([[0, 1, 2],
[0, 1, 2],
[0, 1, 2]])
If you call np.tile(a,3)
you will get concatenate
behavior like you were seeing
array([0, 1, 2, 0, 1, 2, 0, 1, 2])
http://docs.scipy.org/doc/numpy/reference/generated/numpy.tile.html
Concatenate numpy arrays to a single array
Your question has several problems.
What is embs
; we can guess, such as a list of dict
, or maybe pandas dataframe?
What is the block labeled input
? The result of this loop:
for emb in embs:
print(emb['name'])
What is output
? That actual output of your code, or the desired output? If the latter, what did your code do? What was wrong with it?
Often we see naive attempts to replicate this list iteration:
alist = []
for x in another_list:
alist.append(x)
But I'm not sure that's what you have in mind. You don't define any array to 'collect' the loop values in. And your loop just has a print()
, which doesn't return anything (or rather returns None
). And concatenate
doesn't work in in-place.
You've read enough of the concatenate
docs to use the axis
parameter, but your single first argument looks nothing like the documented tuple of arrays, (a1, a2, ...)
Since embs
is apparently a list of 2 somethings with a name
value, and the values are matching 2d arrays, this concatenate
should produce your desired output
:
np.concatenate( (embs[0]['name'], embs[1]['name']), axis=0 )
We could make that argument tuple with a list comprehension, or my earlier list append, but I think you need to see this concatenation spelled out in detail. Your understanding of python loops is still weak.
How to concatenate numpy arrays to create a 2d numpy array
You can use vstack
to concatenate your master_list
.
master_list = []
for array in formatted_list:
master_list.append(array)
master_array = np.vstack(master_list)
Alternatively, if you know the length of your formatted_list
containing the arrays and array length you can just preallocate the master_array
.
import numpy as np
formatted_list = [np.random.rand(20)]*292
master_array = np.zeros((len(formatted_list), len(formatted_list[0])))
for i, array in enumerate(formatted_list):
master_array[i,:] = array
** Edit **
As mentioned by hpaulj in the comments, np.array()
, np.stack()
and np.vstack()
worked with this input and produced a numpy array with shape (7,20).
Related Topics
Find Longest Repetitive Sequence in a String
List of Dicts To/From Dict of Lists
Django Aggregation: Summation of Multiplication of Two Fields
Removing Unicode \U2026 Like Characters in a String in Python2.7
Should I Call Close() After Urllib.Urlopen()
How to Set Folder Permissions in Windows
What Does the Term "Broadcasting" Mean in Pandas Documentation
Asyncio.Sleep() VS Time.Sleep()
Python String 'Join' Is Faster () Than '+', But What's Wrong Here
Plotting Grouped Data in Same Plot Using Pandas
How Can Strings Be Concatenated
Search in Lists of Lists by Given Index
Python String 'In' Operator Implementation Algorithm and Time Complexity
How to Format Date String via Multiple Formats in Python