How do I stack vectors of different lengths in NumPy?
Short answer: you can't. NumPy does not support jagged arrays natively.
Long answer:
>>> a = ones((3,))
>>> b = ones((2,))
>>> c = array([a, b])
>>> c
array([[ 1. 1. 1.], [ 1. 1.]], dtype=object)
gives an array that may or may not behave as you expect. E.g. it doesn't support basic methods like sum
or reshape
, and you should treat this much as you'd treat the ordinary Python list [a, b]
(iterate over it to perform operations instead of using vectorized idioms).
Several possible workarounds exist; the easiest is to coerce a
and b
to a common length, perhaps using masked arrays or NaN to signal that some indices are invalid in some rows. E.g. here's b
as a masked array:
>>> ma.array(np.resize(b, a.shape[0]), mask=[False, False, True])
masked_array(data = [1.0 1.0 --],
mask = [False False True],
fill_value = 1e+20)
This can be stacked with a
as follows:
>>> ma.vstack([a, ma.array(np.resize(b, a.shape[0]), mask=[False, False, True])])
masked_array(data =
[[1.0 1.0 1.0]
[1.0 1.0 --]],
mask =
[[False False False]
[False False True]],
fill_value = 1e+20)
(For some purposes, scipy.sparse
may also be interesting.)
Stacking Numpy arrays of different length using padding
If you don't want to use itertools
and column_stack
, numpy.ndarray.resize
will also do the job perfectly. As mentioned by jtweeder, you just need to know to resulting size of each rows. The advantage to use resize
is that numpy.ndarray
is contiguous in memory. Resizing is faster when each row differs alot in size. The performance difference is observable between the two approaches.
import numpy as np
import timeit
import itertools
def stack_padding(it):
def resize(row, size):
new = np.array(row)
new.resize(size)
return new
# find longest row length
row_length = max(it, key=len).__len__()
mat = np.array( [resize(row, row_length) for row in it] )
return mat
def stack_padding1(l):
return np.column_stack((itertools.zip_longest(*l, fillvalue=0)))
if __name__ == "__main__":
n_rows = 200
row_lengths = np.random.randint(30, 50, size=n_rows)
mat = [np.random.randint(0, 100, size=s) for s in row_lengths]
def test_stack_padding():
global mat
stack_padding(mat)
def test_itertools():
global mat
stack_padding1(mat)
t1 = timeit.timeit(test_stack_padding, number=1000)
t2 = timeit.timeit(test_itertools, number=1000)
print('With ndarray.resize: ', t1)
print('With itertool and vstack: ', t2)
The resize
method wins in the above comparison:
>>> With ndarray.resize: 0.30080295499647036
>>> With itertool and vstack: 1.0151802329928614
How to stack 2 numpy arrays with Different Lengths in python
padding it with np.nan
is a valid option.
create an empty target array, and just assign your arrays into it's sub indexes:
import numpy as np
x1 = np.random.normal( 0, 1, ( 10, 1 ) )
x2 = np.random.normal( 0, 2, ( 5, 1 ) )
x3 = np.random.normal( 0, 3, ( 7, 1 ) )
x4 = np.random.normal( 0, 4, ( 9, 1 ) )
arrs = [x1,x2,x3,x4]
a = np.empty((max(x.shape[0] for x in arrs), len(arrs)))
a[:] = np.nan
for i, x in enumerate(arrs):
a[0:len(x), i] = x.T
print(a)
Output:
[[ -1.5521545 -1.82217348 -3.28589422 -1.59646125]
[ 0.54409311 2.53585401 -2.15704799 2.1590175 ]
[ 0.24202617 -1.62680388 0.58507172 4.24671516]
[ 1.21341942 -2.09405961 1.94415747 -1.21781288]
[ -0.53110862 1.47037056 2.37113853 -10.01200676]
[ 0.50884432 nan -2.56881482 -3.52164926]
[ -0.37551321 nan 0.67952001 -0.5523079 ]
[ 0.5943706 nan nan -6.25704491]
[ -0.37893229 nan nan -6.28029336]
[ -0.34746679 nan nan nan]]
Python how to add multiple arrays with different length into one
Here are some stats for different solutions to the problem. I was able to squeeze a little more performance by vectorizing the implementation to get maxlen, but besides that, I think you will have to try cython or trying other programming languages.
import numpy as np
from numba import jit
from time import time
np.random.seed(42)
def mixing_function(sig, onset):
maxlen = np.max([o + len(s) for o, s in zip(onset, sig)])
result = np.zeros(maxlen)
for i in range(len(onset)):
result[onset[i]:onset[i] + len(sig[i])] += sig[i]
return result
def mix(sig, onset):
siglengths = np.vectorize(len)(sig)
maxlen = max(onset + siglengths)
result = np.zeros(maxlen)
for i in range(len(sig)):
result[onset[i]: onset[i]+siglengths[i]] += sig[i]
return result
@jit(nopython=True)
def mixnumba(sig, onset):
# maxlen = np.max([onset[i] + len(sig[i]) for i in range(len(sig))])
maxlen = -1
for i in range(len(sig)):
maxlen = max(maxlen, sig[i].size + onset[i])
result = np.zeros(maxlen)
for i in range(len(sig)):
result[onset[i]: onset[i] + sig[i].size] += sig[i]
return result
def signal_adder_with_onset(data, onset):
data = np.array(data)
# Get lengths of each row of data
lens = np.array([len(i) for i in data])
#adjust with offset for max possible lengths
max_size = lens + onset
# Mask of valid places in each row
mask = ((np.arange(max_size.max()) >= onset.reshape(-1, 1))
& (np.arange(max_size.max()) < (lens + onset).reshape(-1, 1)))
# Setup output array and put elements from data into masked positions
out = np.zeros(mask.shape, dtype=data.dtype) #could perhaps change dtype here
out[mask] = np.concatenate(data)
return out.sum(axis=0)
sigbig = [np.random.randn(np.random.randint(1000, 10000)) for _ in range(10000)]
onsetbig = np.random.randint(0, 10000, size=10000)
sigrepeat = np.repeat(sig, 500000).tolist()
onsetrepeat = np.repeat(onset, 500000)
assert all(mixing_function(sigbig, onsetbig) == mix(sigbig, onsetbig))
assert all(mixing_function(sigbig, onsetbig) == mixnumba(sigbig, onsetbig))
assert all(mixing_function(sigbig, onsetbig) == signal_adder_with_onset(sigbig, onsetbig))
%timeit result = mixing_function(sigbig, onsetbig)
%timeit result = mix(sigbig, onsetbig)
%timeit result = mixnumba(sigbig, onsetbig)
%timeit result = signal_adder_with_onset(sigbig, onsetbig)
# Output
114 ms ± 1.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
108 ms ± 2.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
368 ms ± 8.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
13.4 s ± 211 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit result = mixing_function(sigrepeat, onsetrepeat)
%timeit result = mix(sigrepeat, onsetrepeat)
%timeit result = mixnumba(sigrepeat, onsetrepeat)
%timeit result = signal_adder_with_onset(sigrepeat.tolist(), onsetrepeat)
# Output
933 ms ± 6.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
803 ms ± 21.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.07 s ± 85.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
254 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
TL.DR.
Marginal performance improvement (around 10% faster) by using np.vectorize
in order to get maxlen
for long signals of random length. Note that for many small signals, @Paritosh Singh answer performs faster than the others.
Comparing two numpy arrays with different lengths line-wise
Thanks to @MichaelSzczesny I got the solution I was looking for. Simply adding [:, None]
to the comparison worked:
smaller = np.array(A[:, 0][:, None] > B[:, 0]).astype('int')
As explained in Use of None in Array indexing in Python, it adds an axis to the array. Thus from
[ 15 200]
We get
[[ 15]
[200]]
How to append different sizes of numpy arrays and put na's in empty location
You can only concatenate arrays of the same number of dimensions (this can be resolved by broadcasting) and the same number of elements except for the concatenating axis.
Thus you need to append/concatenate an empty array of the correct shape and fill it with the values of arr2
afterwards.
# concatenate an array of the correct shape filled with np.nan
arr1_arr2 = np.concatenate((arr1, np.full((1, arr1.shape[1]), np.nan)))
# fill concatenated row with values from arr2
arr1_arr2[-1, :3] = arr2
But in general it is always good advice, NOT to append/concatenate etc. arrays. If it is possible, try to guess the correct shape of the final array in advance and create an empty array (or filled with np.nan) of the final shape, which will be filled in the process. For example:
arr1_arr2 = np.full((3, 5), np.nan)
arr1_arr2[:-1, :] = arr1
arr1_arr2[-1, :arr2.shape[0]] = arr2
If it is only one appending/concatenating operation and it is not performance critical, it is ok to concat/append, otherwise a full preallocation in advance should be preferred.
If many arrays shall be concatenated and all arrays to concatenate have the same shape, this will be the best way to do it:
arr1 = np.array([[9, 4, 2, 6, 7],
[8, 5, 4, 1, 3]])
# some arrays to concatenate:
arr2 = np.array([3, 1, 5])
arr3 = np.array([5, 7, 9])
arr4 = np.array([54, 1, 99])
# make array of all arrays to concatenate:
arrs_to_concat = np.vstack((arr2, arr3, arr4))
# preallocate result array filled with nan
arr1_arr2 = np.full((arr1.shape[0] + arrs_to_concat.shape[0], 5), np.nan)
# fill with values:
arr1_arr2[:arr1.shape[0], :] = arr1
arr1_arr2[arr1.shape[0]:, :arrs_to_concat.shape[1]] = arrs_to_concat
Performance-wise it may be a good idea for large arrays to use np.empty
for preallocating the final array and filling only the remaining shape with np.nan
.
Numpy Adding two vectors with different sizes
This could be what you are looking for
if len(a) < len(b):
c = b.copy()
c[:len(a)] += a
else:
c = a.copy()
c[:len(b)] += b
basically you copy the longer one and then add in-place the shorter one
Fastest way to sum several NumPy vectors that have uneven lengths
Approach #1
With such huge input array sizes and a huger number of arrays, we need to be memory efficient and hence would suggest a loopy one that iteratively adds up one array at a time -
many_vectors = [v1, v2, v3, v4] # list of all vectors
lens = [len(i) for i in many_vectors]
L = max(lens)
out = np.zeros(L)
for l,v in zip(lens,many_vectors):
out[:l] += v
Approach #2
Another almost-vectorized one with masking
to generate a regular 2D
array from the list of those irregular shaped vectors/arrays and then summing along columns for the final output -
# Inspired by https://stackoverflow.com/a/38619350/ @Divakar
def stack1Darrs(v):
lens = np.array([len(item) for item in v])
mask = lens[:,None] > np.arange(lens.max())
out_dtype = np.result_type(*[i.dtype for i in v])
out = np.zeros(mask.shape,dtype=out_dtype)
out[mask] = np.concatenate(v)
return out
out = stack1Darrs(many_vectors).sum(0)
Numpy stack with unequal shapes
Numpy arrays have to be rectangular, so what you are trying to get is not possible with a numpy array.
You need a different data structure. Which one is suitable depends on what you want to do with that data.
Related Topics
Uploading Multiple Files with Flask
Python Argparse - Add Argument to Multiple Subparsers
Python - Activate Conda Env Through Shell Script
Python Datetime Formatting Without Zero-Padding
Validating Detailed Types in Python Dataclasses
Longest Common Substring from More Than Two Strings
Comparing Two .Txt Files Using Difflib in Python
Calling Matlab Functions from Python
How to Make Built-In Containers (Sets, Dicts, Lists) Thread Safe
Remove a Tag Using Beautifulsoup But Keep Its Contents
Group Duplicate Column Ids in Pandas Dataframe
How to Create Downloading Progress Bar in Ttk
How to Deal with Kivy Installing Error in Python 3.8
Extract Email Sub-Strings from Large Document
Pandas Filling Missing Dates and Values Within Group
Detecting Mouse Clicks in Windows Using Python