How to Stack Vectors of Different Lengths in Numpy

How do I stack vectors of different lengths in NumPy?

Short answer: you can't. NumPy does not support jagged arrays natively.

Long answer:

>>> a = ones((3,))
>>> b = ones((2,))
>>> c = array([a, b])
>>> c
array([[ 1. 1. 1.], [ 1. 1.]], dtype=object)

gives an array that may or may not behave as you expect. E.g. it doesn't support basic methods like sum or reshape, and you should treat this much as you'd treat the ordinary Python list [a, b] (iterate over it to perform operations instead of using vectorized idioms).

Several possible workarounds exist; the easiest is to coerce a and b to a common length, perhaps using masked arrays or NaN to signal that some indices are invalid in some rows. E.g. here's b as a masked array:

>>> ma.array(np.resize(b, a.shape[0]), mask=[False, False, True])
masked_array(data = [1.0 1.0 --],
             mask = [False False  True],
       fill_value = 1e+20)

This can be stacked with a as follows:

>>> ma.vstack([a, ma.array(np.resize(b, a.shape[0]), mask=[False, False, True])])
masked_array(data =
[[1.0 1.0 1.0]
[1.0 1.0 --]],
mask =
[[False False False]
[False False True]],
fill_value = 1e+20)

(For some purposes, scipy.sparse may also be interesting.)

Stacking Numpy arrays of different length using padding

If you don't want to use itertools and column_stack, numpy.ndarray.resize will also do the job perfectly. As mentioned by jtweeder, you just need to know to resulting size of each rows. The advantage to use resize is that numpy.ndarray is contiguous in memory. Resizing is faster when each row differs alot in size. The performance difference is observable between the two approaches.

import numpy as np
import timeit
import itertools

def stack_padding(it):

def resize(row, size):
new = np.array(row)
new.resize(size)
return new

# find longest row length
row_length = max(it, key=len).__len__()
mat = np.array( [resize(row, row_length) for row in it] )

return mat

def stack_padding1(l):
return np.column_stack((itertools.zip_longest(*l, fillvalue=0)))

if __name__ == "__main__":
n_rows = 200
row_lengths = np.random.randint(30, 50, size=n_rows)
mat = [np.random.randint(0, 100, size=s) for s in row_lengths]

def test_stack_padding():
global mat
stack_padding(mat)

def test_itertools():
global mat
stack_padding1(mat)

t1 = timeit.timeit(test_stack_padding, number=1000)
t2 = timeit.timeit(test_itertools, number=1000)
print('With ndarray.resize: ', t1)
print('With itertool and vstack: ', t2)

The resize method wins in the above comparison:

>>> With ndarray.resize:  0.30080295499647036
>>> With itertool and vstack: 1.0151802329928614

How to stack 2 numpy arrays with Different Lengths in python

padding it with np.nan is a valid option.

create an empty target array, and just assign your arrays into it's sub indexes:

import numpy as np

x1 = np.random.normal( 0, 1, ( 10, 1 ) )
x2 = np.random.normal( 0, 2, ( 5, 1 ) )
x3 = np.random.normal( 0, 3, ( 7, 1 ) )
x4 = np.random.normal( 0, 4, ( 9, 1 ) )

arrs = [x1,x2,x3,x4]

a = np.empty((max(x.shape[0] for x in arrs), len(arrs)))
a[:] = np.nan

for i, x in enumerate(arrs):
a[0:len(x), i] = x.T
print(a)

Output:

[[ -1.5521545   -1.82217348  -3.28589422  -1.59646125]
[ 0.54409311 2.53585401 -2.15704799 2.1590175 ]
[ 0.24202617 -1.62680388 0.58507172 4.24671516]
[ 1.21341942 -2.09405961 1.94415747 -1.21781288]
[ -0.53110862 1.47037056 2.37113853 -10.01200676]
[ 0.50884432 nan -2.56881482 -3.52164926]
[ -0.37551321 nan 0.67952001 -0.5523079 ]
[ 0.5943706 nan nan -6.25704491]
[ -0.37893229 nan nan -6.28029336]
[ -0.34746679 nan nan nan]]

Python how to add multiple arrays with different length into one

Here are some stats for different solutions to the problem. I was able to squeeze a little more performance by vectorizing the implementation to get maxlen, but besides that, I think you will have to try cython or trying other programming languages.

import numpy as np
from numba import jit
from time import time
np.random.seed(42)

def mixing_function(sig, onset):
maxlen = np.max([o + len(s) for o, s in zip(onset, sig)])
result = np.zeros(maxlen)
for i in range(len(onset)):
result[onset[i]:onset[i] + len(sig[i])] += sig[i]
return result

def mix(sig, onset):
siglengths = np.vectorize(len)(sig)
maxlen = max(onset + siglengths)
result = np.zeros(maxlen)
for i in range(len(sig)):
result[onset[i]: onset[i]+siglengths[i]] += sig[i]
return result

@jit(nopython=True)
def mixnumba(sig, onset):
# maxlen = np.max([onset[i] + len(sig[i]) for i in range(len(sig))])
maxlen = -1
for i in range(len(sig)):
maxlen = max(maxlen, sig[i].size + onset[i])
result = np.zeros(maxlen)
for i in range(len(sig)):
result[onset[i]: onset[i] + sig[i].size] += sig[i]
return result

def signal_adder_with_onset(data, onset):
data = np.array(data)
# Get lengths of each row of data
lens = np.array([len(i) for i in data])
#adjust with offset for max possible lengths
max_size = lens + onset
# Mask of valid places in each row
mask = ((np.arange(max_size.max()) >= onset.reshape(-1, 1))
& (np.arange(max_size.max()) < (lens + onset).reshape(-1, 1)))

# Setup output array and put elements from data into masked positions
out = np.zeros(mask.shape, dtype=data.dtype) #could perhaps change dtype here
out[mask] = np.concatenate(data)
return out.sum(axis=0)

sigbig = [np.random.randn(np.random.randint(1000, 10000)) for _ in range(10000)]
onsetbig = np.random.randint(0, 10000, size=10000)
sigrepeat = np.repeat(sig, 500000).tolist()
onsetrepeat = np.repeat(onset, 500000)

assert all(mixing_function(sigbig, onsetbig) == mix(sigbig, onsetbig))
assert all(mixing_function(sigbig, onsetbig) == mixnumba(sigbig, onsetbig))
assert all(mixing_function(sigbig, onsetbig) == signal_adder_with_onset(sigbig, onsetbig))

%timeit result = mixing_function(sigbig, onsetbig)
%timeit result = mix(sigbig, onsetbig)
%timeit result = mixnumba(sigbig, onsetbig)
%timeit result = signal_adder_with_onset(sigbig, onsetbig)
# Output
114 ms ± 1.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
108 ms ± 2.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
368 ms ± 8.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
13.4 s ± 211 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit result = mixing_function(sigrepeat, onsetrepeat)
%timeit result = mix(sigrepeat, onsetrepeat)
%timeit result = mixnumba(sigrepeat, onsetrepeat)
%timeit result = signal_adder_with_onset(sigrepeat.tolist(), onsetrepeat)
# Output
933 ms ± 6.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
803 ms ± 21.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.07 s ± 85.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
254 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

TL.DR.
Marginal performance improvement (around 10% faster) by using np.vectorize in order to get maxlen for long signals of random length. Note that for many small signals, @Paritosh Singh answer performs faster than the others.

Comparing two numpy arrays with different lengths line-wise

Thanks to @MichaelSzczesny I got the solution I was looking for. Simply adding [:, None] to the comparison worked:

smaller = np.array(A[:, 0][:, None] > B[:, 0]).astype('int')

As explained in Use of None in Array indexing in Python, it adds an axis to the array. Thus from

[ 15 200]

We get

[[ 15]
[200]]

How to append different sizes of numpy arrays and put na's in empty location

You can only concatenate arrays of the same number of dimensions (this can be resolved by broadcasting) and the same number of elements except for the concatenating axis.

Thus you need to append/concatenate an empty array of the correct shape and fill it with the values of arr2 afterwards.

# concatenate an array of the correct shape filled with np.nan
arr1_arr2 = np.concatenate((arr1, np.full((1, arr1.shape[1]), np.nan)))
# fill concatenated row with values from arr2
arr1_arr2[-1, :3] = arr2

But in general it is always good advice, NOT to append/concatenate etc. arrays. If it is possible, try to guess the correct shape of the final array in advance and create an empty array (or filled with np.nan) of the final shape, which will be filled in the process. For example:

arr1_arr2 = np.full((3, 5), np.nan)
arr1_arr2[:-1, :] = arr1
arr1_arr2[-1, :arr2.shape[0]] = arr2

If it is only one appending/concatenating operation and it is not performance critical, it is ok to concat/append, otherwise a full preallocation in advance should be preferred.

If many arrays shall be concatenated and all arrays to concatenate have the same shape, this will be the best way to do it:

arr1 = np.array([[9, 4, 2, 6, 7],
[8, 5, 4, 1, 3]])
# some arrays to concatenate:
arr2 = np.array([3, 1, 5])
arr3 = np.array([5, 7, 9])
arr4 = np.array([54, 1, 99])
# make array of all arrays to concatenate:
arrs_to_concat = np.vstack((arr2, arr3, arr4))
# preallocate result array filled with nan
arr1_arr2 = np.full((arr1.shape[0] + arrs_to_concat.shape[0], 5), np.nan)
# fill with values:
arr1_arr2[:arr1.shape[0], :] = arr1
arr1_arr2[arr1.shape[0]:, :arrs_to_concat.shape[1]] = arrs_to_concat

Performance-wise it may be a good idea for large arrays to use np.empty for preallocating the final array and filling only the remaining shape with np.nan.

Numpy Adding two vectors with different sizes

This could be what you are looking for

if len(a) < len(b):
c = b.copy()
c[:len(a)] += a
else:
c = a.copy()
c[:len(b)] += b

basically you copy the longer one and then add in-place the shorter one

Fastest way to sum several NumPy vectors that have uneven lengths

Approach #1

With such huge input array sizes and a huger number of arrays, we need to be memory efficient and hence would suggest a loopy one that iteratively adds up one array at a time -

many_vectors = [v1, v2, v3, v4] # list of all vectors

lens = [len(i) for i in many_vectors]
L = max(lens)
out = np.zeros(L)
for l,v in zip(lens,many_vectors):
out[:l] += v

Approach #2

Another almost-vectorized one with masking to generate a regular 2D array from the list of those irregular shaped vectors/arrays and then summing along columns for the final output -

# Inspired by https://stackoverflow.com/a/38619350/ @Divakar
def stack1Darrs(v):
lens = np.array([len(item) for item in v])
mask = lens[:,None] > np.arange(lens.max())
out_dtype = np.result_type(*[i.dtype for i in v])
out = np.zeros(mask.shape,dtype=out_dtype)
out[mask] = np.concatenate(v)
return out

out = stack1Darrs(many_vectors).sum(0)

Numpy stack with unequal shapes

Numpy arrays have to be rectangular, so what you are trying to get is not possible with a numpy array.

You need a different data structure. Which one is suitable depends on what you want to do with that data.



Related Topics



Leave a reply



Submit