How to Make a Multidimension Numpy Array with a Varying Row Size

How to make a multidimension numpy array with a varying row size?

While Numpy knows about arrays of arbitrary objects, it's optimized for homogeneous arrays of numbers with fixed dimensions. If you really need arrays of arrays, better use a nested list. But depending on the intended use of your data, different data structures might be even better, e.g. a masked array if you have some invalid data points.

If you really want flexible Numpy arrays, use something like this:

numpy.array([[0,1,2,3], [2,3,4]], dtype=object)

However this will create a one-dimensional array that stores references to lists, which means that you will lose most of the benefits of Numpy (vector processing, locality, slicing, etc.).

numpy array containing multi-dimension numpy arrays with variable shape

np.array(alist) will make an object dtype array if the list arrays differ in the first dimension. But in your case they differ in the 3rd, producing this error. In effect, it can't unambiguously determine where the containing dimension ends, and where the objects begin.

In [270]: alist = [np.ones((10,4,4,20),int), np.zeros((10,4,6,20),int)]                                
In [271]: arr = np.array(alist)                                                                        
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-271-3fd8e9bd05a9> in <module>
----> 1 arr = np.array(alist)

ValueError: could not broadcast input array from shape (10,4,4,20) into shape (10,4)

Instead we need to make an object array of the right size, and copy the list to it. Sometimes this copy still produces broadcasting errors, but here it seems to be ok:

In [272]: arr = np.empty(2, object)                                                                    
In [273]: arr                                                                                          
Out[273]: array([None, None], dtype=object)
In [274]: arr[:] = alist                                                                               
In [275]: arr                                                                                          
Out[275]: 
array([array([[[[1, 1, 1, ..., 1, 1, 1],
         [1, 1, 1, ..., 1, 1, 1],
         [1, 1, 1, ..., 1, 1, 1],
...
         [0, 0, 0, ..., 0, 0, 0],
         [0, 0, 0, ..., 0, 0, 0]]]])], dtype=object)
In [276]: arr[0].shape                                                                                 
Out[276]: (10, 4, 4, 20)
In [277]: arr[1].shape                                                                                 
Out[277]: (10, 4, 6, 20)

Storing multidimensional variable length array with h5py

The essence of your code is:

phn_mfccs = []
<loop several layers>
    phn_mfcc = <some sort of array expanded by one dimension>
    phn_mfccs.append(phn_mfcc)

At the end of loops phn_mfccs is a list of arrays. I can't tell from the code what the dtype and shape is. Or whether it differs for each element of the list.

I'm not entirely sure what create_dataset does when given a list of arrays. It may wrap it in np.array.

mfccs_out.create_dataset('phn_mfccs', data=phn_mfccs, dtype=dt)

What does np.array(phn_mfccs) produce? Shape, dtype? If all the elements are arrays of the same shape and dtype it will produce a higher dimensional array. If they differ in shape, it will produce a 1d array with object dtype. Given the error message, I suspect the latter.

I've answered a few vlen questions but haven't worked with it a lot

http://docs.h5py.org/en/latest/special.html

I vaguely recall that the 'ragged' dimension of a h5 array can only be 1d. So a phn_mfccs object array that contains 1d float arrays of varying dimensions might work.

I might come up with a simple example. And I suggest you construct a simpler problem that we can copy-n-paste and experiement with. We don't need to know how you read the data from your directory. We just need to understand the content of the array (list) that you are trying to write.

A 2015 post on vlen arrays

Inexplicable behavior when using vlen with h5py

H5PY - How to store many 2D arrays of different dimensions

1d ragged arrays example

In [24]: f = h5py.File('vlen.h5','w')
In [25]: dt = h5py.special_dtype(vlen=np.dtype('float64'))
In [26]: dataset = f.create_dataset('vlen',(4,), dtype=dt)
In [27]: dataset.value
Out[27]: 
array([array([], dtype=float64), array([], dtype=float64),
       array([], dtype=float64), array([], dtype=float64)], dtype=object)
In [28]: for i in range(4):
    ...:     dataset[i]=np.arange(i+3)

In [29]: dataset.value
Out[29]: 
array([array([ 0.,  1.,  2.]), array([ 0.,  1.,  2.,  3.]),
       array([ 0.,  1.,  2.,  3.,  4.]),
       array([ 0.,  1.,  2.,  3.,  4.,  5.])], dtype=object)

If I try to write 2d arrays to dataset I get an error

OSError: Can't prepare for writing data (Src and dest data spaces have different sizes)

The dataset itself may be multidimensional, but the vlen object has to be a 1d array of floats.

Convert list of lists with different lengths to a numpy array

you could make a numpy array with np.zeros and fill them with your list elements as shown below.

a = [[1, 2, 3], [4, 5], [6, 7, 8, 9]]
import numpy as np
b = np.zeros([len(a),len(max(a,key = lambda x: len(x)))])
for i,j in enumerate(a):
    b[i][0:len(j)] = j

results in

[[ 1.  2.  3.  0.]
 [ 4.  5.  0.  0.]
 [ 6.  7.  8.  9.]]

Getting indices of different lengths to slice a multidimensional numpy array

I think this should work:

m = np.arange(F.shape[1]) < K
Rnew = R.copy()
Rnew[np.nonzero(m)[0], np.argsort(F)[m]] += 1

Since the first line uses broadcasting, np.tile() is not needed.

Notice that there is a possible ambiguity of the results: since each row of F has values that are repeated several times (e.g. 0.1 in the first row and -0.4 in the second) np.argsort() may give different orderings of elements of F, depending on how these equal values get sorted. This may change which entries of the matrix R will get incremented. For example, instead of incrementing R[0, 7], R[0, 8], and R[1, 7], the code may increment the entries R[0, 2], R[0, 9] and R[1, 1]. To get unambiguous results you can specify that np.argsort() must use a stable sorting algorithm, which will preserve the relative order of elements with equal values:

m = np.arange(F.shape[1]) < K
Rnew = R.copy()
Rnew[np.nonzero(m)[0], np.argsort(F, kind="stable")[m]] += 1

In this particular example this will increment the entries R[0, 2], R[0, 7] and R[1, 1]. You need to decide if this is the result that meets your needs.

How to access specific row of multidimensional NumPy array with different dimension?

It does not make sense to have different number of elements in different rows of a same matrix. To work around your problem, it is better to first fill all the missing elements in rows with 0 or NA so that number of elements in all rows are equal.

Please also look at answers in Numpy: Fix array with rows of different lengths by filling the empty elements with zeros. I am implementing one of the best solutions mentioned there for your problem.

import numpy as np
def numpy_fillna(data):

    lens = np.array([len(i) for i in data])
    mask = np.arange(lens.max()) < lens[:,None]

    out = np.zeros(mask.shape, dtype=data.dtype)
    out[mask] = np.concatenate(data)
    return out

a =np.array([range(1,50),range(50,150)])
data=numpy_fillna(a)

print data[1,:]

How to Make a Multidimension Numpy Array with a Varying Row Size