Efficiently Convert Uneven List of Lists to Minimal Containing Array Padded with Nan

efficiently convert uneven list of lists to minimal containing array padded with nan

This seems to be a close one of this question, where the padding was with zeros instead of NaNs. Interesting approaches were posted there, along with mine based on broadcasting and boolean-indexing. So, I would just modify one line from my post there to solve this case like so -

def boolean_indexing(v, fillval=np.nan):
    lens = np.array([len(item) for item in v])
    mask = lens[:,None] > np.arange(lens.max())
    out = np.full(mask.shape,fillval)
    out[mask] = np.concatenate(v)
    return out

Sample run -

In [32]: l
Out[32]: [[1, 2, 3], [1, 2], [3, 8, 9, 7, 3]]

In [33]: boolean_indexing(l)
Out[33]: 
array([[  1.,   2.,   3.,  nan,  nan],
       [  1.,   2.,  nan,  nan,  nan],
       [  3.,   8.,   9.,   7.,   3.]])

In [34]: boolean_indexing(l,-1)
Out[34]: 
array([[ 1,  2,  3, -1, -1],
       [ 1,  2, -1, -1, -1],
       [ 3,  8,  9,  7,  3]])

I have posted few runtime results there for all the posted approaches on that Q&A, which could be useful.

How to append different sizes of numpy arrays and put na's in empty location

You can only concatenate arrays of the same number of dimensions (this can be resolved by broadcasting) and the same number of elements except for the concatenating axis.

Thus you need to append/concatenate an empty array of the correct shape and fill it with the values of arr2 afterwards.

# concatenate an array of the correct shape filled with np.nan
arr1_arr2 = np.concatenate((arr1, np.full((1, arr1.shape[1]), np.nan)))
# fill concatenated row with values from arr2
arr1_arr2[-1, :3] = arr2

But in general it is always good advice, NOT to append/concatenate etc. arrays. If it is possible, try to guess the correct shape of the final array in advance and create an empty array (or filled with np.nan) of the final shape, which will be filled in the process. For example:

arr1_arr2 = np.full((3, 5), np.nan)
arr1_arr2[:-1, :] = arr1
arr1_arr2[-1, :arr2.shape[0]] = arr2

If it is only one appending/concatenating operation and it is not performance critical, it is ok to concat/append, otherwise a full preallocation in advance should be preferred.

If many arrays shall be concatenated and all arrays to concatenate have the same shape, this will be the best way to do it:

arr1 = np.array([[9, 4, 2, 6, 7],
                 [8, 5, 4, 1, 3]])
# some arrays to concatenate:
arr2 = np.array([3, 1, 5])
arr3 = np.array([5, 7, 9])
arr4 = np.array([54, 1, 99])
# make array of all arrays to concatenate:
arrs_to_concat = np.vstack((arr2, arr3, arr4))
# preallocate result array filled with nan
arr1_arr2 = np.full((arr1.shape[0] + arrs_to_concat.shape[0], 5), np.nan)
# fill with values:
arr1_arr2[:arr1.shape[0], :] = arr1
arr1_arr2[arr1.shape[0]:, :arrs_to_concat.shape[1]] = arrs_to_concat

Performance-wise it may be a good idea for large arrays to use np.empty for preallocating the final array and filling only the remaining shape with np.nan.

Convert array of arrays of different size into a structured array

Here is a slightly faster version of your code:

def alt(a):
    A = np.full((len(a), max(map(len, a))), np.nan)
    for i, aa in enumerate(a):
        A[i, :len(aa)] = aa
    return A

The for-loops are unavoidable. Given that a is a Python list, there is no getting around the need to iterate through the items in the list. Sometimes the loop can be hidden (behind calls to max and map for instance) but speed-wise they are essentially equivalent to Python loops.

Here is a benchmark using a with resultant shape (100, 100):

In [197]: %timeit orig(a)
10000 loops, best of 3: 125 µs per loop

In [198]: %timeit alt(a)
10000 loops, best of 3: 84.1 µs per loop

In [199]: %timeit using_pandas(a)
100 loops, best of 3: 4.8 ms per loop

This was the setup used for the benchmark:

import numpy as np
import pandas as pd

def make_array(h, w):
    a = []
    for i in np.arange(h):
        a += [np.random.rand(np.random.randint(1,w+1))]
    a = np.array(a)
    return a

def orig(a):
    max_len_of_array = 0

    for aa in a:
        len_of_array = aa.shape[0]
        if len_of_array > max_len_of_array:
            max_len_of_array = len_of_array

    n = a.shape[0]

    A = np.zeros((n, max_len_of_array)) * np.nan
    for i, aa in enumerate(zip(a)):
        A[i][:aa[0].shape[0]] = aa[0]

    return A

def alt(a):
    A = np.full((len(a), max(map(len, a))), np.nan)
    for i, aa in enumerate(a):
        A[i, :len(aa)] = aa
    return A

def using_pandas(a):
    return pd.DataFrame.from_records(a).values

a = make_array(100,100)

How to repeat a numpy array along a new dimension with padding?

Here's one with masking based on this idea -

m = repeats[:,None] > np.arange(repeats.max())
out = np.zeros(m.shape,dtype=to_repeat.dtype)
out[m] = np.repeat(to_repeat,repeats)

Sample output -

In [44]: out
Out[44]: 
array([[1, 0, 0],
       [2, 2, 0],
       [3, 3, 0],
       [4, 4, 4],
       [5, 5, 5],
       [6, 0, 0]])

Or with broadcasted-multiplication -

In [67]: m*to_repeat[:,None]
Out[67]: 
array([[1, 0, 0],
       [2, 2, 0],
       [3, 3, 0],
       [4, 4, 4],
       [5, 5, 5],
       [6, 0, 0]])

For large datasets/sizes, we can leverage multi-cores and be more efficient on memory with numexpr module on that broadcasting -

In [64]: import numexpr as ne

# Re-using mask `m` from previous method
In [65]: ne.evaluate('m*R',{'m':m,'R':to_repeat[:,None]})
Out[65]: 
array([[1, 0, 0],
       [2, 2, 0],
       [3, 3, 0],
       [4, 4, 4],
       [5, 5, 5],
       [6, 0, 0]])

Efficiently Convert Uneven List of Lists to Minimal Containing Array Padded with Nan