efficiently convert uneven list of lists to minimal containing array padded with nan
This seems to be a close one of this question
, where the padding was with zeros
instead of NaNs
. Interesting approaches were posted there, along with mine
based on broadcasting
and boolean-indexing
. So, I would just modify one line from my post there to solve this case like so -
def boolean_indexing(v, fillval=np.nan):
lens = np.array([len(item) for item in v])
mask = lens[:,None] > np.arange(lens.max())
out = np.full(mask.shape,fillval)
out[mask] = np.concatenate(v)
return out
Sample run -
In [32]: l
Out[32]: [[1, 2, 3], [1, 2], [3, 8, 9, 7, 3]]
In [33]: boolean_indexing(l)
Out[33]:
array([[ 1., 2., 3., nan, nan],
[ 1., 2., nan, nan, nan],
[ 3., 8., 9., 7., 3.]])
In [34]: boolean_indexing(l,-1)
Out[34]:
array([[ 1, 2, 3, -1, -1],
[ 1, 2, -1, -1, -1],
[ 3, 8, 9, 7, 3]])
I have posted few runtime results there for all the posted approaches on that Q&A, which could be useful.
How to append different sizes of numpy arrays and put na's in empty location
You can only concatenate arrays of the same number of dimensions (this can be resolved by broadcasting) and the same number of elements except for the concatenating axis.
Thus you need to append/concatenate an empty array of the correct shape and fill it with the values of arr2
afterwards.
# concatenate an array of the correct shape filled with np.nan
arr1_arr2 = np.concatenate((arr1, np.full((1, arr1.shape[1]), np.nan)))
# fill concatenated row with values from arr2
arr1_arr2[-1, :3] = arr2
But in general it is always good advice, NOT to append/concatenate etc. arrays. If it is possible, try to guess the correct shape of the final array in advance and create an empty array (or filled with np.nan) of the final shape, which will be filled in the process. For example:
arr1_arr2 = np.full((3, 5), np.nan)
arr1_arr2[:-1, :] = arr1
arr1_arr2[-1, :arr2.shape[0]] = arr2
If it is only one appending/concatenating operation and it is not performance critical, it is ok to concat/append, otherwise a full preallocation in advance should be preferred.
If many arrays shall be concatenated and all arrays to concatenate have the same shape, this will be the best way to do it:
arr1 = np.array([[9, 4, 2, 6, 7],
[8, 5, 4, 1, 3]])
# some arrays to concatenate:
arr2 = np.array([3, 1, 5])
arr3 = np.array([5, 7, 9])
arr4 = np.array([54, 1, 99])
# make array of all arrays to concatenate:
arrs_to_concat = np.vstack((arr2, arr3, arr4))
# preallocate result array filled with nan
arr1_arr2 = np.full((arr1.shape[0] + arrs_to_concat.shape[0], 5), np.nan)
# fill with values:
arr1_arr2[:arr1.shape[0], :] = arr1
arr1_arr2[arr1.shape[0]:, :arrs_to_concat.shape[1]] = arrs_to_concat
Performance-wise it may be a good idea for large arrays to use np.empty
for preallocating the final array and filling only the remaining shape with np.nan
.
Convert array of arrays of different size into a structured array
Here is a slightly faster version of your code:
def alt(a):
A = np.full((len(a), max(map(len, a))), np.nan)
for i, aa in enumerate(a):
A[i, :len(aa)] = aa
return A
The for-loops are unavoidable. Given that a
is a Python list, there is no getting around the need to iterate through the items in the list. Sometimes the loop can be hidden (behind calls to max
and map
for instance) but speed-wise they are essentially equivalent to Python loops.
Here is a benchmark using a
with resultant shape (100, 100)
:
In [197]: %timeit orig(a)
10000 loops, best of 3: 125 µs per loop
In [198]: %timeit alt(a)
10000 loops, best of 3: 84.1 µs per loop
In [199]: %timeit using_pandas(a)
100 loops, best of 3: 4.8 ms per loop
This was the setup used for the benchmark:
import numpy as np
import pandas as pd
def make_array(h, w):
a = []
for i in np.arange(h):
a += [np.random.rand(np.random.randint(1,w+1))]
a = np.array(a)
return a
def orig(a):
max_len_of_array = 0
for aa in a:
len_of_array = aa.shape[0]
if len_of_array > max_len_of_array:
max_len_of_array = len_of_array
n = a.shape[0]
A = np.zeros((n, max_len_of_array)) * np.nan
for i, aa in enumerate(zip(a)):
A[i][:aa[0].shape[0]] = aa[0]
return A
def alt(a):
A = np.full((len(a), max(map(len, a))), np.nan)
for i, aa in enumerate(a):
A[i, :len(aa)] = aa
return A
def using_pandas(a):
return pd.DataFrame.from_records(a).values
a = make_array(100,100)
How to repeat a numpy array along a new dimension with padding?
Here's one with masking
based on this idea
-
m = repeats[:,None] > np.arange(repeats.max())
out = np.zeros(m.shape,dtype=to_repeat.dtype)
out[m] = np.repeat(to_repeat,repeats)
Sample output -
In [44]: out
Out[44]:
array([[1, 0, 0],
[2, 2, 0],
[3, 3, 0],
[4, 4, 4],
[5, 5, 5],
[6, 0, 0]])
Or with broadcasted-multiplication -
In [67]: m*to_repeat[:,None]
Out[67]:
array([[1, 0, 0],
[2, 2, 0],
[3, 3, 0],
[4, 4, 4],
[5, 5, 5],
[6, 0, 0]])
For large datasets/sizes, we can leverage multi-cores
and be more efficient on memory with numexpr
module on that broadcasting
-
In [64]: import numexpr as ne
# Re-using mask `m` from previous method
In [65]: ne.evaluate('m*R',{'m':m,'R':to_repeat[:,None]})
Out[65]:
array([[1, 0, 0],
[2, 2, 0],
[3, 3, 0],
[4, 4, 4],
[5, 5, 5],
[6, 0, 0]])
Related Topics
How to Access the Request Object or Any Other Variable in a Form's Clean() Method
How to Exchange Keys with Values in a Dictionary
Running an Outside Program (Executable) in Python
Matplotlib: Annotating a 3D Scatter Plot
Python Giving Filenotfounderror for File Name Returned by Os.Listdir
Why Doesn't This Division Work in Python
Fastapi Runs API-Calls in Serial Instead of Parallel Fashion
Pythonic Way of Checking If a Condition Holds for Any Element of a List
Import Pandas Dataframe Column as String Not Int
Dll Load Failed Error When Importing Cv2
Tkinter Adding Line Number to Text Widget
Reversing 'One-Hot' Encoding in Pandas
Python Popen Command. Wait Until the Command Is Finished
Pyqt: No Error Msg (Traceback) on Exit