Convert Python Sequence to Numpy Array, Filling Missing Values

Convert Python sequence to NumPy array, filling missing values

You can use itertools.zip_longest:

import itertools
np.array(list(itertools.zip_longest(*v, fillvalue=0))).T
Out:
array([[1, 0],
[1, 2]])

Note: For Python 2, it is itertools.izip_longest.

Fill missing values with mean until getting a certain shape in numpy

You could use the pad() function:

import numpy as np
A = np.array([[365.],
[173.],
[389.],
[173.],
[342.],
[173.],
[294.],
[165.],
[246.],
[142.],
[254.],
[142.],
[357.],
[260.],
[389.],
[339.],
[389.],
[339.],
[381.],
[410.],
[381.],
[410.]])

...

B = np.pad(A,((0,24-A.shape[0]),(0,0)),'mean')
print(B)

[[365. ]
[173. ]
[389. ]
[173. ]
[342. ]
[173. ]
[294. ]
[165. ]
[246. ]
[142. ]
[254. ]
[142. ]
[357. ]
[260. ]
[389. ]
[339. ]
[389. ]
[339. ]
[381. ]
[410. ]
[381. ]
[410. ]
[296.04545455]
[296.04545455]]

fill missing values in 3D list with zeros to create 3D numpy array

Here's a way that allocates the NumPy array up front then copies the data over. Assuming you don't actually need the expanded ll, this should use less memory than appending the 0-triples to ll before creating a1:

a1 = np.zeros((len(ll), max([len(k) for k in ll]), 3))
for ctr,k in enumerate(ll):
a1[ctr,:len(k),:] = k

a1
array([[[ 1., 2., 3.],
[ 4., 5., 6.],
[ 7., 8., 9.],
[10., 11., 12.]],

[[ 6., 7., 8.],
[12., 13., 14.],
[ 0., 0., 0.],
[ 0., 0., 0.]],

[[10., 20., 30.],
[40., 50., 60.],
[70., 80., 90.],
[ 0., 0., 0.]]])

max([len(k) for k in ll]) tells us the maximum number of triples in any member of ll. We allocate a 0-initialized NumPy array of the desired size. Then in the loop, smart indexing tells us where in a1 to copy each member of ll.

Handling missing data with numpy to concentrate different shape arrays

Try this instead of the np.append -

  1. Create np.zeros((difference in shape[0], arr.shape[1]))
  2. np.vstack the arr and the zeros
  3. Concatenate arr, arr1
arr = np.vstack([arr, np.zeros((arr1.shape[0] - arr.shape[0], arr.shape[1]))]) #<--------

arr = np.concatenate((arr, arr1), axis=1)

print(arr)
# [[1   2]
# [0 20]
# [6 10]
# [2 4]
# [0 1]
# [0 6]
# [0 348]]

Convert list of lists with different lengths to a numpy array

you could make a numpy array with np.zeros and fill them with your list elements as shown below.

a = [[1, 2, 3], [4, 5], [6, 7, 8, 9]]
import numpy as np
b = np.zeros([len(a),len(max(a,key = lambda x: len(x)))])
for i,j in enumerate(a):
b[i][0:len(j)] = j

results in

[[ 1.  2.  3.  0.]
[ 4. 5. 0. 0.]
[ 6. 7. 8. 9.]]

numpy time series merge and fill missing values with earlier values

Intersection

Your best bet for a fast intersection is probably np.searchsorted. It will do a binary search in filled_timestamp for the elements of raw_timestamp:

idx = np.searchsorted(filled_timestamp, raw_timestamp)

This will only be accurate if every element of raw_timestamp actually occurs in filled_timestamp because np.searchsorted will return an insertion index regardless.

Non-vectorized Solution

You want to set a slice of filled_sensor from idx[n] to idx[n + 1] to the value of raw_sensor[n]:

from itertools import zip_longest
for start, end, row in zip_longest(idx, idx[1:], raw_sensor):
filled_sensor[start:end] = row

I am using zip_longest here so that the last value coming from idx[1:] would be None, making the last slice be equivalent to filled_sensor[idx[-1]:] without requiring a special condition.

Vectorized Solution

You can create filled_sensor in one shot directly from raw_sensor if you know which indices to repeat from raw_sensor. You can get that information by applying np.cumsum to idx converted to a boolean array:

idx_mask = np.zeros(filled_timestamp.shape, np.bool)
idx_mask[idx] = True

Basically, we start with a boolean array of the same size as filled_timestamp that is True (1) wherever an entry from raw_timestamp matches. We can convert that to an index in raw_timestamp by counting how many total matches have occurred up to that point:

indexes = np.cumsum(idx_mask) - 1

Keep in mind that indexes is an array of integers, not booleans. It will increment whenever a new match is found. The - 1 converts from count to index because the first match will have a count of 1 instead of 0.

Now you can just make filled_sensor:

filled_sensor = raw_sensor[indexes]

The only possible caveat here is if filled_sensor[0] does not come from raw_sensor[0]. It will then be replaced with raw_sensor[-1]. Given how you construct the times in filled based on raw, I am not sure can ever even be an issue.

Example

Here is an example of the Intersection and Vectorized Solution steps with the data that you show in your question.

We start with

raw_timestamp = np.array(['2009-01-01T18:41:00', 
'2009-01-01T18:44:00',
'2009-01-01T18:46:00',
'2009-01-01T18:47:00',], dtype='datetime64[s]')
raw_sensor = np.array([(755, 855, 755, 855, 743, 843, 743, 843, 2),
(693, 793, 693, 793, 693, 793, 693, 793, 1),
(755, 855, 755, 855, 743, 843, 743, 843, 2),
(693, 793, 693, 793, 693, 793, 693, 793, 1),],
dtype=[('sensorA', '<u4'), ('sensorB', '<u4'),
('sensorC', '<u4'), ('sensorD', '<u4'),
('sensorE', '<u4'), ('sensorF', '<u4'),
('sensorG', '<u4'), ('sensorH', '<u4'),
('signal', '<u4')])

We can generate filled_timestamp as

filled_timestamp = np.arange('2009-01-01T18:41:00',
'2009-01-01T18:48:00', 60, dtype='datetime64[s]')

Which yields, as expected:

array(['2009-01-01T18:41:00', '2009-01-01T18:42:00', '2009-01-01T18:43:00',
'2009-01-01T18:44:00', '2009-01-01T18:45:00', '2009-01-01T18:46:00',
'2009-01-01T18:47:00'], dtype='datetime64[s]')

I have taken a slight liberty with the dtypes by making timestamps plain arrays instead of structured arrays, but I think that should make no difference for your purpose.

  1. idx = np.searchsorted(filled_timestamp, raw_timestamp) yields

    idx = np.array([0, 3, 5, 6], dtype=np.int)

    This means that indices 0, 3, 5, 6 in filled_timestamp match values from raw_timestamp.

  2. idx_mask then becomes

    idx_mask = np.array([True, False, False, True, False, True, True], dtype=np.bool)

    This is basically synonymous with idx, except expanded to boolean mask the same size as filled_timestamp.

  3. Now the tricky part: indexes = np.cumsum(idx_mask) - 1:

    indexes = array([0, 0, 0, 1, 1, 2, 3], dtype=np.int)

    This can be interpreted as follows: filled_sensor[0:3] should come from raw_sensor[0]. filled_sensor[3:5] should come from raw_sensor[1], filled_sensor[5] should come from raw_sensor[2], filled_sensor[6] should come from raw_sensor[3].

  4. So now we use indexes to directly extract the correct elements of raw_sensor using filled_sensor = raw_sensor[indexes]:

    np.array([(755, 855, 755, 855, 743, 843, 743, 843, 2),
    (755, 855, 755, 855, 743, 843, 743, 843, 2),
    (755, 855, 755, 855, 743, 843, 743, 843, 2),
    (693, 793, 693, 793, 693, 793, 693, 793, 1),
    (693, 793, 693, 793, 693, 793, 693, 793, 1),
    (755, 855, 755, 855, 743, 843, 743, 843, 2),
    (693, 793, 693, 793, 693, 793, 693, 793, 1)],
    dtype=[('sensorA', '<u4'), ('sensorB', '<u4'),
    ('sensorC', '<u4'), ('sensorD', '<u4'),
    ('sensorE', '<u4'), ('sensorF', '<u4'),
    ('sensorG', '<u4'), ('sensorH', '<u4'),
    ('signal', '<u4')])

How to replace `0`s with missing values according to a series of numbers in a numpy array?

This is straight forward with pandas. You can apply a linear interpolation using interpolate (its default interpolation method is linear):

a = np.array([15, 25, 0, 45, 0, 0, 75, 85])
s = pd.Series(a)
s.mask(s.eq(0)).interpolate()

0 15.0
1 25.0
2 35.0
3 45.0
4 55.0
5 65.0
6 75.0
7 85.0
dtype: float64


Related Topics



Leave a reply



Submit