Convert Python sequence to NumPy array, filling missing values
You can use itertools.zip_longest:
import itertools
np.array(list(itertools.zip_longest(*v, fillvalue=0))).T
Out:
array([[1, 0],
[1, 2]])
Note: For Python 2, it is itertools.izip_longest.
Fill missing values with mean until getting a certain shape in numpy
You could use the pad() function:
import numpy as np
A = np.array([[365.],
[173.],
[389.],
[173.],
[342.],
[173.],
[294.],
[165.],
[246.],
[142.],
[254.],
[142.],
[357.],
[260.],
[389.],
[339.],
[389.],
[339.],
[381.],
[410.],
[381.],
[410.]])
...
B = np.pad(A,((0,24-A.shape[0]),(0,0)),'mean')
print(B)
[[365. ]
[173. ]
[389. ]
[173. ]
[342. ]
[173. ]
[294. ]
[165. ]
[246. ]
[142. ]
[254. ]
[142. ]
[357. ]
[260. ]
[389. ]
[339. ]
[389. ]
[339. ]
[381. ]
[410. ]
[381. ]
[410. ]
[296.04545455]
[296.04545455]]
fill missing values in 3D list with zeros to create 3D numpy array
Here's a way that allocates the NumPy array up front then copies the data over. Assuming you don't actually need the expanded ll
, this should use less memory than appending the 0-triples to ll
before creating a1
:
a1 = np.zeros((len(ll), max([len(k) for k in ll]), 3))
for ctr,k in enumerate(ll):
a1[ctr,:len(k),:] = k
a1
array([[[ 1., 2., 3.],
[ 4., 5., 6.],
[ 7., 8., 9.],
[10., 11., 12.]],
[[ 6., 7., 8.],
[12., 13., 14.],
[ 0., 0., 0.],
[ 0., 0., 0.]],
[[10., 20., 30.],
[40., 50., 60.],
[70., 80., 90.],
[ 0., 0., 0.]]])
max([len(k) for k in ll])
tells us the maximum number of triples in any member of ll
. We allocate a 0-initialized NumPy array of the desired size. Then in the loop, smart indexing tells us where in a1
to copy each member of ll
.
Handling missing data with numpy to concentrate different shape arrays
Try this instead of the np.append
-
- Create
np.zeros((difference in shape[0], arr.shape[1]))
np.vstack
the arr and the zeros- Concatenate arr, arr1
arr = np.vstack([arr, np.zeros((arr1.shape[0] - arr.shape[0], arr.shape[1]))]) #<--------
arr = np.concatenate((arr, arr1), axis=1)
print(arr)
# [[1 2]
# [0 20]
# [6 10]
# [2 4]
# [0 1]
# [0 6]
# [0 348]]
Convert list of lists with different lengths to a numpy array
you could make a numpy array with np.zeros and fill them with your list elements as shown below.
a = [[1, 2, 3], [4, 5], [6, 7, 8, 9]]
import numpy as np
b = np.zeros([len(a),len(max(a,key = lambda x: len(x)))])
for i,j in enumerate(a):
b[i][0:len(j)] = j
results in
[[ 1. 2. 3. 0.]
[ 4. 5. 0. 0.]
[ 6. 7. 8. 9.]]
numpy time series merge and fill missing values with earlier values
Intersection
Your best bet for a fast intersection is probably np.searchsorted
. It will do a binary search in filled_timestamp
for the elements of raw_timestamp
:
idx = np.searchsorted(filled_timestamp, raw_timestamp)
This will only be accurate if every element of raw_timestamp
actually occurs in filled_timestamp
because np.searchsorted
will return an insertion index regardless.
Non-vectorized Solution
You want to set a slice of filled_sensor
from idx[n]
to idx[n + 1]
to the value of raw_sensor[n]
:
from itertools import zip_longest
for start, end, row in zip_longest(idx, idx[1:], raw_sensor):
filled_sensor[start:end] = row
I am using zip_longest
here so that the last value coming from idx[1:]
would be None
, making the last slice be equivalent to filled_sensor[idx[-1]:]
without requiring a special condition.
Vectorized Solution
You can create filled_sensor
in one shot directly from raw_sensor
if you know which indices to repeat from raw_sensor
. You can get that information by applying np.cumsum
to idx
converted to a boolean array:
idx_mask = np.zeros(filled_timestamp.shape, np.bool)
idx_mask[idx] = True
Basically, we start with a boolean array of the same size as filled_timestamp
that is True
(1) wherever an entry from raw_timestamp
matches. We can convert that to an index in raw_timestamp
by counting how many total matches have occurred up to that point:
indexes = np.cumsum(idx_mask) - 1
Keep in mind that indexes
is an array of integers, not booleans. It will increment whenever a new match is found. The - 1
converts from count to index because the first match will have a count of 1 instead of 0.
Now you can just make filled_sensor
:
filled_sensor = raw_sensor[indexes]
The only possible caveat here is if filled_sensor[0]
does not come from raw_sensor[0]
. It will then be replaced with raw_sensor[-1]
. Given how you construct the times in filled
based on raw
, I am not sure can ever even be an issue.
Example
Here is an example of the Intersection and Vectorized Solution steps with the data that you show in your question.
We start with
raw_timestamp = np.array(['2009-01-01T18:41:00',
'2009-01-01T18:44:00',
'2009-01-01T18:46:00',
'2009-01-01T18:47:00',], dtype='datetime64[s]')
raw_sensor = np.array([(755, 855, 755, 855, 743, 843, 743, 843, 2),
(693, 793, 693, 793, 693, 793, 693, 793, 1),
(755, 855, 755, 855, 743, 843, 743, 843, 2),
(693, 793, 693, 793, 693, 793, 693, 793, 1),],
dtype=[('sensorA', '<u4'), ('sensorB', '<u4'),
('sensorC', '<u4'), ('sensorD', '<u4'),
('sensorE', '<u4'), ('sensorF', '<u4'),
('sensorG', '<u4'), ('sensorH', '<u4'),
('signal', '<u4')])
We can generate filled_timestamp
as
filled_timestamp = np.arange('2009-01-01T18:41:00',
'2009-01-01T18:48:00', 60, dtype='datetime64[s]')
Which yields, as expected:
array(['2009-01-01T18:41:00', '2009-01-01T18:42:00', '2009-01-01T18:43:00',
'2009-01-01T18:44:00', '2009-01-01T18:45:00', '2009-01-01T18:46:00',
'2009-01-01T18:47:00'], dtype='datetime64[s]')
I have taken a slight liberty with the dtypes
by making timestamps plain arrays instead of structured arrays, but I think that should make no difference for your purpose.
idx = np.searchsorted(filled_timestamp, raw_timestamp)
yieldsidx = np.array([0, 3, 5, 6], dtype=np.int)
This means that indices
0, 3, 5, 6
infilled_timestamp
match values fromraw_timestamp
.idx_mask
then becomesidx_mask = np.array([True, False, False, True, False, True, True], dtype=np.bool)
This is basically synonymous with
idx
, except expanded to boolean mask the same size asfilled_timestamp
.Now the tricky part:
indexes = np.cumsum(idx_mask) - 1
:indexes = array([0, 0, 0, 1, 1, 2, 3], dtype=np.int)
This can be interpreted as follows:
filled_sensor[0:3]
should come fromraw_sensor[0]
.filled_sensor[3:5]
should come fromraw_sensor[1]
,filled_sensor[5]
should come fromraw_sensor[2]
,filled_sensor[6]
should come fromraw_sensor[3]
.So now we use
indexes
to directly extract the correct elements ofraw_sensor
usingfilled_sensor = raw_sensor[indexes]
:np.array([(755, 855, 755, 855, 743, 843, 743, 843, 2),
(755, 855, 755, 855, 743, 843, 743, 843, 2),
(755, 855, 755, 855, 743, 843, 743, 843, 2),
(693, 793, 693, 793, 693, 793, 693, 793, 1),
(693, 793, 693, 793, 693, 793, 693, 793, 1),
(755, 855, 755, 855, 743, 843, 743, 843, 2),
(693, 793, 693, 793, 693, 793, 693, 793, 1)],
dtype=[('sensorA', '<u4'), ('sensorB', '<u4'),
('sensorC', '<u4'), ('sensorD', '<u4'),
('sensorE', '<u4'), ('sensorF', '<u4'),
('sensorG', '<u4'), ('sensorH', '<u4'),
('signal', '<u4')])
How to replace `0`s with missing values according to a series of numbers in a numpy array?
This is straight forward with pandas. You can apply a linear interpolation using interpolate
(its default interpolation method is linear
):
a = np.array([15, 25, 0, 45, 0, 0, 75, 85])
s = pd.Series(a)
s.mask(s.eq(0)).interpolate()
0 15.0
1 25.0
2 35.0
3 45.0
4 55.0
5 65.0
6 75.0
7 85.0
dtype: float64
Related Topics
Writing Unicode Text to a Text File
In Python, How to Convert All of the Items in a List to Floats
Getting "Permission Denied" When Running Pip as Root on My MAC
How to Jump to a Particular Line in a Huge Text File
Plot a Horizontal Line on a Given Plot
Making Object JSON Serializable with Regular Encoder
Single VS Double Quotes in JSON
Multiprocessing.Pool: When to Use Apply, Apply_Async or Map
Convert Integer to String in Python
How to Manage Local VS Production Settings in Django
How to Initialize a Two-Dimensional Array in Python
Pandas: Filter Rows of Dataframe with Operator Chaining
Adding Days to a Date in Python
How to Remove Relative Shift in Matplotlib Axis