How to Fill in Arbitrary Missing Dates in Pandas Dataframe

creating and filling empty dates with zeroes

Use:

#added parse_dates for datetimes
df=pd.read_csv('https://raw.githubusercontent.com/amanaroratc/hello-world/master/x_restock.csv',
parse_dates=['Date'])

First solution is for add complete range of datetimes from minimal and maximal datetimes in DataFrame.reindex by MultiIndex.from_product:

mux = pd.MultiIndex.from_product([df['Product_ID'].unique(),
pd.date_range(df.Date.min(), df.Date.max())],
names=['Product_ID','Dates'])

df1 = df.set_index(['Product_ID','Date']).reindex(mux, fill_value=0).reset_index()
print (df1)
Product_ID Dates restocking_events
0 1004746 2021-11-13 0
1 1004746 2021-11-14 0
2 1004746 2021-11-15 0
3 1004746 2021-11-16 1
4 1004746 2021-11-17 0
... ... ...
3379 976460 2021-11-26 1
3380 976460 2021-11-27 0
3381 976460 2021-11-28 0
3382 976460 2021-11-29 0
3383 976460 2021-11-30 0

[3384 rows x 3 columns]

Another idea with helper DataFrame:

from  itertools import product

dfdate=pd.DataFrame(product(df['Product_ID'].unique(),
pd.date_range(df.Date.min(), df.Date.max())),
columns=['Product_ID','Date'])
print (dfdate)
Product_ID Date
0 1004746 2021-11-13
1 1004746 2021-11-14
2 1004746 2021-11-15
3 1004746 2021-11-16
4 1004746 2021-11-17
... ...
3379 976460 2021-11-26
3380 976460 2021-11-27
3381 976460 2021-11-28
3382 976460 2021-11-29
3383 976460 2021-11-30

[3384 rows x 2 columns]
df = dfdate.merge(df, how='left').fillna({'restocking_events':0}, downcast='int')
print (df)
Product_ID Date restocking_events
0 1004746 2021-11-13 0
1 1004746 2021-11-14 0
2 1004746 2021-11-15 0
3 1004746 2021-11-16 1
4 1004746 2021-11-17 0
... ... ...
3379 976460 2021-11-26 1
3380 976460 2021-11-27 0
3381 976460 2021-11-28 0
3382 976460 2021-11-29 0
3383 976460 2021-11-30 0

[3384 rows x 3 columns]

Or if need consecutive datetimes per groups use DataFrame.asfreq:

df2 = (df.set_index('Date')
.groupby('Product_ID')['restocking_events']
.apply(lambda x: x.asfreq('d', fill_value=0))
.reset_index())
print (df2)
Product_ID Date restocking_events
0 112714 2021-11-15 1
1 112714 2021-11-16 1
2 112714 2021-11-17 0
3 112714 2021-11-18 1
4 112714 2021-11-19 0
... ... ...
2209 3630918 2021-11-25 0
2210 3630918 2021-11-26 0
2211 3630918 2021-11-27 0
2212 3630918 2021-11-28 0
2213 3630918 2021-11-29 1

[2214 rows x 3 columns]

Pandas filling missing dates and values within group with duplicate index values

Here is one way, reindexing each user to have a date range from your minimum date to your maximum date:

# setup your dataframe as you had it before:
x = pandas.DataFrame({'user': ['a','a','b','b','a'], 'dt': ['2016-01-01','2016-01-02', '2016-01-05','2016-01-06','2016-01-06'], 'val': [1,33,2,1,2]})
udates=x['dt'].unique()
x['dt'] = pandas.to_datetime(x['dt'])

# fill with new dates:
filled_df = (x.set_index('dt')
.groupby('user')
.apply(lambda d: d.reindex(pd.date_range(min(x.dt),
max(x.dt),
freq='D')))
.drop('user', axis=1)
.reset_index('user')
.fillna(0))


>>> filled_df
user val
2016-01-01 a 1.0
2016-01-02 a 33.0
2016-01-03 a 0.0
2016-01-04 a 0.0
2016-01-05 a 0.0
2016-01-06 a 2.0
2016-01-01 b 0.0
2016-01-02 b 0.0
2016-01-03 b 0.0
2016-01-04 b 0.0
2016-01-05 b 2.0
2016-01-06 b 1.0

Pandas filling missing date values with a constant date

Convert values to datetimes with non datetimes to NaT, so possible replacement by fillna:

df['termination_date'] = (pd.to_datetime(df['termination_date'], errors='coerce')
.fillna(pd.to_datetime('2020-07-31')))

#because same times 00:00:00 are not shown
print (df)
termination_date
0 2020-06-28
1 2020-07-31
2 2020-07-13
3 2020-08-11
4 2020-07-31
5 2020-08-11

print(df['termination_date'].tolist())
[Timestamp('2020-06-28 00:00:00'), Timestamp('2020-07-31 00:00:00'),
Timestamp('2020-07-13 00:00:00'), Timestamp('2020-08-11 00:00:00'),
Timestamp('2020-07-31 00:00:00'), Timestamp('2020-08-11 00:00:00')]

print (df.termination_date.dtypes)
datetime64[ns]

Pandas fill missing values in dataframe from another dataframe

If you have two DataFrames of the same shape, then:

df[df.isnull()] = d2

Will do the trick.

visual representation

Only locations where df.isnull() evaluates to True (highlighted in green) will be eligible for assignment.

In practice, the DataFrames aren't always the same size / shape, and transforming methods (especially .shift()) are useful.

Data coming in is invariably dirty, incomplete, or inconsistent. Par for the course. There's a pretty extensive pandas tutorial and associated cookbook for dealing with these situations.



Related Topics



Leave a reply



Submit