Pandas Filling Missing Dates and Values Within Group

Pandas filling missing dates and values within group

Initial Dataframe:

            dt  user    val
0 2016-01-01 a 1
1 2016-01-02 a 33
2 2016-01-05 b 2
3 2016-01-06 b 1

First, convert the dates to datetime:

x['dt'] = pd.to_datetime(x['dt'])

Then, generate the dates and unique users:

dates = x.set_index('dt').resample('D').asfreq().index

>> DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
'2016-01-05', '2016-01-06'],
dtype='datetime64[ns]', name='dt', freq='D')

users = x['user'].unique()

>> array(['a', 'b'], dtype=object)

This will allow you to create a MultiIndex:

idx = pd.MultiIndex.from_product((dates, users), names=['dt', 'user'])

>> MultiIndex(levels=[[2016-01-01 00:00:00, 2016-01-02 00:00:00, 2016-01-03 00:00:00, 2016-01-04 00:00:00, 2016-01-05 00:00:00, 2016-01-06 00:00:00], ['a', 'b']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]],
names=['dt', 'user'])

You can use that to reindex your DataFrame:

x.set_index(['dt', 'user']).reindex(idx, fill_value=0).reset_index()
Out:
dt user val
0 2016-01-01 a 1
1 2016-01-01 b 0
2 2016-01-02 a 33
3 2016-01-02 b 0
4 2016-01-03 a 0
5 2016-01-03 b 0
6 2016-01-04 a 0
7 2016-01-04 b 0
8 2016-01-05 a 0
9 2016-01-05 b 2
10 2016-01-06 a 0
11 2016-01-06 b 1

which then can be sorted by users:

x.set_index(['dt', 'user']).reindex(idx, fill_value=0).reset_index().sort_values(by='user')
Out:
dt user val
0 2016-01-01 a 1
2 2016-01-02 a 33
4 2016-01-03 a 0
6 2016-01-04 a 0
8 2016-01-05 a 0
10 2016-01-06 a 0
1 2016-01-01 b 0
3 2016-01-02 b 0
5 2016-01-03 b 0
7 2016-01-04 b 0
9 2016-01-05 b 2
11 2016-01-06 b 1

Fill missing dates in 2 level of groups in pandas

Use GroupBy.apply with lambd function with div.DataFrame.asfreq:

df['date'] = pd.to_datetime(df['date'])

df = (df.set_index('date')
.groupby(['country','county'])['sales']
.apply(lambda x: x.asfreq('d', fill_value=0))
.reset_index()
[['date','country','county','sales']])
print (df)
date country county sales
0 2021-01-01 a c 1
1 2021-01-02 a c 2
2 2021-01-01 a d 1
3 2021-01-02 a d 0
4 2021-01-03 a d 45
5 2021-01-01 b e 2
6 2021-01-02 b e 341
7 2021-01-05 b f 14
8 2021-01-06 b f 0
9 2021-01-07 b f 25

Pandas fill missing dates and values simultaneously for each group

Let's try:

  1. Getting the minimum value per group using groupby.min
  2. Add a new column to the aggregated mins called max which stores the maximum values from the frame using Series.max on Dt
  3. Create individual date_range per group based on the min and max values
  4. Series.explode into rows to have a DataFrame that represents the new index.
  5. Create a MultiIndex.from_frame to reindex the DataFrame with.
  6. reindex with midx and set the fillvalue=0
# Get Min Per Group
dates = mydf.groupby('Id')['Dt'].min().to_frame(name='min')
# Get max from Frame
dates['max'] = mydf['Dt'].max()

# Create MultiIndex with separate Date ranges per Group
midx = pd.MultiIndex.from_frame(
dates.apply(
lambda x: pd.date_range(x['min'], x['max'], freq='MS'), axis=1
).explode().reset_index(name='Dt')[['Dt', 'Id']]
)

# Reindex
mydf = (
mydf.set_index(['Dt', 'Id'])
.reindex(midx, fill_value=0)
.reset_index()
)

mydf:

           Dt Id  Sales
0 2020-10-01 A 47
1 2020-11-01 A 67
2 2020-12-01 A 46
3 2021-01-01 A 0
4 2021-02-01 A 0
5 2021-03-01 A 0
6 2021-04-01 A 0
7 2021-05-01 A 0
8 2021-06-01 A 0
9 2021-03-01 B 2
10 2021-04-01 B 42
11 2021-05-01 B 20
12 2021-06-01 B 4

DataFrame:

import pandas as pd

mydf = pd.DataFrame({
'Dt': ['2021-03-01', '2021-04-01', '2021-05-01', '2021-06-01', '2020-10-01',
'2020-11-01', '2020-12-01'],
'Id': ['B', 'B', 'B', 'B', 'A', 'A', 'A'],
'Sales': [2, 42, 20, 4, 47, 67, 46]
})
mydf['Dt'] = pd.to_datetime(mydf['Dt'])

Filling missing dates within group with duplicate date pandas python

>>> df.set_index("day") \
.groupby("ID")["val"] \
.resample("D") \
.first() \
.fillna(0) \
.reset_index()

ID day val
0 AA 2020-01-26 100.0
1 AA 2020-01-27 0.0
2 AA 2020-01-28 200.0
3 BB 2020-01-26 100.0
4 BB 2020-01-27 100.0
5 BB 2020-01-28 0.0
6 BB 2020-01-29 40.0

Note: the function first() is useless. It's because Resampler.fillna() only works with the method keyword. You cannot pass a value unlike DataFrame.fillna().

Add missing dates to pandas dataframe

You could use Series.reindex:

import pandas as pd

idx = pd.date_range('09-01-2013', '09-30-2013')

s = pd.Series({'09-02-2013': 2,
'09-03-2013': 10,
'09-06-2013': 5,
'09-07-2013': 1})
s.index = pd.DatetimeIndex(s.index)

s = s.reindex(idx, fill_value=0)
print(s)

yields

2013-09-01     0
2013-09-02 2
2013-09-03 10
2013-09-04 0
2013-09-05 0
2013-09-06 5
2013-09-07 1
2013-09-08 0
...

Filling missing dates on a DataFrame across different groups

Let's try it with pivot + date_range + reindex + stack:

tmp = df.pivot('date','customer','attended')
tmp.index = pd.to_datetime(tmp.index)
out = tmp.reindex(pd.date_range(tmp.index[0], tmp.index[-1])).fillna(False).stack().reset_index().rename(columns={0:'attended'})

Output:

     level_0 customer  attended
0 2022-01-01 John True
1 2022-01-01 Mark False
2 2022-01-02 John True
3 2022-01-02 Mark False
4 2022-01-03 John False
5 2022-01-03 Mark False
6 2022-01-04 John True
7 2022-01-04 Mark False
8 2022-01-05 John False
9 2022-01-05 Mark True

Pandas fill in missing date within each group with information in the previous row

Getting the date right of course:

x.dt = pd.to_datetime(x.dt)

Then this:

cols = ['dt', 'sub_id']

pd.concat([
d.asfreq('D').ffill(downcast='infer')
for _, d in x.drop_duplicates(cols, keep='last')
.set_index('dt').groupby('sub_id')
]).reset_index()

dt amount sub_id
0 2016-01-01 10 1
1 2016-01-02 10 1
2 2016-01-03 30 1
3 2016-01-04 40 1
4 2016-01-01 80 2
5 2016-01-02 80 2
6 2016-01-03 80 2
7 2016-01-04 82 2

Expanding and filling the dataframe for missing dates by each group

I would set the df index to Date, then group by ID and finally reindex depending on the oldest (replacing it with the first day of the month) and most recent dates:

import pandas as pd

df = pd.DataFrame({"ID":[1,1,1,2,2,2],
"Date":["29.12.2020","05.01.2021","15.02.2021","11.04.2021","27.05.2021","29.05.2021"],
"Amount":[6,5,7,9,8,7]})
df["Date"] = pd.to_datetime(df["Date"], format="%d.%m.%Y")
df = df.set_index("Date")

new_df = pd.DataFrame()
for id_val, obs_period in df.groupby("ID"):
date_range = pd.date_range(min(obs_period.index).replace(day=1), max(obs_period.index))
obs_period = obs_period.reindex(date_range, fill_value=pd.NA)
obs_period["ID"] = id_val
if pd.isna(obs_period.at[obs_period.index[0], "Amount"]):
obs_period.at[obs_period.index[0], "Amount"] = 0 # adding 0 at the beginning of the period if undefined
obs_period= obs_period.ffill() # filling Amount with last value
new_df = pd.concat([new_df, obs_period])

print(new_df)

BTW you should specify your date format while converting df["Date"]

Output:

            ID  Amount
2020-12-01 1 0.0
2020-12-02 1 0.0
2020-12-03 1 0.0
2020-12-04 1 0.0
2020-12-05 1 0.0
... .. ...
2021-05-25 2 9.0
2021-05-26 2 9.0
2021-05-27 2 8.0
2021-05-28 2 8.0
2021-05-29 2 7.0

[136 rows x 2 columns]

Fill in missing dates for a pandas dataframe with multiple series

Group by Item and Category, then generate a time series from the min to the max date:

result = (
df.groupby(["Item", "Category"])["Date"]
.apply(lambda s: pd.date_range(s.min(), s.max()))
.explode()
.reset_index()
)


Related Topics



Leave a reply



Submit