Fill missing dates by group
tidyr::complete()
fills missing values
add id
and date
as the columns (...
) to expand for
library(tidyverse)
complete(dat, id, date)
# A tibble: 16 x 3
id date value
<dbl> <date> <dbl>
1 1.00 2017-01-01 30.0
2 1.00 2017-02-01 30.0
3 1.00 2017-03-01 NA
4 1.00 2017-04-01 25.0
5 2.00 2017-01-01 NA
6 2.00 2017-02-01 25.0
7 2.00 2017-03-01 NA
8 2.00 2017-04-01 NA
9 3.00 2017-01-01 25.0
10 3.00 2017-02-01 25.0
11 3.00 2017-03-01 25.0
12 3.00 2017-04-01 NA
13 4.00 2017-01-01 20.0
14 4.00 2017-02-01 20.0
15 4.00 2017-03-01 NA
16 4.00 2017-04-01 20.0
Pandas fill missing dates and values simultaneously for each group
Let's try:
- Getting the minimum value per group using
groupby.min
- Add a new column to the aggregated mins called
max
which stores the maximum values from the frame usingSeries.max
onDt
- Create individual
date_range
per group based on themin
andmax
values Series.explode
into rows to have a DataFrame that represents the new index.- Create a
MultiIndex.from_frame
toreindex
the DataFrame with. reindex
withmidx
and set thefillvalue=0
# Get Min Per Group
dates = mydf.groupby('Id')['Dt'].min().to_frame(name='min')
# Get max from Frame
dates['max'] = mydf['Dt'].max()
# Create MultiIndex with separate Date ranges per Group
midx = pd.MultiIndex.from_frame(
dates.apply(
lambda x: pd.date_range(x['min'], x['max'], freq='MS'), axis=1
).explode().reset_index(name='Dt')[['Dt', 'Id']]
)
# Reindex
mydf = (
mydf.set_index(['Dt', 'Id'])
.reindex(midx, fill_value=0)
.reset_index()
)
mydf
:
Dt Id Sales
0 2020-10-01 A 47
1 2020-11-01 A 67
2 2020-12-01 A 46
3 2021-01-01 A 0
4 2021-02-01 A 0
5 2021-03-01 A 0
6 2021-04-01 A 0
7 2021-05-01 A 0
8 2021-06-01 A 0
9 2021-03-01 B 2
10 2021-04-01 B 42
11 2021-05-01 B 20
12 2021-06-01 B 4
DataFrame:
import pandas as pd
mydf = pd.DataFrame({
'Dt': ['2021-03-01', '2021-04-01', '2021-05-01', '2021-06-01', '2020-10-01',
'2020-11-01', '2020-12-01'],
'Id': ['B', 'B', 'B', 'B', 'A', 'A', 'A'],
'Sales': [2, 42, 20, 4, 47, 67, 46]
})
mydf['Dt'] = pd.to_datetime(mydf['Dt'])
Fill missing dates in 2 level of groups in pandas
Use GroupBy.apply
with lambd function with div.DataFrame.asfreq
:
df['date'] = pd.to_datetime(df['date'])
df = (df.set_index('date')
.groupby(['country','county'])['sales']
.apply(lambda x: x.asfreq('d', fill_value=0))
.reset_index()
[['date','country','county','sales']])
print (df)
date country county sales
0 2021-01-01 a c 1
1 2021-01-02 a c 2
2 2021-01-01 a d 1
3 2021-01-02 a d 0
4 2021-01-03 a d 45
5 2021-01-01 b e 2
6 2021-01-02 b e 341
7 2021-01-05 b f 14
8 2021-01-06 b f 0
9 2021-01-07 b f 25
Fill missing dates in group and convert data to weekly
The code works using the latest version of pandas.
Update your pandas version.
(It's good code, by the way!)
Filling missing dates on a DataFrame across different groups
Let's try it with pivot
+ date_range
+ reindex
+ stack
:
tmp = df.pivot('date','customer','attended')
tmp.index = pd.to_datetime(tmp.index)
out = tmp.reindex(pd.date_range(tmp.index[0], tmp.index[-1])).fillna(False).stack().reset_index().rename(columns={0:'attended'})
Output:
level_0 customer attended
0 2022-01-01 John True
1 2022-01-01 Mark False
2 2022-01-02 John True
3 2022-01-02 Mark False
4 2022-01-03 John False
5 2022-01-03 Mark False
6 2022-01-04 John True
7 2022-01-04 Mark False
8 2022-01-05 John False
9 2022-01-05 Mark True
Fill in missing dates with NAs by group in R - with NA at end of date range as well
I think you're close with your second attempt. If you want to manually enforce the limits of the expansion in the complete
call, you can do it there. It wasn't clear what limits you were after but perhaps the below can get you there. Note that I used two date ranges because it seemed like you wanted to hit two time ranges. But adjust if I misunderstood. Can also be called programmatically if you have those dates stored somewhere. Also, I converted your date
column to an actual date format using as.Date()
during import.
library(tidyverse)
table <- "ID Date dist.km\n 1 1 2007-10-15 15147\n 2 1 2007-10-16 15156\n 3 1 2007-10-17 15173\n 4 1 2007-10-18 15185\n 5 1 2007-10-19 15194\n 6 1 2007-10-25 15202\n 7 1 2007-10-26 15216\n 8 1 2007-10-27 15240\n 9 1 2007-10-28 15270\n10 1 2007-10-29 15290\n11 2 2008-10-15 15147\n12 2 2008-10-16 15156\n13 2 2008-10-17 15173\n14 2 2008-10-18 15185\n15 2 2008-10-19 15194\n16 2 2008-10-20 15202\n17 2 2008-10-21 15216\n18 2 2008-10-29 15240\n19 2 2008-10-30 15270\n20 2 2008-10-31 15290"
#Create a dataframe with the above table
df <- read.table(text=table, header = TRUE) %>%
mutate(Date = as.Date(Date))
# expand by feeding the limits of the date ranges to cover
newdat2 <- df %>%
group_by(ID) %>%
complete(Date = c(
seq.Date(
from = as.Date("2007-10-15"),
to = as.Date("2008-02-15"),
by = "day"
),
seq.Date(
from = as.Date("2008-10-15"),
to = as.Date("2009-02-15"),
by = "day"
)
))
newdat2
#> # A tibble: 496 x 3
#> # Groups: ID [2]
#> ID Date dist.km
#> <int> <date> <int>
#> 1 1 2007-10-15 15147
#> 2 1 2007-10-16 15156
#> 3 1 2007-10-17 15173
#> 4 1 2007-10-18 15185
#> 5 1 2007-10-19 15194
#> 6 1 2007-10-20 NA
#> 7 1 2007-10-21 NA
#> 8 1 2007-10-22 NA
#> 9 1 2007-10-23 NA
#> 10 1 2007-10-24 NA
#> # ... with 486 more rows
Created on 2021-03-15 by the reprex package (v1.0.0)
Filling missing dates within group with duplicate date pandas python
>>> df.set_index("day") \
.groupby("ID")["val"] \
.resample("D") \
.first() \
.fillna(0) \
.reset_index()
ID day val
0 AA 2020-01-26 100.0
1 AA 2020-01-27 0.0
2 AA 2020-01-28 200.0
3 BB 2020-01-26 100.0
4 BB 2020-01-27 100.0
5 BB 2020-01-28 0.0
6 BB 2020-01-29 40.0
Note: the function first()
is useless. It's because Resampler.fillna()
only works with the method
keyword. You cannot pass a value
unlike DataFrame.fillna()
.
Expanding and filling the dataframe for missing dates by each group
I would set the df index to Date
, then group by ID
and finally reindex depending on the oldest (replacing it with the first day of the month) and most recent dates:
import pandas as pd
df = pd.DataFrame({"ID":[1,1,1,2,2,2],
"Date":["29.12.2020","05.01.2021","15.02.2021","11.04.2021","27.05.2021","29.05.2021"],
"Amount":[6,5,7,9,8,7]})
df["Date"] = pd.to_datetime(df["Date"], format="%d.%m.%Y")
df = df.set_index("Date")
new_df = pd.DataFrame()
for id_val, obs_period in df.groupby("ID"):
date_range = pd.date_range(min(obs_period.index).replace(day=1), max(obs_period.index))
obs_period = obs_period.reindex(date_range, fill_value=pd.NA)
obs_period["ID"] = id_val
if pd.isna(obs_period.at[obs_period.index[0], "Amount"]):
obs_period.at[obs_period.index[0], "Amount"] = 0 # adding 0 at the beginning of the period if undefined
obs_period= obs_period.ffill() # filling Amount with last value
new_df = pd.concat([new_df, obs_period])
print(new_df)
BTW you should specify your date format while converting df["Date"]
Output:
ID Amount
2020-12-01 1 0.0
2020-12-02 1 0.0
2020-12-03 1 0.0
2020-12-04 1 0.0
2020-12-05 1 0.0
... .. ...
2021-05-25 2 9.0
2021-05-26 2 9.0
2021-05-27 2 8.0
2021-05-28 2 8.0
2021-05-29 2 7.0
[136 rows x 2 columns]
Fill in missing dates across multiple partitions (Snowflake)
WITH fake_data AS (
SELECT * FROM VALUES
('A','USD','2020-01-01'::date,3)
,('A','USD','2020-01-03'::date,4)
,('A','USD','2020-01-04'::date,2)
,('A','CAD','2021-01-04'::date,5)
,('A','CAD','2021-01-06'::date,6)
,('A','CAD','2020-01-07'::date,1)
,('B','USD','2019-01-01'::date,3)
,('B','USD','2019-01-03'::date,4)
,('B','USD','2019-01-04'::date,5)
,('B','CAD','2017-01-04'::date,3)
,('B','CAD','2017-01-06'::date,2)
,('B','CAD','2017-01-07'::date,2)
d(Name,Currency,Date,Amount)
), partition_ranges AS (
SELECT name,
currency,
min(date) as min_date,
max(date) as max_date,
datediff('days', min_date, max_date) as span
FROM fake_data
GROUP BY 1,2
), huge_range as (
SELECT ROW_NUMBER() OVER(order by true)-1 as rn
FROM table(generator(ROWCOUNT => 10000000))
), in_fill as (
SELECT pr.name,
pr.currency,
dateadd('day', hr.rn, pr.min_date) as date
FROM partition_ranges as pr
JOIN huge_range as hr ON pr.span >= hr.rn
)
SELECT
i.name,
i.currency,
i.date,
nvl(d.amount, 0) as amount
from in_fill as i
left join fake_data as d on d.name = i.name and d.currency = i.currency and d.date = i.date
order by 1,2,3;
NAME | CURRENCY | DATE | AMOUNT |
---|---|---|---|
A | CAD | 2020-01-07 | 1 |
A | CAD | 2020-01-08 | 0 |
A | CAD | 2020-01-09 | 0 |
A | CAD | 2020-01-10 | 0 |
A | CAD | 2020-01-11 | 0 |
A | CAD | 2020-01-12 | 0 |
A | CAD | 2020-01-13 | 0 |
A | CAD | 2020-01-14 | 0 |
A | CAD | 2020-01-15 | 0 |
A | CAD | 2020-01-16 | 0 |
A | CAD | 2020-01-17 | 0 |
A | CAD | 2020-01-18 | 0 |
A | CAD | 2020-01-19 | 0 |
A | CAD | 2020-01-20 | 0 |
A | CAD | 2020-01-21 | 0 |
A | CAD | 2020-01-22 | 0 |
A | CAD | 2020-01-23 | 0 |
A | CAD | 2020-01-24 | 0 |
A | CAD | 2020-01-25 | 0 |
A | CAD | 2020-01-26 | 0 |
A | CAD | 2020-01-27 | 0 |
A | CAD | 2020-01-28 | 0 |
A | CAD | 2020-01-29 | 0 |
A | CAD | 2020-01-30 | 0 |
A | CAD | 2020-01-31 | 0 |
A | CAD | 2020-02-01 | 0 |
A | CAD | 2020-02-02 | 0 |
A | CAD | 2020-02-03 | 0 |
A | CAD | 2020-02-04 | 0 |
A | CAD | 2020-02-05 | 0 |
A | CAD | 2020-02-06 | 0 |
A | CAD | 2020-02-07 | 0 |
A | CAD | 2020-02-08 | 0 |
A | CAD | 2020-02-09 | 0 |
A | CAD | 2020-02-10 | 0 |
A | CAD | 2020-02-11 | 0 |
A | CAD | 2020-02-12 | 0 |
A | CAD | 2020-02-13 | 0 |
A | CAD | 2020-02-14 | 0 |
A | CAD | 2020-02-15 | 0 |
A | CAD | 2020-02-16 | 0 |
A | CAD | 2020-02-17 | 0 |
A | CAD | 2020-02-18 | 0 |
A | CAD | 2020-02-19 | 0 |
A | CAD | 2020-02-20 | 0 |
A | CAD | 2020-02-21 | 0 |
A | CAD | 2020-02-22 | 0 |
A | CAD | 2020-02-23 | 0 |
A | CAD | 2020-02-24 | 0 |
A | CAD | 2020-02-25 | 0 |
A | CAD | 2020-02-26 | 0 |
A | CAD | 2020-02-27 | 0 |
A | CAD | 2020-02-28 | 0 |
A | CAD | 2020-02-29 | 0 |
A | CAD | 2020-03-01 | 0 |
A | CAD | 2020-03-02 | 0 |
A | CAD | 2020-03-03 | 0 |
A | CAD | 2020-03-04 | 0 |
A | CAD | 2020-03-05 | 0 |
A | CAD | 2020-03-06 | 0 |
A | CAD | 2020-03-07 | 0 |
A | CAD | 2020-03-08 | 0 |
A | CAD | 2020-03-09 | 0 |
A | CAD | 2020-03-10 | 0 |
A | CAD | 2020-03-11 | 0 |
A | CAD | 2020-03-12 | 0 |
A | CAD | 2020-03-13 | 0 |
A | CAD | 2020-03-14 | 0 |
A | CAD | 2020-03-15 | 0 |
A | CAD | 2020-03-16 | 0 |
A | CAD | 2020-03-17 | 0 |
A | CAD | 2020-03-18 | 0 |
A | CAD | 2020-03-19 | 0 |
A | CAD | 2020-03-20 | 0 |
A | CAD | 2020-03-21 | 0 |
A | CAD | 2020-03-22 | 0 |
A | CAD | 2020-03-23 | 0 |
A | CAD | 2020-03-24 | 0 |
A | CAD | 2020-03-25 | 0 |
A | CAD | 2020-03-26 | 0 |
A | CAD | 2020-03-27 | 0 |
A | CAD | 2020-03-28 | 0 |
A | CAD | 2020-03-29 | 0 |
A | CAD | 2020-03-30 | 0 |
A | CAD | 2020-03-31 | 0 |
A | CAD | 2020-04-01 | 0 |
A | CAD | 2020-04-02 | 0 |
A | CAD | 2020-04-03 | 0 |
A | CAD | 2020-04-04 | 0 |
A | CAD | 2020-04-05 | 0 |
A | CAD | 2020-04-06 | 0 |
A | CAD | 2020-04-07 | 0 |
A | CAD | 2020-04-08 | 0 |
A | CAD | 2020-04-09 | 0 |
A | CAD | 2020-04-10 | 0 |
A | CAD | 2020-04-11 | 0 |
A | CAD | 2020-04-12 | 0 |
A | CAD | 2020-04-13 | 0 |
A | CAD | 2020-04-14 | 0 |
A | CAD | 2020-04-15 | 0 |
A | CAD | 2020-04-16 | 0 |
A | CAD | 2020-04-17 | 0 |
A | CAD | 2020-04-18 | 0 |
A | CAD | 2020-04-19 | 0 |
A | CAD | 2020-04-20 | 0 |
A | CAD | 2020-04-21 | 0 |
A | CAD | 2020-04-22 | 0 |
A | CAD | 2020-04-23 | 0 |
A | CAD | 2020-04-24 | 0 |
A | CAD | 2020-04-25 | 0 |
A | CAD | 2020-04-26 | 0 |
A | CAD | 2020-04-27 | 0 |
A | CAD | 2020-04-28 | 0 |
A | CAD | 2020-04-29 | 0 |
A | CAD | 2020-04-30 | 0 |
A | CAD | 2020-05-01 | 0 |
A | CAD | 2020-05-02 | 0 |
A | CAD | 2020-05-03 | 0 |
A | CAD | 2020-05-04 | 0 |
A | CAD | 2020-05-05 | 0 |
A | CAD | 2020-05-06 | 0 |
A | CAD | 2020-05-07 | 0 |
A | CAD | 2020-05-08 | 0 |
A | CAD | 2020-05-09 | 0 |
A | CAD | 2020-05-10 | 0 |
A | CAD | 2020-05-11 | 0 |
A | CAD | 2020-05-12 | 0 |
A | CAD | 2020-05-13 | 0 |
A | CAD | 2020-05-14 | 0 |
A | CAD | 2020-05-15 | 0 |
A | CAD | 2020-05-16 | 0 |
A | CAD | 2020-05-17 | 0 |
A | CAD | 2020-05-18 | 0 |
A | CAD | 2020-05-19 | 0 |
A | CAD | 2020-05-20 | 0 |
A | CAD | 2020-05-21 | 0 |
A | CAD | 2020-05-22 | 0 |
A | CAD | 2020-05-23 | 0 |
A | CAD | 2020-05-24 | 0 |
A | CAD | 2020-05-25 | 0 |
A | CAD | 2020-05-26 | 0 |
A | CAD | 2020-05-27 | 0 |
A | CAD | 2020-05-28 | 0 |
A | CAD | 2020-05-29 | 0 |
A | CAD | 2020-05-30 | 0 |
A | CAD | 2020-05-31 | 0 |
A | CAD | 2020-06-01 | 0 |
A | CAD | 2020-06-02 | 0 |
A | CAD | 2020-06-03 | 0 |
A | CAD | 2020-06-04 | 0 |
A | CAD | 2020-06-05 | 0 |
A | CAD | 2020-06-06 | 0 |
A | CAD | 2020-06-07 | 0 |
A | CAD | 2020-06-08 | 0 |
A | CAD | 2020-06-09 | 0 |
A | CAD | 2020-06-10 | 0 |
A | CAD | 2020-06-11 | 0 |
A | CAD | 2020-06-12 | 0 |
A | CAD | 2020-06-13 | 0 |
A | CAD | 2020-06-14 | 0 |
A | CAD | 2020-06-15 | 0 |
A | CAD | 2020-06-16 | 0 |
A | CAD | 2020-06-17 | 0 |
A | CAD | 2020-06-18 | 0 |
A | CAD | 2020-06-19 | 0 |
A | CAD | 2020-06-20 | 0 |
A | CAD | 2020-06-21 | 0 |
A | CAD | 2020-06-22 | 0 |
A | CAD | 2020-06-23 | 0 |
A | CAD | 2020-06-24 | 0 |
A | CAD | 2020-06-25 | 0 |
A | CAD | 2020-06-26 | 0 |
A | CAD | 2020-06-27 | 0 |
A | CAD | 2020-06-28 | 0 |
A | CAD | 2020-06-29 | 0 |
A | CAD | 2020-06-30 | 0 |
A | CAD | 2020-07-01 | 0 |
A | CAD | 2020-07-02 | 0 |
A | CAD | 2020-07-03 | 0 |
A | CAD | 2020-07-04 | 0 |
A | CAD | 2020-07-05 | 0 |
A | CAD | 2020-07-06 | 0 |
A | CAD | 2020-07-07 | 0 |
A | CAD | 2020-07-08 | 0 |
A | CAD | 2020-07-09 | 0 |
A | CAD | 2020-07-10 | 0 |
A | CAD | 2020-07-11 | 0 |
A | CAD | 2020-07-12 | 0 |
A | CAD | 2020-07-13 | 0 |
A | CAD | 2020-07-14 | 0 |
A | CAD | 2020-07-15 | 0 |
A | CAD | 2020-07-16 | 0 |
A | CAD | 2020-07-17 | 0 |
A | CAD | 2020-07-18 | 0 |
A | CAD | 2020-07-19 | 0 |
A | CAD | 2020-07-20 | 0 |
A | CAD | 2020-07-21 | 0 |
A | CAD | 2020-07-22 | 0 |
A | CAD | 2020-07-23 | 0 |
A | CAD | 2020-07-24 | 0 |
A | CAD | 2020-07-25 | 0 |
A | CAD | 2020-07-26 | 0 |
A | CAD | 2020-07-27 | 0 |
A | CAD | 2020-07-28 | 0 |
A | CAD | 2020-07-29 | 0 |
A | CAD | 2020-07-30 | 0 |
A | CAD | 2020-07-31 | 0 |
A | CAD | 2020-08-01 | 0 |
A | CAD | 2020-08-02 | 0 |
A | CAD | 2020-08-03 | 0 |
A | CAD | 2020-08-04 | 0 |
A | CAD | 2020-08-05 | 0 |
A | CAD | 2020-08-06 | 0 |
A | CAD | 2020-08-07 | 0 |
A | CAD | 2020-08-08 | 0 |
A | CAD | 2020-08-09 | 0 |
A | CAD | 2020-08-10 | 0 |
A | CAD | 2020-08-11 | 0 |
A | CAD | 2020-08-12 | 0 |
A | CAD | 2020-08-13 | 0 |
A | CAD | 2020-08-14 | 0 |
A | CAD | 2020-08-15 | 0 |
A | CAD | 2020-08-16 | 0 |
A | CAD | 2020-08-17 | 0 |
A | CAD | 2020-08-18 | 0 |
A | CAD | 2020-08-19 | 0 |
A | CAD | 2020-08-20 | 0 |
A | CAD | 2020-08-21 | 0 |
A | CAD | 2020-08-22 | 0 |
A | CAD | 2020-08-23 | 0 |
A | CAD | 2020-08-24 | 0 |
A | CAD | 2020-08-25 | 0 |
A | CAD | 2020-08-26 | 0 |
A | CAD | 2020-08-27 | 0 |
A | CAD | 2020-08-28 | 0 |
A | CAD | 2020-08-29 | 0 |
A | CAD | 2020-08-30 | 0 |
A | CAD | 2020-08-31 | 0 |
A | CAD | 2020-09-01 | 0 |
A | CAD | 2020-09-02 | 0 |
A | CAD | 2020-09-03 | 0 |
A | CAD | 2020-09-04 | 0 |
A | CAD | 2020-09-05 | 0 |
A | CAD | 2020-09-06 | 0 |
A | CAD | 2020-09-07 | 0 |
A | CAD |