Fill Missing Dates by Group

Fill missing dates by group

tidyr::complete() fills missing values

add id and date as the columns (...) to expand for

library(tidyverse)

complete(dat, id, date)


# A tibble: 16 x 3
id date value
<dbl> <date> <dbl>
1 1.00 2017-01-01 30.0
2 1.00 2017-02-01 30.0
3 1.00 2017-03-01 NA
4 1.00 2017-04-01 25.0
5 2.00 2017-01-01 NA
6 2.00 2017-02-01 25.0
7 2.00 2017-03-01 NA
8 2.00 2017-04-01 NA
9 3.00 2017-01-01 25.0
10 3.00 2017-02-01 25.0
11 3.00 2017-03-01 25.0
12 3.00 2017-04-01 NA
13 4.00 2017-01-01 20.0
14 4.00 2017-02-01 20.0
15 4.00 2017-03-01 NA
16 4.00 2017-04-01 20.0

Pandas fill missing dates and values simultaneously for each group

Let's try:

  1. Getting the minimum value per group using groupby.min
  2. Add a new column to the aggregated mins called max which stores the maximum values from the frame using Series.max on Dt
  3. Create individual date_range per group based on the min and max values
  4. Series.explode into rows to have a DataFrame that represents the new index.
  5. Create a MultiIndex.from_frame to reindex the DataFrame with.
  6. reindex with midx and set the fillvalue=0
# Get Min Per Group
dates = mydf.groupby('Id')['Dt'].min().to_frame(name='min')
# Get max from Frame
dates['max'] = mydf['Dt'].max()

# Create MultiIndex with separate Date ranges per Group
midx = pd.MultiIndex.from_frame(
dates.apply(
lambda x: pd.date_range(x['min'], x['max'], freq='MS'), axis=1
).explode().reset_index(name='Dt')[['Dt', 'Id']]
)

# Reindex
mydf = (
mydf.set_index(['Dt', 'Id'])
.reindex(midx, fill_value=0)
.reset_index()
)

mydf:

           Dt Id  Sales
0 2020-10-01 A 47
1 2020-11-01 A 67
2 2020-12-01 A 46
3 2021-01-01 A 0
4 2021-02-01 A 0
5 2021-03-01 A 0
6 2021-04-01 A 0
7 2021-05-01 A 0
8 2021-06-01 A 0
9 2021-03-01 B 2
10 2021-04-01 B 42
11 2021-05-01 B 20
12 2021-06-01 B 4

DataFrame:

import pandas as pd

mydf = pd.DataFrame({
'Dt': ['2021-03-01', '2021-04-01', '2021-05-01', '2021-06-01', '2020-10-01',
'2020-11-01', '2020-12-01'],
'Id': ['B', 'B', 'B', 'B', 'A', 'A', 'A'],
'Sales': [2, 42, 20, 4, 47, 67, 46]
})
mydf['Dt'] = pd.to_datetime(mydf['Dt'])

Fill missing dates in 2 level of groups in pandas

Use GroupBy.apply with lambd function with div.DataFrame.asfreq:

df['date'] = pd.to_datetime(df['date'])


df = (df.set_index('date')
.groupby(['country','county'])['sales']
.apply(lambda x: x.asfreq('d', fill_value=0))
.reset_index()
[['date','country','county','sales']])
print (df)
date country county sales
0 2021-01-01 a c 1
1 2021-01-02 a c 2
2 2021-01-01 a d 1
3 2021-01-02 a d 0
4 2021-01-03 a d 45
5 2021-01-01 b e 2
6 2021-01-02 b e 341
7 2021-01-05 b f 14
8 2021-01-06 b f 0
9 2021-01-07 b f 25

Fill missing dates in group and convert data to weekly

The code works using the latest version of pandas.

Update your pandas version.

(It's good code, by the way!)

Filling missing dates on a DataFrame across different groups

Let's try it with pivot + date_range + reindex + stack:

tmp = df.pivot('date','customer','attended')
tmp.index = pd.to_datetime(tmp.index)
out = tmp.reindex(pd.date_range(tmp.index[0], tmp.index[-1])).fillna(False).stack().reset_index().rename(columns={0:'attended'})

Output:

     level_0 customer  attended
0 2022-01-01 John True
1 2022-01-01 Mark False
2 2022-01-02 John True
3 2022-01-02 Mark False
4 2022-01-03 John False
5 2022-01-03 Mark False
6 2022-01-04 John True
7 2022-01-04 Mark False
8 2022-01-05 John False
9 2022-01-05 Mark True

Fill in missing dates with NAs by group in R - with NA at end of date range as well

I think you're close with your second attempt. If you want to manually enforce the limits of the expansion in the complete call, you can do it there. It wasn't clear what limits you were after but perhaps the below can get you there. Note that I used two date ranges because it seemed like you wanted to hit two time ranges. But adjust if I misunderstood. Can also be called programmatically if you have those dates stored somewhere. Also, I converted your date column to an actual date format using as.Date() during import.

library(tidyverse)

table <- "ID Date dist.km\n 1 1 2007-10-15 15147\n 2 1 2007-10-16 15156\n 3 1 2007-10-17 15173\n 4 1 2007-10-18 15185\n 5 1 2007-10-19 15194\n 6 1 2007-10-25 15202\n 7 1 2007-10-26 15216\n 8 1 2007-10-27 15240\n 9 1 2007-10-28 15270\n10 1 2007-10-29 15290\n11 2 2008-10-15 15147\n12 2 2008-10-16 15156\n13 2 2008-10-17 15173\n14 2 2008-10-18 15185\n15 2 2008-10-19 15194\n16 2 2008-10-20 15202\n17 2 2008-10-21 15216\n18 2 2008-10-29 15240\n19 2 2008-10-30 15270\n20 2 2008-10-31 15290"

#Create a dataframe with the above table
df <- read.table(text=table, header = TRUE) %>%
mutate(Date = as.Date(Date))

# expand by feeding the limits of the date ranges to cover
newdat2 <- df %>%
group_by(ID) %>%
complete(Date = c(
seq.Date(
from = as.Date("2007-10-15"),
to = as.Date("2008-02-15"),
by = "day"
),
seq.Date(
from = as.Date("2008-10-15"),
to = as.Date("2009-02-15"),
by = "day"
)
))

newdat2

#> # A tibble: 496 x 3
#> # Groups: ID [2]
#> ID Date dist.km
#> <int> <date> <int>
#> 1 1 2007-10-15 15147
#> 2 1 2007-10-16 15156
#> 3 1 2007-10-17 15173
#> 4 1 2007-10-18 15185
#> 5 1 2007-10-19 15194
#> 6 1 2007-10-20 NA
#> 7 1 2007-10-21 NA
#> 8 1 2007-10-22 NA
#> 9 1 2007-10-23 NA
#> 10 1 2007-10-24 NA
#> # ... with 486 more rows

Created on 2021-03-15 by the reprex package (v1.0.0)

Filling missing dates within group with duplicate date pandas python

>>> df.set_index("day") \
.groupby("ID")["val"] \
.resample("D") \
.first() \
.fillna(0) \
.reset_index()

ID day val
0 AA 2020-01-26 100.0
1 AA 2020-01-27 0.0
2 AA 2020-01-28 200.0
3 BB 2020-01-26 100.0
4 BB 2020-01-27 100.0
5 BB 2020-01-28 0.0
6 BB 2020-01-29 40.0

Note: the function first() is useless. It's because Resampler.fillna() only works with the method keyword. You cannot pass a value unlike DataFrame.fillna().

Expanding and filling the dataframe for missing dates by each group

I would set the df index to Date, then group by ID and finally reindex depending on the oldest (replacing it with the first day of the month) and most recent dates:

import pandas as pd

df = pd.DataFrame({"ID":[1,1,1,2,2,2],
"Date":["29.12.2020","05.01.2021","15.02.2021","11.04.2021","27.05.2021","29.05.2021"],
"Amount":[6,5,7,9,8,7]})
df["Date"] = pd.to_datetime(df["Date"], format="%d.%m.%Y")
df = df.set_index("Date")

new_df = pd.DataFrame()
for id_val, obs_period in df.groupby("ID"):
date_range = pd.date_range(min(obs_period.index).replace(day=1), max(obs_period.index))
obs_period = obs_period.reindex(date_range, fill_value=pd.NA)
obs_period["ID"] = id_val
if pd.isna(obs_period.at[obs_period.index[0], "Amount"]):
obs_period.at[obs_period.index[0], "Amount"] = 0 # adding 0 at the beginning of the period if undefined
obs_period= obs_period.ffill() # filling Amount with last value
new_df = pd.concat([new_df, obs_period])

print(new_df)

BTW you should specify your date format while converting df["Date"]

Output:

            ID  Amount
2020-12-01 1 0.0
2020-12-02 1 0.0
2020-12-03 1 0.0
2020-12-04 1 0.0
2020-12-05 1 0.0
... .. ...
2021-05-25 2 9.0
2021-05-26 2 9.0
2021-05-27 2 8.0
2021-05-28 2 8.0
2021-05-29 2 7.0

[136 rows x 2 columns]

Fill in missing dates across multiple partitions (Snowflake)

WITH fake_data AS (
SELECT * FROM VALUES
('A','USD','2020-01-01'::date,3)
,('A','USD','2020-01-03'::date,4)
,('A','USD','2020-01-04'::date,2)
,('A','CAD','2021-01-04'::date,5)
,('A','CAD','2021-01-06'::date,6)
,('A','CAD','2020-01-07'::date,1)
,('B','USD','2019-01-01'::date,3)
,('B','USD','2019-01-03'::date,4)
,('B','USD','2019-01-04'::date,5)
,('B','CAD','2017-01-04'::date,3)
,('B','CAD','2017-01-06'::date,2)
,('B','CAD','2017-01-07'::date,2)
d(Name,Currency,Date,Amount)
), partition_ranges AS (
SELECT name,
currency,
min(date) as min_date,
max(date) as max_date,
datediff('days', min_date, max_date) as span
FROM fake_data
GROUP BY 1,2
), huge_range as (
SELECT ROW_NUMBER() OVER(order by true)-1 as rn
FROM table(generator(ROWCOUNT => 10000000))
), in_fill as (
SELECT pr.name,
pr.currency,
dateadd('day', hr.rn, pr.min_date) as date
FROM partition_ranges as pr
JOIN huge_range as hr ON pr.span >= hr.rn
)
SELECT
i.name,
i.currency,
i.date,
nvl(d.amount, 0) as amount
from in_fill as i
left join fake_data as d on d.name = i.name and d.currency = i.currency and d.date = i.date
order by 1,2,3;














































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































Related Topics



Leave a reply



Submit


NAMECURRENCYDATEAMOUNT
ACAD2020-01-071
ACAD2020-01-080
ACAD2020-01-090
ACAD2020-01-100
ACAD2020-01-110
ACAD2020-01-120
ACAD2020-01-130
ACAD2020-01-140
ACAD2020-01-150
ACAD2020-01-160
ACAD2020-01-170
ACAD2020-01-180
ACAD2020-01-190
ACAD2020-01-200
ACAD2020-01-210
ACAD2020-01-220
ACAD2020-01-230
ACAD2020-01-240
ACAD2020-01-250
ACAD2020-01-260
ACAD2020-01-270
ACAD2020-01-280
ACAD2020-01-290
ACAD2020-01-300
ACAD2020-01-310
ACAD2020-02-010
ACAD2020-02-020
ACAD2020-02-030
ACAD2020-02-040
ACAD2020-02-050
ACAD2020-02-060
ACAD2020-02-070
ACAD2020-02-080
ACAD2020-02-090
ACAD2020-02-100
ACAD2020-02-110
ACAD2020-02-120
ACAD2020-02-130
ACAD2020-02-140
ACAD2020-02-150
ACAD2020-02-160
ACAD2020-02-170
ACAD2020-02-180
ACAD2020-02-190
ACAD2020-02-200
ACAD2020-02-210
ACAD2020-02-220
ACAD2020-02-230
ACAD2020-02-240
ACAD2020-02-250
ACAD2020-02-260
ACAD2020-02-270
ACAD2020-02-280
ACAD2020-02-290
ACAD2020-03-010
ACAD2020-03-020
ACAD2020-03-030
ACAD2020-03-040
ACAD2020-03-050
ACAD2020-03-060
ACAD2020-03-070
ACAD2020-03-080
ACAD2020-03-090
ACAD2020-03-100
ACAD2020-03-110
ACAD2020-03-120
ACAD2020-03-130
ACAD2020-03-140
ACAD2020-03-150
ACAD2020-03-160
ACAD2020-03-170
ACAD2020-03-180
ACAD2020-03-190
ACAD2020-03-200
ACAD2020-03-210
ACAD2020-03-220
ACAD2020-03-230
ACAD2020-03-240
ACAD2020-03-250
ACAD2020-03-260
ACAD2020-03-270
ACAD2020-03-280
ACAD2020-03-290
ACAD2020-03-300
ACAD2020-03-310
ACAD2020-04-010
ACAD2020-04-020
ACAD2020-04-030
ACAD2020-04-040
ACAD2020-04-050
ACAD2020-04-060
ACAD2020-04-070
ACAD2020-04-080
ACAD2020-04-090
ACAD2020-04-100
ACAD2020-04-110
ACAD2020-04-120
ACAD2020-04-130
ACAD2020-04-140
ACAD2020-04-150
ACAD2020-04-160
ACAD2020-04-170
ACAD2020-04-180
ACAD2020-04-190
ACAD2020-04-200
ACAD2020-04-210
ACAD2020-04-220
ACAD2020-04-230
ACAD2020-04-240
ACAD2020-04-250
ACAD2020-04-260
ACAD2020-04-270
ACAD2020-04-280
ACAD2020-04-290
ACAD2020-04-300
ACAD2020-05-010
ACAD2020-05-020
ACAD2020-05-030
ACAD2020-05-040
ACAD2020-05-050
ACAD2020-05-060
ACAD2020-05-070
ACAD2020-05-080
ACAD2020-05-090
ACAD2020-05-100
ACAD2020-05-110
ACAD2020-05-120
ACAD2020-05-130
ACAD2020-05-140
ACAD2020-05-150
ACAD2020-05-160
ACAD2020-05-170
ACAD2020-05-180
ACAD2020-05-190
ACAD2020-05-200
ACAD2020-05-210
ACAD2020-05-220
ACAD2020-05-230
ACAD2020-05-240
ACAD2020-05-250
ACAD2020-05-260
ACAD2020-05-270
ACAD2020-05-280
ACAD2020-05-290
ACAD2020-05-300
ACAD2020-05-310
ACAD2020-06-010
ACAD2020-06-020
ACAD2020-06-030
ACAD2020-06-040
ACAD2020-06-050
ACAD2020-06-060
ACAD2020-06-070
ACAD2020-06-080
ACAD2020-06-090
ACAD2020-06-100
ACAD2020-06-110
ACAD2020-06-120
ACAD2020-06-130
ACAD2020-06-140
ACAD2020-06-150
ACAD2020-06-160
ACAD2020-06-170
ACAD2020-06-180
ACAD2020-06-190
ACAD2020-06-200
ACAD2020-06-210
ACAD2020-06-220
ACAD2020-06-230
ACAD2020-06-240
ACAD2020-06-250
ACAD2020-06-260
ACAD2020-06-270
ACAD2020-06-280
ACAD2020-06-290
ACAD2020-06-300
ACAD2020-07-010
ACAD2020-07-020
ACAD2020-07-030
ACAD2020-07-040
ACAD2020-07-050
ACAD2020-07-060
ACAD2020-07-070
ACAD2020-07-080
ACAD2020-07-090
ACAD2020-07-100
ACAD2020-07-110
ACAD2020-07-120
ACAD2020-07-130
ACAD2020-07-140
ACAD2020-07-150
ACAD2020-07-160
ACAD2020-07-170
ACAD2020-07-180
ACAD2020-07-190
ACAD2020-07-200
ACAD2020-07-210
ACAD2020-07-220
ACAD2020-07-230
ACAD2020-07-240
ACAD2020-07-250
ACAD2020-07-260
ACAD2020-07-270
ACAD2020-07-280
ACAD2020-07-290
ACAD2020-07-300
ACAD2020-07-310
ACAD2020-08-010
ACAD2020-08-020
ACAD2020-08-030
ACAD2020-08-040
ACAD2020-08-050
ACAD2020-08-060
ACAD2020-08-070
ACAD2020-08-080
ACAD2020-08-090
ACAD2020-08-100
ACAD2020-08-110
ACAD2020-08-120
ACAD2020-08-130
ACAD2020-08-140
ACAD2020-08-150
ACAD2020-08-160
ACAD2020-08-170
ACAD2020-08-180
ACAD2020-08-190
ACAD2020-08-200
ACAD2020-08-210
ACAD2020-08-220
ACAD2020-08-230
ACAD2020-08-240
ACAD2020-08-250
ACAD2020-08-260
ACAD2020-08-270
ACAD2020-08-280
ACAD2020-08-290
ACAD2020-08-300
ACAD2020-08-310
ACAD2020-09-010
ACAD2020-09-020
ACAD2020-09-030
ACAD2020-09-040
ACAD2020-09-050
ACAD2020-09-060
ACAD2020-09-070
ACAD