Efficient way to Fill Time-Series per group
It appears that data.table
is really much faster than the tidyverse
option. So merely translating the above into data.table
(compliments of @Frank) completed the operation in little under 3 minutes.
library(data.table)
mDT = setDT(d1)[, .(grp = seq(min(grp), max(grp), by = "hour")), by = source]
new_D <- d1[mDT, on = names(mDT)]
new_D <- new_D[, cnt := replace(cnt, is.na(cnt), 0)] #If needed
Fill in missing rows for dates by group
We can do a join. Create the sequence of 'time' from min
to max
by
'1 month', expand the dataset grouped by 'customer_id', 'account_id' and join on
with those columns and the 'time'
ts1 <- seq(min(my_data$time), max(my_data$time), by = "1 month")
my_data[my_data[, .(time =ts1 ), .(customer_id, account_id)],
on = .(customer_id, account_id, time)]
# customer_id account_id time tenor variable_x
# 1: 1 11 2017-01-01 1 87
# 2: 1 11 2017-02-01 NA NA
# 3: 1 11 2017-03-01 NA NA
# 4: 1 11 2017-04-01 NA NA
# 5: 1 11 2017-05-01 2 90
# 6: 1 11 2017-06-01 3 100
# 7: 2 55 2017-01-01 NA NA
# 8: 2 55 2017-02-01 1 120
# 9: 2 55 2017-03-01 NA NA
#10: 2 55 2017-04-01 2 130
#11: 2 55 2017-05-01 3 150
#12: 2 55 2017-06-01 4 12
#13: 3 38 2017-01-01 1 13
#14: 3 38 2017-02-01 NA NA
#15: 3 38 2017-03-01 NA NA
#16: 3 38 2017-04-01 2 15
#17: 3 38 2017-05-01 3 14
#18: 3 38 2017-06-01 NA NA
Or using tidyverse
library(tidyverse)
distinct(my_data, customer_id, account_id) %>%
mutate(time = list(ts1)) %>%
unnest %>%
left_join(my_data)
Or with complete
from tidyr
my_data %>%
complete(nesting(customer_id, account_id), time = ts1)
Filling missing dates in a grouped time series - a tidyverse-way?
tidyr
has some great tools for these sorts of problems. Take a look at complete
.
library(dplyr)
library(tidyr)
library(lubridate)
want <- df.missing %>%
ungroup() %>%
complete(nesting(d1, d2), date = seq(min(date), max(date), by = "day"))
want %>% filter(d1 == "A" & d2 == 5)
#> # A tibble: 10 x 5
#> d1 d2 date v1 v2
#> <fctr> <dbl> <date> <dbl> <dbl>
#> 1 A 5 2017-01-01 NA NA
#> 2 A 5 2017-01-02 0.21879954 0.1335497
#> 3 A 5 2017-01-03 0.32977018 0.9802127
#> 4 A 5 2017-01-04 0.23902573 0.1206089
#> 5 A 5 2017-01-05 0.19617465 0.7378315
#> 6 A 5 2017-01-06 0.13373890 0.9493668
#> 7 A 5 2017-01-07 0.48613541 0.3392834
#> 8 A 5 2017-01-08 0.35698708 0.3696965
#> 9 A 5 2017-01-09 0.08498474 0.8354756
#> 10 A 5 2017-01-10 NA NA
Fill gaps in time series pandas dataframe in specific time intervall
I case anyone is interested I think I found a solution:
# group the time sires by dates (using the FillDate Column) and than apply
# the "Reindex_by_Date" Function to generate the index for each date in the
# given time frame and fill missing tim stamps
df_process = df_process.groupby('FillDate').apply(reindex_by_date, intervall=update_interval)
#drop the helper index "FillDate"
df_process = df_process.reset_index('FillDate', drop=True).drop('FillDate',1)
# since we reindexed by each date only it can happen that if there is a value missing
# on the boarder of the index (e.g. last or fist entry) it might have NaN as value
# we fix this here (we forward fill for example taking the last value from the previous day)
df_process_out = df_process.fillna(method='ffill')
# Helper Function for Closing data gaps with Pandas Groupby and resample
def reindex_by_date(df, intervall):
start_range = df.index.date.min().strftime('%Y-%m-%d') +" 09:30:00"
end_range = df.index.date.max().strftime('%Y-%m-%d') +" 16:00:00"
dates = pd.date_range(start_range, end_range, freq=intervall)
return df.reindex(dates).ffill()here
Comments are very welcome or if someone has a more efficient solution I would be very interested.
Thanks
Filling missing dates by group
You can use data.table
from R
. Assuming that 'time' column is of 'Date' class,
library(data.table)#v1.9.5+
DT1 <- setDT(df1)[, list(time=seq(min(time), max(time), by ='day')),
by =.(shop_id, item_id)]
setkeyv(df1, names(df1)[1:3])[DT1][is.na(value), value:=0]
# shop_id item_id time value
#1: 150 1 2015-07-10 3
#2: 150 1 2015-07-11 5
#3: 150 1 2015-07-12 0
#4: 150 1 2015-07-13 2
#5: 150 2 2015-07-10 15
#6: 150 2 2015-07-11 0
#7: 150 2 2015-07-12 12
In the devel version, you can also do this without setting the 'key'. Instructions to install the devel version are here
df1[DT1, on =c('shop_id', 'item_id', 'time')][is.na(value), value:=0]
# shop_id item_id time value
#1: 150 1 2015-07-10 3
#2: 150 1 2015-07-11 5
#3: 150 1 2015-07-12 0
#4: 150 1 2015-07-13 2
#5: 150 2 2015-07-10 15
#6: 150 2 2015-07-11 0
#7: 150 2 2015-07-12 12
Or as @Arun suggested, a more efficient option would be
DT1[, value := 0L][df1, value := i.value, on = c('shop_id', 'item_id', 'time')]
DT1
Efficiently fill NAs by group
This is the code I have used: Your code vs akrun vs mine. Sometimes zoo is not the fastest process but it is the cleanest. Anyway, you can test it.
UPDATE:
It has been tested with more data (100.000) and Process 03 (subset and merge) wins by far.
Last UPDATE
Function comparison with rbenchmark:
library(dplyr)
library(tidyr)
library(base)
library(data.table)
library(zoo)
library(rbenchmark)
#data.frame of 100 individuals with 10 observations each
data <- data.frame(group = rep(1:10000,each=10),value = NA)
data$value[seq(5,5000,10)] <- rnorm(50) #first 50 individuals get a value at the fifth observation, others don't have value
#Process01
P01 <- function (data){
data01 <- data %>%
group_by(group) %>% #by group
fill(value) %>% #default direction down
fill(value, .direction = "up") #also fill NAs upwards
return(data01)
}
#Process02
P02 <- function (data){
data02 <- setDT(data)[, value := na.locf(na.locf(value, na.rm = FALSE),
fromLast = TRUE), group]
return(data02)
}
#Process03
P03 <- function (data){
dataU <- subset(unique(data), value!='NA') #keep row number
dataM <- merge(data, dataU, by = "group", all=T) #merge tables
data03 <- data.frame(group=dataM$group, value = dataM$value.y) #idem shape of data
return(data03)
}
benchmark("P01_dplyr" = {data01 <- P01(data)},
"P02_zoo" = {data02 <- P02(data)},
"P03_data.table" = {data03 <- P03(data)},
replications = 10,
columns = c("test", "replications", "elapsed")
)
Results with data=10.000, 10 reps and I5 7400:
test replications elapsed
1 P01_dplyr 10 257.78
2 P02_zoo 10 10.35
3 P03_data.table 10 0.09
How can I fill in missing rows for monthly time series data?
You can use tidyr::complete
for this, but you have the additional wrinkle that you have dates on different days in each month. First then you need to make a column to count months on, which we can do with the day(x) <-
setter from lubridate
.
Here's an example using the provided data truncated to 2014 for conciseness. Note that you should use seq.Date
to specify the full range of dates that you want to be included in the month column, and you also will have NA
s in the date
column. (you can replace with the first of the month if you want)
library(tidyverse)
library(lubridate)
tbl <- structure(list(date = structure(c(16076, 16129, 16143, 16184, 16276, 16303, 16343, 16374, 16400, 16417), tzone = "UTC", tclass = "Date", class = "Date"), AverageTemp = c(18.3387983091787, 18.41875, 19.7, 20.2508333333333, 17.6387426900585, 18.1770897832817, 17.5400297619048, 17.7547246376812, 17.246412037037, 17.0846153846154)), row.names = c(NA, -10L), class = "data.frame")
tbl %>%
mutate(month = date %>% `day<-`(1)) %>%
complete(month = seq.Date(min(month), max(month), by = "month"))
#> # A tibble: 12 x 3
#> month date AverageTemp
#> <date> <date> <dbl>
#> 1 2014-01-01 2014-01-06 18.3
#> 2 2014-02-01 2014-02-28 18.4
#> 3 2014-03-01 2014-03-14 19.7
#> 4 2014-04-01 2014-04-24 20.3
#> 5 2014-05-01 NA NA
#> 6 2014-06-01 NA NA
#> 7 2014-07-01 2014-07-25 17.6
#> 8 2014-08-01 2014-08-21 18.2
#> 9 2014-09-01 2014-09-30 17.5
#> 10 2014-10-01 2014-10-31 17.8
#> 11 2014-11-01 2014-11-26 17.2
#> 12 2014-12-01 2014-12-13 17.1
As an alternative, you can instead just get the year and month components and use complete
on the combination of the two:
tbl %>%
mutate(year = year(date), month = month(date)) %>%
complete(year = min(year):max(year), month = 1:12)
#> # A tibble: 12 x 4
#> year month date AverageTemp
#> <dbl> <dbl> <date> <dbl>
#> 1 2014 1 2014-01-06 18.3
#> 2 2014 2 2014-02-28 18.4
#> 3 2014 3 2014-03-14 19.7
#> 4 2014 4 2014-04-24 20.3
#> 5 2014 5 NA NA
#> 6 2014 6 NA NA
#> 7 2014 7 2014-07-25 17.6
#> 8 2014 8 2014-08-21 18.2
#> 9 2014 9 2014-09-30 17.5
#> 10 2014 10 2014-10-31 17.8
#> 11 2014 11 2014-11-26 17.2
#> 12 2014 12 2014-12-13 17.1
Created on 2019-03-20 by the reprex package (v0.2.1)
Pandas filling missing dates and values within group
Initial Dataframe:
dt user val
0 2016-01-01 a 1
1 2016-01-02 a 33
2 2016-01-05 b 2
3 2016-01-06 b 1
First, convert the dates to datetime:
x['dt'] = pd.to_datetime(x['dt'])
Then, generate the dates and unique users:
dates = x.set_index('dt').resample('D').asfreq().index
>> DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
'2016-01-05', '2016-01-06'],
dtype='datetime64[ns]', name='dt', freq='D')
users = x['user'].unique()
>> array(['a', 'b'], dtype=object)
This will allow you to create a MultiIndex:
idx = pd.MultiIndex.from_product((dates, users), names=['dt', 'user'])
>> MultiIndex(levels=[[2016-01-01 00:00:00, 2016-01-02 00:00:00, 2016-01-03 00:00:00, 2016-01-04 00:00:00, 2016-01-05 00:00:00, 2016-01-06 00:00:00], ['a', 'b']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]],
names=['dt', 'user'])
You can use that to reindex your DataFrame:
x.set_index(['dt', 'user']).reindex(idx, fill_value=0).reset_index()
Out:
dt user val
0 2016-01-01 a 1
1 2016-01-01 b 0
2 2016-01-02 a 33
3 2016-01-02 b 0
4 2016-01-03 a 0
5 2016-01-03 b 0
6 2016-01-04 a 0
7 2016-01-04 b 0
8 2016-01-05 a 0
9 2016-01-05 b 2
10 2016-01-06 a 0
11 2016-01-06 b 1
which then can be sorted by users:
x.set_index(['dt', 'user']).reindex(idx, fill_value=0).reset_index().sort_values(by='user')
Out:
dt user val
0 2016-01-01 a 1
2 2016-01-02 a 33
4 2016-01-03 a 0
6 2016-01-04 a 0
8 2016-01-05 a 0
10 2016-01-06 a 0
1 2016-01-01 b 0
3 2016-01-02 b 0
5 2016-01-03 b 0
7 2016-01-04 b 0
9 2016-01-05 b 2
11 2016-01-06 b 1
Related Topics
Simulate an Ar(1) Process with Uniform Innovations
Cumulative Number of Unique Values in a Column Up to Current Row
Converting 1M to 1000000 Elegantly
Selecting Multiple Columns in Data Frame Using Partial Column Name
Population Pyramid Plot with Ggplot2 and Dplyr (Instead of Plyr)
Filter Group of Rows Based on Sum of Values from Different Column
Sequentially Rename 100+ Columns Having Idiosyncratic Names
How to Obtain All Combinations of the Columns of a Data Frame Taken by 2
Splitting Text to Words with R and Csplit()
Test If Element Is in a List and Return 0 or 1
Removing Row with Duplicated Values in All Columns of a Data Frame (R)
How to Sort a Vector of Alphanumeric Values Using Lexical Ordering in R
Object Not Found Error with Ggplot2 When Adding Shape Aesthetic
R Cmd Check Not Looking for Gcc in Rtools Directory
Equivalent of Which in Scraping
Converting Multiple Existing Xts Objects to Multiple Data.Frames