Compute rolling sum by id variables, with missing timepoints
I'm almost embarrassed to post this. I'm usually pretty good as these, but there's got to be a better way.
This first uses zoo
's as.yearmon
to get the dates in terms of just month and year, then reshapes it to get one column for each id
/class
combination, then fills in with zeros before, after, and for missing months, then uses zoo
to get the rolling sum, then pulls out just the desired months and merges back with the original data frame.
library(reshape2)
library(zoo)
df$yearmon <- as.yearmon(df$t)
dfa <- dcast(id + class ~ yearmon, data=df, value.var="count")
ida <- dfa[,1:2]
dfa <- t(as.matrix(dfa[,-c(1:2)]))
months <- with(df, seq(min(yearmon)-3/12, max(yearmon)+3/12, by=1/12))
dfb <- array(dim=c(length(months), ncol(dfa)),
dimnames=list(paste(months), colnames(dfa)))
dfb[rownames(dfa),] <- dfa
dfb[is.na(dfb)] <- 0
dfb <- rollsumr(dfb,4, fill=0)
rownames(dfb) <- paste(months)
dfb <- dfb[rownames(dfa),]
dfc <- cbind(ida, t(dfb))
dfc <- melt(dfc, id.vars=c("class", "id"))
names(dfc)[3:4] <- c("yearmon", "desired2")
dfc$yearmon <- as.yearmon(dfc$yearmon)
out <- merge(df,dfc)
> out
id class yearmon t count desired desired2
1 1 A Feb 2010 2010-02-15 2 3 3
2 1 A Jan 2010 2010-01-15 1 1 1
3 1 B Apr 2010 2010-04-15 3 3 3
4 1 B Sep 2010 2010-09-15 4 4 4
5 2 A Jan 2010 2010-01-15 5 5 5
6 2 B Aug 2010 2010-08-15 7 13 13
7 2 B Jun 2010 2010-06-15 6 6 6
8 2 B Sep 2010 2010-09-15 8 21 21
r - compute rolling sum by id within specific time frame
Not sure this will be helpful with the dimension of your data.
First, create running index to handle duplicate date and roll sum must not include prev dupe date and also create date one year ago (i would argue that 365 is better but seems like OP wants 366).
Then, perform a non-equi self-join while ensuring prev dupe date not used and dates are within a year.
df[, c("rn", "oneYrAgo") := .(.I, date - 366)]
df[df,
.(roll_sum=.N, flag_sum=sum(flag, na.rm=TRUE)),
on=.(date >= oneYrAgo, rn < rn, id, date <= date),
by=.EACHI][,
-seq_len(2L)]
result:
id date roll_sum flag_sum
1: 1 2012-03-26 0 0
2: 1 2012-04-26 1 1
3: 1 2015-06-27 0 0
4: 1 2016-06-07 1 0
5: 2 2012-06-22 0 0
6: 2 2012-06-22 1 0
7: 2 2012-10-12 2 0
8: 2 2012-10-22 3 1
9: 2 2012-11-05 4 2
10: 2 2012-11-19 5 3
11: 2 2012-11-26 6 4
12: 2 2013-12-12 0 0
13: 2 2013-12-13 1 1
Elegant, Fast Way to Perform Rolling Sum By List of Variables
I think I stumbled upon an answer that is fairly efficient..
set.seed(1)
Trans_Dates <- as.Date(c(31,33,65,96,150,187,210,212,240,273,293,320,
32,34,66,97,151,188,211,213,241,274,294,321,
33,35,67,98,152,189,212,214,242,275,295,322),origin="2010-01-01")
Cust_ID <- c(rep(1,12),rep(2,12),rep(3,12))
Target <- rpois(36,3)
##Make simulated data into a data.table
library(data.table)
data <- data.table(Cust_ID,Trans_Dates,Target)
##Assign each customer an number that ranks them
data[,Cust_No:=.GRP,by=c("Cust_ID")]
##Create "list" of comparison dates
Ref <- data[,list(Compare_Value=list(I(Target)),Compare_Date=list(I(Trans_Dates))), by=c("Cust_No")]
##Compare two lists and see of the compare date is within N days
data$Roll.Val <- mapply(FUN = function(RD, NUM) {
d <- as.numeric(Ref$Compare_Date[[NUM]] - RD)
sum((d <= 0 & d >= -180)*Ref$Compare_Value[[NUM]])
}, RD = data$Trans_Dates,NUM=data$Cust_No)
##Print out data
data <- data[,list(Cust_ID,Trans_Dates,Target,Roll.Val)][order(Cust_ID,Trans_Dates)]
data
Work out rolling sums for variables with non-consecutive days in a dataframe in R
The problem is that the arguments to findInterval should be numeric and ordered.
To address this convert the dates to Date class and then numeric so that d below is the number of days since the Epoch. Now we can use it with findInterval as shown. If the data were already sorted the arrange line could be omitted.
library(dplyr, exclude = c("filter", "lag"))
library(zoo)
DF %>%
arrange(Trainer, Date) %>%
group_by(Trainer) %>%
mutate(d = as.numeric(as.Date(Date)),
Wins14 = rollapplyr(Wins, 1:n() - findInterval(d - 14, d), sum)) %>%
ungroup %>%
select(-d)
giving:
# A tibble: 101 x 4
Trainer Wins Date Wins14
<chr> <dbl> <dttm> <dbl>
1 Appleby, Charlie 1 2017-10-15 00:00:00 1
2 Appleby, Charlie 1 2017-10-18 00:00:00 2
3 Appleby, Charlie 0 2017-10-18 00:00:00 2
4 Appleby, Charlie 0 2017-10-23 00:00:00 2
5 Appleby, Charlie 1 2017-10-25 00:00:00 3
6 Appleby, Charlie 0 2017-10-25 00:00:00 3
7 Appleby, Charlie 0 2017-10-25 00:00:00 3
8 Appleby, Charlie 1 2017-10-25 00:00:00 4
9 Appleby, Charlie 0 2017-10-27 00:00:00 4
10 Appleby, Charlie 0 2017-10-27 00:00:00 4
# ... with 91 more rows
Using zoo's rollsum within data.table on timestamped transactions
Here's one way. First, add a column with the last date you care about, and an index to keep track of things:
d[, old.date := purch_dt - 365]
d[, idx := .I]
Then do a rolling join (assumes version 1.9.5+) on that date, and extract the range of indices for each of the match (i.e. by .EACHI
):
res = d[d, .(idx = i.idx, seq = idx:i.idx), by = .EACHI, roll = -Inf,
on = c(cust_id = 'cust_id', purch_dt = 'old.date')]
Finally, subset original data.table
with appropriate range, and compute the sums:
d[, purch_365 := d[res$seq, sum(purch_amt), by = res$idx]$V1][]
# cust_id purch_dt purch_amt idx old.date purch_365
# 1: 123 1980-01-08 24.63 1 1979-01-08 24.63
# 2: 123 1980-09-03 96.27 2 1979-09-04 120.90
# 3: 123 1981-02-24 60.54 3 1980-02-25 156.81
# 4: 123 1981-04-01 51.99 4 1980-04-01 208.80
# 5: 123 1981-04-02 40.85 5 1980-04-02 249.65
# ---
#196: 456 2006-01-29 24.72 196 2005-01-29 187.81
#197: 456 2006-02-15 27.78 197 2005-02-15 215.59
#198: 456 2006-09-22 11.00 198 2005-09-22 74.94
#199: 456 2006-09-27 12.67 199 2005-09-27 87.61
#200: 456 2006-11-18 99.13 200 2005-11-18 186.74
Cross Prod Rolling Values
If the aim is to calculate a rolling sum of 3 values such that there are implicitly 0s added to ensure that the output has 5 elements even though the input has 3 then try these:
1) rollapply Multiply x
and y
and insert 0's depending on whether right, center or left alignment is used and depending on whether partial=
is used. align="center"
is the default of rollapply
and align = "right"
is the default of rollapplyr
.
library(zoo)
rollapply(c(0, x*y, 0), 3, sum, partial = TRUE)
## [1] 2 5 9 7 4
rollapplyr(c(x*y, 0, 0), 3, sum, partial = TRUE)
## [1] 2 5 9 7 4
rollapplyr(c(0, 0, x*y), 3, sum, align = "left", partial = TRUE)
## [1] 2 5 9 7 4
rollapply(c(0, 0, x*y, 0, 0), 3, sum)
## [1] 2 5 9 7 4
rollsum(c(0, 0, x*y, 0, 0), 3) # this solution has the lowest character count
## [1] 2 5 9 7 4
2) Base R A base solution can be written using embed
:
rowSums(embed(c(0, 0, x*y, 0, 0), 3))
## [1] 2 5 9 7 4
2a) or take the cumulative sum and subtract the cumulative sum 3 back:
cumsum(c(x*y,0,0)) - cumsum(c(0, 0, 0, (x*y)[-3]))
## [1] 2 5 9 7 4
2b) If the idea is that a circular calculation is to be done then:
c(filter(c(0, x*y, 0), c(1,1,1), circular = TRUE))
## [1] 2 5 9 7 4
Vectorize loops when calculating rolling means with variable amounts of data
I used tidyverse
and runner
and have done it like this in a single piped syntax. Syntax explanation-
- I first collected seven days (as per logic provided) DQL and MAX values into a list using
runner
. - Before doing that, I have converted DQL into an ordered factored variable, which will be used in last syntax.
- Secondly, i used
purrr::map
to modify each list according to given conditions,- Not less than six are to be counted
- If there is exactly one
E
in 7 values, that has not to be counted.
- Finally I unnested the list using
unnest_wider
library(runner)
daily_data %>% mutate(dyDQL = factor(dyDQL, levels = c("A", "B", "E"), ordered = T),
d = runner(x = data.frame(a = dyMax, b= dyDQL),
k = "7 days",
lag = 0,
idx = date,
f = function(x) list(x))) %>%
mutate(d = map(d, ~ .x %>% group_by(b) %>%
mutate(c = n()) %>%
ungroup() %>%
filter(!n() < 6) %>%
filter(!(b == 'E' & c == 1 & n() == 7)) %>%
summarise(ma.max7 = ifelse(n() == 0, NA, mean(a)), ma.max7.DQL = max(b))
)
) %>%
unnest_wider(d)
# A tibble: 15 x 7
Monitoring.Location.ID date dyMax dyMin dyDQL ma.max7 ma.max7.DQL
<chr> <date> <dbl> <dbl> <ord> <dbl> <ord>
1 River 1 2018-07-01 24.2 22.5 A NA NA
2 River 1 2018-07-02 24.6 20.4 A NA NA
3 River 1 2018-07-03 24.8 20.1 A NA NA
4 River 1 2018-07-04 25.3 20.7 A NA NA
5 River 1 2018-07-05 25.5 20.9 A NA NA
6 River 1 2018-07-06 25.0 21.0 A 24.9 A
7 River 1 2018-07-07 24.8 20.7 A 24.9 A
8 River 1 2018-07-08 23.4 20.8 B 24.8 B
9 River 1 2018-07-09 22.7 18.9 E 24.8 B
10 River 1 2018-07-10 22.3 18.2 A 24.4 B
11 River 1 2018-07-12 22.9 19.0 A 23.5 E
12 River 1 2018-07-13 24.0 19.5 A 23.4 E
13 River 1 2018-07-14 24.5 19.9 A 23.3 E
14 River 1 2018-07-15 25.1 20.6 A 23.6 E
15 River 1 2018-07-19 24.9 20.7 A NA NA
Count timepoints in R
First of all, you are putting vectors in the variables df1 and df2, not dataframes. This is also what is causing the error in the colnames(df2) <- "timepoints"
argument. A vector does not have columns, so it throws an error when you give colnames a vector as input.
I don't know exactly what you mean, but I guess you want something like a histogram. To do this, you could first convert the timepoints to numeric values.
library(tidyverse)
First you load the tidyverse package bundle. It contains a lot of useful packages that makes it easier to manipulate and visualize data. If you haven't downloaded the tidyverse packages yet, just run install.packages("tidyverse")
.
timepoints <- c("01:00","04:15","07:15","10:30","12:45","16:30","17:15","21:30",
"22:50","02:20","07:00","10:20","11:50","13:50","15:00","19:00",
"20:30","22:00","02:50","07:00","10:15","11:30","15:00","18:45",
"21:30","01:40","05:20","08:30","11:30","12:30","13:50","15:40",
"17:40","18:40","20:15","01:30","06:00","09:30","11:00","13:00",
"15:45","18:00","19:00","20:00","21:20","21:50")
Df <- data.frame(timepoints = timepoints)
First i take the timepoints vector and put it in a Dataframe with columnname "timepoints".
Df_new <- Df %>%
# We seperate the column into hours, minutes and seconds
separate(timepoints, into = c("hh", "mm"), ":") %>%
# We convert the time into fractions of an hour
mutate(hh = as.numeric(hh),
mm = as.numeric(mm) / 60) %>%
# lastly we reassemble the times
mutate(timepoints = hh + mm)
Then I manipulate the timepoint values to numeric values. First I split the hour and minute values using seperate
. Then I change the minute values to decimal hours using the mutate
function. Lastly, I use the mutate
function again to recombine the hour and decimal hour values. This bit of code has mostly been copied from @Seb's comment answer in How to create histogram in R with CSV time data?, so credits to her/him.
hist(x = Df_new$timepoints, breaks = 12)
Next, you can plot the timepoints in a histogram like this, the breaks argument allows you to change to number of bars in the histogram.
ggplot(Df_new, aes(x = timepoints)) + geom_histogram(bins = 12)
Alternatively, you can also use the ggplot function from the tidyverse package ggplot. This looks nicer, but might be difficult to understand for people not familiar with ggplot.
If you want to read up on tidyverse commands and packages, you could go to https://r4ds.had.co.nz/index.html, which offers nice and understandable explanations.
Hope this was useful
Related Topics
R: Insert a Vector as a Row in Data.Frame
Removal of Constant Columns in R
Using Xtable with R and Latex, Math Mode in Column Names
In R Plot Arima Fitted Model with the Original Series
Add Download Buttons in Dt::Renderdatatable
Include Tikz Code in Bookdown Figure Environment
Multinomial Logit in R: Mlogit Versus Nnet
Quick/Elegant Way to Construct Mean/Variance Summary Table
How to Delete a Row from a Data.Frame Without Losing the Attributes
Update a Data Frame in Shiny Server.R Without Restarting the App
Add Column of Predicted Values to Data Frame with Dplyr
Subset Dataframe Such That All Values in Each Row Are Less Than a Certain Value
Convert Data from Many Rows to Many Columns