Compute Rolling Sum by Id Variables, with Missing Timepoints

Compute rolling sum by id variables, with missing timepoints

I'm almost embarrassed to post this. I'm usually pretty good as these, but there's got to be a better way.

This first uses zoo's as.yearmon to get the dates in terms of just month and year, then reshapes it to get one column for each id/class combination, then fills in with zeros before, after, and for missing months, then uses zoo to get the rolling sum, then pulls out just the desired months and merges back with the original data frame.

library(reshape2)
library(zoo)
df$yearmon <- as.yearmon(df$t)
dfa <- dcast(id + class ~ yearmon, data=df, value.var="count")
ida <- dfa[,1:2]
dfa <- t(as.matrix(dfa[,-c(1:2)]))
months <- with(df, seq(min(yearmon)-3/12, max(yearmon)+3/12, by=1/12))
dfb <- array(dim=c(length(months), ncol(dfa)), 
             dimnames=list(paste(months), colnames(dfa)))
dfb[rownames(dfa),] <- dfa
dfb[is.na(dfb)] <- 0
dfb <- rollsumr(dfb,4, fill=0)
rownames(dfb) <- paste(months)
dfb <- dfb[rownames(dfa),]
dfc <- cbind(ida, t(dfb))
dfc <- melt(dfc, id.vars=c("class", "id"))
names(dfc)[3:4] <- c("yearmon", "desired2")
dfc$yearmon <- as.yearmon(dfc$yearmon)
out <- merge(df,dfc)

> out
  id class  yearmon          t count desired desired2
1  1     A Feb 2010 2010-02-15     2       3        3
2  1     A Jan 2010 2010-01-15     1       1        1
3  1     B Apr 2010 2010-04-15     3       3        3
4  1     B Sep 2010 2010-09-15     4       4        4
5  2     A Jan 2010 2010-01-15     5       5        5
6  2     B Aug 2010 2010-08-15     7      13       13
7  2     B Jun 2010 2010-06-15     6       6        6
8  2     B Sep 2010 2010-09-15     8      21       21

r - compute rolling sum by id within specific time frame

Not sure this will be helpful with the dimension of your data.

First, create running index to handle duplicate date and roll sum must not include prev dupe date and also create date one year ago (i would argue that 365 is better but seems like OP wants 366).

Then, perform a non-equi self-join while ensuring prev dupe date not used and dates are within a year.

df[, c("rn", "oneYrAgo") := .(.I, date - 366)]

df[df, 
    .(roll_sum=.N, flag_sum=sum(flag, na.rm=TRUE)), 
    on=.(date >= oneYrAgo, rn < rn, id, date <= date), 
    by=.EACHI][, 
        -seq_len(2L)]

result:

    id       date roll_sum flag_sum
 1:  1 2012-03-26        0        0
 2:  1 2012-04-26        1        1
 3:  1 2015-06-27        0        0
 4:  1 2016-06-07        1        0
 5:  2 2012-06-22        0        0
 6:  2 2012-06-22        1        0
 7:  2 2012-10-12        2        0
 8:  2 2012-10-22        3        1
 9:  2 2012-11-05        4        2
10:  2 2012-11-19        5        3
11:  2 2012-11-26        6        4
12:  2 2013-12-12        0        0
13:  2 2013-12-13        1        1

Elegant, Fast Way to Perform Rolling Sum By List of Variables

I think I stumbled upon an answer that is fairly efficient..

set.seed(1)
Trans_Dates <- as.Date(c(31,33,65,96,150,187,210,212,240,273,293,320,
                         32,34,66,97,151,188,211,213,241,274,294,321,
                         33,35,67,98,152,189,212,214,242,275,295,322),origin="2010-01-01")
Cust_ID <- c(rep(1,12),rep(2,12),rep(3,12))
Target <- rpois(36,3)

##Make simulated data into a data.table
library(data.table)
data <- data.table(Cust_ID,Trans_Dates,Target)

##Assign each customer an number that ranks them
data[,Cust_No:=.GRP,by=c("Cust_ID")]

##Create "list" of comparison dates
Ref <- data[,list(Compare_Value=list(I(Target)),Compare_Date=list(I(Trans_Dates))), by=c("Cust_No")]

##Compare two lists and see of the compare date is within N days
data$Roll.Val <- mapply(FUN = function(RD, NUM) {
  d <- as.numeric(Ref$Compare_Date[[NUM]] - RD)
  sum((d <= 0 & d >= -180)*Ref$Compare_Value[[NUM]])
}, RD = data$Trans_Dates,NUM=data$Cust_No)

##Print out data
data <- data[,list(Cust_ID,Trans_Dates,Target,Roll.Val)][order(Cust_ID,Trans_Dates)]
data

Work out rolling sums for variables with non-consecutive days in a dataframe in R

The problem is that the arguments to findInterval should be numeric and ordered.

To address this convert the dates to Date class and then numeric so that d below is the number of days since the Epoch. Now we can use it with findInterval as shown. If the data were already sorted the arrange line could be omitted.

library(dplyr, exclude = c("filter", "lag"))
library(zoo)

DF %>%
  arrange(Trainer, Date) %>%
  group_by(Trainer) %>%
  mutate(d = as.numeric(as.Date(Date)), 
         Wins14 = rollapplyr(Wins, 1:n() - findInterval(d - 14, d), sum)) %>%
  ungroup %>%
  select(-d)

giving:

# A tibble: 101 x 4
   Trainer           Wins Date                Wins14
   <chr>            <dbl> <dttm>               <dbl>
 1 Appleby, Charlie     1 2017-10-15 00:00:00      1
 2 Appleby, Charlie     1 2017-10-18 00:00:00      2
 3 Appleby, Charlie     0 2017-10-18 00:00:00      2
 4 Appleby, Charlie     0 2017-10-23 00:00:00      2
 5 Appleby, Charlie     1 2017-10-25 00:00:00      3
 6 Appleby, Charlie     0 2017-10-25 00:00:00      3
 7 Appleby, Charlie     0 2017-10-25 00:00:00      3
 8 Appleby, Charlie     1 2017-10-25 00:00:00      4
 9 Appleby, Charlie     0 2017-10-27 00:00:00      4
10 Appleby, Charlie     0 2017-10-27 00:00:00      4
# ... with 91 more rows

Using zoo's rollsum within data.table on timestamped transactions

Here's one way. First, add a column with the last date you care about, and an index to keep track of things:

d[, old.date := purch_dt - 365]
d[, idx := .I]

Then do a rolling join (assumes version 1.9.5+) on that date, and extract the range of indices for each of the match (i.e. by .EACHI):

res = d[d, .(idx = i.idx, seq = idx:i.idx), by = .EACHI, roll = -Inf,
        on = c(cust_id = 'cust_id', purch_dt = 'old.date')]

Finally, subset original data.table with appropriate range, and compute the sums:

d[, purch_365 := d[res$seq, sum(purch_amt), by = res$idx]$V1][]
#     cust_id   purch_dt purch_amt idx   old.date purch_365
#  1:     123 1980-01-08     24.63   1 1979-01-08     24.63
#  2:     123 1980-09-03     96.27   2 1979-09-04    120.90
#  3:     123 1981-02-24     60.54   3 1980-02-25    156.81
#  4:     123 1981-04-01     51.99   4 1980-04-01    208.80
#  5:     123 1981-04-02     40.85   5 1980-04-02    249.65
# ---                                                      
#196:     456 2006-01-29     24.72 196 2005-01-29    187.81
#197:     456 2006-02-15     27.78 197 2005-02-15    215.59
#198:     456 2006-09-22     11.00 198 2005-09-22     74.94
#199:     456 2006-09-27     12.67 199 2005-09-27     87.61
#200:     456 2006-11-18     99.13 200 2005-11-18    186.74

Cross Prod Rolling Values

If the aim is to calculate a rolling sum of 3 values such that there are implicitly 0s added to ensure that the output has 5 elements even though the input has 3 then try these:

1) rollapply Multiply x and y and insert 0's depending on whether right, center or left alignment is used and depending on whether partial= is used. align="center" is the default of rollapply and align = "right" is the default of rollapplyr.

library(zoo)

rollapply(c(0, x*y, 0), 3, sum, partial = TRUE)
## [1] 2 5 9 7 4

rollapplyr(c(x*y, 0, 0), 3, sum, partial = TRUE)
## [1] 2 5 9 7 4

rollapplyr(c(0, 0, x*y), 3, sum, align = "left", partial = TRUE)
## [1] 2 5 9 7 4

rollapply(c(0, 0, x*y, 0, 0), 3, sum)
## [1] 2 5 9 7 4

rollsum(c(0, 0, x*y, 0, 0), 3) #  this solution has the lowest character count
## [1] 2 5 9 7 4

2) Base R A base solution can be written using embed:

rowSums(embed(c(0, 0, x*y, 0, 0), 3))
## [1] 2 5 9 7 4

2a) or take the cumulative sum and subtract the cumulative sum 3 back:

cumsum(c(x*y,0,0)) - cumsum(c(0, 0, 0, (x*y)[-3]))
## [1] 2 5 9 7 4

2b) If the idea is that a circular calculation is to be done then:

c(filter(c(0, x*y, 0), c(1,1,1), circular = TRUE))
## [1] 2 5 9 7 4

Vectorize loops when calculating rolling means with variable amounts of data

I used tidyverse and runner and have done it like this in a single piped syntax. Syntax explanation-

I first collected seven days (as per logic provided) DQL and MAX values into a list using runner.
Before doing that, I have converted DQL into an ordered factored variable, which will be used in last syntax.
Secondly, i used purrr::map to modify each list according to given conditions,
- Not less than six are to be counted
- If there is exactly one E in 7 values, that has not to be counted.
Finally I unnested the list using unnest_wider

library(runner)
daily_data %>% mutate(dyDQL = factor(dyDQL, levels = c("A", "B", "E"), ordered = T),
                      d = runner(x = data.frame(a = dyMax, b= dyDQL),
                                   k = "7 days",
                                   lag = 0,
                                   idx = date,
                                   f = function(x) list(x))) %>%
  mutate(d = map(d, ~ .x %>% group_by(b) %>%
                     mutate(c = n()) %>%
                     ungroup() %>%
                     filter(!n() < 6) %>%
                     filter(!(b == 'E' & c == 1 & n() == 7)) %>%
                     summarise(ma.max7 = ifelse(n() == 0, NA, mean(a)), ma.max7.DQL = max(b))
                   )
         ) %>%
  unnest_wider(d)

# A tibble: 15 x 7
   Monitoring.Location.ID date       dyMax dyMin dyDQL ma.max7 ma.max7.DQL
   <chr>                  <date>     <dbl> <dbl> <ord>   <dbl> <ord>      
 1 River 1                2018-07-01  24.2  22.5 A        NA   NA         
 2 River 1                2018-07-02  24.6  20.4 A        NA   NA         
 3 River 1                2018-07-03  24.8  20.1 A        NA   NA         
 4 River 1                2018-07-04  25.3  20.7 A        NA   NA         
 5 River 1                2018-07-05  25.5  20.9 A        NA   NA         
 6 River 1                2018-07-06  25.0  21.0 A        24.9 A          
 7 River 1                2018-07-07  24.8  20.7 A        24.9 A          
 8 River 1                2018-07-08  23.4  20.8 B        24.8 B          
 9 River 1                2018-07-09  22.7  18.9 E        24.8 B          
10 River 1                2018-07-10  22.3  18.2 A        24.4 B          
11 River 1                2018-07-12  22.9  19.0 A        23.5 E          
12 River 1                2018-07-13  24.0  19.5 A        23.4 E          
13 River 1                2018-07-14  24.5  19.9 A        23.3 E          
14 River 1                2018-07-15  25.1  20.6 A        23.6 E          
15 River 1                2018-07-19  24.9  20.7 A        NA   NA

Count timepoints in R

First of all, you are putting vectors in the variables df1 and df2, not dataframes. This is also what is causing the error in the colnames(df2) <- "timepoints" argument. A vector does not have columns, so it throws an error when you give colnames a vector as input.

I don't know exactly what you mean, but I guess you want something like a histogram. To do this, you could first convert the timepoints to numeric values.

library(tidyverse)

First you load the tidyverse package bundle. It contains a lot of useful packages that makes it easier to manipulate and visualize data. If you haven't downloaded the tidyverse packages yet, just run install.packages("tidyverse").

timepoints <- c("01:00","04:15","07:15","10:30","12:45","16:30","17:15","21:30",
                "22:50","02:20","07:00","10:20","11:50","13:50","15:00","19:00",
                "20:30","22:00","02:50","07:00","10:15","11:30","15:00","18:45",
                "21:30","01:40","05:20","08:30","11:30","12:30","13:50","15:40",
                "17:40","18:40","20:15","01:30","06:00","09:30","11:00","13:00",
                "15:45","18:00","19:00","20:00","21:20","21:50")

Df <- data.frame(timepoints = timepoints)

First i take the timepoints vector and put it in a Dataframe with columnname "timepoints".

Df_new <- Df %>% 
  # We seperate the column into hours, minutes and seconds
  separate(timepoints, into = c("hh", "mm"), ":") %>% 
  # We convert the time into fractions of an hour
  mutate(hh = as.numeric(hh),
         mm = as.numeric(mm) / 60) %>% 
  # lastly we reassemble the times
  mutate(timepoints = hh + mm)

Then I manipulate the timepoint values to numeric values. First I split the hour and minute values using seperate. Then I change the minute values to decimal hours using the mutate function. Lastly, I use the mutate function again to recombine the hour and decimal hour values. This bit of code has mostly been copied from @Seb's comment answer in How to create histogram in R with CSV time data?, so credits to her/him.

hist(x = Df_new$timepoints, breaks = 12)

Next, you can plot the timepoints in a histogram like this, the breaks argument allows you to change to number of bars in the histogram.

ggplot(Df_new, aes(x = timepoints)) + geom_histogram(bins = 12)

Alternatively, you can also use the ggplot function from the tidyverse package ggplot. This looks nicer, but might be difficult to understand for people not familiar with ggplot.

If you want to read up on tidyverse commands and packages, you could go to https://r4ds.had.co.nz/index.html, which offers nice and understandable explanations.

Hope this was useful

Compute Rolling Sum by Id Variables, with Missing Timepoints