Is there a function like rollapply for data.frame
Try this:
> library(zoo)
> DF <- data.frame(a = 1:10, b = 21:30, c = letters[1:10])
> replace(DF, 1:2, rollapply(DF[1:2], 3, sum, fill = NA))
a b c
1 NA NA a
2 6 66 b
3 9 69 c
4 12 72 d
5 15 75 e
6 18 78 f
7 21 81 g
8 24 84 h
9 27 87 i
10 NA NA j
Regarding the example that was added to the question after some discussion, such functionality can be layered on top of rollapply
by applying it to the row indexes:
> lapply(as.data.frame(t(rollapply(1:nrow(test), 2, c))), function(ix)test[ix, ])
$V1
Name Points
1 bob 4
2 jane 9
$V2
Name Points
2 jane 9
3 joe 1
and here it is wrapped up a bit better:
rollapply.data.frame <- function(data, ..., fill = NULL, FUN,
simplify = function(x) do.call(rbind, x)) {
fill0 <- if (!is.null(fill)) NA
result <- lapply(
as.data.frame(t(rollapply(1:nrow(data), ..., fill = fill0, FUN = c))),
function(ix) {if (all(is.na(ix))) fill else FUN(data[ix, ])}
)
simplify(result)
}
> rollapply(test, 2, FUN = identity, simplify = identity)
$V1
Name Points
a bob 4
b jane 9
$V2
Name Points
b jane 9
c joe 1
> rollapply(test, 2, FUN = identity, fill = NA, simplify = identity)
$V1
Name Points
a bob 4
b jane 9
$V2
Name Points
b jane 9
c joe 1
$V3
[1] NA
R: Grouped rolling window linear regression with rollapply and ddply
1) rollapply
works on data frames too so it is not necessary to convert df
to zoo.
2) lm
uses na.action
, not na.rm
, and its default is na.omit
so we can just drop this argument.
3) rollapplyr
is a more concise way to write rollapply(..., align = "right")
.
Assuming that rolled
otherwise does what you want and incorporating these changes into rolled
, the ddply
statement in the question should work or we could use by
from the base of R which we show below:
rolled <- function(df) {
rollapplyr(df, width = 6, function(m) {
coef(lm(formula = y ~ x, data = as.data.frame(m)))[2]
}, by = 3, by.column = FALSE
)
}
do.call("rbind", by(dat, dat[c("w", "z")], rolled))
Simple working example of ddply() in parallel on Windows
Here's a simple working example:
> df <- data.frame(val=1:10, ind=c(rep(2, 5), rep(3, 5)))
> library(doSNOW)
> registerDoSNOW(makeCluster(2, type = "SOCK"))
> system.time(print(ddply(df, .(ind), function(x) { Sys.sleep(2); sum(x) }, .parallel=FALSE)))
ind V1
1 2 25
2 3 55
user system elapsed
0.00 0.00 4.01
> system.time(print(ddply(df, .(ind), function(x) { Sys.sleep(2); sum(x) }, .parallel=TRUE)))
ind V1
1 2 25
2 3 55
user system elapsed
0.02 0.00 2.02
Apply / Plyr like functionality without without performance impact in Rollapply or apply.rolling
This answer is an expanded version of my earlier comments which I have now deleted.
zoo's rollapply
already supports plain vectors and matrices. Furthermore its rollapply
routine extracts the plain vectors or matrices from a zoo object before operating on it so there is no reason for a zoo object to take materially longer than a non-zoo object. The slowness you observed was a bug in rollapply
(the extraction was not taking place properly) that was fixed in early November in the development version. This version is on R-Forge and installed like this:
install.packages("zoo", repo = "http://r-forge.r-project.org")
On the other hand, the generality of rollapply
means its going to be much slower than special purpose routines or vectorized operations.
zoo does have some specialized versions of rollapply
(rollmean
, rollmedian
, rollmax
) that are optimized for particular operations and will be much faster. If you can manufacture something out of those, e.g. a rolling sum of k terms is the same as k
times a rolling mean, then you can get substantial speedups. Faster still will be manufacturing the rolling result from plain operations such as +
.
The post indicated that the function in question was just an example but the particular function could make a big difference in terms of speed since it will affect whether the sorts of speedups discussed are available.
For example, running 3 replications of each of rollapply
, 2 * rollmean
and a simple vectorized addition shows this:
> library(zoo)
> library(rbenchmark)
> n <- 10^4
> set.seed(123)
> a <- rnorm(n)
> library(rbenchmark)
> benchmark(rollapply = a1 <- rollapplyr(a, 2, sum, fill = 0),
+ rollmean = a2 <- 2 * rollmeanr(a, 2, fill = 0),
+ add = a3 <- c(0, a[-1] + a[-n]), replications = 3, order = "relative")
test replications elapsed relative user.self sys.self user.child sys.child
3 add 3 0.00 0.00000 0.00 0 NA NA
2 rollmean 3 0.07 1.00000 0.08 0 NA NA
1 rollapply 3 1.85 26.42857 1.84 0 NA NA
>
> all.equal(a1, a2)
[1] TRUE
> all.equal(a1, a3)
[1] TRUE
Rollapply with different rolling window on each vector of time series
1) Convert to a wide form zoo object z and then to a list of zoo objects L, one per column of z, apply rollfun to each component of L creating a list of zoo objects and then merge back into a wide form zoo object zroll and either use that or optionally convert to long form data frame droll.
library(zoo)
z <- read.zoo(dat, split = "fact")
L <- as.list(z)
rollfun <- function(x) rollapplyr(x, length(na.omit(x)) - 252, mean)
zroll <- do.call("merge", Map(rollfun, L))
droll <- fortify.zoo(zroll, melt = TRUE)
2) This could also be expressed as a pipeline where rollfun is from above.
droll2 <- dat |>
read.zoo(split = "fact") |>
as.list() |>
Map(f = rollfun) |>
do.call(what = "merge") |>
fortify.zoo(melt = TRUE)
3) With dplyr
library(dplyr, exclude = c("lag", "filter"))
library(zoo)
dat %>%
group_by(fact) %>%
mutate(roll = rollapplyr(value, n() - 252, mean, fill = NA)) %>%
ungroup
How to produce simple sliding window features over irregular time series in R?
You can represent each type of event (e.g., "user A clicks item i")
as a time series x
, with value 1 each time it occurs.
The quantities you are interested can be computed from cumsum(x)
(the number of events until today) and its translations
(the number of events until k days in the past or the future).
# Sample data
set.seed(0)
k <- 100
users <- LETTERS[1:4]
files <- letters[1:4]
items <- letters[24:26]
clicks <- data.frame(
time = Sys.time() + runif(k, 0, k * 24 * 3600),
user = sample( users, k, replace=TRUE ),
item = sample( items, k, replace=TRUE )
)
clicks <- unique(clicks)
For a single time series:
x <- subset( clicks, user=="C" & item=="x" )
xts( rep(1,nrow(x)), x$time )
x <- xts( rep(1,nrow(x)), x$time )
y <- xts( coredata(x), index(x)+7*3600*24 )
z <- cbind(y, x)
z[ is.na(z) ] <- 0
cumsum(z[,2]) - cumsum(z[,1])
# cbind(x,z,cumsum(z[,2]) - cumsum(z[,1]))
For the whole dataset, you can use ddply
.
Rolling window slider::slide() with grouped data
You have to first tidyr::nest
the cases. Within the nested tibbles
(accessed via purrr::map
) you can then apply slide
(same technique as with purrr::map
). The important point is that you do not want to slide
across cases, but only within cases.
library(dplyr)
library(tidyr)
library(purrr)
library(slider)
get_coef1 <- function(data) {
coef1 <- lm(data = data, r1 ~ r2 + r3) %>%
coef() %>%
.["r2"] %>%
unname()
return(coef1)
}
data <- tibble(t = rep(1:10, 3),
case = c(rep("a", 10), rep("b", 10), rep("c", 10)),
r1 = rnorm(30),
r2 = rnorm(30),
r3 = rnorm(30))
data %>%
# ungroup() %>%
group_by(case) %>% nest() %>%
mutate(rollreg = map(data, ~ .x %>% mutate(coef1 = slider::slide_dbl(., ~get_coef1(.x), .before = Inf, .complete = TRUE)))) %>%
select(-data) %>% unnest(rollreg)
I have been trying for a while to use the new dplyr::nest_by()
from dplyr 1.0.0 trying to use summarise
in combination with the rowwise
cases but couldn't get that to work.
How to feed in the arguments of a function used in rollapplyr FUN
I think there are a few issues here. Let me walk you through the most critical ones:
# Simple Regression
beta <- function(x, indepvar, depvar) {
a <- coef(lm(formula = indepvar ~ depvar, data = x))
return(a)
}
The way you have written your function beta
means that you have to input the data x
, the indepvar
column and the depvar
column. But this would not work for lm
because what indepvar
and depvar
contains are fed in, instead of the variable names. For instance, the following would not work:
beta(input, y, x1)
Error in eval(expr, envir, enclos) : object 'y' not found
This is because y
and x1
do not exist outside of input
. Your rollapplyr
has the same issue. One way to get around this is to write:
beta <- function(indepvar, depvar) {
a <- coef(lm(indepvar ~ depvar))
return(a)
}
And explicitly input the columns like:
# > beta(input[,3],input[,4])
# (Intercept) depvar
# 0.1308993 0.2373399
Now this works:
rollapplyr(input[3:4], width = 6,
FUN = function(x) beta(x[,1], x[,2]),
by.column = FALSE)
# (Intercept) depvar
# [1,] -0.04987909 0.6433585022
# [2,] -0.23739671 0.7527017129
# [3,] -0.40483456 0.5833452315
# [4,] -0.28191172 0.6660916836
# [5,] 0.02886934 0.5334114615
# [6,] 0.17284232 0.8126499211
# [7,] 0.01236415 0.3194661428
# [8,] 0.48156300 -0.1532216150
# [9,] 0.75965765 -0.1993015431
# [10,] 0.80509109 -0.1822009137
# [11,] 0.55055694 -0.0005880675
# [12,] 0.53963291 -0.0262970723
# [13,] 0.46509011 0.0570725348
# [14,] 0.33227459 0.1598345855
# [15,] -0.20316429 0.2757045612
If you want to be able to call the columns by name, you can write your beta
function as:
library(zoo)
beta <- function(x, indepvar, depvar) {
a <- coef(lm(as.formula(paste(indepvar, "~", depvar)),
data = x))
return(a)
}
rollapplyr(input[3:4], width = 6,
FUN = function(x) beta(as.data.frame(x), "y", "x1"),
by.column = FALSE)
# (Intercept) x1
# [1,] -0.04987909 0.6433585022
# [2,] -0.23739671 0.7527017129
# [3,] -0.40483456 0.5833452315
# [4,] -0.28191172 0.6660916836
# [5,] 0.02886934 0.5334114615
# [6,] 0.17284232 0.8126499211
# [7,] 0.01236415 0.3194661428
# [8,] 0.48156300 -0.1532216150
# [9,] 0.75965765 -0.1993015431
# [10,] 0.80509109 -0.1822009137
# [11,] 0.55055694 -0.0005880675
# [12,] 0.53963291 -0.0262970723
# [13,] 0.46509011 0.0570725348
# [14,] 0.33227459 0.1598345855
# [15,] -0.20316429 0.2757045612
Notice I have to supply input[3:4]
instead of just input
to rollapplyr
because apparently rollapplyr
only takes matrix as input. If input
has mixed types, rollapplyr
coerce it to a matrix of characters, which is not desirable. So I have to both supply the numeric only columns and coerce it back to data.frame with as.data.frame
for lm
to work.
Here are two link that discuss this issue with rollapplyr
:
Is there a function like rollapply for data.frame
Can `ddply` (or similar) do a sliding window?
ddply for regression in R
By using ddply
, you apply lm
to the subsets of your data frame, where each subset corresponds to a certain city. It seems to be the case that some cities in the full data set have only one record. For such cases, the statistical analysis is obviously meaningless, however lm
will return you some answer, but if you have a factor variable in the model, it'll throw an error.
As a workaround, you could check the number of rows inside your anonymous function:
ddply(d,'city',function(x) if (nrow(x)==1) return() else coefficients(lm(output~temperature+humidity+time, data=x)))
where d
is slightly modified version of your sample set, in which I changed the id of the city in the last row to make sure that some cities have only one record:
d <- structure(list(city = c(11, 11, 11, 11, 22, 22, 22, 22, 5, 7), temperature = c(51L, 43L, 55L, 64L, 21L, 43L, 51L, 51L, 45L, 51L), humidity = c(34L, 30L, 50L, 54L, 52L, 65L, 66L, 78L, 70L, 54L), time = structure(c(1L, 2L, 3L, 6L, 7L, 4L, 5L, 8L, 1L, 6L), .Label = c("1", "2", "3", "4", "9", "10", "11", "16"), class = "factor"), output = c(201L, 232L, 253L, 280L, 321L, 201L, 211L, 199L, 202L, 213L)), .Names = c("city", "temperature", "humidity", "time", "output"), row.names = c(NA, -10L), class = "data.frame")
You could also use this base R code instead of ddply
:
L <- split(d,d$city)
L2 <- lapply(L,function(x) {
if (nrow(x)==1)
return()
else
coefficients(lm(output~temperature+humidity+time, data=x))
})
M <- do.call(rbind,L2)
df <- as.data.frame(M)
This code is more wordy but it is much easier to inspect and analyze it in case of problematic behavior.
Related Topics
Convert List to Data Frame While Keeping List-Element Names
Using R Convert Data.Frame to Simple Vector
How to Pass "Nothing" as an Argument to '[' for Subsetting
Changing Format of Some Axis Labels in Ggplot2 According to Condition
Skip Some Rows in Read.CSV in R
Passing Along Ellipsis Arguments to Two Different Functions
Dealing with Spaces and "Weird" Characters in Column Names with Dplyr::Rename()
How to Get the Nth Element of Each Item of a List, Which Is Itself a Vector of Unknown Length
Show Content for Menuitem When Menusubitems Exist in Shiny Dashboard
How to Calculate Mean of All Columns, by Group
Extract the Coefficients for the Best Tuning Parameters of a Glmnet Model in Caret
Ggplot2: Using Gtable to Move Strip Labels to Top of Panel for Facet_Grid
Calculating Peaks in Histograms or Density Functions
How to Access Dimensions of Labels Plotted by 'Geom_Text' in 'Ggplot2'
R Shiny, How to Make Datatable React to Checkboxes in Datatable