Can 'Ddply' (Or Similar) Do a Sliding Window

Is there a function like rollapply for data.frame

Try this:

> library(zoo)
> DF <- data.frame(a = 1:10, b = 21:30, c = letters[1:10])
> replace(DF, 1:2, rollapply(DF[1:2], 3, sum, fill = NA))
a b c
1 NA NA a
2 6 66 b
3 9 69 c
4 12 72 d
5 15 75 e
6 18 78 f
7 21 81 g
8 24 84 h
9 27 87 i
10 NA NA j

Regarding the example that was added to the question after some discussion, such functionality can be layered on top of rollapply by applying it to the row indexes:

> lapply(as.data.frame(t(rollapply(1:nrow(test), 2, c))), function(ix)test[ix, ])
$V1
Name Points
1 bob 4
2 jane 9

$V2
Name Points
2 jane 9
3 joe 1

and here it is wrapped up a bit better:

rollapply.data.frame <- function(data, ..., fill = NULL, FUN, 
simplify = function(x) do.call(rbind, x)) {
fill0 <- if (!is.null(fill)) NA
result <- lapply(
as.data.frame(t(rollapply(1:nrow(data), ..., fill = fill0, FUN = c))),
function(ix) {if (all(is.na(ix))) fill else FUN(data[ix, ])}
)
simplify(result)
}

> rollapply(test, 2, FUN = identity, simplify = identity)
$V1
Name Points
a bob 4
b jane 9

$V2
Name Points
b jane 9
c joe 1

> rollapply(test, 2, FUN = identity, fill = NA, simplify = identity)
$V1
Name Points
a bob 4
b jane 9

$V2
Name Points
b jane 9
c joe 1

$V3
[1] NA

R: Grouped rolling window linear regression with rollapply and ddply

1) rollapply works on data frames too so it is not necessary to convert df to zoo.

2) lm uses na.action, not na.rm, and its default is na.omit so we can just drop this argument.

3) rollapplyr is a more concise way to write rollapply(..., align = "right").

Assuming that rolled otherwise does what you want and incorporating these changes into rolled, the ddply statement in the question should work or we could use by from the base of R which we show below:

rolled <- function(df) {
rollapplyr(df, width = 6, function(m) {
coef(lm(formula = y ~ x, data = as.data.frame(m)))[2]
}, by = 3, by.column = FALSE
)
}
do.call("rbind", by(dat, dat[c("w", "z")], rolled))

Simple working example of ddply() in parallel on Windows

Here's a simple working example:

> df <- data.frame(val=1:10, ind=c(rep(2, 5), rep(3, 5)))
> library(doSNOW)
> registerDoSNOW(makeCluster(2, type = "SOCK"))
> system.time(print(ddply(df, .(ind), function(x) { Sys.sleep(2); sum(x) }, .parallel=FALSE)))
ind V1
1 2 25
2 3 55
user system elapsed
0.00 0.00 4.01
> system.time(print(ddply(df, .(ind), function(x) { Sys.sleep(2); sum(x) }, .parallel=TRUE)))
ind V1
1 2 25
2 3 55
user system elapsed
0.02 0.00 2.02

Apply / Plyr like functionality without without performance impact in Rollapply or apply.rolling

This answer is an expanded version of my earlier comments which I have now deleted.

zoo's rollapply already supports plain vectors and matrices. Furthermore its rollapply routine extracts the plain vectors or matrices from a zoo object before operating on it so there is no reason for a zoo object to take materially longer than a non-zoo object. The slowness you observed was a bug in rollapply (the extraction was not taking place properly) that was fixed in early November in the development version. This version is on R-Forge and installed like this:

install.packages("zoo", repo = "http://r-forge.r-project.org")

On the other hand, the generality of rollapply means its going to be much slower than special purpose routines or vectorized operations.

zoo does have some specialized versions of rollapply (rollmean, rollmedian, rollmax) that are optimized for particular operations and will be much faster. If you can manufacture something out of those, e.g. a rolling sum of k terms is the same as k times a rolling mean, then you can get substantial speedups. Faster still will be manufacturing the rolling result from plain operations such as + .

The post indicated that the function in question was just an example but the particular function could make a big difference in terms of speed since it will affect whether the sorts of speedups discussed are available.

For example, running 3 replications of each of rollapply, 2 * rollmean and a simple vectorized addition shows this:

> library(zoo)
> library(rbenchmark)
> n <- 10^4
> set.seed(123)
> a <- rnorm(n)
> library(rbenchmark)
> benchmark(rollapply = a1 <- rollapplyr(a, 2, sum, fill = 0),
+ rollmean = a2 <- 2 * rollmeanr(a, 2, fill = 0),
+ add = a3 <- c(0, a[-1] + a[-n]), replications = 3, order = "relative")
test replications elapsed relative user.self sys.self user.child sys.child
3 add 3 0.00 0.00000 0.00 0 NA NA
2 rollmean 3 0.07 1.00000 0.08 0 NA NA
1 rollapply 3 1.85 26.42857 1.84 0 NA NA
>
> all.equal(a1, a2)
[1] TRUE
> all.equal(a1, a3)
[1] TRUE

Rollapply with different rolling window on each vector of time series

1) Convert to a wide form zoo object z and then to a list of zoo objects L, one per column of z, apply rollfun to each component of L creating a list of zoo objects and then merge back into a wide form zoo object zroll and either use that or optionally convert to long form data frame droll.

library(zoo)

z <- read.zoo(dat, split = "fact")
L <- as.list(z)

rollfun <- function(x) rollapplyr(x, length(na.omit(x)) - 252, mean)
zroll <- do.call("merge", Map(rollfun, L))
droll <- fortify.zoo(zroll, melt = TRUE)

2) This could also be expressed as a pipeline where rollfun is from above.

droll2 <- dat |>
read.zoo(split = "fact") |>
as.list() |>
Map(f = rollfun) |>
do.call(what = "merge") |>
fortify.zoo(melt = TRUE)

3) With dplyr

library(dplyr, exclude = c("lag", "filter"))
library(zoo)

dat %>%
group_by(fact) %>%
mutate(roll = rollapplyr(value, n() - 252, mean, fill = NA)) %>%
ungroup

How to produce simple sliding window features over irregular time series in R?

You can represent each type of event (e.g., "user A clicks item i")
as a time series x, with value 1 each time it occurs.
The quantities you are interested can be computed from cumsum(x)
(the number of events until today) and its translations
(the number of events until k days in the past or the future).

# Sample data
set.seed(0)
k <- 100
users <- LETTERS[1:4]
files <- letters[1:4]
items <- letters[24:26]
clicks <- data.frame(
time = Sys.time() + runif(k, 0, k * 24 * 3600),
user = sample( users, k, replace=TRUE ),
item = sample( items, k, replace=TRUE )
)
clicks <- unique(clicks)

For a single time series:

x <- subset( clicks, user=="C" & item=="x" )
xts( rep(1,nrow(x)), x$time )
x <- xts( rep(1,nrow(x)), x$time )

y <- xts( coredata(x), index(x)+7*3600*24 )
z <- cbind(y, x)
z[ is.na(z) ] <- 0
cumsum(z[,2]) - cumsum(z[,1])
# cbind(x,z,cumsum(z[,2]) - cumsum(z[,1]))

For the whole dataset, you can use ddply.

Rolling window slider::slide() with grouped data

You have to first tidyr::nest the cases. Within the nested tibbles (accessed via purrr::map) you can then apply slide (same technique as with purrr::map). The important point is that you do not want to slide across cases, but only within cases.

library(dplyr)
library(tidyr)
library(purrr)
library(slider)

get_coef1 <- function(data) {
coef1 <- lm(data = data, r1 ~ r2 + r3) %>%
coef() %>%
.["r2"] %>%
unname()

return(coef1)
}

data <- tibble(t = rep(1:10, 3),
case = c(rep("a", 10), rep("b", 10), rep("c", 10)),
r1 = rnorm(30),
r2 = rnorm(30),
r3 = rnorm(30))

data %>%
# ungroup() %>%
group_by(case) %>% nest() %>%
mutate(rollreg = map(data, ~ .x %>% mutate(coef1 = slider::slide_dbl(., ~get_coef1(.x), .before = Inf, .complete = TRUE)))) %>%
select(-data) %>% unnest(rollreg)

I have been trying for a while to use the new dplyr::nest_by() from dplyr 1.0.0 trying to use summarise in combination with the rowwise cases but couldn't get that to work.

How to feed in the arguments of a function used in rollapplyr FUN

I think there are a few issues here. Let me walk you through the most critical ones:

# Simple Regression
beta <- function(x, indepvar, depvar) {
a <- coef(lm(formula = indepvar ~ depvar, data = x))
return(a)
}

The way you have written your function beta means that you have to input the data x, the indepvar column and the depvar column. But this would not work for lm because what indepvar and depvar contains are fed in, instead of the variable names. For instance, the following would not work:

beta(input, y, x1)

Error in eval(expr, envir, enclos) : object 'y' not found

This is because y and x1 do not exist outside of input. Your rollapplyr has the same issue. One way to get around this is to write:

beta <- function(indepvar, depvar) {
a <- coef(lm(indepvar ~ depvar))
return(a)
}

And explicitly input the columns like:

# > beta(input[,3],input[,4])
# (Intercept) depvar
# 0.1308993 0.2373399

Now this works:

rollapplyr(input[3:4], width = 6,
FUN = function(x) beta(x[,1], x[,2]),
by.column = FALSE)

# (Intercept) depvar
# [1,] -0.04987909 0.6433585022
# [2,] -0.23739671 0.7527017129
# [3,] -0.40483456 0.5833452315
# [4,] -0.28191172 0.6660916836
# [5,] 0.02886934 0.5334114615
# [6,] 0.17284232 0.8126499211
# [7,] 0.01236415 0.3194661428
# [8,] 0.48156300 -0.1532216150
# [9,] 0.75965765 -0.1993015431
# [10,] 0.80509109 -0.1822009137
# [11,] 0.55055694 -0.0005880675
# [12,] 0.53963291 -0.0262970723
# [13,] 0.46509011 0.0570725348
# [14,] 0.33227459 0.1598345855
# [15,] -0.20316429 0.2757045612

If you want to be able to call the columns by name, you can write your beta function as:

library(zoo)
beta <- function(x, indepvar, depvar) {
a <- coef(lm(as.formula(paste(indepvar, "~", depvar)),
data = x))
return(a)
}

rollapplyr(input[3:4], width = 6,
FUN = function(x) beta(as.data.frame(x), "y", "x1"),
by.column = FALSE)

# (Intercept) x1
# [1,] -0.04987909 0.6433585022
# [2,] -0.23739671 0.7527017129
# [3,] -0.40483456 0.5833452315
# [4,] -0.28191172 0.6660916836
# [5,] 0.02886934 0.5334114615
# [6,] 0.17284232 0.8126499211
# [7,] 0.01236415 0.3194661428
# [8,] 0.48156300 -0.1532216150
# [9,] 0.75965765 -0.1993015431
# [10,] 0.80509109 -0.1822009137
# [11,] 0.55055694 -0.0005880675
# [12,] 0.53963291 -0.0262970723
# [13,] 0.46509011 0.0570725348
# [14,] 0.33227459 0.1598345855
# [15,] -0.20316429 0.2757045612

Notice I have to supply input[3:4] instead of just input to rollapplyr because apparently rollapplyr only takes matrix as input. If input has mixed types, rollapplyr coerce it to a matrix of characters, which is not desirable. So I have to both supply the numeric only columns and coerce it back to data.frame with as.data.frame for lm to work.

Here are two link that discuss this issue with rollapplyr:

Is there a function like rollapply for data.frame

Can `ddply` (or similar) do a sliding window?

ddply for regression in R

By using ddply, you apply lm to the subsets of your data frame, where each subset corresponds to a certain city. It seems to be the case that some cities in the full data set have only one record. For such cases, the statistical analysis is obviously meaningless, however lm will return you some answer, but if you have a factor variable in the model, it'll throw an error.

As a workaround, you could check the number of rows inside your anonymous function:

ddply(d,'city',function(x) if (nrow(x)==1) return() else coefficients(lm(output~temperature+humidity+time, data=x)))

where d is slightly modified version of your sample set, in which I changed the id of the city in the last row to make sure that some cities have only one record:

d <- structure(list(city = c(11, 11, 11, 11, 22, 22, 22, 22, 5, 7), temperature = c(51L, 43L, 55L, 64L, 21L, 43L, 51L, 51L, 45L,     51L), humidity = c(34L, 30L, 50L, 54L, 52L, 65L, 66L, 78L,     70L, 54L), time = structure(c(1L, 2L, 3L, 6L, 7L, 4L, 5L,     8L, 1L, 6L), .Label = c("1", "2", "3", "4", "9", "10", "11",     "16"), class = "factor"), output = c(201L, 232L, 253L, 280L,     321L, 201L, 211L, 199L, 202L, 213L)), .Names = c("city", "temperature", "humidity", "time", "output"), row.names = c(NA, -10L), class = "data.frame")

You could also use this base R code instead of ddply:

L <- split(d,d$city)

L2 <- lapply(L,function(x) {
if (nrow(x)==1)
return()
else
coefficients(lm(output~temperature+humidity+time, data=x))
})

M <- do.call(rbind,L2)
df <- as.data.frame(M)

This code is more wordy but it is much easier to inspect and analyze it in case of problematic behavior.



Related Topics



Leave a reply



Submit