Sliding Window in R

Sliding window of mean across dataframe, based on alternate column in R

Here are several alternatives. (1) has no package dependencies, (2) uses rollapply and so is the most similar to the code in the question and (3) uses SQL and is the shortest code-wise.

1) Base R If the problem is that dist does not contain every number between 1 and its maximum value then we can iterate over the intervals like this:

Fun <- function(st, width, df, fun) {
fun(subset(df, dist >= st & dist <= st + width - 1)$value)
}

width <- 50
step <- 25

starts <- seq(1, max(df$dist), step)

data.frame(starts,
ends = starts + width - 1,
mean = sapply(starts, Fun, width, df, mean),
N = sapply(starts, Fun, width, df, length))

giving:

  starts ends     mean  N
1 1 50 5.200910 50
2 26 75 4.710030 50
3 51 100 4.770270 50
4 76 125 4.880030 38
5 101 150 5.318415 25
6 126 175 5.575938 25
7 151 200 4.989383 25
8 176 225 3.918574 12

2) rollapply Another approach is to expand the input data frame in which case we can use rollapply.

library(zoo)

roll <- function(x, width, fun, step) {
fun2 <- function(x) fun(na.omit(x))
rollapply(x, width, by = step, fun2, partial = TRUE, align = "left")
}

width <- 50
step <- 25

m <- merge(df, data.frame(dist = 1:max(df$dist)), all = TRUE)
data.frame(starts,
ends = starts + width - 1,
mean = roll(m$value, width, mean, step),
N = roll(m$value, width, length, step)
)

giving:

  starts ends     mean  N
1 1 50 5.200910 50
2 26 75 4.710030 50
3 51 100 4.770270 50
4 76 125 4.880030 38
5 101 150 5.318415 25
6 126 175 5.575938 25
7 151 200 4.989383 25
8 176 225 3.918574 12

3) sqldf This can be formulated compactly using SQL with the indicated left join.

library(sqldf)

width <- 50
step <- 25

starts <- data.frame(starts = seq(1, max(df$dist), step))
fn$sqldf("select starts, starts+$width-1 ends, avg(value) mean, count(value) N
from starts
left join df on dist between starts and ends
group by starts.rowid")

giving:

  starts ends     mean  N
1 1 50 5.200910 50
2 26 75 4.710030 50
3 51 100 4.770270 50
4 76 125 4.880030 38
5 101 150 5.318415 25
6 126 175 5.575938 25
7 151 200 4.989383 25
8 176 225 3.918574 12

Note

For the input to be reproducible we must set the seed before using any random numbers so in the above we used this:

set.seed(123)
dist <- c(seq(1, 100, by = 1), seq(101, 200, by = 2))
value <- runif(150, min = 0, max = 10)
df <- data.frame(dist, value)

Sliding Windows for Rows in R

The following code implements a R function that receives as input a vector and an integer that represents a sliding window of arbitrary size. The function outputs a vector with the mean values of the elements inside every shunk of the window (you can adapt the code to do any other type of computation).

# Function that computes an sliding window based on the mean values
compute_mean_feat <- function(vector, window) {
mean <- c()
for (i in 1:length(vector)) {
if (i <= window) {
sum <- 0
for (k in 1:i) {
sum <- vector[k] + sum
}
mean <- c(mean, as.numeric(sum / i))
} else{
sum <- 0
for (j in (i - window + 1):i)
sum <- vector[j] + sum
}
mean <- c(mean, as.numeric(sum / i))
}
}
return(mean)
}

You can see the result of applying the compute_mean_feat function with a sliding window of 10 to a dummy vector.

# Manual example for testing
window <- 10
vector <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
compute_mean_feat(vector, window)

For a data frame, you just need to apply this function to the desired column and it will give you the sliding window vector for its rows.

Coding a simple loop for a sliding window

Use rollapplyr with the indicated function.

library(zoo)
pv <- function(xx) runs.test(xx, threshold = mean(xx))$p.value
out <- rollapplyr(x, 256, pv, fill = NA)

Note

library(randtests)
set.seed(123)
x <- ts(rnorm(2659, mean = 0.0001, sd = 0.0001))

Sliding normalising window in R

1) If DF is the input data.frame, calculate the rolling means, subtract those from the original data frame and then divide each column by the corresponding sd value. If you don't want the NA rows then use na.omit(out).

Note that the answer to this question is relevant here: How to divide each row of a matrix by elements of a vector in R

library(zoo)

out <- t( t(DF - rollmean(DF, 3, fill = NA, align = "left")) / sapply(DF, sd))

giving:

> out
W1 W2 W3 W4 W5 W6 W7
1 2.0571604 -0.46799047 -0.3798546 -0.782516058 0.7559711 0.3162800 0.4320913
2 -0.7668684 0.03065979 -0.5079677 -0.656126126 0.4270853 0.3599383 0.4083388
3 -0.7839578 0.82502267 -0.4947466 -0.466405606 0.1438538 0.3990324 0.3966334
4 0.7080855 1.03647378 -0.2435920 -0.236471919 -0.1148815 0.4020498 0.3856112
5 -0.3229973 -0.30756238 0.1618686 -0.000389918 -0.3137854 0.3680621 0.3629682
6 -0.3046393 -1.66132459 0.6238737 0.297421141 -0.4903858 0.3136170 0.3091448
7 1.0105062 -0.16328686 0.9294159 0.662844512 -0.6631908 0.2474401 0.2128288
8 -0.3830338 1.59900097 0.8471133 0.979199212 -0.8212911 0.1795721 0.1020336
9 NA NA NA NA NA NA NA
10 NA NA NA NA NA NA NA

Correcting the formulas in the question the first 3 values in column 1 are:

(1.3785 - (1.378+(-0.7303)+(-0.5213))/3)/sd(DF[, 1])
## [1] 2.057361
(-0.7303 - (-0.7303+(-0.5213)+0.555)/3)/sd(DF[, 1])
## -0.7668342
(-0.5213 - (-0.5213+0.555+(-0.0699))/3)/sd(DF[, 1])
## [1] -0.7839742

2) An alternate solution would be to define a function which performs the required operation on a single column then sapply it to each column.

sapply(DF, function(x) (x - rollmean(x, 3, align = "left", fill = NA))/sd(x))

Note: The input in reproducible form is:

Lines <-  " W1          W2         W3        W4         W5         W6         W7
1 1.37853716 0.01316304 -0.1363012 0.6895341 -0.7230930 -0.1310321 -0.4109521
2 -0.73032998 0.31212925 0.1654731 0.9187255 -0.8017260 -0.1619631 -0.4243575
3 -0.52130420 0.43831484 0.6088623 1.1183964 -0.8486971 -0.1970389 -0.4368820
4 0.55501096 0.13850401 1.1221211 1.2708212 -0.8701385 -0.2372061 -0.4490060
5 -0.06995122 -0.53842548 1.4592013 1.3581935 -0.8661200 -0.2791726 -0.4608654
6 -0.19984548 -0.78829431 1.4564180 1.3823090 -0.8431200 -0.3184653 -0.4722506
7 0.68935525 0.18733222 1.0158497 1.3344059 -0.8043461 -0.3526886 -0.4825229
8 -0.49540738 0.80663376 0.1774945 1.1800970 -0.7494087 -0.3803636 -0.4901212
9 -0.09501622 -0.17931684 -0.7074083 0.9312984 -0.6801124 -0.4008524 -0.4942994
10 -0.14939548 -0.68153738 -1.2723772 0.6054420 -0.5968207 -0.4149125 -0.4952316"
DF <- read.table(text = Lines)

Sliding window in a data frame r

This code will slide p1 from 0 to 6990 in steps of 10 while p2 slides from 10 to 7000 in steps of 10:

output = apply(data.frame(seq(0,6990,10), seq(10,7000,10)), MARGIN=1,
function(x,y,z,a) roh_island(M1, 1, x[1], x[2]))
plot(output, col="blue")
grid(5, 5)

Sample Image

Generating sliding window to subset data for prediction task

The error/warning is from using == when the rhs is of length > 1. Use %in%

pred <- vector('list', 8)
names(pred) <- 2000:2007
for(i in 2000:2007){
df_sub1 <- subset(df, year %in% c(i, i+1, i+2))
mod <- glm(y~var1+var2, data=df_sub1, family=binomial())
df_sub2 <- subset(df, year == (i+3))
pred[[as.character(i)]] <- tryCatch(predict(mod,
newdata=df_sub2, type = "response"), error = function(e) NA_real_)
}

-output

> pred
$`2000`
4
1

$`2001`
5
1

$`2002`
6
1

$`2003`
7
2.220446e-16

$`2004`
8
0.1467543

$`2005`
9
0.001408577

$`2006`
10
2.220446e-16

$`2007`
[1] NA

Sliding window sample from a data frame

Say that your dataset consists of the first 20 rows of iris:

df<-iris[1:20,]

Then you can try:

nr <- nrow(df)
windowSize <- 5
lapply(seq_len(nr-windowSize+1), function(i) df[i:(i+windowSize-1),])

Sliding window on date-time field in data frame R

Here is an option using data.table:

dt[, dayago := date - 24 * 60 * 60]
dt[, c("n", "avg") :=
dt[dt, on=.(customer_id, date>=dayago, date<date),
by=.EACHI, .(n=.N, avg=mean(amount))][, (1L:3L) := NULL]
]

data:

library(data.table)
dt <- data.table(
order_id = 1:10,
customer_id = c(1, rep(2, 2), rep(3, 3), rep(4, 4)),
amount = seq(10, 100, by = 10),
date = as.POSIXct(c("2020-10-07 12:00", # 1st customer
"2020-10-07 12:00", "2020-10-08 11:00", # 2st customer
"2020-10-07 12:00", "2020-10-08 11:00", "2020-10-08 20:00", # 3rd customer
"2020-10-07 12:00", "2020-10-08 11:00", "2020-10-08 20:00", "2020-10-08 21:00" # 4th customer
), format=("%Y-%m-%d %H:%M")))


Related Topics



Leave a reply



Submit