R Data.Table Grouping for Lagged Regression

R data.table grouping for lagged regression

Just some additional notes due to Alex's comment. The reason you have difficulties understanding what's going on here is that a lot of things are done within one line. So it's always a good idea to break things down.

What do we actually want? We want a new column lagret and the syntax to add a new column in data.table is the following:

DT[, lagret := xxx]

where xxx has to be filled up with whatever you want to have in column lagret. So if we just want a new column that gives us the rows, we could just call

DT[, lagret := seq(from=1, to=nrow(DT))]

Here, we actually want the lagged value of logret, but we have to consider that there are many stocks in here. That's why we do a self-join, i.e. we join the data.table DT with itself by the columns stock_id and date, but since we want the previous value of each stock, we use date-1. Note that we have to set the keys first to do such a join:

setkeyv(DT,c('stock_id','date'))
DT[list(stock_id,date-1)]
    stock_id       date logret
 1:        1 2010-12-31     NA
 2:        1 2011-01-01  0.001
 3:        1 2011-01-02  0.003
 4:        1 2011-01-03  0.005
 5:        1 2011-01-04  0.007
 6:        1 2011-01-05  0.009
...

As you can see, we now have what we want. logret is now lagged by one period. But we actually want that in a new column lagret in DT, so we just get that column by calling [[3L]] (this means nothing else then get me the third column) and name this new column lagret:

DT[,lagret:=DT[list(stock_id,date-1),logret][[3L]]]
          date stock_id logret lagret
 1: 2011-01-01        1  0.001     NA
 2: 2011-01-02        1  0.003  0.001
 3: 2011-01-03        1  0.005  0.003
 4: 2011-01-04        1  0.007  0.005
 5: 2011-01-05        1  0.009  0.007
...

This is already the correct solution. In this simple case, we do not need roll=TRUE because there are no gaps in the dates. However, in a more realistic example (as mentioned above, for instance when we have weekends), there might be gaps. So let's make such a realistic example by just deleting two days in the DT for the first stock:

DT <- DT[-c(4, 5)]
setkeyv(DT,c('stock_id','date'))
DT[,lagret:=DT[list(stock_id,date-1),logret][[3L]]]
          date stock_id logret lagret
 1: 2011-01-01        1  0.001     NA
 2: 2011-01-02        1  0.003  0.001
 3: 2011-01-03        1  0.005  0.003
 4: 2011-01-06        1  0.011     NA
 5: 2011-01-01        2  0.013     NA
...

As you can see, the problem is now that we don't have a value for the 6th of January. That's why we use roll=TRUE:

DT[,lagret:=DT[list(stock_id,date-1),logret,roll=TRUE][[3L]]]
          date stock_id logret lagret
 1: 2011-01-01        1  0.001     NA
 2: 2011-01-02        1  0.003  0.001
 3: 2011-01-03        1  0.005  0.003
 4: 2011-01-06        1  0.011  0.005
 5: 2011-01-01        2  0.013     NA
...

Just have a look on the documentation on how roll=TRUE works exactly. In a nutshell: If it can't find the previous value (here logret for the 5th of January), it just takes the last available one (here from the 3rd of January).

How to create a lag variable within each group?

You could do this within data.table

 library(data.table)
 data[, lag.value:=c(NA, value[-.N]), by=groups]
  data
 #   time groups       value   lag.value
 #1:    1      a  0.02779005          NA
 #2:    2      a  0.88029938  0.02779005
 #3:    3      a -1.69514201  0.88029938
 #4:    1      b -1.27560288          NA
 #5:    2      b -0.65976434 -1.27560288
 #6:    3      b -1.37804943 -0.65976434
 #7:    4      b  0.12041778 -1.37804943

For multiple columns:

nm1 <- grep("^value", colnames(data), value=TRUE)
nm2 <- paste("lag", nm1, sep=".")
data[, (nm2):=lapply(.SD, function(x) c(NA, x[-.N])), by=groups, .SDcols=nm1]
 data
#    time groups      value     value1      value2  lag.value lag.value1
#1:    1      b -0.6264538  0.7383247  1.12493092         NA         NA
#2:    2      b  0.1836433  0.5757814 -0.04493361 -0.6264538  0.7383247
#3:    3      b -0.8356286 -0.3053884 -0.01619026  0.1836433  0.5757814
#4:    1      a  1.5952808  1.5117812  0.94383621         NA         NA
#5:    2      a  0.3295078  0.3898432  0.82122120  1.5952808  1.5117812
#6:    3      a -0.8204684 -0.6212406  0.59390132  0.3295078  0.3898432
#7:    4      a  0.4874291 -2.2146999  0.91897737 -0.8204684 -0.6212406
#    lag.value2
#1:          NA
#2:  1.12493092
#3: -0.04493361
#4:          NA
#5:  0.94383621
#6:  0.82122120
#7:  0.59390132

Update

From data.table versions >= v1.9.5, we can use shift with type as lag or lead. By default, the type is lag.

data[, (nm2) :=  shift(.SD), by=groups, .SDcols=nm1]
#   time groups      value     value1      value2  lag.value lag.value1
#1:    1      b -0.6264538  0.7383247  1.12493092         NA         NA
#2:    2      b  0.1836433  0.5757814 -0.04493361 -0.6264538  0.7383247
#3:    3      b -0.8356286 -0.3053884 -0.01619026  0.1836433  0.5757814
#4:    1      a  1.5952808  1.5117812  0.94383621         NA         NA
#5:    2      a  0.3295078  0.3898432  0.82122120  1.5952808  1.5117812
#6:    3      a -0.8204684 -0.6212406  0.59390132  0.3295078  0.3898432
#7:    4      a  0.4874291 -2.2146999  0.91897737 -0.8204684 -0.6212406
#    lag.value2
#1:          NA
#2:  1.12493092
#3: -0.04493361
#4:          NA
#5:  0.94383621
#6:  0.82122120
#7:  0.59390132

If you need the reverse, use type=lead

nm3 <- paste("lead", nm1, sep=".")

Using the original dataset

  data[, (nm3) := shift(.SD, type='lead'), by = groups, .SDcols=nm1]
  #  time groups      value     value1      value2 lead.value lead.value1
  #1:    1      b -0.6264538  0.7383247  1.12493092  0.1836433   0.5757814
  #2:    2      b  0.1836433  0.5757814 -0.04493361 -0.8356286  -0.3053884
  #3:    3      b -0.8356286 -0.3053884 -0.01619026         NA          NA
  #4:    1      a  1.5952808  1.5117812  0.94383621  0.3295078   0.3898432
  #5:    2      a  0.3295078  0.3898432  0.82122120 -0.8204684  -0.6212406
  #6:    3      a -0.8204684 -0.6212406  0.59390132  0.4874291  -2.2146999
  #7:    4      a  0.4874291 -2.2146999  0.91897737         NA          NA
 #   lead.value2
 #1: -0.04493361
 #2: -0.01619026
 #3:          NA
 #4:  0.82122120
 #5:  0.59390132
 #6:  0.91897737
 #7:          NA

data

 set.seed(1)
 data <- data.table(time =c(1:3,1:4),groups = c(rep(c("b","a"),c(3,4))),
             value = rnorm(7), value1=rnorm(7), value2=rnorm(7))

Need a data.table method to use a lagged computed value by group

 DT[, val2 := nafill(val, "locf")/2^(1 + cumsum(is.na(val))), grp]
 DT
   yr grp val val2
1:  1   a  25 12.5
2:  2   a  20 10.0
3:  3   a  NA  5.0
4:  4   a  NA  2.5
5:  1   b  10  5.0
6:  2   b  12  6.0
7:  3   b  NA  3.0
8:  4   b  NA  1.5

Create lagged variable in unbalanced panel data in R

Using a function tlag within groups defined by id

library(dplyr)
tlag <- function(x, n = 1L, time) { 
  index <- match(time - n, time, incomparables = NA)
  x[index]
}

df %>% group_by(id) %>% mutate(value_lagged = tlag(value, 1, time = date))

Grouped count aggregation in R data.table

## by date
DT[, list(total_buys = sum(buy > 0), total_sells = sum(sell > 0)), by = date]
##          date total_buys total_sells
## 1: 2011-01-01          1           0
## 2: 2011-01-02          0           2
## 3: 2011-01-03          2           1
## 4: 2011-01-04          1           0
## 5: 2011-01-05          0           0
## 6: 2011-01-06          0           1

DT[, list(total_buys = sum(buy > 0), total_sells = sum(sell > 0))]
##    total_buys total_sells
## 1:          4           4

Adding multiple columns from a function to data.table using := within groups - without specifying the LHS

For the same as IShouldBuyABoat's answer result try:

data.table(dt, dt[, my.fun(.SD), by = grp][, grp := NULL])
   grp period       y1       y2   y1.lag  y2.lag      y1y2
1:   a      1  1.36677 -0.81025       NA      NA -1.107425
2:   a      2  0.43528  1.04277  1.36677 -0.8102  0.453895
3:   a      3 -1.40229  0.66223  0.43528  1.0428 -0.928633
4:   a      4  1.43362  0.10293 -1.40229  0.6622  0.147560
5:   a      5  0.46713  0.72508  1.43362  0.1029  0.338705
6:   b      1 -0.04418 -0.20014       NA      NA  0.008843
7:   b      2  1.32390  0.19651 -0.04418 -0.2001  0.260160
8:   b      3 -0.82543  1.11483  1.32390  0.1965 -0.920215
9:   b      4 -1.26415  0.53213 -0.82543  1.1148 -0.672698
10:  b      5  0.14549  0.04128 -1.26415  0.5321  0.006005

Note: you can use .SDcols for specify a columns passed to my.fun. [, grp := NULL] to supress duplicate grp column.

R - How do I lag/lead multiple columns in a data.table by multiple periods most efficiently

How's this in a nice little for loop:

cols <- grep("Var", names(dt), value = TRUE)
for ( i in 1:2 ) { # update for the number of shifts
  
  lag_names <- paste(cols, "Lag", i, sep = "_")
  dt[, (lag_names) := lapply(.SD, shift, i, type = "lag"), .SDcols = cols]
  
  lead_names <- paste(cols, "Lead", i, sep = "_")
  dt[, (lead_names) := lapply(.SD, shift, i, type = "lead"), .SDcols = cols]
  
}

Rolling multi regression in R data table

If I understand correctly you want a quarterly auto-regression.

There's a related thread on time-series with data.table here.

You can setup a rolling date in data.table like this (see the link above for more context):

#Example for quarterly data
quarterly[, rollDate:=leftBound]
storeData[, rollDate:=date]

setkey(quarterly,"rollDate")
setkey(storeData,"rollDate")

Since you only provided a few rows of example data, I extended the series through 2019 and made up random return values.

First get your data setup:

require(forecast)
require(xts)
DT <- read.table(con<- file ( "clipboard"))
dput(DT) # the dput was too long to display here
DT[,1] <- as.POSIXct(strptime(DT[,1], "%m/%d/%Y"))
DT[,2] <- as.double(DT[,2])
dat <- xts(DT$V2,DT$V1, order.by = DT$V1)

x.ts          <- to.quarterly(dat) # 120 days

        dat.Open dat.High dat.Low dat.Close
2016 Q1     1292     1292       1       698
2016 Q2      138     1290       3       239
2016 Q3      451     1285       5       780
2016 Q4      355     1243      27      1193
2017 Q1      878     1279       4       687
2017 Q2      794     1283      12       411
2017 Q3      858     1256       9      1222
2017 Q4      219     1282      15       117
2018 Q1      554     1286      32       432
2018 Q2      630     1272      30        46
2018 Q3      310     1288      18       979
2019 Q1      143     1291      10       184
2019 Q2      250     1289       8       441
2019 Q3      110     1220      23       571

Then you can do a rolling ARIMA model with or without re-estimation like this:

fit <- auto.arima(x.ts)
order <- arimaorder(fit)
fcmat <- matrix(0, nrow=nrow(x), ncol=1)
n   <- nrow(x)
for(i in 1:n)
{  
  x <- window(x.ts, end=2017.99 + (i-1)/4)
  refit <- Arima(x, order=order[1:3], seasonal=order[4:6])
  fcmat[i,] <- forecast(refit, h=h)$mean
}

Here's a good related resource with several examples of different ways you might construct this: http://robjhyndman.com/hyndsight/rolling-forecasts/

Calculating the lag for aggregated data

First try

I think the key is that lag() looks at the previous row of the dataframe, or at least the previous row of your group. It doesn't look up by value.

So one way to get what (I think) you want is just to arrange() the dataframe by your groups.

samp %>%
    group_by(shop_id, item_id) %>%
    arrange(shop_id, item_id) %>%
    mutate(lag_item_price2 = lag(item_price2))

Second try

You are mixing together data that has different groupings. Some columns (e.g. item_cnt_monthanditem_price) you want grouped by bothitem_idandshop_id. Other columns you want to group byshop_idanditem_category_id(item_cnt_month1anditem_price1`) and yet other columns have a third grouping.

This is an unnatural format for your data. As a result the solution to do what you want will be very complicated. We will have to (a) reformat the data to long format using gather(), (b) filter the data three separate times to isolate each set of factors that shares the same grouping, (c) regroup and rearrange the long data so that we can call lag() and get the expected output, which comes in the form of three separate data frames of unequal size. Finally, (d) we must reformat the data to match your input format.

library(tidyverse)

by_shop <- c('item_cnt_month2', 'item_price2')
by_shop_and_category <- c('item_cnt_month1', 'item_price1')
by_shop_and_item <- c('item_cnt_month', 'item_price')

long <-
    samp %>% 
    gather(factor, value, contains("month"), contains("price")) 

by_shop_df <- 
    long %>% 
    filter(factor %in% by_shop) %>% 
    group_by(shop_id, factor, date_block_num) 

by_shop_and_category_df <- 
    long %>% 
    filter(factor %in% by_shop_and_category) %>% 
    group_by(shop_id, item_category_id, factor, date_block_num) 

by_shop_and_item_df <- 
    long %>% 
    filter(factor %in% by_shop_and_item) %>% 
    group_by(shop_id, item_id, factor, date_block_num) 

lag_my_value <- function(df){
    df %>%
    summarize(value = first(value)) %>%
    arrange(date_block_num) %>% 
    mutate(value = lag(value, 1)) %>%
    spread(factor, value)
}

my_dfs <- list(by_shop_df, by_shop_and_category_df, by_shop_and_item_df)

my_lagged_dfs <- map(my_dfs, lag_my_value)

final_answer <-
    samp %>%
    select(date_block_num, shop_id, item_category_id, item_id) %>%
    left_join(., my_lagged_dfs[[1]], by=c('shop_id', 'date_block_num')) %>%
    left_join(., my_lagged_dfs[[2]], by=c('shop_id', 'item_category_id', 'date_block_num')) %>%
    left_join(., my_lagged_dfs[[3]], by=c('shop_id', 'item_id', 'date_block_num'))

final_answer

R Data.Table Grouping for Lagged Regression