R Data.Table Grouping for Lagged Regression

R data.table grouping for lagged regression

Just some additional notes due to Alex's comment. The reason you have difficulties understanding what's going on here is that a lot of things are done within one line. So it's always a good idea to break things down.

What do we actually want? We want a new column lagret and the syntax to add a new column in data.table is the following:

DT[, lagret := xxx]

where xxx has to be filled up with whatever you want to have in column lagret. So if we just want a new column that gives us the rows, we could just call

DT[, lagret := seq(from=1, to=nrow(DT))]

Here, we actually want the lagged value of logret, but we have to consider that there are many stocks in here. That's why we do a self-join, i.e. we join the data.table DT with itself by the columns stock_id and date, but since we want the previous value of each stock, we use date-1. Note that we have to set the keys first to do such a join:

setkeyv(DT,c('stock_id','date'))
DT[list(stock_id,date-1)]
stock_id date logret
1: 1 2010-12-31 NA
2: 1 2011-01-01 0.001
3: 1 2011-01-02 0.003
4: 1 2011-01-03 0.005
5: 1 2011-01-04 0.007
6: 1 2011-01-05 0.009
...

As you can see, we now have what we want. logret is now lagged by one period. But we actually want that in a new column lagret in DT, so we just get that column by calling [[3L]] (this means nothing else then get me the third column) and name this new column lagret:

DT[,lagret:=DT[list(stock_id,date-1),logret][[3L]]]
date stock_id logret lagret
1: 2011-01-01 1 0.001 NA
2: 2011-01-02 1 0.003 0.001
3: 2011-01-03 1 0.005 0.003
4: 2011-01-04 1 0.007 0.005
5: 2011-01-05 1 0.009 0.007
...

This is already the correct solution. In this simple case, we do not need roll=TRUE because there are no gaps in the dates. However, in a more realistic example (as mentioned above, for instance when we have weekends), there might be gaps. So let's make such a realistic example by just deleting two days in the DT for the first stock:

DT <- DT[-c(4, 5)]
setkeyv(DT,c('stock_id','date'))
DT[,lagret:=DT[list(stock_id,date-1),logret][[3L]]]
date stock_id logret lagret
1: 2011-01-01 1 0.001 NA
2: 2011-01-02 1 0.003 0.001
3: 2011-01-03 1 0.005 0.003
4: 2011-01-06 1 0.011 NA
5: 2011-01-01 2 0.013 NA
...

As you can see, the problem is now that we don't have a value for the 6th of January. That's why we use roll=TRUE:

DT[,lagret:=DT[list(stock_id,date-1),logret,roll=TRUE][[3L]]]
date stock_id logret lagret
1: 2011-01-01 1 0.001 NA
2: 2011-01-02 1 0.003 0.001
3: 2011-01-03 1 0.005 0.003
4: 2011-01-06 1 0.011 0.005
5: 2011-01-01 2 0.013 NA
...

Just have a look on the documentation on how roll=TRUE works exactly. In a nutshell: If it can't find the previous value (here logret for the 5th of January), it just takes the last available one (here from the 3rd of January).

How to create a lag variable within each group?

You could do this within data.table

 library(data.table)
data[, lag.value:=c(NA, value[-.N]), by=groups]
data
# time groups value lag.value
#1: 1 a 0.02779005 NA
#2: 2 a 0.88029938 0.02779005
#3: 3 a -1.69514201 0.88029938
#4: 1 b -1.27560288 NA
#5: 2 b -0.65976434 -1.27560288
#6: 3 b -1.37804943 -0.65976434
#7: 4 b 0.12041778 -1.37804943

For multiple columns:

nm1 <- grep("^value", colnames(data), value=TRUE)
nm2 <- paste("lag", nm1, sep=".")
data[, (nm2):=lapply(.SD, function(x) c(NA, x[-.N])), by=groups, .SDcols=nm1]
data
# time groups value value1 value2 lag.value lag.value1
#1: 1 b -0.6264538 0.7383247 1.12493092 NA NA
#2: 2 b 0.1836433 0.5757814 -0.04493361 -0.6264538 0.7383247
#3: 3 b -0.8356286 -0.3053884 -0.01619026 0.1836433 0.5757814
#4: 1 a 1.5952808 1.5117812 0.94383621 NA NA
#5: 2 a 0.3295078 0.3898432 0.82122120 1.5952808 1.5117812
#6: 3 a -0.8204684 -0.6212406 0.59390132 0.3295078 0.3898432
#7: 4 a 0.4874291 -2.2146999 0.91897737 -0.8204684 -0.6212406
# lag.value2
#1: NA
#2: 1.12493092
#3: -0.04493361
#4: NA
#5: 0.94383621
#6: 0.82122120
#7: 0.59390132

Update

From data.table versions >= v1.9.5, we can use shift with type as lag or lead. By default, the type is lag.

data[, (nm2) :=  shift(.SD), by=groups, .SDcols=nm1]
# time groups value value1 value2 lag.value lag.value1
#1: 1 b -0.6264538 0.7383247 1.12493092 NA NA
#2: 2 b 0.1836433 0.5757814 -0.04493361 -0.6264538 0.7383247
#3: 3 b -0.8356286 -0.3053884 -0.01619026 0.1836433 0.5757814
#4: 1 a 1.5952808 1.5117812 0.94383621 NA NA
#5: 2 a 0.3295078 0.3898432 0.82122120 1.5952808 1.5117812
#6: 3 a -0.8204684 -0.6212406 0.59390132 0.3295078 0.3898432
#7: 4 a 0.4874291 -2.2146999 0.91897737 -0.8204684 -0.6212406
# lag.value2
#1: NA
#2: 1.12493092
#3: -0.04493361
#4: NA
#5: 0.94383621
#6: 0.82122120
#7: 0.59390132

If you need the reverse, use type=lead

nm3 <- paste("lead", nm1, sep=".")

Using the original dataset

  data[, (nm3) := shift(.SD, type='lead'), by = groups, .SDcols=nm1]
# time groups value value1 value2 lead.value lead.value1
#1: 1 b -0.6264538 0.7383247 1.12493092 0.1836433 0.5757814
#2: 2 b 0.1836433 0.5757814 -0.04493361 -0.8356286 -0.3053884
#3: 3 b -0.8356286 -0.3053884 -0.01619026 NA NA
#4: 1 a 1.5952808 1.5117812 0.94383621 0.3295078 0.3898432
#5: 2 a 0.3295078 0.3898432 0.82122120 -0.8204684 -0.6212406
#6: 3 a -0.8204684 -0.6212406 0.59390132 0.4874291 -2.2146999
#7: 4 a 0.4874291 -2.2146999 0.91897737 NA NA
# lead.value2
#1: -0.04493361
#2: -0.01619026
#3: NA
#4: 0.82122120
#5: 0.59390132
#6: 0.91897737
#7: NA

data

 set.seed(1)
data <- data.table(time =c(1:3,1:4),groups = c(rep(c("b","a"),c(3,4))),
value = rnorm(7), value1=rnorm(7), value2=rnorm(7))

Need a data.table method to use a lagged computed value by group

 DT[, val2 := nafill(val, "locf")/2^(1 + cumsum(is.na(val))), grp]
DT
yr grp val val2
1: 1 a 25 12.5
2: 2 a 20 10.0
3: 3 a NA 5.0
4: 4 a NA 2.5
5: 1 b 10 5.0
6: 2 b 12 6.0
7: 3 b NA 3.0
8: 4 b NA 1.5

Create lagged variable in unbalanced panel data in R

Using a function tlag within groups defined by id

library(dplyr)
tlag <- function(x, n = 1L, time) {
index <- match(time - n, time, incomparables = NA)
x[index]
}

df %>% group_by(id) %>% mutate(value_lagged = tlag(value, 1, time = date))

Grouped count aggregation in R data.table

## by date
DT[, list(total_buys = sum(buy > 0), total_sells = sum(sell > 0)), by = date]
## date total_buys total_sells
## 1: 2011-01-01 1 0
## 2: 2011-01-02 0 2
## 3: 2011-01-03 2 1
## 4: 2011-01-04 1 0
## 5: 2011-01-05 0 0
## 6: 2011-01-06 0 1

DT[, list(total_buys = sum(buy > 0), total_sells = sum(sell > 0))]
## total_buys total_sells
## 1: 4 4

Adding multiple columns from a function to data.table using := within groups - without specifying the LHS

For the same as IShouldBuyABoat's answer result try:

data.table(dt, dt[, my.fun(.SD), by = grp][, grp := NULL])
grp period y1 y2 y1.lag y2.lag y1y2
1: a 1 1.36677 -0.81025 NA NA -1.107425
2: a 2 0.43528 1.04277 1.36677 -0.8102 0.453895
3: a 3 -1.40229 0.66223 0.43528 1.0428 -0.928633
4: a 4 1.43362 0.10293 -1.40229 0.6622 0.147560
5: a 5 0.46713 0.72508 1.43362 0.1029 0.338705
6: b 1 -0.04418 -0.20014 NA NA 0.008843
7: b 2 1.32390 0.19651 -0.04418 -0.2001 0.260160
8: b 3 -0.82543 1.11483 1.32390 0.1965 -0.920215
9: b 4 -1.26415 0.53213 -0.82543 1.1148 -0.672698
10: b 5 0.14549 0.04128 -1.26415 0.5321 0.006005

Note: you can use .SDcols for specify a columns passed to my.fun. [, grp := NULL] to supress duplicate grp column.

R - How do I lag/lead multiple columns in a data.table by multiple periods most efficiently

How's this in a nice little for loop:

cols <- grep("Var", names(dt), value = TRUE)
for ( i in 1:2 ) { # update for the number of shifts

lag_names <- paste(cols, "Lag", i, sep = "_")
dt[, (lag_names) := lapply(.SD, shift, i, type = "lag"), .SDcols = cols]

lead_names <- paste(cols, "Lead", i, sep = "_")
dt[, (lead_names) := lapply(.SD, shift, i, type = "lead"), .SDcols = cols]

}

Rolling multi regression in R data table

If I understand correctly you want a quarterly auto-regression.

There's a related thread on time-series with data.table here.

You can setup a rolling date in data.table like this (see the link above for more context):

#Example for quarterly data
quarterly[, rollDate:=leftBound]
storeData[, rollDate:=date]

setkey(quarterly,"rollDate")
setkey(storeData,"rollDate")

Since you only provided a few rows of example data, I extended the series through 2019 and made up random return values.

First get your data setup:

require(forecast)
require(xts)
DT <- read.table(con<- file ( "clipboard"))
dput(DT) # the dput was too long to display here
DT[,1] <- as.POSIXct(strptime(DT[,1], "%m/%d/%Y"))
DT[,2] <- as.double(DT[,2])
dat <- xts(DT$V2,DT$V1, order.by = DT$V1)

x.ts <- to.quarterly(dat) # 120 days

dat.Open dat.High dat.Low dat.Close
2016 Q1 1292 1292 1 698
2016 Q2 138 1290 3 239
2016 Q3 451 1285 5 780
2016 Q4 355 1243 27 1193
2017 Q1 878 1279 4 687
2017 Q2 794 1283 12 411
2017 Q3 858 1256 9 1222
2017 Q4 219 1282 15 117
2018 Q1 554 1286 32 432
2018 Q2 630 1272 30 46
2018 Q3 310 1288 18 979
2019 Q1 143 1291 10 184
2019 Q2 250 1289 8 441
2019 Q3 110 1220 23 571

Then you can do a rolling ARIMA model with or without re-estimation like this:

fit <- auto.arima(x.ts)
order <- arimaorder(fit)
fcmat <- matrix(0, nrow=nrow(x), ncol=1)
n <- nrow(x)
for(i in 1:n)
{
x <- window(x.ts, end=2017.99 + (i-1)/4)
refit <- Arima(x, order=order[1:3], seasonal=order[4:6])
fcmat[i,] <- forecast(refit, h=h)$mean
}

Here's a good related resource with several examples of different ways you might construct this: http://robjhyndman.com/hyndsight/rolling-forecasts/

Calculating the lag for aggregated data

First try

I think the key is that lag() looks at the previous row of the dataframe, or at least the previous row of your group. It doesn't look up by value.

So one way to get what (I think) you want is just to arrange() the dataframe by your groups.

samp %>%
group_by(shop_id, item_id) %>%
arrange(shop_id, item_id) %>%
mutate(lag_item_price2 = lag(item_price2))

Second try

You are mixing together data that has different groupings. Some columns (e.g. item_cnt_monthanditem_price) you want grouped by bothitem_idandshop_id. Other columns you want to group byshop_idanditem_category_id(item_cnt_month1anditem_price1`) and yet other columns have a third grouping.

This is an unnatural format for your data. As a result the solution to do what you want will be very complicated. We will have to (a) reformat the data to long format using gather(), (b) filter the data three separate times to isolate each set of factors that shares the same grouping, (c) regroup and rearrange the long data so that we can call lag() and get the expected output, which comes in the form of three separate data frames of unequal size. Finally, (d) we must reformat the data to match your input format.

library(tidyverse)

by_shop <- c('item_cnt_month2', 'item_price2')
by_shop_and_category <- c('item_cnt_month1', 'item_price1')
by_shop_and_item <- c('item_cnt_month', 'item_price')

long <-
samp %>%
gather(factor, value, contains("month"), contains("price"))

by_shop_df <-
long %>%
filter(factor %in% by_shop) %>%
group_by(shop_id, factor, date_block_num)

by_shop_and_category_df <-
long %>%
filter(factor %in% by_shop_and_category) %>%
group_by(shop_id, item_category_id, factor, date_block_num)

by_shop_and_item_df <-
long %>%
filter(factor %in% by_shop_and_item) %>%
group_by(shop_id, item_id, factor, date_block_num)

lag_my_value <- function(df){
df %>%
summarize(value = first(value)) %>%
arrange(date_block_num) %>%
mutate(value = lag(value, 1)) %>%
spread(factor, value)
}

my_dfs <- list(by_shop_df, by_shop_and_category_df, by_shop_and_item_df)

my_lagged_dfs <- map(my_dfs, lag_my_value)

final_answer <-
samp %>%
select(date_block_num, shop_id, item_category_id, item_id) %>%
left_join(., my_lagged_dfs[[1]], by=c('shop_id', 'date_block_num')) %>%
left_join(., my_lagged_dfs[[2]], by=c('shop_id', 'item_category_id', 'date_block_num')) %>%
left_join(., my_lagged_dfs[[3]], by=c('shop_id', 'item_id', 'date_block_num'))

final_answer


Related Topics



Leave a reply



Submit