Adding Lagged Variables to an Lm Model

linear regression model with lagging effect/interaction

I work in econometrics. I use this

df <- data.frame(X1=(seq(1,10,by=1))) 

library(dplyr)
df <- 
  df %>%
  mutate(X1_lag1 = dplyr::lag(X1, n = 1, default = NA))
df

> df
   X1 X1_lag1
1   1      NA
2   2       1
3   3       2
4   4       3
5   5       4
6   6       5
7   7       6
8   8       7
9   9       8
10 10       9

whereby df is dataframe, and X1 is the variable you want to lag, and X1_lag1 is the new lagged variable.
You can put a decay on the 0.5*X1_lag1 on the lagged variable, create permutations. In SAS I can put a coefficient on the decay factor but it gets very messy. I rather create multiple permutations of decay and see which has best fit.

Making a long-term forecast with multiple linear regression and lagged variables in R

Using my_df from the question convert it to zoo and then run dyn$lm. The original problem was that there was a datetime field that was character so zoo(my_df) converted it to a character object. If we use read.zoo and use it to convert the datetime field to POSIXct and tell it that that column is the index -- read.zoo assumes that the first column is the index unless specified otherwise -- then it works. Also note that there are date times in the data that conflict with daylight savings time so the time zone must be specified as UTC.

Next follow the code at the end of ?dyn to get predictions. Since this can be very slow with large amounts of data we have just shown 3 predictions below.

As per my comment be sure that dplyr is not loaded since it clobbers lag.

library(dyn)
    
z <- read.zoo(my_df, format = "%d/%m/%Y %H:%M", tz = "UTC")
fo <- LOAD ~ TEMPERATURE + MONTH + WEEKDAY + HOUR + lag(LOAD, -(1:24))
fm <- dyn$lm(fo, z)

fc.orig <- read.zoo(forecast_df, format = "%d/%m/%Y %H:%M", tz = "UTC")

# fc <- fc.orig
fc <- head(fc.orig, 3)

LOAD0 <- zoo(cbind(LOAD = 0), time(fc))
both <- rbind(z, cbind(LOAD0, fc))

for(i in seq(nrow(z) + 1, nrow(both))) {
   fit <- dyn$lm(fo, both, subset = seq_len(i-1))
   both[i, "LOAD"] <- tail(predict(fit, both[1:i, ]), 1)
}

# extract the forecast rows    
fc_new <- both[seq(nrow(z) + 1, nrow(both)), ]

Note

my_df <- read.csv(file = 
"https://raw.githubusercontent.com/Argiro1983/Load/main/my_df.csv", sep=";")
forecast_df <- read.csv(file = "https://raw.githubusercontent.com/Argiro1983/Load/main/forecast_df.csv", sep=";")

Using updating a linear model with lagged new variables

A possible solution for your problem can be:

# Data generating process
yX <- as.data.frame(matrix(rnorm(1000),ncol=5))
names(yX) <- c("y", paste("x",1:4,sep=""))

# Start with a linear model with x1 and x2 as explanatory variables
f1 <- as.formula(y ~ x1 + x2)
fit <- lm(f1, data=yX)

# Add lagged x3 and x4 variables
fmla <- as.formula(paste('~.+',paste("lag(",addvars,",2)", collapse = '+')))
update(fit,  fmla)

# Call:
# lm(formula = y ~ x1 + x2 + lag(x3, 2) + lag(x4, 2), data = yX)
# 
# Coefficients:
# (Intercept)           x1           x2   lag(x3, 2)   lag(x4, 2)  
#   -0.083180     0.015753     0.041998     0.000612    -0.093265

Below an example with the dynlm package.

data("USDistLag", package = "lmtest")

# Start with a dynamic linear model with gnp as explanatory variables
library(dynlm)
f1 <- as.formula(consumption ~ gnp)
( fit <- dynlm(f1, data=USDistLag) )

# Time series regression with "ts" data:
# Start = 1963, End = 1982
# 
# Call:
# dynlm(formula = f1, data = USDistLag)
# 
# Coefficients:
# (Intercept)          gnp  
#    -24.0889       0.6448

# Add lagged gnp 
addvars <- c("gnp")
fmla <- as.formula(paste('~.+',paste("lag(",addvars,",2)", collapse = '+')))
update(fit,  fmla)

# Time series regression with "ts" data:
# Start = 1963, End = 1980
# 
# Call:
# dynlm(formula = consumption ~ gnp + lag(gnp, 2), data = USDistLag)
# 
# Coefficients:
# (Intercept)          gnp  lag(gnp, 2)  
#    -31.1437       0.5366       0.1067

How to include lagged variables in statsmodel ols regression

I don't think you can call it on the fly in the formula. Maybe using the shift method? Do clarify if this is not what you need

import statsmodels.api as sm
df['xlag'] = df['x'].shift()
df

   y  x  v  xlag
0  2  8  4   NaN
1  3  6  3   8.0
2  7  2  1   6.0
3  8  1  3   2.0
4  1  9  8   1.0

sm.formula.ols(formula = 'y ~ xlag + v', data=df).fit()

i am confused with the R implementation of lag in Regression analysis

The question starts out with:

look at this linear regression: Y ~ X + lag(X,1) ,the meaning is very clear
that it is trying to do a linear regression. and the lag(X,1) means the first
lag of X

Actually that is not the case. It does not refer to this model:

Y[i] = a + b * X[i] + c * X[i-1] + error[i]

It actually refers to this model:

Y[i] = a + b * X[i] + c * X[i+1] + error[i]

which is not likely what you intended.

It is likely that you wanted lag(X, -1) rather than lag(X, 1). Lagging a series in R means that the lagged series starts earlier which implies that the series itself moves forward.

The other item to be careful of is that lm does not align series. It knows nothing about the time index. You will need to align the series yourself or use a package which does it for you.

More on these points below.

First let us consider lag.ts from the core of R since lag.zoo and lag.zooreg are based on it and consistent with it. lag.ts lags the times of the series so that the lagged series starts earlier. That is if we have a series whose values are 11, 12, 13 and 14 at times 1, 2, 3 and 4 respectively lag.ts lags each time so that the lagged series has the same values 11, 12, 13 and 14 but at the times 0, 1, 2, 3. The original series started at 1 but the lagged series starts at 0. Originally the value 12 was at time 2 but in the lagged series the value 13 is at time 2. In code, we have:

tt <- ts(11:14)
cbind(tt, lag(tt), lag(tt, 1), lag(tt, -1))

gives:

Time Series:
Start = 0 
End = 5 
Frequency = 1 
  tt lag(tt) lag(tt, 1) lag(tt, -1)
0 NA      11         11          NA
1 11      12         12          NA
2 12      13         13          11
3 13      14         14          12
4 14      NA         NA          13
5 NA      NA         NA          14

zoo

lag.zoo is consistent with lag.ts. Note that since zoo represents irrelgularly spaced series it cannot assume that time 0 comes before time 1. We could only make such an assumption if we knew the series were regularly spaced. Thus if time 1 is the earliest time in a series the value at this time is dropped since there is no way to determine what earlier time to lag it to. The new lagged series now starts at the second time value in the original series. This is similar to the lag.ts example except in the lag.ts there was a 0 time and in this example there is no such time. Similarly we cannot extend the time scale forward in time either.

library(zoo)
z <- zoo(11:14)
merge(z, lag(z), lag(z, 1), lag(z,-1))

giving:

   z lag(z) lag(z, 1) lag(z, -1)
1 11     12        12         NA
2 12     13        13         11
3 13     14        14         12
4 14     NA        NA         13

zooreg

The zoo package does have a zooreg class which assumes regularly spaced series except for some missing values and it can deduce what comes before just as ts can. With zooreg it can deduce that time 0 comes before and time 5 comes after.

library(zoo)
zr <- zooreg(11:14)
merge(zr, lag(zr), lag(zr, 1), lag(zr,-1))

giving:

  zr lag(zr) lag(zr, 1) lag(zr, -1)
0 NA      11         11          NA
1 11      12         12          NA
2 12      13         13          11
3 13      14         14          12
4 14      NA         NA          13
5 NA      NA         NA          14

lm does not know anything about zoo and will ignore the time index entirely. If you want to not ignore it, i.e. you want to align the series involved prior to running the regression, use the dyn (or dynlm) package. Using the former:

library(dyn)
set.seed(123)
zr <- zooreg(rnorm(10))
y <- 1 + 2 * zr + 3 * lag(zr, -1)
dyn$lm(y ~ zr + lag(zr, -1))

giving:

Call:
lm(formula = dyn(y ~ zr + lag(zr, -1)))

Coefficients:
(Intercept)           zr  lag(zr, -1)  
          1            2            3

Note 1: Be sure to read the documentation in the help files: ?lag.ts , ?lag.zoo , ?lag.zooreg and help(package = dyn)

Note 2: If the direction of the lag seems confusing you could define your own function and use that in place of lag. For example, this gives the same coefficients as the lm output shown above:

Lag <- function(x, k = 1) lag(x, -k)
dyn$lm(y ~ zr + Lag(zr))

An additional word of warning is that unlike lag.zoo and lag.zooreg which are consistent with the core of R, lag.xts from the xts package is inconsistent. Also the lag in dplyr is also inconsistent (and to make things worse if you load dplyr then dplyr will mask lag in R with its own inconsistent version of lag. Also note that L in dynlm works the same as Lag but wisely used a different name to avoid confusion.

Lagged linear model to identify correlation in age frequency data

I am adding another answer upon your comment on 7.27.2020. The plot does not have numbers, but it gives some idea about the numbers that I should have in the IVS matrix. Please try the following code and see if it makes sense.

tmp = wi.age.count[order(wi.age.count$Age), ]
ivs = reshape(tmp[which(tmp$Age != 0), -4], direction = "wide", idvar = "Year", timevar = "Age")
ivs[is.na(ivs)] = 0

> ivs
   Year n.1 n.2 n.3 n.4 n.5 n.6 n.7 n.8 n.9
13 2007   8  17   0   0   1   0   0   0   0
16 2008   4  19   1   0   0   0   0   0   0
20 2009  46  37  52   5   1   1   0   0   0
26 2010  19  41  15  16   0   0   0   0   1
32 2011  13   4  26  12  11   1   1   1   0
41 2012  87  15  13  27  13  17   1   0   0
49 2013  32  30   3   4   1   1   1   0   0
57 2014  24  15  23   6   2   1   2   2   0
66 2015  18  13  31  28   3   3   6   0   1
74 2016   4   6   1   5   9   1   0   0   1
82 2017  16  16   8   1   1   4   0   0   0
89 2018  12   4   7   2   1   2   1   0   0

This is your ivs matrix. Does that look correct?

Everything else is the same. Here is your dv matrix:

dv = wi.age.count[which(wi.age.count$id == "YOY"), c(1, 3)]

> dv
   Year   n
1  2008 166
2  2009  28
3  2010  34
4  2011  77
5  2012 170
6  2013  18
7  2014   3
8  2015  22
9  2016  43
10 2017  50
11 2018 151

And your formula with three lags.

formula = ""
for (i in 2:4) formula = paste(formula, "+",  names(ivs)[i])
formula = paste("n ~", substr(formula, 4, nchar(formula)))

> formula
[1] "n ~ n.1 + n.2 + n.3"

And here are the results:

l.fit = lm(formula, merge(dv, ivs))
AIC.l.fit <- signif(AIC(l.fit), digits = 3)
summary(l.fit)

Call:
lm(formula = formula, data = merge(dv, ivs))

Residuals:
    Min      1Q  Median      3Q     Max 
-60.367 -38.028   8.698  23.763  96.257 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  99.8976    36.1761   2.761    0.028 *
n.1           1.1059     0.8388   1.318    0.229  
n.2          -1.7339     1.5773  -1.099    0.308  
n.3          -1.6346     1.2932  -1.264    0.247  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 59.48 on 7 degrees of freedom
Multiple R-squared:  0.3731,    Adjusted R-squared:  0.1044 
F-statistic: 1.389 on 3 and 7 DF,  p-value: 0.3233

> AIC.l.fit
[1] 126

Adding Lagged Variables to an Lm Model

linear regression model with lagging effect/interaction

Making a long-term forecast with multiple linear regression and lagged variables in R

Note

Using updating a linear model with lagged new variables

How to include lagged variables in statsmodel ols regression

i am confused with the R implementation of lag in Regression analysis

Lagged linear model to identify correlation in age frequency data

Related Topics

Leave a reply