Adding Lagged Variables to an Lm Model

linear regression model with lagging effect/interaction

I work in econometrics. I use this

df <- data.frame(X1=(seq(1,10,by=1))) 

library(dplyr)
df <-
df %>%
mutate(X1_lag1 = dplyr::lag(X1, n = 1, default = NA))
df

> df
X1 X1_lag1
1 1 NA
2 2 1
3 3 2
4 4 3
5 5 4
6 6 5
7 7 6
8 8 7
9 9 8
10 10 9

whereby df is dataframe, and X1 is the variable you want to lag, and X1_lag1 is the new lagged variable.
You can put a decay on the 0.5*X1_lag1 on the lagged variable, create permutations. In SAS I can put a coefficient on the decay factor but it gets very messy. I rather create multiple permutations of decay and see which has best fit.

Making a long-term forecast with multiple linear regression and lagged variables in R

Using my_df from the question convert it to zoo and then run dyn$lm. The original problem was that there was a datetime field that was character so zoo(my_df) converted it to a character object. If we use read.zoo and use it to convert the datetime field to POSIXct and tell it that that column is the index -- read.zoo assumes that the first column is the index unless specified otherwise -- then it works. Also note that there are date times in the data that conflict with daylight savings time so the time zone must be specified as UTC.

Next follow the code at the end of ?dyn to get predictions. Since this can be very slow with large amounts of data we have just shown 3 predictions below.

As per my comment be sure that dplyr is not loaded since it clobbers lag.

library(dyn)

z <- read.zoo(my_df, format = "%d/%m/%Y %H:%M", tz = "UTC")
fo <- LOAD ~ TEMPERATURE + MONTH + WEEKDAY + HOUR + lag(LOAD, -(1:24))
fm <- dyn$lm(fo, z)

fc.orig <- read.zoo(forecast_df, format = "%d/%m/%Y %H:%M", tz = "UTC")

# fc <- fc.orig
fc <- head(fc.orig, 3)

LOAD0 <- zoo(cbind(LOAD = 0), time(fc))
both <- rbind(z, cbind(LOAD0, fc))

for(i in seq(nrow(z) + 1, nrow(both))) {
fit <- dyn$lm(fo, both, subset = seq_len(i-1))
both[i, "LOAD"] <- tail(predict(fit, both[1:i, ]), 1)
}

# extract the forecast rows
fc_new <- both[seq(nrow(z) + 1, nrow(both)), ]

Note

my_df <- read.csv(file = 
"https://raw.githubusercontent.com/Argiro1983/Load/main/my_df.csv", sep=";")
forecast_df <- read.csv(file = "https://raw.githubusercontent.com/Argiro1983/Load/main/forecast_df.csv", sep=";")

Using updating a linear model with lagged new variables

A possible solution for your problem can be:

# Data generating process
yX <- as.data.frame(matrix(rnorm(1000),ncol=5))
names(yX) <- c("y", paste("x",1:4,sep=""))

# Start with a linear model with x1 and x2 as explanatory variables
f1 <- as.formula(y ~ x1 + x2)
fit <- lm(f1, data=yX)

# Add lagged x3 and x4 variables
fmla <- as.formula(paste('~.+',paste("lag(",addvars,",2)", collapse = '+')))
update(fit, fmla)

# Call:
# lm(formula = y ~ x1 + x2 + lag(x3, 2) + lag(x4, 2), data = yX)
#
# Coefficients:
# (Intercept) x1 x2 lag(x3, 2) lag(x4, 2)
# -0.083180 0.015753 0.041998 0.000612 -0.093265

Below an example with the dynlm package.

data("USDistLag", package = "lmtest")

# Start with a dynamic linear model with gnp as explanatory variables
library(dynlm)
f1 <- as.formula(consumption ~ gnp)
( fit <- dynlm(f1, data=USDistLag) )

# Time series regression with "ts" data:
# Start = 1963, End = 1982
#
# Call:
# dynlm(formula = f1, data = USDistLag)
#
# Coefficients:
# (Intercept) gnp
# -24.0889 0.6448

# Add lagged gnp
addvars <- c("gnp")
fmla <- as.formula(paste('~.+',paste("lag(",addvars,",2)", collapse = '+')))
update(fit, fmla)

# Time series regression with "ts" data:
# Start = 1963, End = 1980
#
# Call:
# dynlm(formula = consumption ~ gnp + lag(gnp, 2), data = USDistLag)
#
# Coefficients:
# (Intercept) gnp lag(gnp, 2)
# -31.1437 0.5366 0.1067

How to include lagged variables in statsmodel ols regression

I don't think you can call it on the fly in the formula. Maybe using the shift method? Do clarify if this is not what you need

import statsmodels.api as sm
df['xlag'] = df['x'].shift()
df

y x v xlag
0 2 8 4 NaN
1 3 6 3 8.0
2 7 2 1 6.0
3 8 1 3 2.0
4 1 9 8 1.0

sm.formula.ols(formula = 'y ~ xlag + v', data=df).fit()

i am confused with the R implementation of lag in Regression analysis

The question starts out with:

look at this linear regression: Y ~ X + lag(X,1) ,the meaning is very clear
that it is trying to do a linear regression. and the lag(X,1) means the first
lag of X

Actually that is not the case. It does not refer to this model:

Y[i] = a + b * X[i] + c * X[i-1] + error[i]

It actually refers to this model:

Y[i] = a + b * X[i] + c * X[i+1] + error[i]

which is not likely what you intended.

It is likely that you wanted lag(X, -1) rather than lag(X, 1). Lagging a series in R means that the lagged series starts earlier which implies that the series itself moves forward.

The other item to be careful of is that lm does not align series. It knows nothing about the time index. You will need to align the series yourself or use a package which does it for you.

More on these points below.

ts

First let us consider lag.ts from the core of R since lag.zoo and lag.zooreg are based on it and consistent with it. lag.ts lags the times of the series so that the lagged series starts earlier. That is if we have a series whose values are 11, 12, 13 and 14 at times 1, 2, 3 and 4 respectively lag.ts lags each time so that the lagged series has the same values 11, 12, 13 and 14 but at the times 0, 1, 2, 3. The original series started at 1 but the lagged series starts at 0. Originally the value 12 was at time 2 but in the lagged series the value 13 is at time 2. In code, we have:

tt <- ts(11:14)
cbind(tt, lag(tt), lag(tt, 1), lag(tt, -1))

gives:

Time Series:
Start = 0
End = 5
Frequency = 1
tt lag(tt) lag(tt, 1) lag(tt, -1)
0 NA 11 11 NA
1 11 12 12 NA
2 12 13 13 11
3 13 14 14 12
4 14 NA NA 13
5 NA NA NA 14

zoo

lag.zoo is consistent with lag.ts. Note that since zoo represents irrelgularly spaced series it cannot assume that time 0 comes before time 1. We could only make such an assumption if we knew the series were regularly spaced. Thus if time 1 is the earliest time in a series the value at this time is dropped since there is no way to determine what earlier time to lag it to. The new lagged series now starts at the second time value in the original series. This is similar to the lag.ts example except in the lag.ts there was a 0 time and in this example there is no such time. Similarly we cannot extend the time scale forward in time either.

library(zoo)
z <- zoo(11:14)
merge(z, lag(z), lag(z, 1), lag(z,-1))

giving:

   z lag(z) lag(z, 1) lag(z, -1)
1 11 12 12 NA
2 12 13 13 11
3 13 14 14 12
4 14 NA NA 13

zooreg

The zoo package does have a zooreg class which assumes regularly spaced series except for some missing values and it can deduce what comes before just as ts can. With zooreg it can deduce that time 0 comes before and time 5 comes after.

library(zoo)
zr <- zooreg(11:14)
merge(zr, lag(zr), lag(zr, 1), lag(zr,-1))

giving:

  zr lag(zr) lag(zr, 1) lag(zr, -1)
0 NA 11 11 NA
1 11 12 12 NA
2 12 13 13 11
3 13 14 14 12
4 14 NA NA 13
5 NA NA NA 14

lm

lm does not know anything about zoo and will ignore the time index entirely. If you want to not ignore it, i.e. you want to align the series involved prior to running the regression, use the dyn (or dynlm) package. Using the former:

library(dyn)
set.seed(123)
zr <- zooreg(rnorm(10))
y <- 1 + 2 * zr + 3 * lag(zr, -1)
dyn$lm(y ~ zr + lag(zr, -1))

giving:

Call:
lm(formula = dyn(y ~ zr + lag(zr, -1)))

Coefficients:
(Intercept) zr lag(zr, -1)
1 2 3

Note 1: Be sure to read the documentation in the help files: ?lag.ts , ?lag.zoo , ?lag.zooreg and help(package = dyn)

Note 2: If the direction of the lag seems confusing you could define your own function and use that in place of lag. For example, this gives the same coefficients as the lm output shown above:

Lag <- function(x, k = 1) lag(x, -k)
dyn$lm(y ~ zr + Lag(zr))

An additional word of warning is that unlike lag.zoo and lag.zooreg which are consistent with the core of R, lag.xts from the xts package is inconsistent. Also the lag in dplyr is also inconsistent (and to make things worse if you load dplyr then dplyr will mask lag in R with its own inconsistent version of lag. Also note that L in dynlm works the same as Lag but wisely used a different name to avoid confusion.

Lagged linear model to identify correlation in age frequency data

I am adding another answer upon your comment on 7.27.2020. The plot does not have numbers, but it gives some idea about the numbers that I should have in the IVS matrix. Please try the following code and see if it makes sense.

tmp = wi.age.count[order(wi.age.count$Age), ]
ivs = reshape(tmp[which(tmp$Age != 0), -4], direction = "wide", idvar = "Year", timevar = "Age")
ivs[is.na(ivs)] = 0

> ivs
Year n.1 n.2 n.3 n.4 n.5 n.6 n.7 n.8 n.9
13 2007 8 17 0 0 1 0 0 0 0
16 2008 4 19 1 0 0 0 0 0 0
20 2009 46 37 52 5 1 1 0 0 0
26 2010 19 41 15 16 0 0 0 0 1
32 2011 13 4 26 12 11 1 1 1 0
41 2012 87 15 13 27 13 17 1 0 0
49 2013 32 30 3 4 1 1 1 0 0
57 2014 24 15 23 6 2 1 2 2 0
66 2015 18 13 31 28 3 3 6 0 1
74 2016 4 6 1 5 9 1 0 0 1
82 2017 16 16 8 1 1 4 0 0 0
89 2018 12 4 7 2 1 2 1 0 0

This is your ivs matrix. Does that look correct?

Everything else is the same. Here is your dv matrix:

dv = wi.age.count[which(wi.age.count$id == "YOY"), c(1, 3)]

> dv
Year n
1 2008 166
2 2009 28
3 2010 34
4 2011 77
5 2012 170
6 2013 18
7 2014 3
8 2015 22
9 2016 43
10 2017 50
11 2018 151

And your formula with three lags.

formula = ""
for (i in 2:4) formula = paste(formula, "+", names(ivs)[i])
formula = paste("n ~", substr(formula, 4, nchar(formula)))

> formula
[1] "n ~ n.1 + n.2 + n.3"

And here are the results:

l.fit = lm(formula, merge(dv, ivs))
AIC.l.fit <- signif(AIC(l.fit), digits = 3)
summary(l.fit)

Call:
lm(formula = formula, data = merge(dv, ivs))

Residuals:
Min 1Q Median 3Q Max
-60.367 -38.028 8.698 23.763 96.257

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 99.8976 36.1761 2.761 0.028 *
n.1 1.1059 0.8388 1.318 0.229
n.2 -1.7339 1.5773 -1.099 0.308
n.3 -1.6346 1.2932 -1.264 0.247
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 59.48 on 7 degrees of freedom
Multiple R-squared: 0.3731, Adjusted R-squared: 0.1044
F-statistic: 1.389 on 3 and 7 DF, p-value: 0.3233

> AIC.l.fit
[1] 126


Related Topics



Leave a reply



Submit