linear regression model with lagging effect/interaction
I work in econometrics. I use this
df <- data.frame(X1=(seq(1,10,by=1)))
library(dplyr)
df <-
df %>%
mutate(X1_lag1 = dplyr::lag(X1, n = 1, default = NA))
df
> df
X1 X1_lag1
1 1 NA
2 2 1
3 3 2
4 4 3
5 5 4
6 6 5
7 7 6
8 8 7
9 9 8
10 10 9
whereby df is dataframe, and X1 is the variable you want to lag, and X1_lag1 is the new lagged variable.
You can put a decay on the 0.5*X1_lag1 on the lagged variable, create permutations. In SAS I can put a coefficient on the decay factor but it gets very messy. I rather create multiple permutations of decay and see which has best fit.
Making a long-term forecast with multiple linear regression and lagged variables in R
Using my_df from the question convert it to zoo and then run dyn$lm. The original problem was that there was a datetime field that was character so zoo(my_df) converted it to a character object. If we use read.zoo and use it to convert the datetime field to POSIXct and tell it that that column is the index -- read.zoo assumes that the first column is the index unless specified otherwise -- then it works. Also note that there are date times in the data that conflict with daylight savings time so the time zone must be specified as UTC.
Next follow the code at the end of ?dyn
to get predictions. Since this can be very slow with large amounts of data we have just shown 3 predictions below.
As per my comment be sure that dplyr is not loaded since it clobbers lag.
library(dyn)
z <- read.zoo(my_df, format = "%d/%m/%Y %H:%M", tz = "UTC")
fo <- LOAD ~ TEMPERATURE + MONTH + WEEKDAY + HOUR + lag(LOAD, -(1:24))
fm <- dyn$lm(fo, z)
fc.orig <- read.zoo(forecast_df, format = "%d/%m/%Y %H:%M", tz = "UTC")
# fc <- fc.orig
fc <- head(fc.orig, 3)
LOAD0 <- zoo(cbind(LOAD = 0), time(fc))
both <- rbind(z, cbind(LOAD0, fc))
for(i in seq(nrow(z) + 1, nrow(both))) {
fit <- dyn$lm(fo, both, subset = seq_len(i-1))
both[i, "LOAD"] <- tail(predict(fit, both[1:i, ]), 1)
}
# extract the forecast rows
fc_new <- both[seq(nrow(z) + 1, nrow(both)), ]
Note
my_df <- read.csv(file =
"https://raw.githubusercontent.com/Argiro1983/Load/main/my_df.csv", sep=";")
forecast_df <- read.csv(file = "https://raw.githubusercontent.com/Argiro1983/Load/main/forecast_df.csv", sep=";")
Using updating a linear model with lagged new variables
A possible solution for your problem can be:
# Data generating process
yX <- as.data.frame(matrix(rnorm(1000),ncol=5))
names(yX) <- c("y", paste("x",1:4,sep=""))
# Start with a linear model with x1 and x2 as explanatory variables
f1 <- as.formula(y ~ x1 + x2)
fit <- lm(f1, data=yX)
# Add lagged x3 and x4 variables
fmla <- as.formula(paste('~.+',paste("lag(",addvars,",2)", collapse = '+')))
update(fit, fmla)
# Call:
# lm(formula = y ~ x1 + x2 + lag(x3, 2) + lag(x4, 2), data = yX)
#
# Coefficients:
# (Intercept) x1 x2 lag(x3, 2) lag(x4, 2)
# -0.083180 0.015753 0.041998 0.000612 -0.093265
Below an example with the dynlm
package.
data("USDistLag", package = "lmtest")
# Start with a dynamic linear model with gnp as explanatory variables
library(dynlm)
f1 <- as.formula(consumption ~ gnp)
( fit <- dynlm(f1, data=USDistLag) )
# Time series regression with "ts" data:
# Start = 1963, End = 1982
#
# Call:
# dynlm(formula = f1, data = USDistLag)
#
# Coefficients:
# (Intercept) gnp
# -24.0889 0.6448
# Add lagged gnp
addvars <- c("gnp")
fmla <- as.formula(paste('~.+',paste("lag(",addvars,",2)", collapse = '+')))
update(fit, fmla)
# Time series regression with "ts" data:
# Start = 1963, End = 1980
#
# Call:
# dynlm(formula = consumption ~ gnp + lag(gnp, 2), data = USDistLag)
#
# Coefficients:
# (Intercept) gnp lag(gnp, 2)
# -31.1437 0.5366 0.1067
How to include lagged variables in statsmodel ols regression
I don't think you can call it on the fly in the formula. Maybe using the shift method? Do clarify if this is not what you need
import statsmodels.api as sm
df['xlag'] = df['x'].shift()
df
y x v xlag
0 2 8 4 NaN
1 3 6 3 8.0
2 7 2 1 6.0
3 8 1 3 2.0
4 1 9 8 1.0
sm.formula.ols(formula = 'y ~ xlag + v', data=df).fit()
i am confused with the R implementation of lag in Regression analysis
The question starts out with:
look at this linear regression: Y ~ X + lag(X,1) ,the meaning is very clear
that it is trying to do a linear regression. and the lag(X,1) means the first
lag of X
Actually that is not the case. It does not refer to this model:
Y[i] = a + b * X[i] + c * X[i-1] + error[i]
It actually refers to this model:
Y[i] = a + b * X[i] + c * X[i+1] + error[i]
which is not likely what you intended.
It is likely that you wanted lag(X, -1)
rather than lag(X, 1)
. Lagging a series in R means that the lagged series starts earlier which implies that the series itself moves forward.
The other item to be careful of is that lm
does not align series. It knows nothing about the time index. You will need to align the series yourself or use a package which does it for you.
More on these points below.
ts
First let us consider lag.ts
from the core of R since lag.zoo
and lag.zooreg
are based on it and consistent with it. lag.ts
lags the times of the series so that the lagged series starts earlier. That is if we have a series whose values are 11, 12, 13 and 14 at times 1, 2, 3 and 4 respectively lag.ts
lags each time so that the lagged series has the same values 11, 12, 13 and 14 but at the times 0, 1, 2, 3. The original series started at 1 but the lagged series starts at 0. Originally the value 12 was at time 2 but in the lagged series the value 13 is at time 2. In code, we have:
tt <- ts(11:14)
cbind(tt, lag(tt), lag(tt, 1), lag(tt, -1))
gives:
Time Series:
Start = 0
End = 5
Frequency = 1
tt lag(tt) lag(tt, 1) lag(tt, -1)
0 NA 11 11 NA
1 11 12 12 NA
2 12 13 13 11
3 13 14 14 12
4 14 NA NA 13
5 NA NA NA 14
zoo
lag.zoo
is consistent with lag.ts
. Note that since zoo represents irrelgularly spaced series it cannot assume that time 0 comes before time 1. We could only make such an assumption if we knew the series were regularly spaced. Thus if time 1 is the earliest time in a series the value at this time is dropped since there is no way to determine what earlier time to lag it to. The new lagged series now starts at the second time value in the original series. This is similar to the lag.ts
example except in the lag.ts there was a 0 time and in this example there is no such time. Similarly we cannot extend the time scale forward in time either.
library(zoo)
z <- zoo(11:14)
merge(z, lag(z), lag(z, 1), lag(z,-1))
giving:
z lag(z) lag(z, 1) lag(z, -1)
1 11 12 12 NA
2 12 13 13 11
3 13 14 14 12
4 14 NA NA 13
zooreg
The zoo package does have a zooreg class which assumes regularly spaced series except for some missing values and it can deduce what comes before just as ts can. With zooreg it can deduce that time 0 comes before and time 5 comes after.
library(zoo)
zr <- zooreg(11:14)
merge(zr, lag(zr), lag(zr, 1), lag(zr,-1))
giving:
zr lag(zr) lag(zr, 1) lag(zr, -1)
0 NA 11 11 NA
1 11 12 12 NA
2 12 13 13 11
3 13 14 14 12
4 14 NA NA 13
5 NA NA NA 14
lm
lm
does not know anything about zoo
and will ignore the time index entirely. If you want to not ignore it, i.e. you want to align the series involved prior to running the regression, use the dyn (or dynlm) package. Using the former:
library(dyn)
set.seed(123)
zr <- zooreg(rnorm(10))
y <- 1 + 2 * zr + 3 * lag(zr, -1)
dyn$lm(y ~ zr + lag(zr, -1))
giving:
Call:
lm(formula = dyn(y ~ zr + lag(zr, -1)))
Coefficients:
(Intercept) zr lag(zr, -1)
1 2 3
Note 1: Be sure to read the documentation in the help files: ?lag.ts
, ?lag.zoo
, ?lag.zooreg
and help(package = dyn)
Note 2: If the direction of the lag seems confusing you could define your own function and use that in place of lag
. For example, this gives the same coefficients as the lm output shown above:
Lag <- function(x, k = 1) lag(x, -k)
dyn$lm(y ~ zr + Lag(zr))
An additional word of warning is that unlike lag.zoo
and lag.zooreg
which are consistent with the core of R, lag.xts
from the xts package is inconsistent. Also the lag
in dplyr is also inconsistent (and to make things worse if you load dplyr then dplyr will mask lag
in R with its own inconsistent version of lag
. Also note that L
in dynlm works the same as Lag
but wisely used a different name to avoid confusion.
Lagged linear model to identify correlation in age frequency data
I am adding another answer upon your comment on 7.27.2020. The plot does not have numbers, but it gives some idea about the numbers that I should have in the IVS matrix. Please try the following code and see if it makes sense.
tmp = wi.age.count[order(wi.age.count$Age), ]
ivs = reshape(tmp[which(tmp$Age != 0), -4], direction = "wide", idvar = "Year", timevar = "Age")
ivs[is.na(ivs)] = 0
> ivs
Year n.1 n.2 n.3 n.4 n.5 n.6 n.7 n.8 n.9
13 2007 8 17 0 0 1 0 0 0 0
16 2008 4 19 1 0 0 0 0 0 0
20 2009 46 37 52 5 1 1 0 0 0
26 2010 19 41 15 16 0 0 0 0 1
32 2011 13 4 26 12 11 1 1 1 0
41 2012 87 15 13 27 13 17 1 0 0
49 2013 32 30 3 4 1 1 1 0 0
57 2014 24 15 23 6 2 1 2 2 0
66 2015 18 13 31 28 3 3 6 0 1
74 2016 4 6 1 5 9 1 0 0 1
82 2017 16 16 8 1 1 4 0 0 0
89 2018 12 4 7 2 1 2 1 0 0
This is your ivs matrix. Does that look correct?
Everything else is the same. Here is your dv matrix:
dv = wi.age.count[which(wi.age.count$id == "YOY"), c(1, 3)]
> dv
Year n
1 2008 166
2 2009 28
3 2010 34
4 2011 77
5 2012 170
6 2013 18
7 2014 3
8 2015 22
9 2016 43
10 2017 50
11 2018 151
And your formula with three lags.
formula = ""
for (i in 2:4) formula = paste(formula, "+", names(ivs)[i])
formula = paste("n ~", substr(formula, 4, nchar(formula)))
> formula
[1] "n ~ n.1 + n.2 + n.3"
And here are the results:
l.fit = lm(formula, merge(dv, ivs))
AIC.l.fit <- signif(AIC(l.fit), digits = 3)
summary(l.fit)
Call:
lm(formula = formula, data = merge(dv, ivs))
Residuals:
Min 1Q Median 3Q Max
-60.367 -38.028 8.698 23.763 96.257
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 99.8976 36.1761 2.761 0.028 *
n.1 1.1059 0.8388 1.318 0.229
n.2 -1.7339 1.5773 -1.099 0.308
n.3 -1.6346 1.2932 -1.264 0.247
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 59.48 on 7 degrees of freedom
Multiple R-squared: 0.3731, Adjusted R-squared: 0.1044
F-statistic: 1.389 on 3 and 7 DF, p-value: 0.3233
> AIC.l.fit
[1] 126
Related Topics
How to Identify the Distribution of the Given Data Using R
Add Author Affiliation in R Markdown Beamer Presentation
Download Attachment from an Outlook Email Using R
What's the Difference Between Hex Code (\X) and Unicode (\U) Chars
Control Alignment of Two Side-By-Side Plots in Knitr
Error: Could Not Find Function "Unit"
Building a Box Plot from All Columns of Data Frame with Column Names on X in Ggplot2
R: Insert a Vector as a Row in Data.Frame
How to Calculate Wind Direction from U and V Wind Components in R
R: How Does a Foreach Loop Find a Function That Should Be Invoked
Dply: Order Columns Alphabetically in R
Loop Over Rows of Dataframe Applying Function with If-Statement
How to Control the Igraph Plot Layout with Fixed Positions
Plot Every Column in a Data Frame as a Histogram on One Page Using Ggplot
How to Remove the Legend Title in Ggplot2
Aggregating Sub Totals and Grand Totals with Data.Table
Why Is This Naive Matrix Multiplication Faster Than Base R'S