How to Automatically Create N Lags in a Timeseries

How can I automatically create n lags in a timeseries?

If you are looking for efficiency, try data.tables new shift function

library(data.table) # V >= 1.9.5
n <- 2
setDT(df)[, paste("t", 1:n) := shift(t, 1:n)][]
# t t 1 t 2
# 1: 1 NA NA
# 2: 2 1 NA
# 3: 3 2 1
# 4: 4 3 2
# 5: 5 4 3
# 6: 6 5 4

Here you can set any name for your new columns (within paste) and you also don't need to bind this back to the original as this updates your data set by reference using the := operator.

How to auto-discover a lagging of time-series data in scikit-learn and classify using time-series data

TLDR but a few QF gems put in between the lines for those who care

Prologue:

"There is no dinner for free" so we will have to pay a cost for having the wished result, but you know it is pretty worth doing it, so, get advanced creativity, numpy utilities knowledge and scikit-learn tools ready and turn the volume button of imagination on max.

Next, do not expect the process to deliver results in just few seconds.

With an experience of AI/ML-hyperparameters' search job-schedules on example DataSETs spanning just about X.shape ~ ( 400, 200000 ), the best-generalising ML-engine hyperparametrisation crossvalidation takes regularly several days on a distributed multi-processing cluster.

As a bonus for directing further Quant research efforts:
a sample from a similar feature-engineering research, with LDF()/GDF() indcators about varying predictive power of respective features elaborated into an extendedDataSET:
as written below, one may realise that
just the top 1. feature is responsible for 43% per se
and the next 27. features account for +17%
and the "rest" 360+ features added the remaining 40% into decisions as importances report

(
individual feature and pre-lagging detail$ are not publi$hed here for obviou$ rea$on$
and are free to be discussed separately
)

|>>> aFeatureImportancesMAP_v4( loc_PREDICTOR, X_v412 )

ID.|LDF( fI ) | GDF|HUMAN_READABLE_FEATURE_NAME[COL] min() | MAX() | var()
___|__________|____|___________________________[___]___________|__________|____________
| | | [ ] | |
0. 0.4360231 | 43%| __________xxxxxxxxxxxxxCE [216] min: ... | MAX: ... | var(): ...
1. 0.0464851 | 48%| __________xxxxxxxxxxxxx_0 [215] min: ... | MAX: ... | var(): ...
2. 0.0104704 | 49%| __________xxxxxxxxxxxxx_1 [251] min: ... | MAX: ... | var(): ...
3. 0.0061596 | 49%| __________xxxxxxxxxxxxx_3 [206] min: ... | MAX: ... | var(): ...
4. 0.0055069 | 50%| __________xxxxxxxxxxxxx_2 [203] min: ... | MAX: ... | var(): ...
5. 0.0053235 | 50%| __________xxxxxxxxxxxxx_3 [212] min: ... | MAX: ... | var(): ...
6. 0.0050404 | 51%| ________f_xxxxxxxxxxxxx_7 [261] min: ... | MAX: ... | var(): ...
7. 0.0049998 | 52%| ________f_xxxxxxxxxxxxx_7 [253] min: ... | MAX: ... | var(): ...
8. 0.0048721 | 52%| __________xxxxxxxxxxxxx_4 [113] min: ... | MAX: ... | var(): ...
9. 0.0047981 | 52%| __________xxxxxxxxxxxxx_4 [141] min: ... | MAX: ... | var(): ...
10. 0.0043784 | 53%| __________xxxxxxxxxxxxx_3 [142] min: ... | MAX: ... | var(): ...
11. 0.0043257 | 53%| __________xxxxxxxxxxxxx_4 [129] min: ... | MAX: ... | var(): ...
12. 0.0042124 | 54%| __________xxxxxxxxxxxxx_1 [144] min: ... | MAX: ... | var(): ...
13. 0.0041864 | 54%| ________f_xxxxxxxxxxxxx_8 [260] min: ... | MAX: ... | var(): ...
14. 0.0039645 | 55%| __________xxxxxxxxxxxxx_1 [140] min: ... | MAX: ... | var(): ...
15. 0.0037486 | 55%| ________f_xxxxxxxxxxxxx_8 [252] min: ... | MAX: ... | var(): ...
16. 0.0036820 | 55%| ________f_xxxxxxxxxxxxx_8 [268] min: ... | MAX: ... | var(): ...
17. 0.0036384 | 56%| __________xxxxxxxxxxxxx_1 [108] min: ... | MAX: ... | var(): ...
18. 0.0036112 | 56%| __________xxxxxxxxxxxxx_2 [207] min: ... | MAX: ... | var(): ...
19. 0.0035978 | 56%| __________xxxxxxxxxxxxx_1 [132] min: ... | MAX: ... | var(): ...
20. 0.0035812 | 57%| __________xxxxxxxxxxxxx_4 [248] min: ... | MAX: ... | var(): ...
21. 0.0035558 | 57%| __________xxxxxxxxxxxxx_3 [130] min: ... | MAX: ... | var(): ...
22. 0.0035105 | 57%| _______f_Kxxxxxxxxxxxxx_1 [283] min: ... | MAX: ... | var(): ...
23. 0.0034851 | 58%| __________xxxxxxxxxxxxx_4 [161] min: ... | MAX: ... | var(): ...
24. 0.0034352 | 58%| __________xxxxxxxxxxxxx_2 [250] min: ... | MAX: ... | var(): ...
25. 0.0034146 | 59%| __________xxxxxxxxxxxxx_2 [199] min: ... | MAX: ... | var(): ...
26. 0.0033744 | 59%| __________xxxxxxxxxxxxx_1 [ 86] min: ... | MAX: ... | var(): ...
27. 0.0033624 | 59%| __________xxxxxxxxxxxxx_3 [202] min: ... | MAX: ... | var(): ...
28. 0.0032876 | 60%| __________xxxxxxxxxxxxx_4 [169] min: ... | MAX: ... | var(): ...
...
62. 0.0027483 | 70%| __________xxxxxxxxxxxxx_8 [117] min: ... | MAX: ... | var(): ...
63. 0.0027368 | 70%| __________xxxxxxxxxxxxx_2 [ 85] min: ... | MAX: ... | var(): ...
64. 0.0027221 | 70%| __________xxxxxxxxxxxxx_1 [211] min: ... | MAX: ... | var(): ...
...
104. 0.0019674 | 80%| ________f_xxxxxxxxxxxxx_3 [273] min: ... | MAX: ... | var(): ...
105. 0.0019597 | 80%| __________xxxxxxxxxxxxx_6 [ 99] min: ... | MAX: ... | var(): ...
106. 0.0019199 | 80%| __________xxxxxxxxxxxxx_1 [104] min: ... | MAX: ... | var(): ...
...
169. 0.0012095 | 90%| __________xxxxxxxxxxxxx_4 [181] min: ... | MAX: ... | var(): ...
170. 0.0012017 | 90%| __________xxxxxxxxxxxxx_3 [ 9] min: ... | MAX: ... | var(): ...
171. 0.0011984 | 90%| __________xxxxxxxxxxxxx_4 [185] min: ... | MAX: ... | var(): ...
172. 0.0011926 | 90%| __________xxxxxxxxxxxxx_1 [ 19] min: ... | MAX: ... | var(): ...
...
272. 0.0005956 | 99%| __________xxxxxxxxxxxxx_2 [ 33] min: ... | MAX: ... | var(): ...
273. 0.0005844 | 99%| __________xxxxxxxxxxxxx_2 [127] min: ... | MAX: ... | var(): ...
274. 0.0005802 | 99%| __________xxxxxxxxxxxxx_3 [ 54] min: ... | MAX: ... | var(): ...
275. 0.0005663 | 99%| __________xxxxxxxxxxxxx_3 [ 32] min: ... | MAX: ... | var(): ...
276. 0.0005534 | 99%| __________xxxxxxxxxxxxx_1 [ 83] min: ... | MAX: ... | var(): ...
...
391. 0.0004347 |100%| __________xxxxxxxxxxxxx_2 [ 82] min: ... | MAX: ... | var(): ...

So rather plan & reserve a bit more vCPU-cores capacities,
than to expect it to run such search on a laptop just during a forthcoming lunchtime...


Let's have a working plan

the intended auto-find service, due to many reasons, is not a part of scikit-learn, however, the goal is achievable.

We will use the following adaptation steps, that will allow us to make it work:

  1. we'll rely on scikit-learn abilities to search for a best tandem of a [learner + hyperparameters] for a well-defined AI/ML problem

  2. we'll rely on numpy powers for obvious reasons to support scikit-learn phase

  3. we'll rely on rather a proper scikit-learn AI/ML-engine processing & proces-controls ( pipeline, GridSearchCV et al ), which are by far better optimised on low-level performance for such massive-scale-attacks, than to try to depend on an "externally" ordinated for-looping ( which loses all the valuable cache/data-locality advances ) and is known to be of a remarkable performance disadvantage.

  4. we'll substitute the wished autodiscovery by a fast, one-step, a-priori DataSET adaptation

  5. we'll let scikit-learn decide ( quantitatively indicate ) which pre-lagged features, artificially synthesised into compound DataSET elaborated in [4] have finally the best predictive powers


[4] DataSET adaptation with smart numpy aids:

Your DataSET consists of an unspecified count of TimeSeries-data. For each such, you assume some pre-lagging may have better predictive powers, that you would like to find ( quantitatively support the selection of such, for the final ML-predictor ).

So let's first construct for each TimeSerie DataSET[i,:] in the source-part of the DataSET an extended part of the DataSET, which contains the respectively pre-lagged versions of this TimeSerie:

>>> def generate_TimeSERIE_preLAGs( aTimeSERIE, pre_lag_window_depth ):
... #
... # COURTESY & THANKS TO:
... # Nicolas P. Rougier, INRIA
... # Author: Joe Kington / Erik Rigtorp
... #
... shape = ( aTimeSERIE.size - pre_lag_window_depth + 1,
... pre_lag_window_depth
... )
... strides = ( aTimeSERIE.itemsize,
... aTimeSERIE.itemsize
... )
... return np.lib.stride_tricks.as_strided( aTimeSERIE,
... shape,
... strides = strides
... )
...
>>> xVECTOR = np.arange( 10 )
>>>
>>> pre_laggz_on_xVECTOR = generate_TimeSERIE_preLAGs( xVECTOR, 4 )
>>>
>>> pre_laggz_on_xVECTOR
array([[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5],
[3, 4, 5, 6],
[4, 5, 6, 7],
[5, 6, 7, 8],
[6, 7, 8, 9]])
>>>

With such extended ( wider, and you know that a lot ) but static extendedDataSET, containing now both the original TimeSERIE vector and all the wished-to-test pre-lagged versions thereof, your ML-search starts.

Phase [1.A]
Initially use scikit-learn tools for the best feature-selection supporting your hypothesis
+
Phase [1.B]
next start hyperparameter optimisation for the best cross-validation results supporting a maximum generalisation ability of the learner.

The phase [1.B], naturally ought to run on a sub-set of the extendedDataSET ( as was intentionally extended for the sake of scikit-learn evaluation in feature-selection-phase [1.A] ).


Epilogue:
Memento mori: Quants do not like this ... but

For the sake of your further interests in TimeSeries analyses & quantitative modelling, you might be interested in the best answer on this >>>

Sample Image

Correlation does not imply Causation, so even more care is needed in making decisions to be executed ( paper can always handle much more, than the markets :o) ).

Basic lag in R vector/dataframe

Another way to deal with this is using the zoo package, which has a lag method that will pad the result with NA:

require(zoo)
> set.seed(123)
> x <- zoo(sample(c(1:9), 10, replace = T))
> y <- lag(x, -1, na.pad = TRUE)
> cbind(x, y)
x y
1 3 NA
2 8 3
3 4 8
4 8 4
5 9 8
6 1 9
7 5 1
8 9 5
9 5 9
10 5 5

The result is a multivariate zoo object (which is an enhanced matrix), but easily converted to a data.frame via

> data.frame(cbind(x, y))

How to Use Lagged Time-Series Variables in a Python Pandas Regression Model?

pandas allows you to shift your data without moving the index such has

df.shift(-1)

will create a 1 index lag behing

or

df.shift(1)

will create a forward lag of 1 index

so if you have a daily time series, you could use df.shift(1) to create a 1 day lag in you values of price such has

df['lagprice'] = df['price'].shift(1)

after that if you want to do OLS you can look at scipy module here :

http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.linregress.html

Lag time series with new rows

Somewhat prolix, but this does the job. This requires dplyr and magrittr.

# Original data frame
df <- data.frame(date = seq(as.Date("2017-07-01"), length=10, by="1 month") - 1, n = 1:10)

# date n
# 1 2017-06-30 1
# 2 2017-07-31 2
# 3 2017-08-31 3
# 4 2017-09-30 4
# 5 2017-10-31 5
# 6 2017-11-30 6
# 7 2017-12-31 7
# 8 2018-01-31 8
# 9 2018-02-28 9
# 10 2018-03-31 10

Next, I define the lag length:

# Length of lag
lag_length <- 2

Here, I create the extra rows to be added:

# Extra rows to add
extra <- data.frame(date = (seq(tail(df$date, 1) + 1, length = lag_length + 1, by = "1 month") - 1)[-1], n = NA)

Finally, I bind them to the original data frame and lag the variable n:

# Bind extra rows and lag 'n' by 'lag_length'
df %<>%
bind_rows(extra) %>%
mutate(n = lag(n, lag_length))

# New data frame
# date n
# 1 2017-06-30 NA
# 2 2017-07-31 NA
# 3 2017-08-31 1
# 4 2017-09-30 2
# 5 2017-10-31 3
# 6 2017-11-30 4
# 7 2017-12-31 5
# 8 2018-01-31 6
# 9 2018-02-28 7
# 10 2018-03-31 8
# 11 2018-04-30 9
# 12 2018-05-31 10

avoiding for loops when generating lagged variables in R (and using zoo)

k in lag.zoo can be a vector. See ?lag.zoo.

x <- zoo(11:21)
lag(x,1:3)

Time series lags and correlations (autocorrelations)

Probably all what you need is to use acf function in stats package. It will do correlations for many lags as you prefer.

library(stats) # for the use of "acf" function
health.d <- health.d[!duplicated(health.d),]
health.d$lnincome <- log(health.d$Income + 1)
health.d <- health.d[(health.d$sex == 1 & health.d$married == 0),]
#First Difference for each individual ( %>% , group_by and mutate are functions in dplyr package)
health.d <- health.d %>%
group_by(ID) %>%
mutate(Dy = lnincome - lag(lnincome, 1))
acf.results<-acf(health.d$Dy, lag.max = 5, type = "correlation",plot = TRUE, na.action = na.pass)
plot(acf.results, main="Auto-correlation")

This will give you the following plot of auto-corrections at 5 lags specified in the acf argument

Sample Image

If you want to access the acf results you can use:

print(acf.results)

and you will get the following

Autocorrelations of series ‘health.d$Dy’, by lag

0 1 2 3 4 5
1.000 -0.225 0.016 -0.030 -0.002 0.002


Related Topics



Leave a reply



Submit