How can I automatically create n lags in a timeseries?
If you are looking for efficiency, try data.table
s new shift
function
library(data.table) # V >= 1.9.5
n <- 2
setDT(df)[, paste("t", 1:n) := shift(t, 1:n)][]
# t t 1 t 2
# 1: 1 NA NA
# 2: 2 1 NA
# 3: 3 2 1
# 4: 4 3 2
# 5: 5 4 3
# 6: 6 5 4
Here you can set any name for your new columns (within paste
) and you also don't need to bind this back to the original as this updates your data set by reference using the :=
operator.
How to auto-discover a lagging of time-series data in scikit-learn and classify using time-series data
TLDR
but a few QF gems put in between the lines for those who care
Prologue:
"There is no dinner for free" so we will have to pay a cost for having the wished result, but you know it is pretty worth doing it, so, get advanced creativity, numpy
utilities knowledge and scikit-learn
tools ready and turn the volume button of imagination on max.
Next, do not expect the process to deliver results in just few seconds.
With an experience of AI/ML-hyperparameters' search job-schedules on example DataSETs
spanning just about X.shape ~ ( 400, 200000 )
, the best-generalising ML-engine hyperparametrisation crossvalidation takes regularly several days on a distributed multi-processing cluster.
As a bonus for directing further Quant research efforts:
a sample from a similar feature-engineering research, withLDF()/GDF()
indcators about varying predictive power of respective features elaborated into anextendedDataSET
:
as written below, one may realise that
just the top 1. feature is responsible for 43% per se
and the next 27. features account for +17%
and the "rest" 360+ features added the remaining 40% into decisions as importances report
(
individual feature and pre-lagging detail$ are not publi$hed here for obviou$ rea$on$
and are free to be discussed separately
)
|>>> aFeatureImportancesMAP_v4( loc_PREDICTOR, X_v412 )
ID.|LDF( fI ) | GDF|HUMAN_READABLE_FEATURE_NAME[COL] min() | MAX() | var()
___|__________|____|___________________________[___]___________|__________|____________
| | | [ ] | |
0. 0.4360231 | 43%| __________xxxxxxxxxxxxxCE [216] min: ... | MAX: ... | var(): ...
1. 0.0464851 | 48%| __________xxxxxxxxxxxxx_0 [215] min: ... | MAX: ... | var(): ...
2. 0.0104704 | 49%| __________xxxxxxxxxxxxx_1 [251] min: ... | MAX: ... | var(): ...
3. 0.0061596 | 49%| __________xxxxxxxxxxxxx_3 [206] min: ... | MAX: ... | var(): ...
4. 0.0055069 | 50%| __________xxxxxxxxxxxxx_2 [203] min: ... | MAX: ... | var(): ...
5. 0.0053235 | 50%| __________xxxxxxxxxxxxx_3 [212] min: ... | MAX: ... | var(): ...
6. 0.0050404 | 51%| ________f_xxxxxxxxxxxxx_7 [261] min: ... | MAX: ... | var(): ...
7. 0.0049998 | 52%| ________f_xxxxxxxxxxxxx_7 [253] min: ... | MAX: ... | var(): ...
8. 0.0048721 | 52%| __________xxxxxxxxxxxxx_4 [113] min: ... | MAX: ... | var(): ...
9. 0.0047981 | 52%| __________xxxxxxxxxxxxx_4 [141] min: ... | MAX: ... | var(): ...
10. 0.0043784 | 53%| __________xxxxxxxxxxxxx_3 [142] min: ... | MAX: ... | var(): ...
11. 0.0043257 | 53%| __________xxxxxxxxxxxxx_4 [129] min: ... | MAX: ... | var(): ...
12. 0.0042124 | 54%| __________xxxxxxxxxxxxx_1 [144] min: ... | MAX: ... | var(): ...
13. 0.0041864 | 54%| ________f_xxxxxxxxxxxxx_8 [260] min: ... | MAX: ... | var(): ...
14. 0.0039645 | 55%| __________xxxxxxxxxxxxx_1 [140] min: ... | MAX: ... | var(): ...
15. 0.0037486 | 55%| ________f_xxxxxxxxxxxxx_8 [252] min: ... | MAX: ... | var(): ...
16. 0.0036820 | 55%| ________f_xxxxxxxxxxxxx_8 [268] min: ... | MAX: ... | var(): ...
17. 0.0036384 | 56%| __________xxxxxxxxxxxxx_1 [108] min: ... | MAX: ... | var(): ...
18. 0.0036112 | 56%| __________xxxxxxxxxxxxx_2 [207] min: ... | MAX: ... | var(): ...
19. 0.0035978 | 56%| __________xxxxxxxxxxxxx_1 [132] min: ... | MAX: ... | var(): ...
20. 0.0035812 | 57%| __________xxxxxxxxxxxxx_4 [248] min: ... | MAX: ... | var(): ...
21. 0.0035558 | 57%| __________xxxxxxxxxxxxx_3 [130] min: ... | MAX: ... | var(): ...
22. 0.0035105 | 57%| _______f_Kxxxxxxxxxxxxx_1 [283] min: ... | MAX: ... | var(): ...
23. 0.0034851 | 58%| __________xxxxxxxxxxxxx_4 [161] min: ... | MAX: ... | var(): ...
24. 0.0034352 | 58%| __________xxxxxxxxxxxxx_2 [250] min: ... | MAX: ... | var(): ...
25. 0.0034146 | 59%| __________xxxxxxxxxxxxx_2 [199] min: ... | MAX: ... | var(): ...
26. 0.0033744 | 59%| __________xxxxxxxxxxxxx_1 [ 86] min: ... | MAX: ... | var(): ...
27. 0.0033624 | 59%| __________xxxxxxxxxxxxx_3 [202] min: ... | MAX: ... | var(): ...
28. 0.0032876 | 60%| __________xxxxxxxxxxxxx_4 [169] min: ... | MAX: ... | var(): ...
...
62. 0.0027483 | 70%| __________xxxxxxxxxxxxx_8 [117] min: ... | MAX: ... | var(): ...
63. 0.0027368 | 70%| __________xxxxxxxxxxxxx_2 [ 85] min: ... | MAX: ... | var(): ...
64. 0.0027221 | 70%| __________xxxxxxxxxxxxx_1 [211] min: ... | MAX: ... | var(): ...
...
104. 0.0019674 | 80%| ________f_xxxxxxxxxxxxx_3 [273] min: ... | MAX: ... | var(): ...
105. 0.0019597 | 80%| __________xxxxxxxxxxxxx_6 [ 99] min: ... | MAX: ... | var(): ...
106. 0.0019199 | 80%| __________xxxxxxxxxxxxx_1 [104] min: ... | MAX: ... | var(): ...
...
169. 0.0012095 | 90%| __________xxxxxxxxxxxxx_4 [181] min: ... | MAX: ... | var(): ...
170. 0.0012017 | 90%| __________xxxxxxxxxxxxx_3 [ 9] min: ... | MAX: ... | var(): ...
171. 0.0011984 | 90%| __________xxxxxxxxxxxxx_4 [185] min: ... | MAX: ... | var(): ...
172. 0.0011926 | 90%| __________xxxxxxxxxxxxx_1 [ 19] min: ... | MAX: ... | var(): ...
...
272. 0.0005956 | 99%| __________xxxxxxxxxxxxx_2 [ 33] min: ... | MAX: ... | var(): ...
273. 0.0005844 | 99%| __________xxxxxxxxxxxxx_2 [127] min: ... | MAX: ... | var(): ...
274. 0.0005802 | 99%| __________xxxxxxxxxxxxx_3 [ 54] min: ... | MAX: ... | var(): ...
275. 0.0005663 | 99%| __________xxxxxxxxxxxxx_3 [ 32] min: ... | MAX: ... | var(): ...
276. 0.0005534 | 99%| __________xxxxxxxxxxxxx_1 [ 83] min: ... | MAX: ... | var(): ...
...
391. 0.0004347 |100%| __________xxxxxxxxxxxxx_2 [ 82] min: ... | MAX: ... | var(): ...
So rather plan & reserve a bit more vCPU-cores capacities,
than to expect it to run such search on a laptop just during a forthcoming lunchtime...
Let's have a working plan
the intended auto-find
service, due to many reasons, is not a part of scikit-learn
, however, the goal is achievable.
We will use the following adaptation steps, that will allow us to make it work:
we'll rely on
scikit-learn
abilities to search for a best tandem of a [learner
+hyperparameters
] for a well-defined AI/ML problemwe'll rely on
numpy
powers for obvious reasons to supportscikit-learn
phasewe'll rely on rather a proper
scikit-learn
AI/ML-engine processing & proces-controls (pipeline
,GridSearchCV
et al ), which are by far better optimised on low-level performance for such massive-scale-attacks, than to try to depend on an "externally" ordinatedfor
-looping ( which loses all the valuable cache/data-locality advances ) and is known to be of a remarkable performance disadvantage.we'll substitute the wished autodiscovery by a fast, one-step, a-priori
DataSET
adaptationwe'll let
scikit-learn
decide ( quantitatively indicate ) which pre-lagged features, artificially synthesised into compoundDataSET
elaborated in [4] have finally the best predictive powers
[4] DataSET
adaptation with smart numpy
aids:
Your DataSET
consists of an unspecified count of TimeSeries-data. For each such, you assume some pre-lagging may have better predictive powers, that you would like to find ( quantitatively support the selection of such, for the final ML-predictor ).
So let's first construct for each TimeSerie DataSET[i,:]
in the source-part of the DataSET
an extended part of the DataSET
, which contains the respectively pre-lagged versions of this TimeSerie:
>>> def generate_TimeSERIE_preLAGs( aTimeSERIE, pre_lag_window_depth ):
... #
... # COURTESY & THANKS TO:
... # Nicolas P. Rougier, INRIA
... # Author: Joe Kington / Erik Rigtorp
... #
... shape = ( aTimeSERIE.size - pre_lag_window_depth + 1,
... pre_lag_window_depth
... )
... strides = ( aTimeSERIE.itemsize,
... aTimeSERIE.itemsize
... )
... return np.lib.stride_tricks.as_strided( aTimeSERIE,
... shape,
... strides = strides
... )
...
>>> xVECTOR = np.arange( 10 )
>>>
>>> pre_laggz_on_xVECTOR = generate_TimeSERIE_preLAGs( xVECTOR, 4 )
>>>
>>> pre_laggz_on_xVECTOR
array([[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5],
[3, 4, 5, 6],
[4, 5, 6, 7],
[5, 6, 7, 8],
[6, 7, 8, 9]])
>>>
With such extended ( wider, and you know that a lot ) but static extendedDataSET
, containing now both the original TimeSERIE
vector and all the wished-to-test pre-lagged versions thereof, your ML-search starts.
Phase [1.A]
Initially use scikit-learn
tools for the best feature-selection supporting your hypothesis
+
Phase [1.B]
next start hyperparameter optimisation for the best cross-validation results supporting a maximum generalisation ability of the learner.
The phase [1.B]
, naturally ought to run on a sub-set of the extendedDataSET
( as was intentionally extended for the sake of scikit-learn
evaluation in feature-selection-phase [1.A]
).
Epilogue:
Memento mori: Quants do not like this ... but
For the sake of your further interests in TimeSeries analyses & quantitative modelling, you might be interested in the best answer on this >>>
Correlation does not imply Causation, so even more care is needed in making decisions to be executed ( paper can always handle much more, than the markets :o) ).
Basic lag in R vector/dataframe
Another way to deal with this is using the zoo package, which has a lag method that will pad the result with NA:
require(zoo)
> set.seed(123)
> x <- zoo(sample(c(1:9), 10, replace = T))
> y <- lag(x, -1, na.pad = TRUE)
> cbind(x, y)
x y
1 3 NA
2 8 3
3 4 8
4 8 4
5 9 8
6 1 9
7 5 1
8 9 5
9 5 9
10 5 5
The result is a multivariate zoo object (which is an enhanced matrix), but easily converted to a data.frame via
> data.frame(cbind(x, y))
How to Use Lagged Time-Series Variables in a Python Pandas Regression Model?
pandas allows you to shift your data without moving the index such has
df.shift(-1)
will create a 1 index lag behing
or
df.shift(1)
will create a forward lag of 1 index
so if you have a daily time series, you could use df.shift(1) to create a 1 day lag in you values of price such has
df['lagprice'] = df['price'].shift(1)
after that if you want to do OLS you can look at scipy module here :
http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.linregress.html
Lag time series with new rows
Somewhat prolix, but this does the job. This requires dplyr
and magrittr
.
# Original data frame
df <- data.frame(date = seq(as.Date("2017-07-01"), length=10, by="1 month") - 1, n = 1:10)
# date n
# 1 2017-06-30 1
# 2 2017-07-31 2
# 3 2017-08-31 3
# 4 2017-09-30 4
# 5 2017-10-31 5
# 6 2017-11-30 6
# 7 2017-12-31 7
# 8 2018-01-31 8
# 9 2018-02-28 9
# 10 2018-03-31 10
Next, I define the lag length:
# Length of lag
lag_length <- 2
Here, I create the extra rows to be added:
# Extra rows to add
extra <- data.frame(date = (seq(tail(df$date, 1) + 1, length = lag_length + 1, by = "1 month") - 1)[-1], n = NA)
Finally, I bind them to the original data frame and lag the variable n
:
# Bind extra rows and lag 'n' by 'lag_length'
df %<>%
bind_rows(extra) %>%
mutate(n = lag(n, lag_length))
# New data frame
# date n
# 1 2017-06-30 NA
# 2 2017-07-31 NA
# 3 2017-08-31 1
# 4 2017-09-30 2
# 5 2017-10-31 3
# 6 2017-11-30 4
# 7 2017-12-31 5
# 8 2018-01-31 6
# 9 2018-02-28 7
# 10 2018-03-31 8
# 11 2018-04-30 9
# 12 2018-05-31 10
avoiding for loops when generating lagged variables in R (and using zoo)
k
in lag.zoo
can be a vector. See ?lag.zoo
.
x <- zoo(11:21)
lag(x,1:3)
Time series lags and correlations (autocorrelations)
Probably all what you need is to use acf
function in stats
package. It will do correlations for many lags as you prefer.
library(stats) # for the use of "acf" function
health.d <- health.d[!duplicated(health.d),]
health.d$lnincome <- log(health.d$Income + 1)
health.d <- health.d[(health.d$sex == 1 & health.d$married == 0),]
#First Difference for each individual ( %>% , group_by and mutate are functions in dplyr package)
health.d <- health.d %>%
group_by(ID) %>%
mutate(Dy = lnincome - lag(lnincome, 1))
acf.results<-acf(health.d$Dy, lag.max = 5, type = "correlation",plot = TRUE, na.action = na.pass)
plot(acf.results, main="Auto-correlation")
This will give you the following plot of auto-corrections at 5 lags specified in the acf
argument
If you want to access the acf results you can use:
print(acf.results)
and you will get the following
Autocorrelations of series ‘health.d$Dy’, by lag
0 1 2 3 4 5
1.000 -0.225 0.016 -0.030 -0.002 0.002
Related Topics
Making Plot Functions with Ggplot and Aes_String
Partially Color Histogram in R
Different Robust Standard Errors of Logit Regression in Stata and R
R - Faster Way to Calculate Rolling Statistics Over a Variable Interval
Knitr: Include Figures in Report *And* Output Figures to Separate Files
Looping Through List of Data Frames in R
Passing Parameters to R Markdown
Use an Image as Area Fill in an R Plot
Assigning Null to a List Element in R
Convert Accented Characters into Ascii Character
Devtools::Install_Github() - Ignore Ssl Cert Verification Failure
Text-Mining with the Tm-Package - Word Stemming
Configuration Failed Because Libcurl Was Not Found
How to Access Global/Outer Scope Variable from R Apply Function
Display HTML File in Shiny App
How to Extend Letters Past 26 Characters E.G., Aa, Ab, Ac...