Time-series - data splitting and model evaluation
Note that the original question that you have posted, takes care of the timeSlicing, and you don't have to create timeSlices by hand.
However, here is how to use createTimeSlices
for splitting the data and then using it for training and testing a model.
Step 0: Setting up the data and trainControl
:(from your question)
library(caret)
library(ggplot2)
library(pls)
data(economics)
Step 1: Creating the timeSlices for the index of the data:
timeSlices <- createTimeSlices(1:nrow(economics),
initialWindow = 36, horizon = 12, fixedWindow = TRUE)
This creates a list of training and testing timeSlices.
> str(timeSlices,max.level = 1)
## List of 2
## $ train:List of 431
## .. [list output truncated]
## $ test :List of 431
## .. [list output truncated]
For ease of understanding, I am saving them in separate variable:
trainSlices <- timeSlices[[1]]
testSlices <- timeSlices[[2]]
Step 2: Training on the first of the trainSlices
:
plsFitTime <- train(unemploy ~ pce + pop + psavert,
data = economics[trainSlices[[1]],],
method = "pls",
preProc = c("center", "scale"))
Step 3: Testing on the first of the testSlices
:
pred <- predict(plsFitTime,economics[testSlices[[1]],])
Step 4: Plotting:
true <- economics$unemploy[testSlices[[1]]]
plot(true, col = "red", ylab = "true (red) , pred (blue)", ylim = range(c(pred,true)))
points(pred, col = "blue")
You can then do this for all the slices:
for(i in 1:length(trainSlices)){
plsFitTime <- train(unemploy ~ pce + pop + psavert,
data = economics[trainSlices[[i]],],
method = "pls",
preProc = c("center", "scale"))
pred <- predict(plsFitTime,economics[testSlices[[i]],])
true <- economics$unemploy[testSlices[[i]]]
plot(true, col = "red", ylab = "true (red) , pred (blue)",
main = i, ylim = range(c(pred,true)))
points(pred, col = "blue")
}
As mentioned earlier, this sort of timeSlicing is done by your original function in one step:
> myTimeControl <- trainControl(method = "timeslice",
+ initialWindow = 36,
+ horizon = 12,
+ fixedWindow = TRUE)
>
> plsFitTime <- train(unemploy ~ pce + pop + psavert,
+ data = economics,
+ method = "pls",
+ preProc = c("center", "scale"),
+ trControl = myTimeControl)
> plsFitTime
Partial Least Squares
478 samples
5 predictors
Pre-processing: centered, scaled
Resampling: Rolling Forecasting Origin Resampling (12 held-out with a fixed window)
Summary of sample sizes: 36, 36, 36, 36, 36, 36, ...
Resampling results across tuning parameters:
ncomp RMSE Rsquared RMSE SD Rsquared SD
1 1080 0.443 796 0.297
2 1090 0.43 845 0.295
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 1.
Hope this helps!!
splitting data for time series prediction
I found a way to use createTimeSlices inspired by Shambho's SO answer.
library(caret)
dates <- seq(as.Date('2017-01-01'), as.Date('2019-12-31'), by = 'days')
df <- data.frame(date = dates)
df$x <- 1
df$y <- 42
timeSlices <- createTimeSlices(1:nrow(df), initialWindow = 365 * 2, horizon = 30, fixedWindow = TRUE, skip = 30)
#str(timeSlices, max.level = 1)
trainSlices <- timeSlices[[1]]
testSlices <- timeSlices[[2]]
for (i in 1:length(trainSlices)) {
train <- df[trainSlices[[i]],]
test <- df[testSlices[[i]],]
# fit and calculate performance on test to ultimately get average etc.
print(paste0(min(train$date), " - ", max(train$date)))
print(paste0(min(test$date), " - ", max(test$date)))
print("")
}
The key for me was to specify skip as otherwise the window would only move 1 day and one would end up with to many "folds".
Splitting data to training, testing and valuation when making Keras model
Generally, in training time (model. fit
), you have two sets: one is for the training set and another is for validation/tuning/development set. With the training set, you train the model, and with the validation set, you need to find the best set of hyper-parameter. And when you're done, you may then test your model with unseen data set - a set that was completely hidden from the model unlike the training or validation set.
Now, when you used
X_train, X_test, y_train, y_test = train_test_split(features, results, test_size=0.33)
By this, you split the features
and results
into 33%
of data for testing, 67%
for training. Now, you can do two things
- use the (
X_test
andy_test
as validation set inmodel.fit(...)
. Or, - use them for final prediction in
model. predict(...)
So, if you choose these test sets as a validation set ( number 1 ), you would do as follows:
model.fit(x=X_train, y=y_trian,
validation_data = (X_test, y_test), ...)
In the training log, you will get the validation results along with the training score. The validation results should be the same if you later compute model.evaluate(X_test, y_test)
.
Now, if you choose those test set as a final prediction or final evaluation set ( number 2 ), then you need to make validation set newly or use the validation_split
argument as follows:
model.fit(x=X_train, y=y_trian,
validation_split = 0.2, ...)
The Keras
API will take the .2
percentage of the training data (X_train
and y_train
) and use it for validation. And lastly, for the final evaluation of your model, you can do as follows:
y_pred = model.predict(x_test, batch_size=50)
Now, you can compare with y_test
and y_pred
with some relevant metrics.
is test data used in Pycaret time series(beta) completely unseen by the model(s)?
If you notice the cv splits, they do not use the test data at all. So any step such as create_model
, tune_model
, blend_model
, compare_models
that use Cross-Validation, will not use the test data at all for training.
Once you are happy with the models from these steps, you can finalize the model using finalize_model
. In this case, whatever model you pass to finalize_model
is trained on the complete dataset (train + test) so that you can make true future predictions.
Using Random Forest for time series dataset
We live in a world where "future-to-past-causality" only occurs in cool scifi movies. Thus, when modeling time series we like to avoid explaining past events with future events. Also, we like to verify that our models, strictly trained on past events, can explain future events.
To model time series T with RF rolling is used. For day t, value T[t] is the target and values T[t-k] where k= {1,2,...,h}, where h is the past horizon will be used to form features. For nonstationary time series, T is converted to e.g. the relatively change Trel. = (T[t+1]-T[t]) / T[t].
To evaluate performance, I advise to check the out-of-bag cross validation measure of RF. Be aware, that there are some pitfalls possibly rendering this measure over optimistic:
Unknown future to past contamination - somehow rolling is faulty and the model using future events to explain the same future within training set.
Non-independent sampling: if the time interval you want to forecast ahead is shorter than the time interval the relative change is computed over, your samples are not independent.
possible other mistakes I don't know of yet
In the end, everyone can make above mistakes in some latent way. To check that is not happening you need to validate your model with back testing. Where each day is forecasted by a model strictly trained on past events only.
When OOB-CV and back testing wildly disagree, this may be a hint to some bug in the code.
To backtest, do rolling on T[t-1 to t-traindays]. Model this training data and forecast T[t]. Then increase t by one, t++, and repeat.
To speed up you may train your model only once or at every n'th increment of t.
How to split the training data and test data for LSTM for time series prediction in Tensorflow
(1) We cannot. Imagine trying to predict the weather for tomorrow. Would you want a sequence of temperature values for the last 10 hours or would you want random temperature values of the last 5 years?
Your dataset is a long sequence of values in a 1-hour interval. Your LSTM takes in a sequence of samples that is chronologically connected. For example, with sequence_length = 10
it can take the data from 2018-03-01 09:00:00 to 2018-03-01 19:00:00 as input. If you shuffle the dataset before generating batches that consist of these sequences, you will train your LSTM on predicting based on a sequence of random samples from your whole dataset.
(2) Yes, we need to consider temporal ordering for time series. You can find ways to test your time series LSTM in python here: https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/
The train/test data must be split in such a way as to respect the temporal ordering and the model is never trained on data from the future and only tested on data from the future.
Related Topics
How to Merge Two Data.Table by Different Column Names
R: How to Total the Number of Na in Each Col of Data.Frame
Using Dplyr to Conditionally Replace Values in a Column
How to Remove Multiple Columns in R Dataframe
How to Swap Columns Around in a Data Frame Using R
Rm(List=Ls()) Doesn't Completely Clear the Workspace
Embedding an R HTMLwidget into Existing Webpage
How to Save Summary(Lm) to a File
Replace Accented Characters in R with Non-Accented Counterpart (Utf-8 Encoding)
Filter Out Rows from One Data.Frame That Are Present in Another Data.Frame
Change Color of Only One Bar in Ggplot
How to Fill Nas with Locf by Factors in Data Frame, Split by Country
Polygons Nicely Cropping Ggplot2/Ggmap at Different Zoom Levels
Center-Align Legend Title and Legend Keys in Ggplot2 for Long Legend Titles
Functions Available for Tufte Boxplots in R