Time-Series - Data Splitting and Model Evaluation

Time-series - data splitting and model evaluation

Note that the original question that you have posted, takes care of the timeSlicing, and you don't have to create timeSlices by hand.

However, here is how to use createTimeSlices for splitting the data and then using it for training and testing a model.

Step 0: Setting up the data and trainControl:(from your question)

library(caret)
library(ggplot2)
library(pls)

data(economics)

Step 1: Creating the timeSlices for the index of the data:

timeSlices <- createTimeSlices(1:nrow(economics), 
initialWindow = 36, horizon = 12, fixedWindow = TRUE)

This creates a list of training and testing timeSlices.

> str(timeSlices,max.level = 1)
## List of 2
## $ train:List of 431
## .. [list output truncated]
## $ test :List of 431
## .. [list output truncated]

For ease of understanding, I am saving them in separate variable:

trainSlices <- timeSlices[[1]]
testSlices <- timeSlices[[2]]

Step 2: Training on the first of the trainSlices:

plsFitTime <- train(unemploy ~ pce + pop + psavert,
data = economics[trainSlices[[1]],],
method = "pls",
preProc = c("center", "scale"))

Step 3: Testing on the first of the testSlices:

pred <- predict(plsFitTime,economics[testSlices[[1]],])

Step 4: Plotting:

true <- economics$unemploy[testSlices[[1]]]

plot(true, col = "red", ylab = "true (red) , pred (blue)", ylim = range(c(pred,true)))
points(pred, col = "blue")

You can then do this for all the slices:

for(i in 1:length(trainSlices)){
plsFitTime <- train(unemploy ~ pce + pop + psavert,
data = economics[trainSlices[[i]],],
method = "pls",
preProc = c("center", "scale"))
pred <- predict(plsFitTime,economics[testSlices[[i]],])


true <- economics$unemploy[testSlices[[i]]]
plot(true, col = "red", ylab = "true (red) , pred (blue)",
main = i, ylim = range(c(pred,true)))
points(pred, col = "blue")
}

As mentioned earlier, this sort of timeSlicing is done by your original function in one step:

> myTimeControl <- trainControl(method = "timeslice",
+ initialWindow = 36,
+ horizon = 12,
+ fixedWindow = TRUE)
>
> plsFitTime <- train(unemploy ~ pce + pop + psavert,
+ data = economics,
+ method = "pls",
+ preProc = c("center", "scale"),
+ trControl = myTimeControl)
> plsFitTime
Partial Least Squares

478 samples
5 predictors

Pre-processing: centered, scaled
Resampling: Rolling Forecasting Origin Resampling (12 held-out with a fixed window)

Summary of sample sizes: 36, 36, 36, 36, 36, 36, ...

Resampling results across tuning parameters:

ncomp RMSE Rsquared RMSE SD Rsquared SD
1 1080 0.443 796 0.297
2 1090 0.43 845 0.295

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 1.

Hope this helps!!

splitting data for time series prediction

I found a way to use createTimeSlices inspired by Shambho's SO answer.

library(caret)

dates <- seq(as.Date('2017-01-01'), as.Date('2019-12-31'), by = 'days')

df <- data.frame(date = dates)
df$x <- 1
df$y <- 42

timeSlices <- createTimeSlices(1:nrow(df), initialWindow = 365 * 2, horizon = 30, fixedWindow = TRUE, skip = 30)

#str(timeSlices, max.level = 1)

trainSlices <- timeSlices[[1]]
testSlices <- timeSlices[[2]]

for (i in 1:length(trainSlices)) {

train <- df[trainSlices[[i]],]
test <- df[testSlices[[i]],]

# fit and calculate performance on test to ultimately get average etc.

print(paste0(min(train$date), " - ", max(train$date)))
print(paste0(min(test$date), " - ", max(test$date)))
print("")
}

The key for me was to specify skip as otherwise the window would only move 1 day and one would end up with to many "folds".

Splitting data to training, testing and valuation when making Keras model

Generally, in training time (model. fit), you have two sets: one is for the training set and another is for validation/tuning/development set. With the training set, you train the model, and with the validation set, you need to find the best set of hyper-parameter. And when you're done, you may then test your model with unseen data set - a set that was completely hidden from the model unlike the training or validation set.


Now, when you used

X_train, X_test, y_train, y_test = train_test_split(features, results, test_size=0.33)

By this, you split the features and results into 33% of data for testing, 67% for training. Now, you can do two things

  1. use the (X_test and y_test as validation set in model.fit(...). Or,
  2. use them for final prediction in model. predict(...)

So, if you choose these test sets as a validation set ( number 1 ), you would do as follows:

model.fit(x=X_train, y=y_trian, 
validation_data = (X_test, y_test), ...)

In the training log, you will get the validation results along with the training score. The validation results should be the same if you later compute model.evaluate(X_test, y_test).


Now, if you choose those test set as a final prediction or final evaluation set ( number 2 ), then you need to make validation set newly or use the validation_split argument as follows:

model.fit(x=X_train, y=y_trian, 
validation_split = 0.2, ...)

The Keras API will take the .2 percentage of the training data (X_train and y_train) and use it for validation. And lastly, for the final evaluation of your model, you can do as follows:

y_pred = model.predict(x_test, batch_size=50)

Now, you can compare with y_test and y_pred with some relevant metrics.

is test data used in Pycaret time series(beta) completely unseen by the model(s)?

If you notice the cv splits, they do not use the test data at all. So any step such as create_model, tune_model, blend_model, compare_models that use Cross-Validation, will not use the test data at all for training.

Once you are happy with the models from these steps, you can finalize the model using finalize_model. In this case, whatever model you pass to finalize_model is trained on the complete dataset (train + test) so that you can make true future predictions.

Using Random Forest for time series dataset

We live in a world where "future-to-past-causality" only occurs in cool scifi movies. Thus, when modeling time series we like to avoid explaining past events with future events. Also, we like to verify that our models, strictly trained on past events, can explain future events.

To model time series T with RF rolling is used. For day t, value T[t] is the target and values T[t-k] where k= {1,2,...,h}, where h is the past horizon will be used to form features. For nonstationary time series, T is converted to e.g. the relatively change Trel. = (T[t+1]-T[t]) / T[t].

To evaluate performance, I advise to check the out-of-bag cross validation measure of RF. Be aware, that there are some pitfalls possibly rendering this measure over optimistic:

  1. Unknown future to past contamination - somehow rolling is faulty and the model using future events to explain the same future within training set.

  2. Non-independent sampling: if the time interval you want to forecast ahead is shorter than the time interval the relative change is computed over, your samples are not independent.

  3. possible other mistakes I don't know of yet

In the end, everyone can make above mistakes in some latent way. To check that is not happening you need to validate your model with back testing. Where each day is forecasted by a model strictly trained on past events only.

When OOB-CV and back testing wildly disagree, this may be a hint to some bug in the code.

To backtest, do rolling on T[t-1 to t-traindays]. Model this training data and forecast T[t]. Then increase t by one, t++, and repeat.

To speed up you may train your model only once or at every n'th increment of t.

How to split the training data and test data for LSTM for time series prediction in Tensorflow

(1) We cannot. Imagine trying to predict the weather for tomorrow. Would you want a sequence of temperature values for the last 10 hours or would you want random temperature values of the last 5 years?

Your dataset is a long sequence of values in a 1-hour interval. Your LSTM takes in a sequence of samples that is chronologically connected. For example, with sequence_length = 10 it can take the data from 2018-03-01 09:00:00 to 2018-03-01 19:00:00 as input. If you shuffle the dataset before generating batches that consist of these sequences, you will train your LSTM on predicting based on a sequence of random samples from your whole dataset.


(2) Yes, we need to consider temporal ordering for time series. You can find ways to test your time series LSTM in python here: https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/

The train/test data must be split in such a way as to respect the temporal ordering and the model is never trained on data from the future and only tested on data from the future.



Related Topics



Leave a reply



Submit