Error when using predict() on a randomForest object trained with caret's train() using formula
First, almost never use the $finalModel
object for prediction. Use predict.train
. This is one good example of why.
There is some inconsistency between how some functions (including randomForest
and train
) handle dummy variables. Most functions in R that use the formula method will convert factor predictors to dummy variables because their models require numerical representations of the data. The exceptions to this are tree- and rule-based models (that can split on categorical predictors), naive Bayes, and a few others.
So randomForest
will not create dummy variables when you use randomForest(y ~ ., data = dat)
but train
(and most others) will using a call like train(y ~ ., data = dat)
.
The error occurs because fuelType
is a factor. The dummy variables created by train
don't have the same names so predict.randomForest
can't find them.
Using the non-formula method with train
will pass the factor predictors to randomForest
and everything will work.
TL;DR
Use the non-formula method with train
if you want the same levels or use predict.train
R: randomForest error when combining forest produced using Caret
I also had this issue and found the following from this post: Error when using predict() on a randomForest object trained with caret's train() using formula
The randomForest
object is in $finalModel
, so forestList_caret[[i]]$finalModel
in your example. Your code works with the following changes:
line 8 to forestList <- forestList_caret <- list()
line 28 to rf.all_caret <- do.call("combine",forestList_caret)
Insert after line 22:
forestList_caret[[i]] <- forestList_caret[[i]]$finalModel
print(class(forestList_caret[[i]]))
Storing the $finalModel
object lets you can combine them at the end, and the result is an object with class randomForest
. Check with:
print(class(rf.all_caret))
Error in predicting raster with randomForest, Caret, and factor variables
It took a good bit of testing, but the answer is that raster::predict()
only works with models generated from caret::train()
that contain factors, if the model is presented as a formula (y ~ x1 + x2 + x3
) and not as y = y, x = x
(as a matrix or data.frame). Only through the formula interface will the the model create the proper contrasts or dummy variables. There is no need to make your raster layers into factors via as.factor()
. The predict function will do that for you.
Error in UseMethod predict when running Random Forest model
I was able to make that command work after reinstalling rtools and rlang. I also installed caret but I'm not sure if that necessary.
At any rate, I think rtools was lacking or not found.
R caret: values of $finalModel$predicted and values obtained by predict()
There already are a lot of questions related to this issue. See
- Using randomForest package in R, how to get probabilities from classification model?
- The Effect of Specifying Training Data as New Data when Making Random Forest Predictions in R
- R random forest inconsistent predictions
- Error when using predict() on a randomForest object trained with caret's train() using formula
- Different results with randomForest() and caret's randomForest (method = "rf")
on SO and Question 1, Question 2, Question 3, Question 4, Question 5 on Stats.SE.
As a couple of answers on Stats.SE mention, dat$pred_caret
differ from dat$pred
because predict.train
uses the whole training set, while with predict.randomForest
we have that
newdata - a data frame or matrix containing new data. (Note: If not
given, the out-of-bag prediction in object is returned.
where rf_gridsearch$finalModel$predicted
is basically the same as
randomForest:::predict.randomForest(rf_gridsearch$finalModel)
since rf_gridsearch$finalModel
is an object of randomForest
class. That is, no newdata
gets provided.
As for the error, it relates to the fact that train
and randomForest
treat data differently. This time it's not about scaling or centering, but rather about creating dummies. In particular, randomForest
is looking for the C
variable (factor), while train
created dummy variable CB <- 1 * (C == "B")
. Hence, you may replicate the result of predict.train
with
predict(object = rf_gridsearch$finalModel,
newdata = model.matrix(~ A + B + C, dat[, 2:4])[, -1])
where
model.matrix(~ A + B + C, dat[, 2:4])[, -1]
# A B CB
# 1 1.3 44.5 0
# 2 4.4 50.1 0
# 3 5.5 23.7 1
# 4 6.7 89.2 1
# 5 8.1 10.5 1
How to input a caret trained random forest model into predict() and performance() functions?
If you read the vignette of performance:
it has to be declared which class label denotes the negative, and
which the positive class. Ideally, labels should be supplied as
ordered factor(s), the lower level corresponding to the negative
class, the upper level to the positive class. If the labels are
factors (unordered), numeric, logical or characters, ordering of the
labels is inferred from R's built-in < relation (e.g. 0 < 1, -1 < 1,
'a' < 'b', FALSE < TRUE).
In your case, when you provide rf_train_model$pred$pred, the upper level is still "control", so the best way is to make it TRUE / FALSE. Also you should provide the actual label, not the predicted label, rf_train_model$obs
. see below for an example:
library(caret)
library(ROCR)
set.seed(100)
df = data.frame(matrix(runif(100*100),ncol=100))
df$outcome = ifelse(runif(100)>0.5,"case","control")
df_train = df[1:80,]
df_test = df[81:100,]
rf_train_model <- train(outcome ~ ., data=df_train,
method= "rf",
ntree = 1500,
tuneGrid = data.frame(mtry = 33),
trControl = ctrl,
preProc=c("center","scale"),
metric="ROC",
importance=TRUE)
levels(rf_train_model$pred$pred)
[1] "case" "control"
plotCurve = function(label,positive_class,prob){
pred = prediction(prob,label==positive_class)
perf <- performance(pred,"prec","rec")
plot(perf)
}
plotCurve(rf_train_model$pred$obs,"case",rf_train_model$pred$case)
plotCurve(rf_test$outcome,"case",predict(rf_train,df_test,type="prob")[,2])
Related Topics
R Shiny - Disable/Able Shinyui Elements
R: How to Total the Number of Na in Each Col of Data.Frame
Override Column Types When Importing Data Using Readr::Read_Csv() When There Are Many Columns
Choosing Eps and Minpts for Dbscan (R)
Filter Out Rows from One Data.Frame That Are Present in Another Data.Frame
Polygons Nicely Cropping Ggplot2/Ggmap at Different Zoom Levels
Model.Matrix() with Na.Action=Null
Plotting Cumulative Counts in Ggplot2
How to Count How Many Values Per Level in a Given Factor
Automatic Documentation of Datasets
Data.Table Alternative for Dplyr Case_When
Creating Professional Looking Powerpoints in R
Regression Tables in Markdown Format (For Flexible Use in R Markdown V2)
Avoiding Type Conflicts with Dplyr::Case_When
How to Plot One Variable in Ggplot
How to Select Columns in Data.Table Using a Character Vector of Certain Column Names