Error When Using Predict() on a Randomforest Object Trained with Caret's Train() Using Formula

Error when using predict() on a randomForest object trained with caret's train() using formula

First, almost never use the $finalModel object for prediction. Use predict.train. This is one good example of why.

There is some inconsistency between how some functions (including randomForest and train) handle dummy variables. Most functions in R that use the formula method will convert factor predictors to dummy variables because their models require numerical representations of the data. The exceptions to this are tree- and rule-based models (that can split on categorical predictors), naive Bayes, and a few others.

So randomForest will not create dummy variables when you use randomForest(y ~ ., data = dat) but train (and most others) will using a call like train(y ~ ., data = dat).

The error occurs because fuelType is a factor. The dummy variables created by train don't have the same names so predict.randomForest can't find them.

Using the non-formula method with train will pass the factor predictors to randomForest and everything will work.

TL;DR

Use the non-formula method with train if you want the same levels or use predict.train

R: randomForest error when combining forest produced using Caret

I also had this issue and found the following from this post: Error when using predict() on a randomForest object trained with caret's train() using formula

The randomForest object is in $finalModel, so forestList_caret[[i]]$finalModel in your example. Your code works with the following changes:

line 8 to forestList <- forestList_caret <- list()

line 28 to rf.all_caret <- do.call("combine",forestList_caret)

Insert after line 22:

forestList_caret[[i]] <- forestList_caret[[i]]$finalModel
print(class(forestList_caret[[i]]))

Storing the $finalModel object lets you can combine them at the end, and the result is an object with class randomForest. Check with:

print(class(rf.all_caret))

Error in predicting raster with randomForest, Caret, and factor variables

It took a good bit of testing, but the answer is that raster::predict() only works with models generated from caret::train() that contain factors, if the model is presented as a formula (y ~ x1 + x2 + x3) and not as y = y, x = x (as a matrix or data.frame). Only through the formula interface will the the model create the proper contrasts or dummy variables. There is no need to make your raster layers into factors via as.factor(). The predict function will do that for you.

Error in UseMethod predict when running Random Forest model

I was able to make that command work after reinstalling rtools and rlang. I also installed caret but I'm not sure if that necessary.

At any rate, I think rtools was lacking or not found.

R caret: values of $finalModel$predicted and values obtained by predict()

There already are a lot of questions related to this issue. See

Using randomForest package in R, how to get probabilities from classification model?
The Effect of Specifying Training Data as New Data when Making Random Forest Predictions in R
R random forest inconsistent predictions
Error when using predict() on a randomForest object trained with caret's train() using formula
Different results with randomForest() and caret's randomForest (method = "rf")

on SO and Question 1, Question 2, Question 3, Question 4, Question 5 on Stats.SE.

As a couple of answers on Stats.SE mention, dat$pred_caret differ from dat$pred because predict.train uses the whole training set, while with predict.randomForest we have that

newdata - a data frame or matrix containing new data. (Note: If not
given, the out-of-bag prediction in object is returned.

where rf_gridsearch$finalModel$predicted is basically the same as

randomForest:::predict.randomForest(rf_gridsearch$finalModel)

since rf_gridsearch$finalModel is an object of randomForest class. That is, no newdata gets provided.

As for the error, it relates to the fact that train and randomForest treat data differently. This time it's not about scaling or centering, but rather about creating dummies. In particular, randomForest is looking for the C variable (factor), while train created dummy variable CB <- 1 * (C == "B"). Hence, you may replicate the result of predict.train with

predict(object = rf_gridsearch$finalModel, 
        newdata = model.matrix(~ A + B + C, dat[, 2:4])[, -1])

where

model.matrix(~ A + B + C, dat[, 2:4])[, -1]
#     A    B CB
# 1 1.3 44.5  0
# 2 4.4 50.1  0
# 3 5.5 23.7  1
# 4 6.7 89.2  1
# 5 8.1 10.5  1

How to input a caret trained random forest model into predict() and performance() functions?

If you read the vignette of performance:

it has to be declared which class label denotes the negative, and
which the positive class. Ideally, labels should be supplied as
ordered factor(s), the lower level corresponding to the negative
class, the upper level to the positive class. If the labels are
factors (unordered), numeric, logical or characters, ordering of the
labels is inferred from R's built-in < relation (e.g. 0 < 1, -1 < 1,
'a' < 'b', FALSE < TRUE).

In your case, when you provide rf_train_model$pred$pred, the upper level is still "control", so the best way is to make it TRUE / FALSE. Also you should provide the actual label, not the predicted label, rf_train_model$obs. see below for an example:

library(caret)
library(ROCR)
set.seed(100)
df = data.frame(matrix(runif(100*100),ncol=100))
df$outcome = ifelse(runif(100)>0.5,"case","control")

df_train = df[1:80,]
df_test = df[81:100,]

rf_train_model <- train(outcome ~ ., data=df_train, 
                  method= "rf",
                  ntree = 1500, 
                  tuneGrid = data.frame(mtry = 33), 
                  trControl = ctrl, 
                  preProc=c("center","scale"), 
                  metric="ROC",
                  importance=TRUE)

levels(rf_train_model$pred$pred)
[1] "case"    "control"

plotCurve = function(label,positive_class,prob){
pred = prediction(prob,label==positive_class)
perf <- performance(pred,"prec","rec")
plot(perf)
}

plotCurve(rf_train_model$pred$obs,"case",rf_train_model$pred$case)
plotCurve(rf_test$outcome,"case",predict(rf_train,df_test,type="prob")[,2])

Error When Using Predict() on a Randomforest Object Trained with Caret's Train() Using Formula