Different Results with Randomforest() and Caret's Randomforest (Method = "Rf")

Using formula interface in train converts factors to dummy. To compare results from caret with randomForest you should use the non-formula interface.

In your case, you should provide a seed inside trainControl to get the same result as in randomForest.

Section training in caret webpage, there are some notes on reproducibility where it explains how to use seeds.

rf.model <- randomForest(uptake ~ .,
data = CO2,
ntree = 50,
nodesize = 5,
mtry = 2,
importance = TRUE,
metric = "RMSE")

caret.oob.model <- train(CO2[, -5], CO2$uptake,
method = "rf",
ntree = 50,
tuneGrid = data.frame(mtry = 2),
nodesize = 5,
importance = TRUE,
metric = "RMSE",
trControl = trainControl(method = "oob", seed = 1),
allowParallel = FALSE)

If you are doing resampling, you should provide seeds for each resampling iteration and an additional one for the final model. Examples in ?trainControl show how to create them.

In the following example, the last seed is for the final model and I set it to 1.

seeds <- as.vector(c(1:26), mode = "list")

# For the final model
seeds[[26]] <- 1

caret.boot.model <- train(CO2[, -5], CO2$uptake,
method = "rf",
ntree = 50,
tuneGrid = data.frame(mtry = 2),
nodesize = 5,
importance = TRUE,
metric = "RMSE",
trControl = trainControl(method = "boot", seeds = seeds),
allowParallel = FALSE)

Definig correctly the non-formula interface with caret and seed in trainControl you will get the same results in all three models:


Problem with type = 'prob' argument in caret::train package

This works smoother with terra::predict but with raster::predict you can use the index argument to specificy which output variable(s) you want.

predict_p_rf <- predict(image.x, model_rf, type = 'prob', index=1:3)

See ?raster::predict

The data represent the predicted probability of belonging to a particular category (0 is lowest probability, 1 is highest).

Error when using predict() on a randomForest object trained with caret's train() using formula

First, almost never use the $finalModel object for prediction. Use predict.train. This is one good example of why.

There is some inconsistency between how some functions (including randomForest and train) handle dummy variables. Most functions in R that use the formula method will convert factor predictors to dummy variables because their models require numerical representations of the data. The exceptions to this are tree- and rule-based models (that can split on categorical predictors), naive Bayes, and a few others.

So randomForest will not create dummy variables when you use randomForest(y ~ ., data = dat) but train (and most others) will using a call like train(y ~ ., data = dat).

The error occurs because fuelType is a factor. The dummy variables created by train don't have the same names so predict.randomForest can't find them.

Using the non-formula method with train will pass the factor predictors to randomForest and everything will work.


Use the non-formula method with train if you want the same levels or use predict.train

