Roc Curve from Training Data in Caret

ROC curve from training data in caret

There is just the savePredictions = TRUE argument missing from ctrl (this also works for other resampling methods):

library(caret)
library(mlbench)
data(Sonar)
ctrl <- trainControl(method="cv", 
                     summaryFunction=twoClassSummary, 
                     classProbs=T,
                     savePredictions = T)
rfFit <- train(Class ~ ., data=Sonar, 
               method="rf", preProc=c("center", "scale"), 
               trControl=ctrl)
library(pROC)
# Select a parameter setting
selectedIndices <- rfFit$pred$mtry == 2
# Plot:
plot.roc(rfFit$pred$obs[selectedIndices],
         rfFit$pred$M[selectedIndices])

ROC

Maybe I am missing something, but a small concern is that train always estimates slightly different AUC values than plot.roc and pROC::auc (absolute difference < 0.005), although twoClassSummary uses pROC::auc to estimate the AUC. Edit: I assume this occurs because the ROC from train is the average of the AUC using the separate CV-Sets and here we are calculating the AUC over all resamples simultaneously to obtain the overall AUC.

Update Since this is getting a bit of attention, here's a solution using plotROC::geom_roc() for ggplot2:

library(ggplot2)
library(plotROC)
ggplot(rfFit$pred[selectedIndices, ], 
       aes(m = M, d = factor(obs, levels = c("R", "M")))) + 
    geom_roc(hjust = -0.4, vjust = 1.5) + coord_equal()

ggplot_roc

ROC curve for the testing set using Caret package

You can use the following code

library(MLeval)
pred <- predict(mod_fit, newdata=testing, type="prob")
test1 <- evalm(data.frame(pred, testing$Class))

Sample Image

If you want to change the name of "Group1" into something else like GLM, you can use the following code

test1 <- evalm(data.frame(pred, testing$Class, Group = "GLM"))

Sample Image

ROC curve of the testing dataset

As you have not provided any data, I am using Sonar data. You can use the following code to make ROC plot for test data

library(caret)
library(MLeval)

data(Sonar)

# Split data 
a <- createDataPartition(Sonar$Class, p=0.8, list=FALSE)
train <- Sonar[ a, ]
test <- Sonar[ -a, ]

myControl = trainControl(
  method = "cv",
  summaryFunction = twoClassSummary,
  classProbs = TRUE,
  verboseIter = FALSE,
)

model_knn = train(
  Class ~ .,
  train,
  method = "knn",
  metric = "ROC",
  tuneLength = 10,
  trControl = myControl)

pred <- predict(model_knn, newdata=test, type="prob")
ROC <- evalm(data.frame(pred, test$Class, Group = "KNN"))

Sample Image

How to compute ROC and AUC under ROC after training using caret in R?

A sample example for AUC:

rf_output=randomForest(x=predictor_data, y=target, importance = TRUE, ntree = 10001, proximity=TRUE, sampsize=sampsizes)

library(ROCR)
predictions=as.vector(rf_output$votes[,2])
pred=prediction(predictions,target)

perf_AUC=performance(pred,"auc") #Calculate the AUC value
AUC=perf_AUC@y.values[[1]]

perf_ROC=performance(pred,"tpr","fpr") #plot the actual ROC curve
plot(perf_ROC, main="ROC plot")
text(0.5,0.5,paste("AUC = ",format(AUC, digits=5, scientific=FALSE)))

or using pROC and caret

library(caret)
library(pROC)
data(iris)

iris <- iris[iris$Species == "virginica" | iris$Species == "versicolor", ]
iris$Species <- factor(iris$Species)  # setosa should be removed from factor

samples <- sample(NROW(iris), NROW(iris) * .5)
data.train <- iris[samples, ]
data.test <- iris[-samples, ]
forest.model <- train(Species ~., data.train)

result.predicted.prob <- predict(forest.model, data.test, type="prob") # Prediction

result.roc <- roc(data.test$Species, result.predicted.prob$versicolor) # Draw ROC curve.
plot(result.roc, print.thres="best", print.thres.best.method="closest.topleft")

result.coords <- coords(result.roc, "best", best.method="closest.topleft", ret=c("threshold", "accuracy"))
print(result.coords)#to get threshold and accuracy

How to plot AUC ROC for different caret training models?

I assume that you want to show the ROC curves on the test set, unlike in the question pointed in the comment (ROC curve from training data in caret) which uses the training data.

The first thing to do will be to extract predictions on the test data (newdata=test_cars), in the form of probabilities (type="prob"):

predictions_nb <- predict(cars_nb, newdata=test_cars, type="prob")
predictions_glm <- predict(cars_glm, newdata=test_cars, type="prob")

This gives us a data.frame with probabilities to belong to class 0 or 1. Let's use the probability of class 1 only:

predictions_nb <- predict(cars_nb, newdata=test_cars, type="prob")[,"1"]
predictions_glm <- predict(cars_glm, newdata=test_cars, type="prob")[,"1"]

Next I'll use the pROC package to create the ROC curves for the training data (disclaimer: I am the author of this package. There are other ways to achieve the result, but this is the one I am the most familiar with):

library(pROC)
roc_nb <- roc(test_cars$am, predictions_nb)
roc_glm <- roc(test_cars$am, predictions_glm)

Finally you can plot the curves. To have two curves with the pROC package, use the lines function to add the line of the second ROC curve to the plot

plot(roc_nb, col="green")
lines(roc_glm, col="blue")

To make it more readable you can add a legend:

legend("bottomright", col=c("green", "blue"), legend=c("NB", "GLM"), lty=1)

And with the AUC:

legend_nb <- sprintf("NB (AUC: %.2f)", auc(roc_nb))
legend_glm <- sprintf("GLM (AUC: %.2f)", auc(roc_glm))
legend("bottomright",
       col=c("green", "blue"), lty=1,
       legend=c(legend_nb, legend_glm))

ROC curves

Plot ROC curve from Cross-Validation (training) data in R

As you already did you can a) enable savePredictions = T in the trainControl parameter of caret::train, then, b) from the trained model object, use the pred variable - which contains all predictions over all partitions and resamples - to compute whichever ROC curve you would like to look at. You now have multiple options of which ROC this can be, e.g.:

you could look at all predictions over all partitions and resamples at once:

plot(roc(predictor = modelObject$pred$CLASSNAME, response = modelObject$pred$obs))

Or you could do this over individual partitions and/or resamples (which is what you tried above). The following example computes the ROC curve per partition and resample, so with 10 partitions and 5 repeats will result in 50 ROC curves:

library(plyr)
l_ply(split(modelObject$pred, modelObject$pred$Resample), function(d) {
    plot(roc(predictor = d$CLASSNAME, response = d$obs))
})

Depending on your data and model, the latter will give you certain variance in the resulting ROC curves and AUC values. You can see the same variance in the AUC and SD values caret calculated for your individual partitions and resamples, so this results from your data and model and is correct.

BTW: I was using the pROC::roc function for calculating the examples above, but you could use any suitable function here. And, when using caret::train obtaining the ROC is always the same, no matter the model type.

Plot ROC curve for bootstrapped caret model

First an explanation:

If you are not going to check how each possible hyper parameter combination predicted on each sample in each re-sample you can set savePredictions = "final" in trainControl to save space:

fitControl <-
  trainControl(
    method = "boot632",
    number = 10,
    classProbs = T,
    savePredictions = "final",
    summaryFunction = twoClassSummary
  )

after running the model:

model <- train(
  Class ~ .,
  data = my_data,
  method = "xgbTree",
  trControl = fitControl,
  metric = "ROC"
)

results of interest are in model$pred

here you can check how many samples were tested in each re-sample (I set 25 repetitions)

nrow(model$pred[model$pred$Resample == "Resample01",]) 
#83

caret always provides prediction from rows not used in the model build.

nrow(my_data) #208

83/208 makes sense for the test samples for boot632

Now to build the ROC curve. You may opt for several options here:

-average the probability for each sample and use that (this is usual for CV since you have all samples repeated the same number of times, but it can be done with boot also).

-plot all as is without averaging

-plot ROC for each re-sample.

I will show you the second approach:

Create a data frame of class probabilities and true outcomes:

for_lift = data.frame(Class = model$pred$obs,  xgbTree = model$pred$R)

plot ROC:

pROC::plot.roc(pROC::roc(response = for_lift$Class,
                         predictor = for_lift$xgbTree,
                         levels = c("M", "R")),
           lwd=1.5)

Sample Image

You can also do this with ggplot, to do so I find it easiest to make a lift object using caret function lift

lift_obj = lift(Class ~ xgbTree, data = for_lift, class = "R")

specify which class the probability was used ^.

library(ggplot2)

ggplot(lift_obj$data)+
  geom_line(aes(1-Sp , Sn, color = liftModelVar))+
  scale_color_discrete(guide = guide_legend(title = "method"))

Sample Image

How to plot ROC curves for every cross-validations using Caret

library(mlbench)
library(caret)
library(ggplot2)
set.seed(998)

# Prepare data ------------------------------------------------------------

data(Sonar)
my_data <- Sonar

# Cross Validation Definition ---------------------------------------------------

fitControl <-
  trainControl(
    method = "cv",
    number = 10,
    classProbs = T,
    savePredictions = T,
    summaryFunction = twoClassSummary
  )

# Training with Random Forest ----------------------------------------------------------------

model <- train(
  Class ~ .,
  data = my_data,
  method = "rf",
  trControl = fitControl,
  metric = "ROC"
)

for_lift <- data.frame(Class = model$pred$obs, rf = model$pred$R, resample = model$pred$Resample)
lift_df <-  data.frame()
for (fold in unique(for_lift$resample)) {
  fold_df <- dplyr::filter(for_lift, resample == fold)
  lift_obj_data <- lift(Class ~ rf, data = fold_df, class = "R")$data
  lift_obj_data$fold = fold
  lift_df = rbind(lift_df, lift_obj_data)
}
lift_obj <- lift(Class ~ rf, data = for_lift, class = "R")

# Plot ROC ----------------------------------------------------------------

ggplot(lift_df) +
  geom_line(aes(1 - Sp, Sn, color = fold)) +
  scale_color_discrete(guide = guide_legend(title = "Fold"))

Plot

To calculate AUC:

model <- train(
  Class ~ .,
  data = my_data,
  method = "rf",
  trControl = fitControl,
  metric = "ROC"
)

library(plyr)
library(MLmetrics)
ddply(model$pred, "Resample", summarise,
      accuracy = Accuracy(pred, obs))

Output:

   Resample  accuracy
1    Fold01 0.8253968
2    Fold02 0.8095238
3    Fold03 0.8000000
4    Fold04 0.8253968
5    Fold05 0.8095238
6    Fold06 0.8253968
7    Fold07 0.8333333
8    Fold08 0.8253968
9    Fold09 0.9841270
10   Fold10 0.7936508

ROC curve from train/test set in caret R package

It's hard to know for sure without a reproducible answer, but presumably your response variable bin.frail isn't numeric. For example, it might be coded using letters (e.g., "Y", "N"); or with numbers which are being stored as a factor. You could check this using is.numeric(whas$bin.frail).

As a side note, in your call to roc() it looks like mod1pred is being created from your training data whereas testing$bin.frail is from your test data. You could correct this by adding newdata = testing to your call to predict when creating mod1pred.

ROC curve for Training set and Test set for each fold of cross validation in Caret

It is true that the documentation is not at all clear regarding the contents of rfmodel$pred - I would bet that the predictions included are for the fold used as a test set, but I cannot point to any evidence in the docs; nevertheless, and regardless of this, you are still missing some points in the way you are trying to get the ROC.

First, let's isolate rfmodel$pred in a separate dataframe for easier handling:

dd <- rfmodel$pred

nrow(dd)
# 450

Why 450 rows? It is because you have tried 3 different parameter sets (in your case just 3 different values for mtry):

rfmodel$results
# output:
  mtry Accuracy Kappa AccuracySD    KappaSD
1    2     0.96  0.94 0.04346135 0.06519202
2    3     0.96  0.94 0.04346135 0.06519202
3    4     0.96  0.94 0.04346135 0.06519202

and 150 rows X 3 settings = 450.

Let's have a closer look at the contents of rfmodel$pred:

head(dd)

# result:
    pred    obs setosa versicolor virginica rowIndex mtry Resample
1 setosa setosa  1.000      0.000         0        2    2    Fold1
2 setosa setosa  1.000      0.000         0        3    2    Fold1
3 setosa setosa  1.000      0.000         0        6    2    Fold1
4 setosa setosa  0.998      0.002         0       24    2    Fold1
5 setosa setosa  1.000      0.000         0       33    2    Fold1
6 setosa setosa  1.000      0.000         0       38    2    Fold1

Column obs contains the true values
The three columns setosa, versicolor, and virginica contain the respective probabilities calculated for each class, and they sum up to 1 for each row
Column pred contains the final prediction, i.e. the class with the maximum probability from the three columns mentioned above

If this were the whole story, your way of plotting the ROC would be OK, i.e.:

selectedIndices <- rfmodel$pred$Resample == "Fold1"
plot.roc(rfmodel$pred$obs[selectedIndices],rfmodel$pred$setosa[selectedIndices])

But this is not the whole story (the mere existence of 450 rows instead of just 150 should have given a hint already): notice the existence of a column named mtry; indeed, rfmodel$pred includes the results for all runs of cross-validation (i.e. for all the parameter settings):

tail(dd)
# result:
         pred       obs setosa versicolor virginica rowIndex mtry Resample
445 virginica virginica      0      0.004     0.996      112    4    Fold5
446 virginica virginica      0      0.000     1.000      113    4    Fold5
447 virginica virginica      0      0.020     0.980      115    4    Fold5
448 virginica virginica      0      0.000     1.000      118    4    Fold5
449 virginica virginica      0      0.394     0.606      135    4    Fold5
450 virginica virginica      0      0.000     1.000      140    4    Fold5

This is the ultimate reason why your selectedIndices calculation is not correct; it should also include a specific choice of mtry, otherwise the ROC does not make any sense, since it "aggregates" more than one model:

selectedIndices <- rfmodel$pred$Resample == "Fold1" & rfmodel$pred$mtry == 2

As I said in the beginning, I bet that the predictions in rfmodel$pred are for the folder as a test set; indeed, if we compute manually the accuracies, they coincide with the ones reported in rfmodel$results shown above (0.96 for all 3 settings), which we know are for the folder used as test (arguably, the respective training accuracies are 1.0):

for (i in 2:4) {  # mtry values in {2, 3, 4}

acc = (length(which(dd$pred == dd$obs & dd$mtry==i & dd$Resample=='Fold1'))/30 +
    length(which(dd$pred == dd$obs & dd$mtry==i & dd$Resample=='Fold2'))/30 +
    length(which(dd$pred == dd$obs & dd$mtry==i & dd$Resample=='Fold3'))/30 +
    length(which(dd$pred == dd$obs & dd$mtry==i & dd$Resample=='Fold4'))/30 +
    length(which(dd$pred == dd$obs & dd$mtry==i & dd$Resample=='Fold5'))/30
)/5

print(acc) 
}

# result:
[1] 0.96
[1] 0.96
[1] 0.96

Roc Curve from Training Data in Caret