Additional Metrics in Caret - Ppv, Sensitivity, Specificity

Additional metrics in caret - PPV, sensitivity, specificity

Caret already has summary functions to output all the metrics you mention:

defaultSummary outputs Accuracy and Kappa

twoClassSummary outputs AUC (area under the ROC curve - see last line of answer), sensitivity and specificity

prSummary outputs precision and recall

in order to get combined metrics you can write your own summary function which combines the outputs of these three:

library(caret)
MySummary  <- function(data, lev = NULL, model = NULL){
  a1 <- defaultSummary(data, lev, model)
  b1 <- twoClassSummary(data, lev, model)
  c1 <- prSummary(data, lev, model)
  out <- c(a1, b1, c1)
  out}

lets try on the Sonar data set:

library(mlbench)
data("Sonar")

when defining the train control it is important to set classProbs = TRUE since some of these metrics (ROC and prAUC) can not be calculated based on predicted class but based on the predicted probabilities.

ctrl <- trainControl(method = "repeatedcv",
                     number = 10,
                     savePredictions = TRUE,
                     summaryFunction = MySummary,
                     classProbs = TRUE)

Now fit the model of your choice:

mod_fit <- train(Class ~.,
                 data = Sonar,
                 method = "rf",
                 trControl = ctrl)

mod_fit$results
#output
  mtry  Accuracy     Kappa       ROC      Sens      Spec       AUC Precision    Recall         F AccuracySD   KappaSD
1    2 0.8364069 0.6666364 0.9454798 0.9280303 0.7333333 0.8683726 0.8121087 0.9280303 0.8621526 0.10570484 0.2162077
2   31 0.8179870 0.6307880 0.9208081 0.8840909 0.7411111 0.8450612 0.8074942 0.8840909 0.8374326 0.06076222 0.1221844
3   60 0.8034632 0.6017979 0.9049242 0.8659091 0.7311111 0.8332068 0.7966889 0.8659091 0.8229330 0.06795824 0.1369086
       ROCSD     SensSD    SpecSD      AUCSD PrecisionSD   RecallSD        FSD
1 0.04393947 0.05727927 0.1948585 0.03410854  0.12717667 0.05727927 0.08482963
2 0.04995650 0.11053858 0.1398657 0.04694993  0.09075782 0.11053858 0.05772388
3 0.04965178 0.12047598 0.1387580 0.04820979  0.08951728 0.12047598 0.06715206

in this output
ROC is in fact the area under the ROC curve - usually called AUC

and
AUC is the area under the precision-recall curve across all cutoffs.

How to use sum of specificity and sensitivity metric as a summary metric for train in R caret?

Here is a summary function which will use the the sum of Sens + Spec as selection metric:

youdenSumary <- function(data, lev = NULL, model = NULL){
  if (length(lev) > 2) {
    stop(paste("Your outcome has", length(lev), "levels. The joudenSumary() function isn't appropriate."))
  }
  if (!all(levels(data[, "pred"]) == lev)) {
    stop("levels of observed and predicted data do not match")
  }
  Sens <- caret::sensitivity(data[, "pred"], data[, "obs"], lev[1]) 
  Spec <- caret::specificity(data[, "pred"], data[, "obs"], lev[2])
  j <- Sens + Spec
  out <- c(j, Spec, Sens)
  names(out) <- c("j", "Spec", "Sens")
 out
}

To understand why it is defined as such please read this chapter from the caret book. Some answers that might be helpful here on SO are:

Custom Performance Function in caret Package using predicted Probability

Additional metrics in caret - PPV, sensitivity, specificity

Example:

library(caret)
library(mlbench)
data(Sonar)

fitControl <- trainControl(method = "cv",
                           number = 5,
                           summaryFunction = youdenSumary)
fit <-  train(Class ~.,
              data = Sonar,
              method = "rpart", 
              metric = "j" ,
              tuneLength = 5,
              trControl = fitControl)

fit
#output
CART 

208 samples
 60 predictor
  2 classes: 'M', 'R' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 167, 166, 166, 166, 167 
Resampling results across tuning parameters:

  cp          j         Spec       Sens     
  0.00000000  1.394980  0.6100000  0.7849802
  0.01030928  1.394980  0.6100000  0.7849802
  0.05154639  1.387708  0.6300000  0.7577075
  0.06701031  1.398629  0.6405263  0.7581028
  0.48453608  1.215457  0.3684211  0.8470356

j was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.06701031.

How to set a ppv in caret for random forest in r?

We define a function to calculate PPV and return the results with a name:

PPV <- function (data,lev = NULL,model = NULL) {
   value <- posPredValue(data$pred,data$obs, positive = lev[1])
   c(PPV=value)
}

Let's say we have the following data:

library(randomForest)
library(caret)
data=iris
data$Species = ifelse(data$Species == "versicolor","versi","others")
trn = sample(nrow(iris),100)

Then we train by specifying PPV to be the metric:

mdl <- train(Species ~ ., data = data[trn,],
             method = "rf",
             metric = "PPV",
             trControl = trainControl(summaryFunction = PPV, 
                                      classProbs = TRUE))

Random Forest 

100 samples
  4 predictor
  2 classes: 'others', 'versi' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 100, 100, 100, 100, 100, 100, ... 
Resampling results across tuning parameters:

  mtry  PPV      
  2     0.9682811
  3     0.9681759
  4     0.9648426

PPV was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.

Now you can see it is trained on PPV. However you cannot force the training to achieve a PPV of 0.9.. It really depends on the data, if your independent variables have no predictive power, it will not improve however much you train it right?

Increasing specificity in Caret package in R

Inside of predict function you need to specify de type='prob' parameter. This allows you to get all the probabilities and choose the threshold of your preference.

model_pred <- predict(model_default, newdata = test_default, type = "prob")

Then, you can manually make a classification. For example:

model_pred_class <- ifelse(model_pred < 0.2, "No", "Yes")

How does caret calculate sensitivity and specificity in resamples?

Please be aware that data(GermanCredit) does not have the same variables as the ones you save in form, it would help for future questions that you post a reproducible example. Also, it would help to use set.seed().

Nevertheless, the issue here is that you need to take in account of the mtry, i.e. the number of "Randomly Selected Predictors" used in the random forest model. See documentation and code here.

I adjusted the GermanCredit so that everyone can run it as is:

library(caret)
data("GermanCredit")
form = as.formula('Class~Amount+SavingsAccountBonds.100.to.500+SavingsAccountBonds.lt.100+SavingsAccountBonds.500.to.1000+
SavingsAccountBonds.lt.100+SavingsAccountBonds.gt.1000+SavingsAccountBonds.Unknown+
                  InstallmentRatePercentage+Age+Housing.ForFree+Housing.Own+Housing.Rent+NumberExistingCredits')
train.control <- trainControl(method="cv", 
                              number=5,
                              summaryFunction = twoClassSummary,
                              classProbs = TRUE,
                              savePredictions='all')

set.seed(100)
rf <- train(form, data=GermanCredit,  method = 'rf',
           metric = 'ROC', trControl=train.control)

If we check the rf we can see that the final value of mtry used in the model was mtry = 2.

> rf
Random Forest 

1000 samples
  12 predictor
   2 classes: 'Bad', 'Good' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 800, 800, 800, 800, 800 
Resampling results across tuning parameters:

  mtry  ROC        Sens        Spec     
   2    0.6465714  0.06333333  0.9842857
   7    0.6413214  0.31333333  0.8571429
  12    0.6358214  0.31666667  0.8385714

ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.

Therefore by filtering mtry = 2 in the rf$pred you will get the expected result.

resamp.1 <- rf$pred %>% filter(Resample=='Fold1' & mtry == 2)
cm <- confusionMatrix(resamp.1$pred, resamp.1$obs)
print(cm) 
Confusion Matrix and Statistics

          Reference
Prediction Bad Good
      Bad    7    5
      Good  53  135

               Accuracy : 0.71            
                 95% CI : (0.6418, 0.7718)
    No Information Rate : 0.7             
    P-Value [Acc > NIR] : 0.4123          

                  Kappa : 0.1049          
 Mcnemar's Test P-Value : 6.769e-10       

            Sensitivity : 0.1167          
            Specificity : 0.9643          
         Pos Pred Value : 0.5833          
         Neg Pred Value : 0.7181          
             Prevalence : 0.3000          
         Detection Rate : 0.0350          
   Detection Prevalence : 0.0600          
      Balanced Accuracy : 0.5405          

       'Positive' Class : Bad  

 cm$byClass[1:2] == rf$resample[1,2:3]
  Sens Spec
  TRUE TRUE

EDIT:

You can also control this by checking rf$resampledCM, and see the number of observations in the different cells for different mtry and folds.

Optimising caret for sensitivity still seems to optimise for ROC

You over-complicated things.

Two class summary already contains Sensitivity as output. The column name "Sens". It is enough to specify:

metric = "Sens" to train and
summaryFunction = twoClassSummary to trainControl

Full example:

library(caret)
library(mlbench)
data(Sonar)

rpart_caret_fit <- train(Class~., 
                         data = Sonar,
                         method = "rpart", 
                         tuneLength = 20, 
                         metric = "Sens", 
                         maximize = TRUE,
                         trControl = trainControl(classProbs = TRUE,
                                                  method = "cv",
                                                  number = 5,
                                                  summaryFunction = twoClassSummary))

rpart_caret_fit
CART 

208 samples
 60 predictor
  2 classes: 'M', 'R' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 167, 166, 166, 166, 167 
Resampling results across tuning parameters:

  cp         ROC        Sens       Spec     
  0.0000000  0.7088298  0.7023715  0.7210526
  0.0255019  0.7075400  0.7292490  0.6684211
  0.0510038  0.7105388  0.7758893  0.6405263
  0.0765057  0.6904202  0.7841897  0.6294737
  0.1020076  0.7104681  0.8114625  0.6094737
  0.1275095  0.7104681  0.8114625  0.6094737
  0.1530114  0.7104681  0.8114625  0.6094737
  0.1785133  0.7104681  0.8114625  0.6094737
  0.2040152  0.7104681  0.8114625  0.6094737
  0.2295171  0.7104681  0.8114625  0.6094737
  0.2550190  0.7104681  0.8114625  0.6094737
  0.2805209  0.7104681  0.8114625  0.6094737
  0.3060228  0.7104681  0.8114625  0.6094737
  0.3315247  0.7104681  0.8114625  0.6094737
  0.3570266  0.7104681  0.8114625  0.6094737
  0.3825285  0.7104681  0.8114625  0.6094737
  0.4080304  0.7104681  0.8114625  0.6094737
  0.4335323  0.7104681  0.8114625  0.6094737
  0.4590342  0.6500135  0.8205534  0.4794737
  0.4845361  0.6500135  0.8205534  0.4794737

Sens was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.4845361.

~~Additionally I do not think you can specify control = rpart.control(maxdepth = 6) to caret train.~~ This is not correct - caret passes any parameters forward using .... So you can pass pretty much any argument.

If you are looking to write you own summary functions here is an example on the "Sens":

Sensitivity.fc <- function (data, lev = NULL, model = NULL) { #every summary function takes these three arguments
  obs <- data[, "obs"] #these are the real values - always in column name "obs" in data
  cls <- levels(obs) #there are the levels - you can also pass this to lev argument 
  probs <- data[, cls[2]] #these are the probabilities for the 2nd class - useful only if prob = TRUE
  class <- as.factor(ifelse(probs > 0.5, cls[2], cls[1])) #calculate the classes based on some probability treshold
  Sensitivity <- caret::sensitivity(class, obs) #do the calculation - I was lazy so I used a built in function to do it for me
  names(Sensitivity) <- "Sens" #the name of the output
  Sensitivity
}

and now:

rpart_caret_fit <- train(Class~., 
                         data = Sonar,
                         method = "rpart", 
                         tuneLength = 20, 
                         metric = "Sens", #because of this line: names(Sensitivity) <- "Sens" 
                         maximize = TRUE,
                         trControl = trainControl(classProbs = TRUE,
                                                  method = "cv",
                                                  number = 5,
                                                  summaryFunction = Sensitivity.fc))

Lets check if both produce the same results:

set.seed(1)
fit_sens <- train(Class~., 
                  data = Sonar,
                  method = "rpart", 
                  tuneLength = 20, 
                  metric = "Sens", 
                  maximize = TRUE,
                  trControl = trainControl(classProbs = TRUE,
                                           method = "cv",
                                           number = 5,
                                           summaryFunction = Sensitivity.fc))

set.seed(1)
fit_sens2 <- train(Class~., 
                   data = Sonar,
                   method = "rpart", 
                   tuneLength = 20, 
                   metric = "Sens", 
                   maximize = TRUE,
                   trControl = trainControl(classProbs = TRUE,
                                            method = "cv",
                                            number = 5,
                                            summaryFunction = twoClassSummary))

all.equal(fit_sens$results[c("cp", "Sens")],
          fit_sens2$results[c("cp", "Sens")])  

TRUE

all.equal(fit_sens$bestTune,
          fit_sens2$bestTune)
TRUE

Train on Specificity

Try specifiyng the summaryFunction argument to twoClassSummary inside trainControl along with classProbs = TRUE , and metric = "Spec" inside train():

control <- trainControl(method="cv", 
                        number=10, 
                        summaryFunction = twoClassSummary,
                        classProbs = TRUE) 

fit.svm <- train(target_var ~., 
                 data=dataset,
                 method="svmRadial", 
                 metric="Spec", 
                 trControl=control)

Additional Metrics in Caret - Ppv, Sensitivity, Specificity