Additional Metrics in Caret - Ppv, Sensitivity, Specificity

Additional metrics in caret - PPV, sensitivity, specificity

Caret already has summary functions to output all the metrics you mention:

defaultSummary outputs Accuracy and Kappa

twoClassSummary outputs AUC (area under the ROC curve - see last line of answer), sensitivity and specificity

prSummary outputs precision and recall

in order to get combined metrics you can write your own summary function which combines the outputs of these three:

library(caret)
MySummary <- function(data, lev = NULL, model = NULL){
a1 <- defaultSummary(data, lev, model)
b1 <- twoClassSummary(data, lev, model)
c1 <- prSummary(data, lev, model)
out <- c(a1, b1, c1)
out}

lets try on the Sonar data set:

library(mlbench)
data("Sonar")

when defining the train control it is important to set classProbs = TRUE since some of these metrics (ROC and prAUC) can not be calculated based on predicted class but based on the predicted probabilities.

ctrl <- trainControl(method = "repeatedcv",
number = 10,
savePredictions = TRUE,
summaryFunction = MySummary,
classProbs = TRUE)

Now fit the model of your choice:

mod_fit <- train(Class ~.,
data = Sonar,
method = "rf",
trControl = ctrl)

mod_fit$results
#output
mtry Accuracy Kappa ROC Sens Spec AUC Precision Recall F AccuracySD KappaSD
1 2 0.8364069 0.6666364 0.9454798 0.9280303 0.7333333 0.8683726 0.8121087 0.9280303 0.8621526 0.10570484 0.2162077
2 31 0.8179870 0.6307880 0.9208081 0.8840909 0.7411111 0.8450612 0.8074942 0.8840909 0.8374326 0.06076222 0.1221844
3 60 0.8034632 0.6017979 0.9049242 0.8659091 0.7311111 0.8332068 0.7966889 0.8659091 0.8229330 0.06795824 0.1369086
ROCSD SensSD SpecSD AUCSD PrecisionSD RecallSD FSD
1 0.04393947 0.05727927 0.1948585 0.03410854 0.12717667 0.05727927 0.08482963
2 0.04995650 0.11053858 0.1398657 0.04694993 0.09075782 0.11053858 0.05772388
3 0.04965178 0.12047598 0.1387580 0.04820979 0.08951728 0.12047598 0.06715206

in this output
ROC is in fact the area under the ROC curve - usually called AUC

and
AUC is the area under the precision-recall curve across all cutoffs.

How to use sum of specificity and sensitivity metric as a summary metric for train in R caret?

Here is a summary function which will use the the sum of Sens + Spec as selection metric:

youdenSumary <- function(data, lev = NULL, model = NULL){
if (length(lev) > 2) {
stop(paste("Your outcome has", length(lev), "levels. The joudenSumary() function isn't appropriate."))
}
if (!all(levels(data[, "pred"]) == lev)) {
stop("levels of observed and predicted data do not match")
}
Sens <- caret::sensitivity(data[, "pred"], data[, "obs"], lev[1])
Spec <- caret::specificity(data[, "pred"], data[, "obs"], lev[2])
j <- Sens + Spec
out <- c(j, Spec, Sens)
names(out) <- c("j", "Spec", "Sens")
out
}

To understand why it is defined as such please read this chapter from the caret book. Some answers that might be helpful here on SO are:

Custom Performance Function in caret Package using predicted Probability

Additional metrics in caret - PPV, sensitivity, specificity

Example:

library(caret)
library(mlbench)
data(Sonar)

fitControl <- trainControl(method = "cv",
number = 5,
summaryFunction = youdenSumary)
fit <- train(Class ~.,
data = Sonar,
method = "rpart",
metric = "j" ,
tuneLength = 5,
trControl = fitControl)

fit
#output
CART

208 samples
60 predictor
2 classes: 'M', 'R'

No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 167, 166, 166, 166, 167
Resampling results across tuning parameters:

cp j Spec Sens
0.00000000 1.394980 0.6100000 0.7849802
0.01030928 1.394980 0.6100000 0.7849802
0.05154639 1.387708 0.6300000 0.7577075
0.06701031 1.398629 0.6405263 0.7581028
0.48453608 1.215457 0.3684211 0.8470356

j was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.06701031.

How to set a ppv in caret for random forest in r?

We define a function to calculate PPV and return the results with a name:

PPV <- function (data,lev = NULL,model = NULL) {
value <- posPredValue(data$pred,data$obs, positive = lev[1])
c(PPV=value)
}

Let's say we have the following data:

library(randomForest)
library(caret)
data=iris
data$Species = ifelse(data$Species == "versicolor","versi","others")
trn = sample(nrow(iris),100)

Then we train by specifying PPV to be the metric:

mdl <- train(Species ~ ., data = data[trn,],
method = "rf",
metric = "PPV",
trControl = trainControl(summaryFunction = PPV,
classProbs = TRUE))

Random Forest

100 samples
4 predictor
2 classes: 'others', 'versi'

No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 100, 100, 100, 100, 100, 100, ...
Resampling results across tuning parameters:

mtry PPV
2 0.9682811
3 0.9681759
4 0.9648426

PPV was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.

Now you can see it is trained on PPV. However you cannot force the training to achieve a PPV of 0.9.. It really depends on the data, if your independent variables have no predictive power, it will not improve however much you train it right?

Increasing specificity in Caret package in R

Inside of predict function you need to specify de type='prob' parameter. This allows you to get all the probabilities and choose the threshold of your preference.

model_pred <- predict(model_default, newdata = test_default, type = "prob")

Then, you can manually make a classification. For example:

model_pred_class <- ifelse(model_pred < 0.2, "No", "Yes")

How does caret calculate sensitivity and specificity in resamples?

Please be aware that data(GermanCredit) does not have the same variables as the ones you save in form, it would help for future questions that you post a reproducible example. Also, it would help to use set.seed().

Nevertheless, the issue here is that you need to take in account of the mtry, i.e. the number of "Randomly Selected Predictors" used in the random forest model. See documentation and code here.

I adjusted the GermanCredit so that everyone can run it as is:

library(caret)
data("GermanCredit")
form = as.formula('Class~Amount+SavingsAccountBonds.100.to.500+SavingsAccountBonds.lt.100+SavingsAccountBonds.500.to.1000+
SavingsAccountBonds.lt.100+SavingsAccountBonds.gt.1000+SavingsAccountBonds.Unknown+
InstallmentRatePercentage+Age+Housing.ForFree+Housing.Own+Housing.Rent+NumberExistingCredits')
train.control <- trainControl(method="cv",
number=5,
summaryFunction = twoClassSummary,
classProbs = TRUE,
savePredictions='all')

set.seed(100)
rf <- train(form, data=GermanCredit, method = 'rf',
metric = 'ROC', trControl=train.control)

If we check the rf we can see that the final value of mtry used in the model was mtry = 2.

> rf
Random Forest

1000 samples
12 predictor
2 classes: 'Bad', 'Good'

No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 800, 800, 800, 800, 800
Resampling results across tuning parameters:

mtry ROC Sens Spec
2 0.6465714 0.06333333 0.9842857
7 0.6413214 0.31333333 0.8571429
12 0.6358214 0.31666667 0.8385714

ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.

Therefore by filtering mtry = 2 in the rf$pred you will get the expected result.

resamp.1 <- rf$pred %>% filter(Resample=='Fold1' & mtry == 2)
cm <- confusionMatrix(resamp.1$pred, resamp.1$obs)
print(cm)
Confusion Matrix and Statistics

Reference
Prediction Bad Good
Bad 7 5
Good 53 135

Accuracy : 0.71
95% CI : (0.6418, 0.7718)
No Information Rate : 0.7
P-Value [Acc > NIR] : 0.4123

Kappa : 0.1049
Mcnemar's Test P-Value : 6.769e-10

Sensitivity : 0.1167
Specificity : 0.9643
Pos Pred Value : 0.5833
Neg Pred Value : 0.7181
Prevalence : 0.3000
Detection Rate : 0.0350
Detection Prevalence : 0.0600
Balanced Accuracy : 0.5405

'Positive' Class : Bad

cm$byClass[1:2] == rf$resample[1,2:3]
Sens Spec
TRUE TRUE

EDIT:

You can also control this by checking rf$resampledCM, and see the number of observations in the different cells for different mtry and folds.

Optimising caret for sensitivity still seems to optimise for ROC

You over-complicated things.

Two class summary already contains Sensitivity as output. The column name "Sens". It is enough to specify:

metric = "Sens" to train and
summaryFunction = twoClassSummary to trainControl

Full example:

library(caret)
library(mlbench)
data(Sonar)

rpart_caret_fit <- train(Class~.,
data = Sonar,
method = "rpart",
tuneLength = 20,
metric = "Sens",
maximize = TRUE,
trControl = trainControl(classProbs = TRUE,
method = "cv",
number = 5,
summaryFunction = twoClassSummary))

rpart_caret_fit
CART

208 samples
60 predictor
2 classes: 'M', 'R'

No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 167, 166, 166, 166, 167
Resampling results across tuning parameters:

cp ROC Sens Spec
0.0000000 0.7088298 0.7023715 0.7210526
0.0255019 0.7075400 0.7292490 0.6684211
0.0510038 0.7105388 0.7758893 0.6405263
0.0765057 0.6904202 0.7841897 0.6294737
0.1020076 0.7104681 0.8114625 0.6094737
0.1275095 0.7104681 0.8114625 0.6094737
0.1530114 0.7104681 0.8114625 0.6094737
0.1785133 0.7104681 0.8114625 0.6094737
0.2040152 0.7104681 0.8114625 0.6094737
0.2295171 0.7104681 0.8114625 0.6094737
0.2550190 0.7104681 0.8114625 0.6094737
0.2805209 0.7104681 0.8114625 0.6094737
0.3060228 0.7104681 0.8114625 0.6094737
0.3315247 0.7104681 0.8114625 0.6094737
0.3570266 0.7104681 0.8114625 0.6094737
0.3825285 0.7104681 0.8114625 0.6094737
0.4080304 0.7104681 0.8114625 0.6094737
0.4335323 0.7104681 0.8114625 0.6094737
0.4590342 0.6500135 0.8205534 0.4794737
0.4845361 0.6500135 0.8205534 0.4794737

Sens was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.4845361.

Additionally I do not think you can specify control = rpart.control(maxdepth = 6) to caret train. This is not correct - caret passes any parameters forward using .... So you can pass pretty much any argument.

If you are looking to write you own summary functions here is an example on the "Sens":

Sensitivity.fc <- function (data, lev = NULL, model = NULL) { #every summary function takes these three arguments
obs <- data[, "obs"] #these are the real values - always in column name "obs" in data
cls <- levels(obs) #there are the levels - you can also pass this to lev argument
probs <- data[, cls[2]] #these are the probabilities for the 2nd class - useful only if prob = TRUE
class <- as.factor(ifelse(probs > 0.5, cls[2], cls[1])) #calculate the classes based on some probability treshold
Sensitivity <- caret::sensitivity(class, obs) #do the calculation - I was lazy so I used a built in function to do it for me
names(Sensitivity) <- "Sens" #the name of the output
Sensitivity
}

and now:

rpart_caret_fit <- train(Class~., 
data = Sonar,
method = "rpart",
tuneLength = 20,
metric = "Sens", #because of this line: names(Sensitivity) <- "Sens"
maximize = TRUE,
trControl = trainControl(classProbs = TRUE,
method = "cv",
number = 5,
summaryFunction = Sensitivity.fc))

Lets check if both produce the same results:

set.seed(1)
fit_sens <- train(Class~.,
data = Sonar,
method = "rpart",
tuneLength = 20,
metric = "Sens",
maximize = TRUE,
trControl = trainControl(classProbs = TRUE,
method = "cv",
number = 5,
summaryFunction = Sensitivity.fc))

set.seed(1)
fit_sens2 <- train(Class~.,
data = Sonar,
method = "rpart",
tuneLength = 20,
metric = "Sens",
maximize = TRUE,
trControl = trainControl(classProbs = TRUE,
method = "cv",
number = 5,
summaryFunction = twoClassSummary))

all.equal(fit_sens$results[c("cp", "Sens")],
fit_sens2$results[c("cp", "Sens")])

TRUE

all.equal(fit_sens$bestTune,
fit_sens2$bestTune)
TRUE

Train on Specificity

Try specifiyng the summaryFunction argument to twoClassSummary inside trainControl along with classProbs = TRUE , and metric = "Spec" inside train():

control <- trainControl(method="cv", 
number=10,
summaryFunction = twoClassSummary,
classProbs = TRUE)

fit.svm <- train(target_var ~.,
data=dataset,
method="svmRadial",
metric="Spec",
trControl=control)


Related Topics



Leave a reply



Submit