Additional metrics in caret - PPV, sensitivity, specificity
Caret already has summary functions to output all the metrics you mention:
defaultSummary
outputs Accuracy and KappatwoClassSummary
outputs AUC (area under the ROC curve - see last line of answer), sensitivity and specificityprSummary
outputs precision and recall
in order to get combined metrics you can write your own summary function which combines the outputs of these three:
library(caret)
MySummary <- function(data, lev = NULL, model = NULL){
a1 <- defaultSummary(data, lev, model)
b1 <- twoClassSummary(data, lev, model)
c1 <- prSummary(data, lev, model)
out <- c(a1, b1, c1)
out}
lets try on the Sonar data set:
library(mlbench)
data("Sonar")
when defining the train control it is important to set classProbs = TRUE
since some of these metrics (ROC and prAUC) can not be calculated based on predicted class but based on the predicted probabilities.
ctrl <- trainControl(method = "repeatedcv",
number = 10,
savePredictions = TRUE,
summaryFunction = MySummary,
classProbs = TRUE)
Now fit the model of your choice:
mod_fit <- train(Class ~.,
data = Sonar,
method = "rf",
trControl = ctrl)
mod_fit$results
#output
mtry Accuracy Kappa ROC Sens Spec AUC Precision Recall F AccuracySD KappaSD
1 2 0.8364069 0.6666364 0.9454798 0.9280303 0.7333333 0.8683726 0.8121087 0.9280303 0.8621526 0.10570484 0.2162077
2 31 0.8179870 0.6307880 0.9208081 0.8840909 0.7411111 0.8450612 0.8074942 0.8840909 0.8374326 0.06076222 0.1221844
3 60 0.8034632 0.6017979 0.9049242 0.8659091 0.7311111 0.8332068 0.7966889 0.8659091 0.8229330 0.06795824 0.1369086
ROCSD SensSD SpecSD AUCSD PrecisionSD RecallSD FSD
1 0.04393947 0.05727927 0.1948585 0.03410854 0.12717667 0.05727927 0.08482963
2 0.04995650 0.11053858 0.1398657 0.04694993 0.09075782 0.11053858 0.05772388
3 0.04965178 0.12047598 0.1387580 0.04820979 0.08951728 0.12047598 0.06715206
in this output
ROC is in fact the area under the ROC curve - usually called AUC
and
AUC is the area under the precision-recall curve across all cutoffs.
How to use sum of specificity and sensitivity metric as a summary metric for train in R caret?
Here is a summary function which will use the the sum of Sens + Spec as selection metric:
youdenSumary <- function(data, lev = NULL, model = NULL){
if (length(lev) > 2) {
stop(paste("Your outcome has", length(lev), "levels. The joudenSumary() function isn't appropriate."))
}
if (!all(levels(data[, "pred"]) == lev)) {
stop("levels of observed and predicted data do not match")
}
Sens <- caret::sensitivity(data[, "pred"], data[, "obs"], lev[1])
Spec <- caret::specificity(data[, "pred"], data[, "obs"], lev[2])
j <- Sens + Spec
out <- c(j, Spec, Sens)
names(out) <- c("j", "Spec", "Sens")
out
}
To understand why it is defined as such please read this chapter from the caret book. Some answers that might be helpful here on SO are:
Custom Performance Function in caret Package using predicted Probability
Additional metrics in caret - PPV, sensitivity, specificity
Example:
library(caret)
library(mlbench)
data(Sonar)
fitControl <- trainControl(method = "cv",
number = 5,
summaryFunction = youdenSumary)
fit <- train(Class ~.,
data = Sonar,
method = "rpart",
metric = "j" ,
tuneLength = 5,
trControl = fitControl)
fit
#output
CART
208 samples
60 predictor
2 classes: 'M', 'R'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 167, 166, 166, 166, 167
Resampling results across tuning parameters:
cp j Spec Sens
0.00000000 1.394980 0.6100000 0.7849802
0.01030928 1.394980 0.6100000 0.7849802
0.05154639 1.387708 0.6300000 0.7577075
0.06701031 1.398629 0.6405263 0.7581028
0.48453608 1.215457 0.3684211 0.8470356
j was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.06701031.
How to set a ppv in caret for random forest in r?
We define a function to calculate PPV and return the results with a name:
PPV <- function (data,lev = NULL,model = NULL) {
value <- posPredValue(data$pred,data$obs, positive = lev[1])
c(PPV=value)
}
Let's say we have the following data:
library(randomForest)
library(caret)
data=iris
data$Species = ifelse(data$Species == "versicolor","versi","others")
trn = sample(nrow(iris),100)
Then we train by specifying PPV to be the metric:
mdl <- train(Species ~ ., data = data[trn,],
method = "rf",
metric = "PPV",
trControl = trainControl(summaryFunction = PPV,
classProbs = TRUE))
Random Forest
100 samples
4 predictor
2 classes: 'others', 'versi'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 100, 100, 100, 100, 100, 100, ...
Resampling results across tuning parameters:
mtry PPV
2 0.9682811
3 0.9681759
4 0.9648426
PPV was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
Now you can see it is trained on PPV. However you cannot force the training to achieve a PPV of 0.9.. It really depends on the data, if your independent variables have no predictive power, it will not improve however much you train it right?
Increasing specificity in Caret package in R
Inside of predict
function you need to specify de type='prob'
parameter. This allows you to get all the probabilities and choose the threshold of your preference.
model_pred <- predict(model_default, newdata = test_default, type = "prob")
Then, you can manually make a classification. For example:
model_pred_class <- ifelse(model_pred < 0.2, "No", "Yes")
How does caret calculate sensitivity and specificity in resamples?
Please be aware that data(GermanCredit)
does not have the same variables as the ones you save in form
, it would help for future questions that you post a reproducible example. Also, it would help to use set.seed()
.
Nevertheless, the issue here is that you need to take in account of the mtry
, i.e. the number of "Randomly Selected Predictors" used in the random forest model. See documentation and code here.
I adjusted the GermanCredit
so that everyone can run it as is:
library(caret)
data("GermanCredit")
form = as.formula('Class~Amount+SavingsAccountBonds.100.to.500+SavingsAccountBonds.lt.100+SavingsAccountBonds.500.to.1000+
SavingsAccountBonds.lt.100+SavingsAccountBonds.gt.1000+SavingsAccountBonds.Unknown+
InstallmentRatePercentage+Age+Housing.ForFree+Housing.Own+Housing.Rent+NumberExistingCredits')
train.control <- trainControl(method="cv",
number=5,
summaryFunction = twoClassSummary,
classProbs = TRUE,
savePredictions='all')
set.seed(100)
rf <- train(form, data=GermanCredit, method = 'rf',
metric = 'ROC', trControl=train.control)
If we check the rf
we can see that the final value of mtry
used in the model was mtry = 2
.
> rf
Random Forest
1000 samples
12 predictor
2 classes: 'Bad', 'Good'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 800, 800, 800, 800, 800
Resampling results across tuning parameters:
mtry ROC Sens Spec
2 0.6465714 0.06333333 0.9842857
7 0.6413214 0.31333333 0.8571429
12 0.6358214 0.31666667 0.8385714
ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
Therefore by filtering mtry = 2
in the rf$pred
you will get the expected result.
resamp.1 <- rf$pred %>% filter(Resample=='Fold1' & mtry == 2)
cm <- confusionMatrix(resamp.1$pred, resamp.1$obs)
print(cm)
Confusion Matrix and Statistics
Reference
Prediction Bad Good
Bad 7 5
Good 53 135
Accuracy : 0.71
95% CI : (0.6418, 0.7718)
No Information Rate : 0.7
P-Value [Acc > NIR] : 0.4123
Kappa : 0.1049
Mcnemar's Test P-Value : 6.769e-10
Sensitivity : 0.1167
Specificity : 0.9643
Pos Pred Value : 0.5833
Neg Pred Value : 0.7181
Prevalence : 0.3000
Detection Rate : 0.0350
Detection Prevalence : 0.0600
Balanced Accuracy : 0.5405
'Positive' Class : Bad
cm$byClass[1:2] == rf$resample[1,2:3]
Sens Spec
TRUE TRUE
EDIT:
You can also control this by checking rf$resampledCM
, and see the number of observations in the different cells for different mtry
and folds.
Optimising caret for sensitivity still seems to optimise for ROC
You over-complicated things.
Two class summary already contains Sensitivity as output. The column name "Sens". It is enough to specify:
metric = "Sens"
to train
and summaryFunction = twoClassSummary
to trainControl
Full example:
library(caret)
library(mlbench)
data(Sonar)
rpart_caret_fit <- train(Class~.,
data = Sonar,
method = "rpart",
tuneLength = 20,
metric = "Sens",
maximize = TRUE,
trControl = trainControl(classProbs = TRUE,
method = "cv",
number = 5,
summaryFunction = twoClassSummary))
rpart_caret_fit
CART
208 samples
60 predictor
2 classes: 'M', 'R'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 167, 166, 166, 166, 167
Resampling results across tuning parameters:
cp ROC Sens Spec
0.0000000 0.7088298 0.7023715 0.7210526
0.0255019 0.7075400 0.7292490 0.6684211
0.0510038 0.7105388 0.7758893 0.6405263
0.0765057 0.6904202 0.7841897 0.6294737
0.1020076 0.7104681 0.8114625 0.6094737
0.1275095 0.7104681 0.8114625 0.6094737
0.1530114 0.7104681 0.8114625 0.6094737
0.1785133 0.7104681 0.8114625 0.6094737
0.2040152 0.7104681 0.8114625 0.6094737
0.2295171 0.7104681 0.8114625 0.6094737
0.2550190 0.7104681 0.8114625 0.6094737
0.2805209 0.7104681 0.8114625 0.6094737
0.3060228 0.7104681 0.8114625 0.6094737
0.3315247 0.7104681 0.8114625 0.6094737
0.3570266 0.7104681 0.8114625 0.6094737
0.3825285 0.7104681 0.8114625 0.6094737
0.4080304 0.7104681 0.8114625 0.6094737
0.4335323 0.7104681 0.8114625 0.6094737
0.4590342 0.6500135 0.8205534 0.4794737
0.4845361 0.6500135 0.8205534 0.4794737
Sens was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.4845361.
Additionally I do not think you can specify This is not correct - caret passes any parameters forward using control = rpart.control(maxdepth = 6)
to caret train
....
. So you can pass pretty much any argument.
If you are looking to write you own summary functions here is an example on the "Sens":
Sensitivity.fc <- function (data, lev = NULL, model = NULL) { #every summary function takes these three arguments
obs <- data[, "obs"] #these are the real values - always in column name "obs" in data
cls <- levels(obs) #there are the levels - you can also pass this to lev argument
probs <- data[, cls[2]] #these are the probabilities for the 2nd class - useful only if prob = TRUE
class <- as.factor(ifelse(probs > 0.5, cls[2], cls[1])) #calculate the classes based on some probability treshold
Sensitivity <- caret::sensitivity(class, obs) #do the calculation - I was lazy so I used a built in function to do it for me
names(Sensitivity) <- "Sens" #the name of the output
Sensitivity
}
and now:
rpart_caret_fit <- train(Class~.,
data = Sonar,
method = "rpart",
tuneLength = 20,
metric = "Sens", #because of this line: names(Sensitivity) <- "Sens"
maximize = TRUE,
trControl = trainControl(classProbs = TRUE,
method = "cv",
number = 5,
summaryFunction = Sensitivity.fc))
Lets check if both produce the same results:
set.seed(1)
fit_sens <- train(Class~.,
data = Sonar,
method = "rpart",
tuneLength = 20,
metric = "Sens",
maximize = TRUE,
trControl = trainControl(classProbs = TRUE,
method = "cv",
number = 5,
summaryFunction = Sensitivity.fc))
set.seed(1)
fit_sens2 <- train(Class~.,
data = Sonar,
method = "rpart",
tuneLength = 20,
metric = "Sens",
maximize = TRUE,
trControl = trainControl(classProbs = TRUE,
method = "cv",
number = 5,
summaryFunction = twoClassSummary))
all.equal(fit_sens$results[c("cp", "Sens")],
fit_sens2$results[c("cp", "Sens")])
TRUE
all.equal(fit_sens$bestTune,
fit_sens2$bestTune)
TRUE
Train on Specificity
Try specifiyng the summaryFunction
argument to twoClassSummary
inside trainControl
along with classProbs = TRUE
, and metric = "Spec"
inside train()
:
control <- trainControl(method="cv",
number=10,
summaryFunction = twoClassSummary,
classProbs = TRUE)
fit.svm <- train(target_var ~.,
data=dataset,
method="svmRadial",
metric="Spec",
trControl=control)
Related Topics
Extract Date from Given String in R
Is There a Limit for the Possible Number of Nested Ifelse Statements
The Rolling Regression in R Using Roll Apply
Contrast Between Label and Background: Determine If Color Is Light or Dark
How to Pad a Vector with Na from the Front
How to Preserve Continuous (1,2,3,...N) Ranking Notation When Ranking in R
Install.Packages R on Ubuntu 12.04 Downloads But Does Not Install Packages
Convert from K to Thousand (1000) in R
What Is Your Preferred Style for Naming Variables in R
R Define Dimensions of Empty Data Frame
How to Get the Min/Max Possible Numeric
How to Change Name of Factor Levels
3D Equivalent of the Curve Function in R
How to Figure Third Friday of a Month in R
Reshaping Data to Plot in R Using Ggplot2