Predicting Probabilities for Gbm with Caret Library

Predicting Probabilities for GBM with caret library

When class probabilities are requested, train puts them into a data frame with a column for each class. If the factor levels are not valid variable names, they are automatically changed (e.g. "0" becomes "X0"). train issues a warning in this case that goes something like "At least one of the class levels are not valid R variables names. This may cause errors if class probabilities are generated."

How to run a gbm simulation over probabilities using caret package model results

Found my own answer, although I would be happy if someone found a more efficient way, this is for running 100 simulations:

SIMUL <- list()

for(i in 1:100){
species <- list()
for(j in 1:nrow(PROBS)){
species[[j]] <- sample(c("setosa", "versicolor", "virginica"), size = 1, replace = TRUE, prob = PROBS[j,])
}
SIMUL[[i]] <- as.data.frame(table(unlist(species)))
}

SIMUL <- do.call("rbind", SIMUL)

SIMUL <- dplyr::group_by(SIMUL, Var1)
SIMUL <- dplyr::summarise(SIMUL, MEAN_class = mean(Freq), MIN_Class = min(Freq), MAX_Class = max(Freq))

This will result in:

SIMUL

Source: local data frame [3 x 4]

Var1 MEAN_class MIN_Class MAX_Class
(fctr) (dbl) (int) (int)
1 setosa 50.0 50 50
2 versicolor 49.7 47 53
3 virginica 50.3 47 53

GBM classification with the caret package

"since whatever summary function is used gets passed predictions that are already binary"

That's definitely not the case.

It cannot use the classes to compute the ROC curve (unless you go out of your way to do so). See the note below.

train can predict the classes as factors (using the internal code that you show) and/or the class probabilities.

For example, this code will compute the class probabilities and use them to get the area under the ROC curve:

library(caret)
library(mlbench)
data(Sonar)

ctrl <- trainControl(method = "cv",
summaryFunction = twoClassSummary,
classProbs = TRUE)
set.seed(1)
gbmTune <- train(Class ~ ., data = Sonar,
method = "gbm",
metric = "ROC",
verbose = FALSE,
trControl = ctrl)

In fact, if you omit the classProbs = TRUE bit, you will get the error:

train()'s use of ROC codes requires class probabilities. See the classProbs option of trainControl()

Max

How to get the class probabilities AND predictions from caret::predict()?

I make my comment into an answer.
Once you generate your prediction table of probabilities, you don't actually need to run twice the prediction function to get the classes. You can ask to add the class column by applying a simple which.max function (which runs fast imo). This will assign for each row the name of the column (one in the three c("setosa", "versicolor", "virginica")) based on which probability is the highest.

You get this table with both informations, as requested:

library(dplyr)
predict(knnFit, newdata = iris, type = "prob") %>%
mutate('class'=names(.)[apply(., 1, which.max)])
# a random sample of the resulting table:
#### setosa versicolor virginica class
#### 18 1 0.0000000 0.0000000 setosa
#### 64 0 0.6666667 0.3333333 versicolor
#### 90 0 1.0000000 0.0000000 versicolor
#### 121 0 0.0000000 1.0000000 virginica

ps: this uses the piping operator from dplyr or magrittr packages. The dot . indicates when you reuse the result from the previous instruction

How to get randomForest model output in probability using Caret?

UPDATE: I found the solution through here. Apparently, caret's train is not good with handling 0 and 1 binary class values in target variable. Changing them to any string ('r' and 's') worked perfectly.

> a$dv<-gsub('0','r',a$dv)
> a$dv<-gsub('1','s',a$dv)
> rf_model<-train(x = a[,-c(1:2)], y = as.factor(a[,2]),method="rf",ntree=120)
> head(predict(rf_model, test_data[,-c(1:2)], type="prob"))
r s
1 0.9750000 0.025000000
2 0.9916667 0.008333333
3 0.2583333 0.741666667
4 0.2833333 0.716666667
5 0.1583333 0.841666667
6 1.0000000 0.000000000

How to Produce a Confusion Matrix using the 'gbm' Method in the Caret Package

Thanks for including all the required information; I believe this is the solution to your problem:

library(magrittr)
library(gbm)
#> Loaded gbm 2.1.8
library(caret)
#> Loading required package: ggplot2
#> Loading required package: lattice
library(e1071)

set.seed(45L)

# Load in your example data to an object ("data")
#Produce a new version of the data frame 'Clusters_Dummy' with the rows shuffled
Cluster_Dummy_2 <- data
NewClusters <- Cluster_Dummy_2[sample(1:nrow(Cluster_Dummy_2)),]

NewCluster<-as.data.frame(NewClusters)

training.parameters <- Cluster_Dummy_2$Country %>%
createDataPartition(p = 0.7, list = FALSE)
train.data <- NewClusters[training.parameters, ]
test.data <- NewClusters[-training.parameters, ]

dim(train.data)
#> [1] 70 11
#259 10

dim(test.data)
#> [1] 29 11
#108 10

#Auxiliary function for controlling model fitting
#10 fold cross validation; 10 times
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated ten times
repeats = 10,
classProbs = TRUE)
#Fit the model
gbmFit1 <- train(Country ~ ., data=train.data,
method = "gbm",
trControl = fitControl,
## This last option is actually one
## for gbm() that passes through
verbose = FALSE)
gbmFit1
#> Stochastic Gradient Boosting
#>
#> 70 samples
#> 10 predictors
#> 2 classes: 'France', 'Holland'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 10 times)
#> Summary of sample sizes: 64, 64, 63, 63, 63, 62, ...
#> Resampling results across tuning parameters:
#>
#> interaction.depth n.trees Accuracy Kappa
#> 1 50 0.7397619 0.4810245
#> 1 100 0.7916667 0.5816756
#> 1 150 0.8204167 0.6392434
#> 2 50 0.7396429 0.4813670
#> 2 100 0.7943452 0.5901254
#> 2 150 0.8380357 0.6768166
#> 3 50 0.7361905 0.4711780
#> 3 100 0.7966071 0.5897921
#> 3 150 0.8356548 0.6694202
#>
#> Tuning parameter 'shrinkage' was held constant at a value of 0.1
#>
#> Tuning parameter 'n.minobsinnode' was held constant at a value of 10
#> Accuracy was used to select the optimal model using the largest value.
#> The final values used for the model were n.trees = 150, interaction.depth =
#> 2, shrinkage = 0.1 and n.minobsinnode = 10.
summary(gbmFit1)

Sample Image

#>                     var   rel.inf
#> ID ID 66.517974
#> Center_Freq Center_Freq 6.624256
#> Start.Freq Start.Freq 5.545827
#> Delta.Time Delta.Time 5.033223
#> Peak.Time Peak.Time 4.951384
#> End.Freq End.Freq 3.211461
#> Delta.Freq Delta.Freq 2.352933
#> Low.Freq Low.Freq 2.207371
#> High.Freq High.Freq 1.951895
#> Peak.Freq Peak.Freq 1.603675

#Predict the model with the test data
pred_model_Tree1 <- predict(object = gbmFit1, newdata = test.data, type = "prob")
pred_model_Tree1
#> France Holland
#> 1 0.919393487 0.080606513
#> 2 0.095638010 0.904361990
#> 3 0.019038102 0.980961898
#> 4 0.045807668 0.954192332
#> 5 0.157809127 0.842190873
#> 6 0.987391435 0.012608565
#> 7 0.011436393 0.988563607
#> 8 0.032262438 0.967737562
#> 9 0.151393564 0.848606436
#> 10 0.993447390 0.006552610
#> 11 0.020833439 0.979166561
#> 12 0.993910239 0.006089761
#> 13 0.009170816 0.990829184
#> 14 0.010519644 0.989480356
#> 15 0.995338954 0.004661046
#> 16 0.994153479 0.005846521
#> 17 0.998099611 0.001900389
#> 18 0.056571139 0.943428861
#> 19 0.801327096 0.198672904
#> 20 0.192220458 0.807779542
#> 21 0.899189477 0.100810523
#> 22 0.766542297 0.233457703
#> 23 0.940046468 0.059953532
#> 24 0.069087397 0.930912603
#> 25 0.916674076 0.083325924
#> 26 0.023676968 0.976323032
#> 27 0.996824979 0.003175021
#> 28 0.996068088 0.003931912
#> 29 0.096807861 0.903192139

# Evaluate each prediction, i.e. if the predicted likelihood that the country is France is '0.9'
# and the likelihood it's Holland is '0.1', then the prediction is "France"
pred_model_Tree1$evaluation <- ifelse(pred_model_Tree1$France >= 0.5, "France", "Holland")

# Now you can print the confusionMatrix (make sure each factor has the same levels)
confusionMatrix(factor(pred_model_Tree1$evaluation, levels = unique(test.data$Country)),
factor(test.data$Country, levels = unique(test.data$Country)))
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction France Holland
#> France 13 1
#> Holland 0 15
#>
#> Accuracy : 0.9655
#> 95% CI : (0.8224, 0.9991)
#> No Information Rate : 0.5517
#> P-Value [Acc > NIR] : 7.947e-07
#>
#> Kappa : 0.9308
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 1.0000
#> Specificity : 0.9375
#> Pos Pred Value : 0.9286
#> Neg Pred Value : 1.0000
#> Prevalence : 0.4483
#> Detection Rate : 0.4483
#> Detection Prevalence : 0.4828
#> Balanced Accuracy : 0.9688
#>
#> 'Positive' Class : France
#>

Created on 2022-06-02 by the reprex package (v2.0.1)



Edit

Something seems wrong - perhaps you want to remove the IDs before you train/test the model? (Maybe they weren't randomly assigned?) E.g.

library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(gbm)
#> Loaded gbm 2.1.8
library(caret)
#> Loading required package: ggplot2
#> Loading required package: lattice
library(e1071)

set.seed(45L)

#Produce a new version of the data frame 'Clusters_Dummy' with the rows shuffled
Cluster_Dummy_2 <- data
NewClusters <- Cluster_Dummy_2[sample(1:nrow(Cluster_Dummy_2)),]

NewCluster<-as.data.frame(NewClusters)

training.parameters <- Cluster_Dummy_2$Country %>%
createDataPartition(p = 0.7, list = FALSE)
train.data <- NewClusters[training.parameters, ] %>%
select(-ID)
test.data <- NewClusters[-training.parameters, ] %>%
select(-ID)

dim(train.data)
#> [1] 70 10

dim(test.data)
#> [1] 29 10

#Auxiliary function for controlling model fitting
#10 fold cross validation; 10 times
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated ten times
repeats = 10,
classProbs = TRUE)
#Fit the model
gbmFit1 <- train(Country ~ ., data=train.data,
method = "gbm",
trControl = fitControl,
## This last option is actually one
## for gbm() that passes through
verbose = FALSE)
gbmFit1
#> Stochastic Gradient Boosting
#>
#> 70 samples
#> 9 predictor
#> 2 classes: 'France', 'Holland'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 10 times)
#> Summary of sample sizes: 64, 64, 63, 63, 63, 62, ...
#> Resampling results across tuning parameters:
#>
#> interaction.depth n.trees Accuracy Kappa
#> 1 50 0.5515476 0.08773090
#> 1 100 0.5908929 0.17272118
#> 1 150 0.5958333 0.18280502
#> 2 50 0.5386905 0.06596478
#> 2 100 0.5767262 0.13757567
#> 2 150 0.5785119 0.14935661
#> 3 50 0.5575000 0.09991455
#> 3 100 0.5585119 0.10906906
#> 3 150 0.5780952 0.14820067
#>
#> Tuning parameter 'shrinkage' was held constant at a value of 0.1
#>
#> Tuning parameter 'n.minobsinnode' was held constant at a value of 10
#> Accuracy was used to select the optimal model using the largest value.
#> The final values used for the model were n.trees = 150, interaction.depth =
#> 1, shrinkage = 0.1 and n.minobsinnode = 10.
summary(gbmFit1)

Sample Image

#>                     var   rel.inf
#> Center_Freq Center_Freq 14.094306
#> High.Freq High.Freq 14.060959
#> Peak.Time Peak.Time 13.503953
#> Peak.Freq Peak.Freq 11.358891
#> Delta.Time Delta.Time 9.964882
#> Low.Freq Low.Freq 9.610686
#> End.Freq End.Freq 9.308919
#> Delta.Freq Delta.Freq 9.097253
#> Start.Freq Start.Freq 9.000152

#Predict the model with the test data
pred_model_Tree1 <- predict(object = gbmFit1, newdata = test.data, type = "prob")
pred_model_Tree1
#> France Holland
#> 1 0.75514031 0.24485969
#> 2 0.44409692 0.55590308
#> 3 0.15027904 0.84972096
#> 4 0.49861536 0.50138464
#> 5 0.95406713 0.04593287
#> 6 0.82122854 0.17877146
#> 7 0.27931450 0.72068550
#> 8 0.50113421 0.49886579
#> 9 0.61912973 0.38087027
#> 10 0.91005442 0.08994558
#> 11 0.42625105 0.57374895
#> 12 0.27339404 0.72660596
#> 13 0.14520192 0.85479808
#> 14 0.16607144 0.83392856
#> 15 0.97198722 0.02801278
#> 16 0.88614818 0.11385182
#> 17 0.65561219 0.34438781
#> 18 0.86793709 0.13206291
#> 19 0.28583233 0.71416767
#> 20 0.97002073 0.02997927
#> 21 0.74408374 0.25591626
#> 22 0.28408111 0.71591889
#> 23 0.07257257 0.92742743
#> 24 0.22724577 0.77275423
#> 25 0.32581206 0.67418794
#> 26 0.59713799 0.40286201
#> 27 0.75814205 0.24185795
#> 28 0.94018097 0.05981903
#> 29 0.51155700 0.48844300

# Evaluate each prediction, i.e. if the predicted likelihood that the country is France is '0.9'
# and the likelihood it's Holland is '0.1', then the prediction is "France"
pred_model_Tree1$evaluation <- ifelse(pred_model_Tree1$France >= 0.5, "France", "Holland")

# Now you can print the confusionMatrix (make sure each factor has the same levels)
confusionMatrix(factor(pred_model_Tree1$evaluation, levels = unique(test.data$Country)),
factor(test.data$Country, levels = unique(test.data$Country)))
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction France Holland
#> France 9 7
#> Holland 4 9
#>
#> Accuracy : 0.6207
#> 95% CI : (0.4226, 0.7931)
#> No Information Rate : 0.5517
#> P-Value [Acc > NIR] : 0.2897
#>
#> Kappa : 0.2494
#>
#> Mcnemar's Test P-Value : 0.5465
#>
#> Sensitivity : 0.6923
#> Specificity : 0.5625
#> Pos Pred Value : 0.5625
#> Neg Pred Value : 0.6923
#> Prevalence : 0.4483
#> Detection Rate : 0.3103
#> Detection Prevalence : 0.5517
#> Balanced Accuracy : 0.6274
#>
#> 'Positive' Class : France
#>

Created on 2022-06-02 by the reprex package (v2.0.1)



Edit 2

For multi-class classification (3 classes):

library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(gbm)
#> Loaded gbm 2.1.8
library(caret)
#> Loading required package: ggplot2
#> Loading required package: lattice
library(e1071)

set.seed(45L)

#Produce a new version of the data frame 'Clusters_Dummy' with the rows shuffled
Cluster_Dummy_2 <- data_updated

NewClusters <- Cluster_Dummy_2[sample(1:nrow(Cluster_Dummy_2)),]

NewCluster <- as.data.frame(NewClusters)

training.parameters <- Cluster_Dummy_2$Country %>%
createDataPartition(p = 0.7, list = FALSE)
train.data <- NewClusters[training.parameters, ]
test.data <- NewClusters[-training.parameters, ]

dim(train.data)
#> [1] 71 10

dim(test.data)
#> [1] 28 10

#Auxiliary function for controlling model fitting
#10 fold cross validation; 10 times
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated ten times
repeats = 10,
classProbs = TRUE)
#Fit the model
gbmFit1 <- train(Country ~ ., data=train.data,
method = "gbm",
trControl = fitControl,
## This last option is actually one
## for gbm() that passes through
verbose = FALSE)
gbmFit1
#> Stochastic Gradient Boosting
#>
#> 71 samples
#> 9 predictor
#> 3 classes: 'France', 'Holland', 'Spain'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 10 times)
#> Summary of sample sizes: 63, 64, 64, 63, 63, 63, ...
#> Resampling results across tuning parameters:
#>
#> interaction.depth n.trees Accuracy Kappa
#> 1 50 0.4165476 0.07310546
#> 1 100 0.4264683 0.09363788
#> 1 150 0.4164683 0.08078702
#> 2 50 0.3894048 0.03705497
#> 2 100 0.4032341 0.06489744
#> 2 150 0.4075794 0.06765817
#> 3 50 0.4032341 0.05972739
#> 3 100 0.3906944 0.04364377
#> 3 150 0.4236905 0.10068155
#>
#> Tuning parameter 'shrinkage' was held constant at a value of 0.1
#>
#> Tuning parameter 'n.minobsinnode' was held constant at a value of 10
#> Accuracy was used to select the optimal model using the largest value.
#> The final values used for the model were n.trees = 100, interaction.depth =
#> 1, shrinkage = 0.1 and n.minobsinnode = 10.
summary(gbmFit1)

Sample Image

#>                     var   rel.inf
#> Peak.Time Peak.Time 16.211328
#> End.Freq End.Freq 15.001295
#> Center_Freq Center_Freq 12.583477
#> Delta.Freq Delta.Freq 11.236692
#> Start.Freq Start.Freq 10.692191
#> Delta.Time Delta.Time 9.224466
#> Peak.Freq Peak.Freq 8.772731
#> Low.Freq Low.Freq 8.674891
#> High.Freq High.Freq 7.602928

#Predict the model with the test data
pred_model_Tree1 <- predict(object = gbmFit1, newdata = test.data, type = "prob")
pred_model_Tree1
#> France Holland Spain
#> 1 0.15839683 0.11884456 0.72275861
#> 2 0.31551164 0.62037910 0.06410925
#> 3 0.06056686 0.03289397 0.90653917
#> 4 0.22705213 0.03439780 0.73855007
#> 5 0.05455049 0.02259610 0.92285341
#> 6 0.34187929 0.25613079 0.40198992
#> 7 0.12857217 0.39860882 0.47281901
#> 8 0.08617855 0.09096950 0.82285196
#> 9 0.22635900 0.62549636 0.14814464
#> 10 0.20887256 0.64739917 0.14372826
#> 11 0.03588915 0.74148076 0.22263010
#> 12 0.03083337 0.48043152 0.48873511
#> 13 0.44698228 0.07630407 0.47671365
#> 14 0.12247065 0.01864920 0.85888015
#> 15 0.03022037 0.08301324 0.88676639
#> 16 0.18190023 0.50467449 0.31342527
#> 17 0.10173416 0.11619956 0.78206628
#> 18 0.29744577 0.31149440 0.39105983
#> 19 0.08555810 0.83492846 0.07951344
#> 20 0.67158503 0.12913684 0.19927813
#> 21 0.33985892 0.30094634 0.35919474
#> 22 0.41752286 0.43288825 0.14958889
#> 23 0.10014057 0.85848587 0.04137356
#> 24 0.02483037 0.57939110 0.39577853
#> 25 0.20376019 0.16867259 0.62756722
#> 26 0.05082254 0.11736656 0.83181090
#> 27 0.02621289 0.74597052 0.22781659
#> 28 0.37202204 0.48168272 0.14629524

# Select the most likely country (i.e. the highest prob)
pred_model_Tree1$evaluation <- factor(max.col(pred_model_Tree1[,1:3]), levels=1:3, labels = c("France", "Holland", "Spain"))

# Print the confusionMatrix (make sure each factor has the same levels)
confusionMatrix(factor(pred_model_Tree1$evaluation, levels = unique(test.data$Country)),
factor(test.data$Country, levels = unique(test.data$Country)))
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Spain France Holland
#> Spain 10 4 2
#> France 0 0 1
#> Holland 4 5 2
#>
#> Overall Statistics
#>
#> Accuracy : 0.4286
#> 95% CI : (0.2446, 0.6282)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : 0.8275
#>
#> Kappa : 0.0968
#>
#> Mcnemar's Test P-Value : 0.0620
#>
#> Statistics by Class:
#>
#> Class: Spain Class: France Class: Holland
#> Sensitivity 0.7143 0.00000 0.40000
#> Specificity 0.5714 0.94737 0.60870
#> Pos Pred Value 0.6250 0.00000 0.18182
#> Neg Pred Value 0.6667 0.66667 0.82353
#> Prevalence 0.5000 0.32143 0.17857
#> Detection Rate 0.3571 0.00000 0.07143
#> Detection Prevalence 0.5714 0.03571 0.39286
#> Balanced Accuracy 0.6429 0.47368 0.50435

#########
library(tidyverse)

Created on 2022-06-03 by the reprex package (v2.0.1)

Caret classification thresholds

There are several problems in your code. I will use the PimaIndiansDiabetes data set from mlbench since it is better suited then the iris data set.

First of all for splitting data into train and test sets the code:

train <- data %>% sample_frac(.70)
test <- data %>% sample_frac(.30)

is not suited since some rows occurring in the train set will also occur in the test set.

Additionally avoid to use function names as object names, it will save you much headache in the long run.

data(iris)
data <- as.data.frame(iris) #bad object name

To the example:

library(caret)
library(ModelMetrics)
library(dplyr)
library(mlbench)

data(PimaIndiansDiabetes, package = "mlbench")

Create train and test sets, you may use base R sample to sample rows or caret::createDataPartition. createDataPartition is preferable since it tries to preserve the distribution of the response.

set.seed(123)
ind <- createDataPartition(PimaIndiansDiabetes$diabetes, 0.7)

tr <- PimaIndiansDiabetes[ind$Resample1,]
ts <- PimaIndiansDiabetes[-ind$Resample1,]

This way no rows in the train set will be in the test set.

Lets create the model:

ctrl <- trainControl(method = "cv",
number = 5,
returnResamp = 'none',
summaryFunction = twoClassSummary,
classProbs = T,
savePredictions = T,
verboseIter = F)

gbmGrid <- expand.grid(interaction.depth = 10,
n.trees = 200,
shrinkage = 0.01,
n.minobsinnode = 4)

set.seed(5627)
gbm_pima <- train(diabetes ~ .,
data = tr,
method = "gbm", #use xgboost
metric = "ROC",
tuneGrid = gbmGrid,
verbose = FALSE,
trControl = ctrl)

create a vector of probabilities for thresholder

probs <- seq(.1, 0.9, by = 0.02)

ths <- thresholder(gbm_pima,
threshold = probs,
final = TRUE,
statistics = "all")

head(ths)

Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall F1 Prevalence Detection Rate Detection Prevalence
1 200 10 0.01 4 0.10 1.000 0.02222222 0.6562315 1.0000000 0.6562315 1.000 0.7924209 0.6510595 0.6510595 0.9922078
2 200 10 0.01 4 0.12 1.000 0.05213675 0.6633439 1.0000000 0.6633439 1.000 0.7975413 0.6510595 0.6510595 0.9817840
3 200 10 0.01 4 0.14 0.992 0.05954416 0.6633932 0.8666667 0.6633932 0.992 0.7949393 0.6510595 0.6458647 0.9739918
4 200 10 0.01 4 0.16 0.984 0.07435897 0.6654277 0.7936508 0.6654277 0.984 0.7936383 0.6510595 0.6406699 0.9636022
5 200 10 0.01 4 0.18 0.984 0.14188034 0.6821550 0.8750000 0.6821550 0.984 0.8053941 0.6510595 0.6406699 0.9401230
6 200 10 0.01 4 0.20 0.980 0.17179487 0.6886786 0.8833333 0.6886786 0.980 0.8086204 0.6510595 0.6380725 0.9271018
Balanced Accuracy Accuracy Kappa J Dist
1 0.5111111 0.6588517 0.02833828 0.02222222 0.9777778
2 0.5260684 0.6692755 0.06586592 0.05213675 0.9478632
3 0.5257721 0.6666781 0.06435166 0.05154416 0.9406357
4 0.5291795 0.6666781 0.07134190 0.05835897 0.9260250
5 0.5629402 0.6901572 0.15350721 0.12588034 0.8585308
6 0.5758974 0.6979836 0.18460584 0.15179487 0.8288729

extract the threshold probability based on your preferred metric

ths %>%
mutate(prob = probs) %>%
filter(J == max(J)) %>%
pull(prob) -> thresh_prob

thresh_prob
0.74

predict on test data

pred <- predict(gbm_pima, newdata = ts, type = "prob")

create a numeric response (0 or 1) based on the response in the test set since this is needed for the functions from package ModelMetrics

real <- as.numeric(factor(ts$diabetes))-1

ModelMetrics::sensitivity(real, pred$pos, cutoff = thresh_prob)
0.2238806 #based on this it is clear the threshold chosen is not optimal on this test data

ModelMetrics::specificity(real, pred$pos, cutoff = thresh_prob)
0.956

ModelMetrics::kappa(real, pred$pos, cutoff = thresh_prob)
0.2144026 #based on this it is clear the threshold chosen is not optimal on this test data

ModelMetrics::mcc(real, pred$pos, cutoff = thresh_prob)
0.2776309 #based on this it is clear the threshold chosen is not optimal on this test data

ModelMetrics::auc(real, pred$pos)
0.8047463 #decent AUC and low mcc and kappa indicate a poor choice of threshold

Auc is a measure over all thresholds so it does not require specification of the cutoff threshold.

Since only one train/test split is used the performance evaluation will be biased. Best is to use nested resampling so the same can be evaluated over several train/test splits. Here is a way to performed nested resampling.

EDIT: Answer to the questions in comments.

To create the roc curve you do not need to calculate sensitivity and specificity on all thresholds you can just use a specified package for such a task. The results are probability going to be more trustworthy.

I prefer using the pROC package:

library(pROC)

roc.obj <- roc(real, pred$pos)
plot(roc.obj, print.thres = "best")

The best threshold on the figure is the threshold that gives the highest specificity + sensitivity on the test data.



Related Topics



Leave a reply



Submit