How to Plot the ROC Curve in rStudios from the given values?
Using only base-R, you could write the following code:
## your data
df <- read.table(header = TRUE, text = "
Cut_off TP FP TN FN
0.1 100 50 500 450
0.2 150 100 450 400
0.3 250 150 400 300
0.4 300 200 350 250
0.5 350 250 300 200
0.6 350 300 250 200
0.7 400 350 200 150
0.8 400 400 150 150
0.9 450 450 100 100
1.0 500 500 50 50")
## calculate False Positive ratio
df$FPR <- df$FP/(df$FP + df$TN)
## calculte True Positive Ratio
df$TPR <- df$TP/(df$TP + df$FN)
## df is now:
Cut_off TP FP TN FN FPR TPR
0.1 100 50 500 450 0.09090909 0.1818182
0.2 150 100 450 400 0.18181818 0.2727273
0.3 250 150 400 300 0.27272727 0.4545455
0.4 300 200 350 250 0.36363636 0.5454545
0.5 350 250 300 200 0.45454545 0.6363636
0.6 350 300 250 200 0.54545455 0.6363636
0.7 400 350 200 150 0.63636364 0.7272727
0.8 400 400 150 150 0.72727273 0.7272727
0.9 450 450 100 100 0.81818182 0.8181818
1.0 500 500 50 50 0.90909091 0.9090909
## plot the ROC with base plot
plot(df$FPR, df$TPR, type = "b",
xlim = c(0,1), ylim = c(0,1),
main = 'ROC Curve',
xlab = "False Positive Rate (1 - Specificity)",
ylab = "True Positive Rate (Sensitivity)",
col = "blue")
abline(a = 0, b = 1, lty=2, col = "grey") ### pure chance line
yielding the following plot:
if you want to mark the cut-off points with a label you need the following line after the line with abline(...
text(df$FPR, df$TPR+.05, df$Cut_off, col = "blue", cex = .7)
yielding this plot:
Calculate ROC curve, classification report and confusion matrix for multilabel classification problem
From v0.21 onwards, scikit-learn includes a multilabel confusion matrix; adapting the example from the docs for 5 classes:
import numpy as np
from sklearn.metrics import multilabel_confusion_matrix
y_true = np.array([[1, 0, 1, 0, 0],
[0, 1, 0, 1, 1],
[1, 1, 1, 0, 1]])
y_pred = np.array([[1, 0, 0, 0, 1],
[0, 1, 1, 1, 0],
[1, 1, 1, 0, 0]])
multilabel_confusion_matrix(y_true, y_pred)
# result:
array([[[1, 0],
[0, 2]],
[[1, 0],
[0, 2]],
[[0, 1],
[1, 1]],
[[2, 0],
[0, 1]],
[[0, 1],
[2, 0]]])
The usual classification_report
also works fine:
from sklearn.metrics import classification_report
print(classification_report(y_true, y_pred))
# result
precision recall f1-score support
0 1.00 1.00 1.00 2
1 1.00 1.00 1.00 2
2 0.50 0.50 0.50 2
3 1.00 1.00 1.00 1
4 0.00 0.00 0.00 2
micro avg 0.75 0.67 0.71 9
macro avg 0.70 0.70 0.70 9
weighted avg 0.67 0.67 0.67 9
samples avg 0.72 0.64 0.67 9
Regarding ROC, you can take some ideas from the Plot ROC curves for the multilabel problem example in the docs (not quite sure the concept itself is very useful though).
Confusion matrix and classification report require hard class predictions (as in the example); ROC requires the predictions as probabilities.
To convert your probabilistic predictions to hard classes, you need a threshold. Now, usually (and implicitly), this threshold is taken to be 0.5, i.e. predict 1 if y_pred > 0.5
, else predict 0. Nevertheless, this is not necessarily the case always, and it depends on the particular problem. Once you have set such a threshold, you can easily convert your probabilistic predictions to hard classes with a list comprehension; here is a simple example:
import numpy as np
y_prob = np.array([[0.9, 0.05, 0.12, 0.23, 0.78],
[0.11, 0.81, 0.51, 0.63, 0.34],
[0.68, 0.89, 0.76, 0.43, 0.27]])
thresh = 0.5
y_pred = np.array([[1 if i > thresh else 0 for i in j] for j in y_prob])
y_pred
# result:
array([[1, 0, 0, 0, 1],
[0, 1, 1, 1, 0],
[1, 1, 1, 0, 0]])
Manually ROC curve doen´t match with sklearn.metrics
FPR is not 1-precision. The former is FP/(FP+TN)
, the latter is FP/(FP+TP)
.
Correcting the recall_fpr
function to have
False_Positive_rate = round(cm[1, 0] / (cm[1, 0] + cm[1, 1]), 3) #FP /(FP + TN)
gives the correct ROC curve:
Confusion matrix, threshold and ROC curve in statsmodel LogIt
Well, I think it's because your data is imbalanced. You have a label=1 to label=0 ratio of 0.83%. You can try the LogisticRegression
object from the sklearn
package. There you have the option to specify class_weight='balanced'
. I am not sure if statsmodels
also supports this. Alternatively, you could resample you data to fix the imbalance problem. For that, I highly recommend using the package ìmblearn
, which is an extension of scitkit-learn and straight-forward to implement.
Related Topics
R: "Make" Not Found When Installing a R-Package from Local Tar.Gz
Why "Character Is Often Preferred to Factor" in Data.Table for Key
Ggplot2_Error: Geom_Point Requires the Following Missing Aesthetics: Y
Plot a Function with Several Arguments in R
Subsetting Data Based on Dynamic Column Names
R: Ggplot2 Setting the Last Plot in the Midle with Facet_Wrap
R Cleaning Up a Character and Converting It into a Numeric
Replace Na with Mode Based on Id Attribute
Distance Calculation on Large Vectors [Performance]
How to Set Bin Width with Geom_Bar Stat="Identity" in a Time Series Plot
Add a Constant Value to All Rows in a Dataframe
How to Embed Plots into a Tab in Rmarkdown in a Procedural Fashion