One-class classification with SVM in R
I think this is what you want:
library(e1071)
data(iris)
df <- iris
df <- subset(df , Species=='setosa') #choose only one of the classes
x <- subset(df, select = -Species) #make x variables
y <- df$Species #make y variable(dependent)
model <- svm(x, y,type='one-classification') #train an one-classification model
print(model)
summary(model) #print summary
# test on the whole set
pred <- predict(model, subset(iris, select=-Species)) #create predictions
Output:
-Summary:
> summary(model)
Call:
svm.default(x = x, y = y, type = "one-classification")
Parameters:
SVM-Type: one-classification
SVM-Kernel: radial
gamma: 0.25
nu: 0.5
Number of Support Vectors: 27
Number of Classes: 1
-Predictions (only some of the predictions are shown here (where Species=='setosa') for visual reason):
> pred
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
45 46 47 48 49 50
FALSE TRUE TRUE TRUE TRUE TRUE
One Class Classification in R language. What am I doing wrong when generating the confusion matrix?
I see a number of issues. First it seems that a lot of your data is of class character rather than numeric, which is required by the classifier. Let's pick some columns and convert to numeric. I will use data.table
because fread
is very convenient.
library(caret)
library(e1071)
library(data.table)
setDT(ds)
#Choose columns
mycols <- c("id","bp","sg","al","su")
#Convert to numeric
ds[,(mycols) := lapply(.SD, as.numeric),.SDcols = mycols]
#Convert classification to logical
data <- ds[,.(bp,sg,al,su,classification = ds$classification == "ckd")]
data
bp sg al su classification
1: 80 1.020 1 0 TRUE
2: 50 1.020 4 0 TRUE
3: 80 1.010 2 3 TRUE
4: 70 1.005 4 0 TRUE
5: 80 1.010 2 0 TRUE
---
396: 80 1.020 0 0 FALSE
397: 70 1.025 0 0 FALSE
398: 80 1.020 0 0 FALSE
399: 60 1.025 0 0 FALSE
400: 80 1.025 0 0 FALSE
Once the data is cleaned up, you can sample a training and test set with createDataPartition
as in your original code.
#Sample data for training and test set
inTrain<-createDataPartition(1:nrow(data),p=0.6,list=FALSE)
train<- data[inTrain,]
test <- data[-inTrain,]
Then we can create the model and make the predictions.
svm.model<-svm(classification ~ bp + sg + al + su, data = train,
type='one-classification',
nu=0.10,
scale=TRUE,
kernel="radial")
#Perform predictions
svm.predtrain<-predict(svm.model,train)
svm.predtest<-predict(svm.model,test)
Your main issue with the cross table was that the model can only predict for cases that don't have any NA
s, so you have to subset the classification levels to those with predictions. Then you can evaluate confusionMatrix
:
confTrain <- table(Predicted=svm.predtrain,
Reference=train$classification[as.integer(names(svm.predtrain))])
confTest <- table(Predicted=svm.predtest,
Reference=test$classification[as.integer(names(svm.predtest))])
confusionMatrix(confTest,positive='TRUE')
Confusion Matrix and Statistics
Reference
Predicted FALSE TRUE
FALSE 0 17
TRUE 55 64
Accuracy : 0.4706
95% CI : (0.3845, 0.558)
No Information Rate : 0.5956
P-Value [Acc > NIR] : 0.9988
Kappa : -0.2361
Mcnemar's Test P-Value : 1.298e-05
Sensitivity : 0.7901
Specificity : 0.0000
Pos Pred Value : 0.5378
Neg Pred Value : 0.0000
Prevalence : 0.5956
Detection Rate : 0.4706
Detection Prevalence : 0.8750
Balanced Accuracy : 0.3951
'Positive' Class : TRUE
Data
library(archive)
library(data.table)
tf1 <- tempfile(fileext = ".rar")
#Download data file
download.file("http://archive.ics.uci.edu/ml/machine-learning-databases/00336/Chronic_Kidney_Disease.rar", tf1)
tf2 <- tempfile()
#Un-rar file
archive_extract(tf1, tf2)
#Read in data
ds <- fread(paste0(tf2,"/Chronic_Kidney_Disease/chronic_kidney_disease.arff"), fill = TRUE, skip = "48")
#Remove erroneous last column
ds[,V26:= NULL]
#Set column names (from header)
setnames(ds,c("id","bp","sg","al","su","rbc","pc","pcc","ba","bgr","bu","sc","sod","pot","hemo","pcv","wc","rc","htn","dm","cad","appet","pe","ane","classification"))
#Replace "?" with NA
ds[ds == "?"] <- NA
Which algorithm does R use for computing one-class SVM ? (package e1071)
You can see the following link:
https://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf
The link shows the dual problem formulation of the SVM algorithm this package uses (when one use one-class SVM, page 7 index (3)), easy transformation from the dual to the primal problem shows that this default implementation is the one Schölkopf suggested, see paper:
https://www.stat.purdue.edu/~yuzhu/stat598m3/Papers/NewSVM.pdf
Related Topics
How to Remove Na from Facet_Wrap in Ggplot2
Sources on S4 Objects, Methods and Programming in R
Do I Need to Normalize (Or Scale) Data for Randomforest (R Package)
Does the Ternary Operator Exist in R
Ggplot: How to Increase Spacing Between Faceted Plots
Compare Two Character Vectors in R
Coding Practice in R:What Are the Advantages and Disadvantages of Different Styles
How to Remove Rows That Have Only 1 Combination for a Given Id
Replace Na with Groups Mean in a Non Specified Number of Columns
Loop Character Values in Ggtitle
R How to Convert a Numeric into Factor with Predefined Labels
Create Unique Identifier from the Interchangeable Combination of Two Variables
Link Selectinput with Sliderinput in Shiny
Multiple Boxplots Using Ggplot
R Error in Unique.Default(X) Unique() Applies Only to Vectors