R confusionMatrix error data and reference factors with same levels
The error from confusionMatrix()
tells us that the two variables passed to the function need to be factors with the same values. We can see why we received the error when we run str()
on both variables.
> str(pred)
Factor w/ 5318 levels "-23.6495182533792",..: 310 339 419 1105 310 353 1062 942 594 1272 ...
> str(wine_quality$fixed.acidity)
num [1:6497] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
pred
is a factor, when wine_quality$fixed_acidity
is a numeric vector. The confusionMatrix()
function is used to compare predicted and actual values of a dependent variable. It is not intended to cross tabulate a predicted variable and an independent variable.
Code in the question uses fixed.acidity
in the confusion matrix when it should be comparing predicted values of type
against actual values of type
from the testing data.
Also, the code in the question creates the model prior to splitting the data into test and training data. The correct procedure is to split the data before building a model on the training data, make predictions with the testing (hold back) data, and compare actuals to predictions in the testing data.
Finally, the result of the predict()
function as coded in the original post is the linear predicted values from the GLM model (equivalent to wine_model$linear.predictors
in the output model object). These values must be further transformed to make them suitable before use in confusionMatrix()
.
In practice, it's easier to use caret::train()
with the GLM method and binomial family, where predict()
will generate results that are usable in confusionMatrix()
. We'll illustrate this with the UCI wine quality data.
First, we download the data from the UCI Machine Learning Repository to make the example reproducible.
download.file("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv",
"./data/wine_quality_red.csv")
download.file("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv",
"./data/wine_quality_white.csv")
Second, we load the data, assign type
as either red or white depending on the data file, and bind the data into a single data frame.
red <- read.csv("./data/wine_quality_red.csv",header = TRUE,sep=";")
white <- read.csv("./data/wine_quality_white.csv",header = TRUE,sep=";")
red$type <- "red"
white$type <- "white"
wine_quality <- rbind(red,white)
wine_quality$type <- factor(wine_quality$type)
Next, we split the data into test and training based on values of type
so each data frame gets a proportional number of red and white wines, train the data with the default caret::train()
settings and a GLM method.
library(caret)
set.seed(123)
inTrain <- createDataPartition(wine_quality$type, p = 3/4)[[1]]
training <- wine_quality[ inTrain,]
testing <- wine_quality[-inTrain,]
aModel <- train(type ~ .,data = training, method="glm", familia's = "binomial")
Finally, we use the model to make predictions on the hold back data frame, and run a confusion matrix.
testLM <- predict(aModel,testing)
confusionMatrix(data=testLM,reference=testing$type)
...and the output:
> confusionMatrix(data=testLM,reference=testing$type)
Confusion Matrix and Statistics
Reference
Prediction red white
red 393 3
white 6 1221
Accuracy : 0.9945
95% CI : (0.9895, 0.9975)
No Information Rate : 0.7542
P-Value [Acc > NIR] : <2e-16
Kappa : 0.985
Mcnemar's Test P-Value : 0.505
Sensitivity : 0.9850
Specificity : 0.9975
Pos Pred Value : 0.9924
Neg Pred Value : 0.9951
Prevalence : 0.2458
Detection Rate : 0.2421
Detection Prevalence : 0.2440
Balanced Accuracy : 0.9913
'Positive' Class : red
'factors with the same levels' in Confusion Matrix
I made a toy data set and examined your code. There were a couple issues:
- R has a easier time with variable names that follow a certain style. Your 'Customer type' variable has a space in it. In general, coding is easier when you avoid spaces. So I renamed it 'Customer_type". For your data.frame you could simply go into the source file, or use
names(df) <- gsub("Customer type", "Customer_type", names(df))
. - I coded 'Customer_type' as a factor. For you this will look like
df$Customer_type <- factor(df$Customer_type)
- The documentation for
sample.split()
says the first argument 'Y' should be a vector of labels. But in your code you gave the variable name. The labels are the names of the levels of the factor. In my example these levels are High, Med and Low. To see the levels of your variable you could uselevels(df$Customer_type)
. Input these tosample.split()
as a character vector. - Adjust the
rpart()
call as shown below.
With these adjustments, your code might be OK.
# toy data
df <- data.frame(City = factor(sample(c("Paris", "Tokyo", "Miami"), 100, replace = T)),
Customer_type = factor(sample(c("High", "Med", "Low"), 100, replace = T)),
Gender = factor(sample(c("Female", "Male"), 100, replace = T)),
Quantity = sample(1:10, 100, replace = T),
Total = sample(1:10, 100, replace = T),
Date = sample(seq(as.Date('2020/01/01'), as.Date('2020/12/31'), by="day"), 100),
Rating = factor(sample(1:5, 100, replace = T)))
library(rpart)
library(caret)
library(dplyr)
library(caTools)
library(data.tree)
library(e1071)
#Splitting into training and testing data
set.seed(123)
sample = sample.split(levels(df$Customer_type), SplitRatio = .70) # ADJUST YOUR CODE TO MATCH YOUR FACTOR LABEL NAMES
train = subset(df, sample==TRUE)
test = subset(df, sample == FALSE)
#Training the Decision Tree Classifier
tree <- rpart(Customer_type ~., data = train) # ADJUST YOUR CODE SO IT'S LIKE THIS
#Predictions
tree.customertype.predicted <- predict(tree, test, type= 'class')
#confusion Matrix for evaluating the model
confusionMatrix(tree.customertype.predicted, test$Customer_type)
Related Topics
Rstudio Suddenly Stopped Showing Plots in the Plot Pane
Adding Some Space Between the X-Axis and the Bars, in Ggplot
How to Join (Merge) Data Frames (Inner, Outer, Left, Right)
Combine a List of Data Frames into One Data Frame by Row
Convert a List to a Data Frame
Finding Local Maxima and Minima
How to Use Greek Symbols in Ggplot2
How to Read Multiple .Txt Files into R
Plot Multiple Boxplot in One Graph
Adding Value from One Data.Frame to Another Data.Frame by Matching a Variable
Error in Confusionmatrix the Data and Reference Factors Must Have the Same Number of Levels
R: Error in Usemethod("Group_By_"):Applied to an Object of Class
How to Reshape Data from Long to Wide Format
Add Regression Line Equation and R^2 on Graph
How to Install an R Package from Source
Interpreting "Condition Has Length ≫ 1" Warning from 'If' Function