﻿ Error in Confusion Matrix:The Data and Reference Factors Must Have the Same Number of Levels - ITCodar

# Error in Confusion Matrix:The Data and Reference Factors Must Have the Same Number of Levels

## R confusionMatrix error data and reference factors with same levels

The error from `confusionMatrix()` tells us that the two variables passed to the function need to be factors with the same values. We can see why we received the error when we run `str()` on both variables.

``> str(pred) Factor w/ 5318 levels "-23.6495182533792",..: 310 339 419 1105 310 353 1062 942 594 1272 ...> str(wine_quality\$fixed.acidity) num [1:6497] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...``

`pred` is a factor, when `wine_quality\$fixed_acidity` is a numeric vector. The `confusionMatrix()` function is used to compare predicted and actual values of a dependent variable. It is not intended to cross tabulate a predicted variable and an independent variable.

Code in the question uses `fixed.acidity` in the confusion matrix when it should be comparing predicted values of `type` against actual values of `type` from the testing data.

Also, the code in the question creates the model prior to splitting the data into test and training data. The correct procedure is to split the data before building a model on the training data, make predictions with the testing (hold back) data, and compare actuals to predictions in the testing data.

Finally, the result of the `predict()` function as coded in the original post is the linear predicted values from the GLM model (equivalent to `wine_model\$linear.predictors` in the output model object). These values must be further transformed to make them suitable before use in `confusionMatrix()`.

In practice, it's easier to use `caret::train()` with the GLM method and binomial family, where `predict()` will generate results that are usable in `confusionMatrix()`. We'll illustrate this with the UCI wine quality data.

First, we download the data from the UCI Machine Learning Repository to make the example reproducible.

``download.file("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv",              "./data/wine_quality_red.csv")download.file("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv",              "./data/wine_quality_white.csv")``

Second, we load the data, assign `type` as either red or white depending on the data file, and bind the data into a single data frame.

``red <- read.csv("./data/wine_quality_red.csv",header = TRUE,sep=";")white <- read.csv("./data/wine_quality_white.csv",header = TRUE,sep=";")red\$type <- "red"white\$type <- "white"   wine_quality <- rbind(red,white)wine_quality\$type <- factor(wine_quality\$type)``

Next, we split the data into test and training based on values of `type` so each data frame gets a proportional number of red and white wines, train the data with the default `caret::train()` settings and a GLM method.

``library(caret)set.seed(123)inTrain <- createDataPartition(wine_quality\$type, p = 3/4)[[1]]training <- wine_quality[ inTrain,]testing <- wine_quality[-inTrain,]aModel <- train(type ~ .,data = training, method="glm", familia's = "binomial")``

Finally, we use the model to make predictions on the hold back data frame, and run a confusion matrix.

``testLM <- predict(aModel,testing)confusionMatrix(data=testLM,reference=testing\$type)``

...and the output:

``> confusionMatrix(data=testLM,reference=testing\$type)Confusion Matrix and Statistics          ReferencePrediction  red white     red    393     3     white    6  1221                                                         Accuracy : 0.9945                           95% CI : (0.9895, 0.9975)    No Information Rate : 0.7542              P-Value [Acc > NIR] : <2e-16                                                                      Kappa : 0.985                                                      Mcnemar's Test P-Value : 0.505                                                                 Sensitivity : 0.9850                      Specificity : 0.9975                   Pos Pred Value : 0.9924                   Neg Pred Value : 0.9951                       Prevalence : 0.2458                   Detection Rate : 0.2421             Detection Prevalence : 0.2440                Balanced Accuracy : 0.9913                                                           'Positive' Class : red   ``

## Confusion Matrix Error: Error: `data` and `reference` should be factors with the same levels

Also change method = "prob" to method = "raw"

Table1 <- table(NNPredictions, test\$Rank, useNA = "ifany")

cnf1 <- confusionMatrix(Table1)

Answered provided by dclarson