Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): factor X has new levels
tl;dr it looks like you have some levels in your factor that are not represented in your data, that get dropped from the factors used in the model. In hindsight this isn't terribly surprising, since you won't be able to predict responses for these levels. That said, it's mildly surprising that R doesn't do something nice for you like generate NA
values automatically. You can solve this problem by using levels(droplevels(NH11$r_maritl))
in constructing your prediction frame, or equivalently EW$xlevels$r_maritl
.
A reproducible example:
maritl_levels <- c( "0 Under 14 years", "1 Married - spouse in household",
"2 Married - spouse not in household", "3 Married - spouse in household unknown",
"4 Widowed", "5 Divorced", "6 Separated", "7 Never married", "8 Living with partner",
"9 Unknown marital status")
set.seed(101)
NH11 <- data.frame(everwrk=rbinom(1000,size=1,prob=0.5),
age_p=runif(1000,20,50),
r_maritl = sample(maritl_levels,size=1000,replace=TRUE))
Let's make a missing level:
NH11 <- subset(NH11,as.numeric(NH11$r_maritl) != 3)
Fit the model:
EW <- glm(everwrk~r_maritl+age_p,data=NH11,family=binomial)
predEW <- with(NH11,
expand.grid(r_maritl=levels(r_maritl),age_p=mean(age_p,na.rm=TRUE)))
predict(EW,newdata=predEW)
Success!
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor r_maritl has new levels 2 Married - spouse not in household
predEW <- with(NH11,
expand.grid(r_maritl=EW$xlevels$r_maritl,age_p=mean(age_p,na.rm=TRUE)))
predict(EW,newdata=predEW)
An issue in R Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = attr(object, : factor
It is typical to get this error when the test data has categories that does not exists in the training data.
There are many ways of handling this. Under, I show how we can use recpes
to turn all novel categories into NA
values.
library(rsample)
library(rpart)
library(recipes)
set.seed(123)
# I'll use mtcars as an example, but I'll add a new categorical column
data <- mtcars
data$var <- sample(c("A", "B"), size = 32, replace = TRUE)
# Split the data
split <- initial_split(data, prop = 0.7)
train <- training(split)
test <- testing(split)
# Add a new factor level to var just for the training data
test$var[1] <- "C"
# Build the model
model <- rpart(
formula = mpg ~ .,
data = train
)
# Error due to the novel factor level
predict(model, newdata = test)
#> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = attr(object, : factor var has new levels C
# Fixing this with recipes ------------------------------------------------
# Create a recipe that handles novel levels in var
rec <-
recipe(mpg ~ ., data = train) %>%
step_other(var) %>% # This is where we handle new categories
prep()
new_train <- bake(rec, new_data = train)
new_test <- bake(rec, new_data = test)
# We build the model again with the prepared data
model <- rpart(
formula = mpg ~ .,
data = new_train
)
# This works
predict(model, newdata = new_test)
#> 1 2 3 4 5 6 7 8
#> 25.61000 25.61000 25.61000 25.61000 15.64167 15.64167 15.64167 25.61000
#> 9 10
#> 25.61000 15.64167
Created on 2022-05-24 by the reprex package (v2.0.1)
An alternative to step_other()
would be step_novel()
.
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : variable lengths differ (found for 'factor(DAF)')
It appears since one of your predictor values is a function of your two other predictor values, it looks like your use of the prediction
function with the DAF predictor is incorrect.
Since I don't have your model in order to test, this is more of a brute force solution using base R's predict function. Here we are generating all of the possible combinations of your 2 predictor variables and your derived variable (only 4 combinations in this case).
#Devise the new test matrix
predictdf<-expand.grid(Diabetes=c(0,1), AtrialFib = c(0,1))
predictdf$DAF <- predictdf$Diabetes * predictdf$AtrialFib
#convert from integers to factors (to match the model)
predictdf<-apply(predictdf, 2, factor)
#preform the prediction
predict(r123, data.frame(predictdf))
To simplify the problem, allow R to calculate the interaction term directly within the linear regression formula:
lm(MWT1Best~factor(Diabetes)*factor(AtrialFib), data=COPD)
Replace the + with * and the model will take all of the interactions into account.
Factor has new levels error for variable I'm not using
You could try updating mod2$xlevels[["y"]]
in the model object
mod2 <- glm(z~.-y, data=train, family="binomial")
mod2$xlevels[["y"]] <- union(mod2$xlevels[["y"]], levels(test$y))
predict(mod2, newdata=test, type="response")
# 5
#0.5546394
Another option would be to exclude (but not remove) "y" from the training data
mod2 <- glm(z~., data=train[,!colnames(train) %in% c("y")], family="binomial")
predict(mod2, newdata=test, type="response")
# 5
#0.5546394
Error in model.frame.default for Predict() - Factor has new levels - For a Char Variable
The person that answered the question in the post you linked to already gave an indication on why myCharVar
is still considered in the model. When you use z~.-y
, the formula basically expands to z~(x+y)-y
.
Now, to answer your other question: Consider the following quote from the predict()
documentation: "For factor variables having numeric levels, you can specify the numeric values in newdata without first converting the variables to factors. These numeric values are checked to make sure they match a level, then the variable is converted internally to a factor".
I think we can assume that the same kind of behaviour occurs for myCharVar
. The myCharVar
values are first checked against the corresponding existing levels in the model and this is where it goes wrong. The testset contains values for the myCharVar
that were never encountered during the training of the model (note that the glm
function itself also performs factor conversion. It throws a warning when conversion needs to take place). In summary, the error basically means that the model is unable to make predictions for unknown levels in the testdata that were never encountered during the training of the model.
In this post there is another clarification given on the issue.
Related Topics
Round_Any Equivalent for Dplyr
Robust Standard Errors for Mixed-Effects Models in Lme4 Package of R
Error: C Stack Usage Is Too Close to The Limit in R
Embed Instagram/Youtube into Shiny R App
Split Data.Frame Row into Multiple Rows Based on Commas
Round_Any Equivalent for Dplyr
Means from a List of Data Frames in R
R Produces "Unsupported Url Scheme" Error When Getting Data from Https Sites
Check If a Program Is Installed
Fastest Way to Parse a Date-Time String to Class Date
How to Get Proportions and Counts of a Data Frame in R
Dynamic Number of Actionbuttons Tied to Unique Observeevent
Staggered and Stacked Geom_Bar in The Same Figure
How to Efficiently Retrieve Top K-Similar Vectors by Cosine Similarity Using R