Error in Model.Frame.Default(Terms, Newdata, Na.Action = Na.Action, Xlev = Object$Xlevels): Factor X Has New Levels

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): factor X has new levels

tl;dr it looks like you have some levels in your factor that are not represented in your data, that get dropped from the factors used in the model. In hindsight this isn't terribly surprising, since you won't be able to predict responses for these levels. That said, it's mildly surprising that R doesn't do something nice for you like generate NA values automatically. You can solve this problem by using levels(droplevels(NH11$r_maritl)) in constructing your prediction frame, or equivalently EW$xlevels$r_maritl.

A reproducible example:

maritl_levels <- c( "0 Under 14 years", "1 Married - spouse in household", 
"2 Married - spouse not in household", "3 Married - spouse in household unknown",
"4 Widowed", "5 Divorced", "6 Separated", "7 Never married", "8 Living with partner",
"9 Unknown marital status")
set.seed(101)
NH11 <- data.frame(everwrk=rbinom(1000,size=1,prob=0.5),
age_p=runif(1000,20,50),
r_maritl = sample(maritl_levels,size=1000,replace=TRUE))

Let's make a missing level:

NH11 <- subset(NH11,as.numeric(NH11$r_maritl) != 3)

Fit the model:

EW <- glm(everwrk~r_maritl+age_p,data=NH11,family=binomial)
predEW <- with(NH11,
expand.grid(r_maritl=levels(r_maritl),age_p=mean(age_p,na.rm=TRUE)))
predict(EW,newdata=predEW)

Success!

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor r_maritl has new levels 2 Married - spouse not in household

predEW <- with(NH11,
expand.grid(r_maritl=EW$xlevels$r_maritl,age_p=mean(age_p,na.rm=TRUE)))
predict(EW,newdata=predEW)

An issue in R Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = attr(object, : factor

It is typical to get this error when the test data has categories that does not exists in the training data.

There are many ways of handling this. Under, I show how we can use recpes to turn all novel categories into NA values.

library(rsample)
library(rpart)
library(recipes)
set.seed(123)

# I'll use mtcars as an example, but I'll add a new categorical column
data <- mtcars
data$var <- sample(c("A", "B"), size = 32, replace = TRUE)

# Split the data
split <- initial_split(data, prop = 0.7)
train <- training(split)
test <- testing(split)

# Add a new factor level to var just for the training data
test$var[1] <- "C"

# Build the model
model <- rpart(
formula = mpg ~ .,
data = train
)

# Error due to the novel factor level
predict(model, newdata = test)
#> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = attr(object, : factor var has new levels C

# Fixing this with recipes ------------------------------------------------

# Create a recipe that handles novel levels in var
rec <-
recipe(mpg ~ ., data = train) %>%
step_other(var) %>% # This is where we handle new categories
prep()

new_train <- bake(rec, new_data = train)
new_test <- bake(rec, new_data = test)

# We build the model again with the prepared data
model <- rpart(
formula = mpg ~ .,
data = new_train
)

# This works
predict(model, newdata = new_test)
#> 1 2 3 4 5 6 7 8
#> 25.61000 25.61000 25.61000 25.61000 15.64167 15.64167 15.64167 25.61000
#> 9 10
#> 25.61000 15.64167

Created on 2022-05-24 by the reprex package (v2.0.1)

An alternative to step_other() would be step_novel().

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : variable lengths differ (found for 'factor(DAF)')

It appears since one of your predictor values is a function of your two other predictor values, it looks like your use of the prediction function with the DAF predictor is incorrect.

Since I don't have your model in order to test, this is more of a brute force solution using base R's predict function. Here we are generating all of the possible combinations of your 2 predictor variables and your derived variable (only 4 combinations in this case).

#Devise the new test matrix
predictdf<-expand.grid(Diabetes=c(0,1), AtrialFib = c(0,1))
predictdf$DAF <- predictdf$Diabetes * predictdf$AtrialFib

#convert from integers to factors (to match the model)
predictdf<-apply(predictdf, 2, factor)
#preform the prediction
predict(r123, data.frame(predictdf))

To simplify the problem, allow R to calculate the interaction term directly within the linear regression formula:

lm(MWT1Best~factor(Diabetes)*factor(AtrialFib), data=COPD)

Replace the + with * and the model will take all of the interactions into account.

Factor has new levels error for variable I'm not using

You could try updating mod2$xlevels[["y"]] in the model object

mod2 <- glm(z~.-y, data=train, family="binomial")
mod2$xlevels[["y"]] <- union(mod2$xlevels[["y"]], levels(test$y))

predict(mod2, newdata=test, type="response")
# 5
#0.5546394

Another option would be to exclude (but not remove) "y" from the training data

mod2 <- glm(z~., data=train[,!colnames(train) %in% c("y")], family="binomial")
predict(mod2, newdata=test, type="response")
# 5
#0.5546394

Error in model.frame.default for Predict() - Factor has new levels - For a Char Variable

The person that answered the question in the post you linked to already gave an indication on why myCharVar is still considered in the model. When you use z~.-y, the formula basically expands to z~(x+y)-y.

Now, to answer your other question: Consider the following quote from the predict() documentation: "For factor variables having numeric levels, you can specify the numeric values in newdata without first converting the variables to factors. These numeric values are checked to make sure they match a level, then the variable is converted internally to a factor".

I think we can assume that the same kind of behaviour occurs for myCharVar. The myCharVar values are first checked against the corresponding existing levels in the model and this is where it goes wrong. The testset contains values for the myCharVar that were never encountered during the training of the model (note that the glm function itself also performs factor conversion. It throws a warning when conversion needs to take place). In summary, the error basically means that the model is unable to make predictions for unknown levels in the testdata that were never encountered during the training of the model.

In this post there is another clarification given on the issue.



Related Topics



Leave a reply



Submit