"Factor Has New Levels" Error for Variable I'm Not Using

Factor has new levels error for variable I'm not using

You could try updating mod2$xlevels[["y"]] in the model object

mod2 <- glm(z~.-y, data=train, family="binomial")
mod2$xlevels[["y"]] <- union(mod2$xlevels[["y"]], levels(test$y))

predict(mod2, newdata=test, type="response")
# 5
#0.5546394

Another option would be to exclude (but not remove) "y" from the training data

mod2 <- glm(z~., data=train[,!colnames(train) %in% c("y")], family="binomial")
predict(mod2, newdata=test, type="response")
# 5
#0.5546394

Error in model.frame.default for Predict() - Factor has new levels - For a Char Variable

The person that answered the question in the post you linked to already gave an indication on why myCharVar is still considered in the model. When you use z~.-y, the formula basically expands to z~(x+y)-y.

Now, to answer your other question: Consider the following quote from the predict() documentation: "For factor variables having numeric levels, you can specify the numeric values in newdata without first converting the variables to factors. These numeric values are checked to make sure they match a level, then the variable is converted internally to a factor".

I think we can assume that the same kind of behaviour occurs for myCharVar. The myCharVar values are first checked against the corresponding existing levels in the model and this is where it goes wrong. The testset contains values for the myCharVar that were never encountered during the training of the model (note that the glm function itself also performs factor conversion. It throws a warning when conversion needs to take place). In summary, the error basically means that the model is unable to make predictions for unknown levels in the testdata that were never encountered during the training of the model.

In this post there is another clarification given on the issue.

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): factor X has new levels

tl;dr it looks like you have some levels in your factor that are not represented in your data, that get dropped from the factors used in the model. In hindsight this isn't terribly surprising, since you won't be able to predict responses for these levels. That said, it's mildly surprising that R doesn't do something nice for you like generate NA values automatically. You can solve this problem by using levels(droplevels(NH11$r_maritl)) in constructing your prediction frame, or equivalently EW$xlevels$r_maritl.

A reproducible example:

maritl_levels <- c( "0 Under 14 years", "1 Married - spouse in household", 
"2 Married - spouse not in household", "3 Married - spouse in household unknown",
"4 Widowed", "5 Divorced", "6 Separated", "7 Never married", "8 Living with partner",
"9 Unknown marital status")
set.seed(101)
NH11 <- data.frame(everwrk=rbinom(1000,size=1,prob=0.5),
age_p=runif(1000,20,50),
r_maritl = sample(maritl_levels,size=1000,replace=TRUE))

Let's make a missing level:

NH11 <- subset(NH11,as.numeric(NH11$r_maritl) != 3)

Fit the model:

EW <- glm(everwrk~r_maritl+age_p,data=NH11,family=binomial)
predEW <- with(NH11,
expand.grid(r_maritl=levels(r_maritl),age_p=mean(age_p,na.rm=TRUE)))
predict(EW,newdata=predEW)

Success!

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor r_maritl has new levels 2 Married - spouse not in household

predEW <- with(NH11,
expand.grid(r_maritl=EW$xlevels$r_maritl,age_p=mean(age_p,na.rm=TRUE)))
predict(EW,newdata=predEW)


Related Topics



Leave a reply



Submit