Factor has new levels error for variable I'm not using
You could try updating mod2$xlevels[["y"]]
in the model object
mod2 <- glm(z~.-y, data=train, family="binomial")
mod2$xlevels[["y"]] <- union(mod2$xlevels[["y"]], levels(test$y))
predict(mod2, newdata=test, type="response")
# 5
#0.5546394
Another option would be to exclude (but not remove) "y" from the training data
mod2 <- glm(z~., data=train[,!colnames(train) %in% c("y")], family="binomial")
predict(mod2, newdata=test, type="response")
# 5
#0.5546394
Error in model.frame.default for Predict() - Factor has new levels - For a Char Variable
The person that answered the question in the post you linked to already gave an indication on why myCharVar
is still considered in the model. When you use z~.-y
, the formula basically expands to z~(x+y)-y
.
Now, to answer your other question: Consider the following quote from the predict()
documentation: "For factor variables having numeric levels, you can specify the numeric values in newdata without first converting the variables to factors. These numeric values are checked to make sure they match a level, then the variable is converted internally to a factor".
I think we can assume that the same kind of behaviour occurs for myCharVar
. The myCharVar
values are first checked against the corresponding existing levels in the model and this is where it goes wrong. The testset contains values for the myCharVar
that were never encountered during the training of the model (note that the glm
function itself also performs factor conversion. It throws a warning when conversion needs to take place). In summary, the error basically means that the model is unable to make predictions for unknown levels in the testdata that were never encountered during the training of the model.
In this post there is another clarification given on the issue.
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): factor X has new levels
tl;dr it looks like you have some levels in your factor that are not represented in your data, that get dropped from the factors used in the model. In hindsight this isn't terribly surprising, since you won't be able to predict responses for these levels. That said, it's mildly surprising that R doesn't do something nice for you like generate NA
values automatically. You can solve this problem by using levels(droplevels(NH11$r_maritl))
in constructing your prediction frame, or equivalently EW$xlevels$r_maritl
.
A reproducible example:
maritl_levels <- c( "0 Under 14 years", "1 Married - spouse in household",
"2 Married - spouse not in household", "3 Married - spouse in household unknown",
"4 Widowed", "5 Divorced", "6 Separated", "7 Never married", "8 Living with partner",
"9 Unknown marital status")
set.seed(101)
NH11 <- data.frame(everwrk=rbinom(1000,size=1,prob=0.5),
age_p=runif(1000,20,50),
r_maritl = sample(maritl_levels,size=1000,replace=TRUE))
Let's make a missing level:
NH11 <- subset(NH11,as.numeric(NH11$r_maritl) != 3)
Fit the model:
EW <- glm(everwrk~r_maritl+age_p,data=NH11,family=binomial)
predEW <- with(NH11,
expand.grid(r_maritl=levels(r_maritl),age_p=mean(age_p,na.rm=TRUE)))
predict(EW,newdata=predEW)
Success!
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor r_maritl has new levels 2 Married - spouse not in household
predEW <- with(NH11,
expand.grid(r_maritl=EW$xlevels$r_maritl,age_p=mean(age_p,na.rm=TRUE)))
predict(EW,newdata=predEW)
Related Topics
Get the Number of Lines in a Text File Using R
Create Sections Through a Loop with Knitr
Efficiently Getting Older Versions of R Packages
How to Plot Ellipse Given a General Equation in R
Justification of Multiple Legends in Ggmap/Ggplot2
Remove Strings Found in Vector 1, from Vector 2
How to Read a Password Protected Excel File into R
Intersect All Possible Combinations of List Elements
Marking Specific Tiles in Geom_Tile()/Geom_Raster()
How to Put Exact Number of Decimal Places on Label Ggplot Bar Chart
How to Put the Labels Outside of Piechart
Run a Custom Function on a Data Frame in R, by Group
Extract Standard Errors from Glm
Grouped Barplot with Cut Y Axis
The Right Way to Plot Multiple Y Values as Separate Lines with Ggplot2