How to Get Around Error "Factor Has New Levels" in Cross-Validation Glm

How to get around error factor has new levels in cross-validation glm?

To answer your question in the comment, I don't know if there is a function or not. Most likely there is one, but I have no idea on which package would contain it. For this example, this function should work:

set.seed(1)
x <- rnorm(n = 1000)
y <- rep(x = c("A","B"), times = c(500,500))
z <- rep(x = c("D","E","F"), times = c(997,2,1))
data <- data.frame(x,y,z)

#optional tag row for later identification: 
#data$rowid<-1:nrow(data)

stratified <- function(df, column, percent){
  #split dataframe into groups based on column
  listdf<-split(df, df[[column]])
  testsubgroups<-lapply(listdf, function(x){
    #pick the number of samples per group, round up.
    numsamples <- ceiling(percent*nrow(x))
    #selects the rows
    whichones <-sample(1:nrow(x), numsamples, replace = FALSE)
    testsubgroup <-x[whichones,] 
  })  
  #combine the subgroups into one data frame
  testgroup<-do.call(rbind, testsubgroups)
  testgroup
}

testgroup<-stratified(data, "z", 0.8)

This will just split the initial data by column z, if you are interested is grouping by multiple columns then this could be extended by using the group_by function from the dplyr package, but that would be another question.

Comment on the statistics: If you just have a few examples for any particular factor, what type of fit do you expect? A poor fit with wide confidence limits.

Factor has new levels error for variable I'm not using

You could try updating mod2$xlevels[["y"]] in the model object

mod2 <- glm(z~.-y, data=train, family="binomial")
mod2$xlevels[["y"]] <- union(mod2$xlevels[["y"]], levels(test$y))

predict(mod2, newdata=test, type="response")
#        5 
#0.5546394

Another option would be to exclude (but not remove) "y" from the training data

mod2 <- glm(z~., data=train[,!colnames(train) %in% c("y")], family="binomial")
predict(mod2, newdata=test, type="response")
#        5 
#0.5546394

Avoid failing when a factor has new levels in test set

I start with the following data generating process (a binary response variable, one numerical independent variable and 3 categorical independent variables):

set.seed(1)
n <- 500
y <- factor(rbinom(n, size=1, p=0.7))
x1 <- rnorm(n)
x2 <- cut(runif(n), breaks=seq(0,1,0.2))
x3 <- cut(runif(n), breaks=seq(0,1,0.25))
x4 <- cut(runif(n), breaks=seq(0,1,0.1))
df <- data.frame(y, x1, x2, x3, x4)

Here I build the training and testing set in a way to have some categorical covariates (x2 and x3) in the testing set with more categories than in the training set:

idx <- which(df$x2!="(0.6,0.8]" & df$x3!="(0,0.25]")
train_ind <- sample(idx, size=(2/3)*length(idx))
train <- df[train_ind,]
train$x2 <- droplevels(train$x2)
train$x3 <- droplevels(train$x3)
test <- df[-train_ind,]

table(train$x2)
(0,0.2] (0.2,0.4] (0.4,0.6]   (0.8,1] 
     55        40        53        49 

table(test$x2)
(0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8]   (0.8,1] 
     58        48        45        90        62 

table(train$x3)
(0.25,0.5] (0.5,0.75]   (0.75,1] 
        66         61         70 

table(test$x3)
(0,0.25] (0.25,0.5] (0.5,0.75]   (0.75,1] 
     131         63         47         62

Of course, predict yields the message error that is described above by @Setzer22:

glm.res <- glm(y ~ ., data=train, family = binomial(link=logit)) 
preds <- predict(glm.res, test, type="response")

Error in model.frame.default(Terms, newdata, na.action = na.action,
xlev = object$xlevels) : factor x2 has new levels (0.6,0.8]

Here is a (not elegant) way to delete rows of test which have new levels in the covariates:

dropcats <- function(k) {
   xtst <- test[,k]
   xtrn <- train[,k]
   cmp.tst.trn <- (unique(xtst) %in% unique(xtrn))
   if (is.factor(xtst) & any(!cmp.tst.trn)) {
      cat.tst <- unique(xtst)
      apply(test[,k]==matrix(rep(cat.tst[cmp.tst.trn],each=nrow(test)),
                      nrow=nrow(test)),1,any)
   } else {
      rep(TRUE,nrow(test))
   }
}   
filt <- apply(sapply(2:ncol(df),dropcats),1,all)
subset.test <- test[filt,]

In the subset subset.test of the testing set x2 and x3 have no new categories:

table(subset.test[,"x2"])
  (0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8]   (0.8,1] 
       26        25        20         0        28

table(subset.test[,"x3"])
  (0,0.25] (0.25,0.5] (0.5,0.75]   (0.75,1] 
         0         29         29         41

Now predict works nicely:

preds <- predict(glm.res, subset(test,filt), type="response")
head(preds)

       30        39        41        49        55        56 
0.7732564 0.8361226 0.7576259 0.5589563 0.8965357 0.8058025

Hope this can help you.

Error in model.frame.default for Predict() - Factor has new levels - For a Char Variable

The person that answered the question in the post you linked to already gave an indication on why myCharVar is still considered in the model. When you use z~.-y, the formula basically expands to z~(x+y)-y.

Now, to answer your other question: Consider the following quote from the predict() documentation: "For factor variables having numeric levels, you can specify the numeric values in newdata without first converting the variables to factors. These numeric values are checked to make sure they match a level, then the variable is converted internally to a factor".

I think we can assume that the same kind of behaviour occurs for myCharVar. The myCharVar values are first checked against the corresponding existing levels in the model and this is where it goes wrong. The testset contains values for the myCharVar that were never encountered during the training of the model (note that the glm function itself also performs factor conversion. It throws a warning when conversion needs to take place). In summary, the error basically means that the model is unable to make predictions for unknown levels in the testdata that were never encountered during the training of the model.

In this post there is another clarification given on the issue.

cv.glm Issue with missing factors in R

As I mentioned in my comment, here's the example straight from ?errorest in the ipred package:

#cv of a fixed partition of the data
list.tindx <- list(1:100, 101:200, 201:300, 301:400, 401:500,
        501:600, 601:700, 701:768)

errorest(diabetes ~ ., data=PimaIndiansDiabetes, model=lda,
          estimator = "cv", predict = mypredict.lda,
          est.para = control.errorest(list.tindx = list.tindx))

So you can specify your own cv folds to use, and ensure that they are sufficiently balanced to avoid levels of factors being missing in any single fold.

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): factor X has new levels

tl;dr it looks like you have some levels in your factor that are not represented in your data, that get dropped from the factors used in the model. In hindsight this isn't terribly surprising, since you won't be able to predict responses for these levels. That said, it's mildly surprising that R doesn't do something nice for you like generate NA values automatically. You can solve this problem by using levels(droplevels(NH11$r_maritl)) in constructing your prediction frame, or equivalently EW$xlevels$r_maritl.

A reproducible example:

maritl_levels <- c( "0 Under 14 years", "1 Married - spouse in household", 
  "2 Married - spouse not in household", "3 Married - spouse in household unknown", 
  "4 Widowed", "5 Divorced", "6 Separated", "7 Never married", "8 Living with partner", 
 "9 Unknown marital status")
set.seed(101)
NH11 <- data.frame(everwrk=rbinom(1000,size=1,prob=0.5),
                 age_p=runif(1000,20,50),
                 r_maritl = sample(maritl_levels,size=1000,replace=TRUE))

Let's make a missing level:

NH11 <- subset(NH11,as.numeric(NH11$r_maritl) != 3)

Fit the model:

EW <- glm(everwrk~r_maritl+age_p,data=NH11,family=binomial)
predEW <- with(NH11,
  expand.grid(r_maritl=levels(r_maritl),age_p=mean(age_p,na.rm=TRUE)))
predict(EW,newdata=predEW)

Success!

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor r_maritl has new levels 2 Married - spouse not in household

predEW <- with(NH11,
           expand.grid(r_maritl=EW$xlevels$r_maritl,age_p=mean(age_p,na.rm=TRUE)))
predict(EW,newdata=predEW)

How to Get Around Error "Factor Has New Levels" in Cross-Validation Glm