Predict.Lm() with an Unknown Factor Level in Test Data

predict.lm() with an unknown factor level in test data

Tidied and extended the function by MorgenBall. It is also implemented in sperrorest now.

Additional features

drops unused factor levels rather than just setting the missing values to NA.
issues a message to the user that factor levels have been dropped
checks for existence of factor variables in test_data and returns original data.frame if non are present
works not only for lm, glm and but also for glmmPQL

Note: The function shown here may change (improve) over time.

#' @title remove_missing_levels
#' @description Accounts for missing factor levels present only in test data
#' but not in train data by setting values to NA
#'
#' @import magrittr
#' @importFrom gdata unmatrix
#' @importFrom stringr str_split
#'
#' @param fit fitted model on training data
#'
#' @param test_data data to make predictions for
#'
#' @return data.frame with matching factor levels to fitted model
#'
#' @keywords internal
#'
#' @export
remove_missing_levels <- function(fit, test_data) {

  # https://stackoverflow.com/a/39495480/4185785

  # drop empty factor levels in test data
  test_data %>%
    droplevels() %>%
    as.data.frame() -> test_data

  # 'fit' object structure of 'lm' and 'glmmPQL' is different so we need to
  # account for it
  if (any(class(fit) == "glmmPQL")) {
    # Obtain factor predictors in the model and their levels
    factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "",
                     names(unlist(fit$contrasts))))
    # do nothing if no factors are present
    if (length(factors) == 0) {
      return(test_data)
    }

    map(fit$contrasts, function(x) names(unmatrix(x))) %>%
      unlist() -> factor_levels
    factor_levels %>% str_split(":", simplify = TRUE) %>%
      extract(, 1) -> factor_levels

    model_factors <- as.data.frame(cbind(factors, factor_levels))
  } else {
    # Obtain factor predictors in the model and their levels
    factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "",
                     names(unlist(fit$xlevels))))
    # do nothing if no factors are present
    if (length(factors) == 0) {
      return(test_data)
    }

    factor_levels <- unname(unlist(fit$xlevels))
    model_factors <- as.data.frame(cbind(factors, factor_levels))
  }

  # Select column names in test data that are factor predictors in
  # trained model

  predictors <- names(test_data[names(test_data) %in% factors])

  # For each factor predictor in your data, if the level is not in the model,
  # set the value to NA

  for (i in 1:length(predictors)) {
    found <- test_data[, predictors[i]] %in% model_factors[
      model_factors$factors == predictors[i], ]$factor_levels
    if (any(!found)) {
      # track which variable
      var <- predictors[i]
      # set to NA
      test_data[!found, predictors[i]] <- NA
      # drop empty factor levels in test data
      test_data %>%
        droplevels() -> test_data
      # issue warning to console
      message(sprintf(paste0("Setting missing levels in '%s', only present",
                             " in test data but missing in train data,",
                             " to 'NA'."),
                      var))
    }
  }
  return(test_data)
}

We can apply this function to the example in the question as follows:

predict(model,newdata=remove_missing_levels (fit=model, test_data=foo.new))

While trying to improve this function, I came across the fact that SL learning methods like lm, glm etc. need the same levels in train & test while ML learning methods (svm, randomForest) fail if the levels are removed. These methods need all levels in train & test.

A general solution is quite hard to achieve since every fitted model has a different way of storing their factor level component (fit$xlevels for lm and fit$contrasts for glmmPQL). At least it seems to be consistent across lm related models.

R: factor as new level when I predict with test data

As @jdobres has already explained the reason of why this error popped up I'll straightforwardly jump to the solution approach:

Let's try below line of code just before predict statement

#add all levels of 'y' in 'test' dataset to fit$xlevels[["y"]] in the fit object
fit$xlevels[["y"]] <- union(fit$xlevels[["y"]], levels(test[["y"]]))

Hope this would resolve your problem!

How to auto-exclude unseen new factor levels in predict.randomForest?

One solution is to combine Train and Test Matrix and use as.factor on the combined matrix. Then separate into Train and Test again. I had faced this same issue in random forest and this solution had worked for me.

for example :

   combine <- rbind(Train,Test)
   combine$var1 <- as.factor(combine$var1)

   ##Then split into Test and Train
   Train$var1 <- combine[1:nrow(train)]

   similar for Test.

Hope this helps!

Using predict() with RcppArmadillo/RcppEigen when a factor level has only one level

Not a solution, but an explanation for why it happens.

If we inspect the source code of RcppAramdillo:::predict.fastLm(), we find that it constructs the design matrix for the prediction points via

x <- model.matrix(object$formula, newdata)

On the other hand, if we inspect the source for stats::predict.lm(), we find

tt <- terms(object)
## Some source omitted here
Terms <- delete.response(tt)
m <- model.frame(Terms, newdata, na.action = na.action, xlev = object$xlevels)
if (!is.null(cl <- attr(Terms, "dataClasses")))  .checkMFClasses(cl, m)
X <- model.matrix(Terms, m, contrasts.arg = object$contrasts)

which reveals that lm() stores in its result information about the factor levels and contrasts for predictors, while fastLm() reconstructs that information in the predict() call:

names(model)
# [1] "coefficients"  "stderr"        "df.residual"   "fitted.values"
# [5] "residuals"     "call"          "intercept"     "formula"      
names(lm_mod) ## Constructed with `lm()` call with same formula
#  [1] "coefficients"  "residuals"     "effects"       "rank"         
#  [5] "fitted.values" "assign"        "qr"            "df.residual"  
#  [9] "contrasts"     "xlevels"       "call"          "terms"        
# [13] "model"

Note the "xlevels" and "contrasts" elements in an lm object that are not present in a fastLm object. This goes to a larger point, though, from help("fastLM"):

Linear models should be estimated using the lm function. In some cases, lm.fit may be appropriate.

Dirk can correct me if I'm wrong, but I think the point of fastLm() is not to provide a rich OLS implementation that covers all the use cases that stats::lm() does; I think it's more illustrative.

If your problem is big data and that's why you don't want to use stats::lm(), might I suggest something like biglm::biglm()? (See, for example, here). If you are really set on using RcppArmadillo::fastLm(), you could do a smaller version of your workaround; instead of copying your whole data, just append one row to your prediction set for each unused factor level.

What does a glm model do with unknown factor levels added after training?

Lets start by looking at the model coefficients for the partial dataset summary(model)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   26.664      1.068  24.966  < 2e-16 ***
cyl8         -11.564      1.427  -8.102 3.45e-08 ***

Predictions for cyl8 are equal to the intercept + the effect of cyl8, so 26.664 + -11.564 = 15.10. For the other factor levels (cyl4), the predictions are equal to the intercept (26.664). Adding a factor level that is unknown will yield the same prediction, as R has no basis for additional factors effect (these were excluded in the original model).
We can see that the estimates of the known factors are unaffected by estimating the model on the full data.

model2<- glm(formula = mpg ~ cyl, data = mtcars)
summary(model2)
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  26.6636     0.9718  27.437  < 2e-16 ***
cyl6         -6.9208     1.5583  -4.441 0.000119 ***
cyl8        -11.5636     1.2986  -8.905 8.57e-10 ***

You see that the estimated effects for cyl8 and the reference category cyl4 are unchanged (still 15.10 and 26.66 resp.). As such, the model will yield the same predictions for these factor levels. However, predictions for cyl6 are overestimated by 6.92 as you can see from the newly estimated coefficient.

randomForest does not work when training set has more different factor levels than test set

R expects both the training and the test data to have the exact same levels (even if one of the sets has no observations for a given level or levels). In your case, since the test dataset is missing a level that the train has, you can do

test$val <- factor(test$val, levels=levels(train$val))

to make sure it has all the same levels and they are coded the same say.

(reposted here to close out the question)

Predict.Lm() with an Unknown Factor Level in Test Data