predict.lm() with an unknown factor level in test data
Tidied and extended the function by MorgenBall. It is also implemented in sperrorest now.
Additional features
- drops unused factor levels rather than just setting the missing values to
NA
. - issues a message to the user that factor levels have been dropped
- checks for existence of factor variables in
test_data
and returns original data.frame if non are present - works not only for
lm
,glm
and but also forglmmPQL
Note: The function shown here may change (improve) over time.
#' @title remove_missing_levels
#' @description Accounts for missing factor levels present only in test data
#' but not in train data by setting values to NA
#'
#' @import magrittr
#' @importFrom gdata unmatrix
#' @importFrom stringr str_split
#'
#' @param fit fitted model on training data
#'
#' @param test_data data to make predictions for
#'
#' @return data.frame with matching factor levels to fitted model
#'
#' @keywords internal
#'
#' @export
remove_missing_levels <- function(fit, test_data) {
# https://stackoverflow.com/a/39495480/4185785
# drop empty factor levels in test data
test_data %>%
droplevels() %>%
as.data.frame() -> test_data
# 'fit' object structure of 'lm' and 'glmmPQL' is different so we need to
# account for it
if (any(class(fit) == "glmmPQL")) {
# Obtain factor predictors in the model and their levels
factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "",
names(unlist(fit$contrasts))))
# do nothing if no factors are present
if (length(factors) == 0) {
return(test_data)
}
map(fit$contrasts, function(x) names(unmatrix(x))) %>%
unlist() -> factor_levels
factor_levels %>% str_split(":", simplify = TRUE) %>%
extract(, 1) -> factor_levels
model_factors <- as.data.frame(cbind(factors, factor_levels))
} else {
# Obtain factor predictors in the model and their levels
factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "",
names(unlist(fit$xlevels))))
# do nothing if no factors are present
if (length(factors) == 0) {
return(test_data)
}
factor_levels <- unname(unlist(fit$xlevels))
model_factors <- as.data.frame(cbind(factors, factor_levels))
}
# Select column names in test data that are factor predictors in
# trained model
predictors <- names(test_data[names(test_data) %in% factors])
# For each factor predictor in your data, if the level is not in the model,
# set the value to NA
for (i in 1:length(predictors)) {
found <- test_data[, predictors[i]] %in% model_factors[
model_factors$factors == predictors[i], ]$factor_levels
if (any(!found)) {
# track which variable
var <- predictors[i]
# set to NA
test_data[!found, predictors[i]] <- NA
# drop empty factor levels in test data
test_data %>%
droplevels() -> test_data
# issue warning to console
message(sprintf(paste0("Setting missing levels in '%s', only present",
" in test data but missing in train data,",
" to 'NA'."),
var))
}
}
return(test_data)
}
We can apply this function to the example in the question as follows:
predict(model,newdata=remove_missing_levels (fit=model, test_data=foo.new))
While trying to improve this function, I came across the fact that SL learning methods like lm
, glm
etc. need the same levels in train & test while ML learning methods (svm
, randomForest
) fail if the levels are removed. These methods need all levels in train & test.
A general solution is quite hard to achieve since every fitted model has a different way of storing their factor level component (fit$xlevels
for lm
and fit$contrasts
for glmmPQL
). At least it seems to be consistent across lm
related models.
R: factor as new level when I predict with test data
As @jdobres has already explained the reason of why this error popped up I'll straightforwardly jump to the solution approach:
Let's try below line of code just before predict
statement
#add all levels of 'y' in 'test' dataset to fit$xlevels[["y"]] in the fit object
fit$xlevels[["y"]] <- union(fit$xlevels[["y"]], levels(test[["y"]]))
Hope this would resolve your problem!
How to auto-exclude unseen new factor levels in predict.randomForest?
One solution is to combine Train and Test Matrix and use as.factor on the combined matrix. Then separate into Train and Test again. I had faced this same issue in random forest and this solution had worked for me.
for example :
combine <- rbind(Train,Test)
combine$var1 <- as.factor(combine$var1)
##Then split into Test and Train
Train$var1 <- combine[1:nrow(train)]
similar for Test.
Hope this helps!
Using predict() with RcppArmadillo/RcppEigen when a factor level has only one level
Not a solution, but an explanation for why it happens.
If we inspect the source code of RcppAramdillo:::predict.fastLm()
, we find that it constructs the design matrix for the prediction points via
x <- model.matrix(object$formula, newdata)
On the other hand, if we inspect the source for stats::predict.lm()
, we find
tt <- terms(object)
## Some source omitted here
Terms <- delete.response(tt)
m <- model.frame(Terms, newdata, na.action = na.action, xlev = object$xlevels)
if (!is.null(cl <- attr(Terms, "dataClasses"))) .checkMFClasses(cl, m)
X <- model.matrix(Terms, m, contrasts.arg = object$contrasts)
which reveals that lm()
stores in its result information about the factor levels and contrasts for predictors, while fastLm()
reconstructs that information in the predict()
call:
names(model)
# [1] "coefficients" "stderr" "df.residual" "fitted.values"
# [5] "residuals" "call" "intercept" "formula"
names(lm_mod) ## Constructed with `lm()` call with same formula
# [1] "coefficients" "residuals" "effects" "rank"
# [5] "fitted.values" "assign" "qr" "df.residual"
# [9] "contrasts" "xlevels" "call" "terms"
# [13] "model"
Note the "xlevels"
and "contrasts"
elements in an lm
object that are not present in a fastLm
object. This goes to a larger point, though, from help("fastLM")
:
Linear models should be estimated using the lm function. In some cases, lm.fit may be appropriate.
Dirk can correct me if I'm wrong, but I think the point of fastLm()
is not to provide a rich OLS implementation that covers all the use cases that stats::lm()
does; I think it's more illustrative.
If your problem is big data and that's why you don't want to use stats::lm()
, might I suggest something like biglm::biglm()
? (See, for example, here). If you are really set on using RcppArmadillo::fastLm()
, you could do a smaller version of your workaround; instead of copying your whole data, just append one row to your prediction set for each unused factor level.
What does a glm model do with unknown factor levels added after training?
Lets start by looking at the model coefficients for the partial dataset summary(model)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.664 1.068 24.966 < 2e-16 ***
cyl8 -11.564 1.427 -8.102 3.45e-08 ***
Predictions for cyl8
are equal to the intercept + the effect of cyl8
, so 26.664 + -11.564 = 15.10. For the other factor levels (cyl4
), the predictions are equal to the intercept (26.664). Adding a factor level that is unknown will yield the same prediction, as R has no basis for additional factors effect (these were excluded in the original model).
We can see that the estimates of the known factors are unaffected by estimating the model on the full data.
model2<- glm(formula = mpg ~ cyl, data = mtcars)
summary(model2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.6636 0.9718 27.437 < 2e-16 ***
cyl6 -6.9208 1.5583 -4.441 0.000119 ***
cyl8 -11.5636 1.2986 -8.905 8.57e-10 ***
You see that the estimated effects for cyl8
and the reference category cyl4
are unchanged (still 15.10 and 26.66 resp.). As such, the model will yield the same predictions for these factor levels. However, predictions for cyl6
are overestimated by 6.92 as you can see from the newly estimated coefficient.
randomForest does not work when training set has more different factor levels than test set
R expects both the training and the test data to have the exact same levels (even if one of the sets has no observations for a given level or levels). In your case, since the test dataset is missing a level that the train has, you can do
test$val <- factor(test$val, levels=levels(train$val))
to make sure it has all the same levels and they are coded the same say.
(reposted here to close out the question)
Related Topics
Merge by Range in R - Applying Loops
How to Arrange an Arbitrary Number of Ggplots Using Grid.Arrange
What Are the Double Colons (::) in R
R Shiny Rest API Communication
Can't Download Data from Yahoo Finance Using Quantmod in R
Databricks Configure Using Cmd and R
R Knitr: Possible to Programmatically Modify Chunk Labels
How to Format Axis Labels with Exponents with Ggplot2 and Scales
Writing Multiple Data Frames into .CSV Files Using R
Embedded Nul in String' Error When Importing CSV with Fread
Add Objects to Package Namespace
Stacked Bar Chart in R (Ggplot2) with Y Axis and Bars as Percentage of Counts