Random Forest Package in R Shows Error During Prediction() If There Are New Factor Levels Present in Test Data. How to Avoid This Error

Random forest package in R shows error during prediction() if there are new factor levels present in test data. Is there any way to avoid this error?

One workaround I've found is to first convert the factor variables in your train and test sets into characters

test$factor <- as.character(test$factor)

Then add a column to each with a flag for test/train, i.e.

test$isTest <- rep(1,nrow(test))
train$isTest <- rep(0,nrow(train))

Then rbind them

fullSet <- rbind(test,train)

Then convert back to a factor

fullSet$factor <- as.factor(fullSet$factor)

This will ensure that both the test and train sets have the same levels. Then you can split back off:

test.new <- fullSet[fullSet$isTest==1,]
train.new <- fullSet[fullSet$isTest==0,]

and you can drop/NULL out the isTest column from each. Then you'll have sets with identical levels you can train and test on. There might be a more elegant solution, but this has worked for me in the past and you can write it into a little function if you need to repeat it often.

Random Forest in R: New factor levels not present in the training data

Fellow newbie here, I was just toying around with Titanic these days. I think it doesn´t make sense to have the Parch variable as a factor, so maybe make it numeric and that may solve the problem:

train$Parch <- as.numeric(train$Parch)

Otherwise, the test data has 2 obs with the value of 9 for Parch, which are not present in the train data:

> table(train$Parch)

0   1   2   3   4   5   6 
678 118  80   5   4   5   1 

> table(test$Parch)

0   1   2   3   4   5   6   9 
324  52  33   3   2   1   1   2 
>

Alternatively, if you need the variable to be a factor, then you could just add another level to it:

train$Parch <- as.factor(train$Parch) # in my data, Parch is type int
train$Parch
levels(train$Parch) <- c(levels(train$Parch), "9") 
train$Parch # now Parch has 7 levels
table(train$Parch) # level 9 is empty

How does randomForest() predict for new factor levels not in training data?

My opinion is that this is a very bad example; but, here's the answer:

Your created df1 only has factor variables and 4 observations. Here, mtry will equal 1, meaning that roughly 1/2 your trees will be based on b alone and 1/2 on a alone. When b == "4" the classification is always 1. IE- b == 4 perfectly predicts c. Similarly a == 1 perfectly predicts c == 0.

The reason that this works when you create the data in a single dataset is that the variables are factor variables, where the possible levels exist in both train and test, although the observed quantities for some levels == 0 in train. Since "unwanted_char" is a possible level in train$a (although unobserved) it's not problematic for your prediction. If you create these as separate datasets, the factor variables are created distinctly and test has new levels.

That is to say that, essentially, your problem works because you do not understand how factors work in R.

random forest: error in dealing with factor levels in R

Try

newdata$feature_x1 <- factor(newdata$feature_x1, levels=levels(feature_x1))

tidymodels Novel levels found in column

If you notice in the documentation for step_novel(), it says:

When fitting a model that can deal with new factor levels, consider using workflows::add_recipe() with allow_novel_levels = TRUE set in hardhat::default_recipe_blueprint(). This will allow your model to handle new levels at prediction time, instead of throwing warnings or errors.

So you want to do that:

library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip

data <-
  data.frame(
    Survived = as.factor(c(0,1,1,1,0,0,0,0,0,1,1,1,0,0,0,0)),
    Siblings = as.factor(c(1,1,0,1,0,0,0,3,1,1,0,1,0,0,0,3)),
    Class = as.factor(c(0,1,0,1,0,1,0,0,0,1,0,1,0,1,0,0)),
    Embarked = as.factor(c("s","c","m","m","s","c","s","m","m","s","s","s","s","s","s","s")) 
  )

test <-
  data.frame(
    Siblings = as.factor(c(1,1,0,1,0,0,0,3,1,1,0,1,0,0,0,4)), #New factor level
    Class = as.factor(c(0,1,0,1,0,1,0,0,0,1,0,1,0,1,0,0)),
    Embarked = as.factor(c("s","c","m","m","s","c","s","m","m","s","s","s","s","s","s","s")) 
  )

#Model
rf_model <-
  rand_forest() %>%
  set_args(
    mtry = 3,
    trees = 1000,
    min_n = 15
  ) %>%
  set_engine("ranger", 
             importance = "impurity") %>%
  set_mode("classification")

#Recipe
data_recipe <- 
  recipe(Survived ~Siblings + Class + Embarked, data=data) %>%
  step_novel(Siblings) %>%
  step_dummy(Siblings)

#Workflow
rf_workflow <- 
  workflow() %>%
  add_recipe(data_recipe, 
             blueprint = hardhat::default_recipe_blueprint(allow_novel_levels = TRUE)) %>%
  add_model(rf_model)

final_model <- fit(rf_workflow, data)
final_model
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: rand_forest()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 2 Recipe Steps
#> 
#> • step_novel()
#> • step_dummy()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Ranger result
#> 
#> Call:
#>  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~3,      x), num.trees = ~1000, min.node.size = min_rows(~15, x),      importance = ~"impurity", num.threads = 1, verbose = FALSE,      seed = sample.int(10^5, 1), probability = TRUE) 
#> 
#> Type:                             Probability estimation 
#> Number of trees:                  1000 
#> Sample size:                      16 
#> Number of independent variables:  5 
#> Mtry:                             3 
#> Target node size:                 15 
#> Variable importance mode:         impurity 
#> Splitrule:                        gini 
#> OOB prediction error (Brier s.):  0.254242

test_predict <- predict(final_model, test)
test_predict
#> # A tibble: 16 x 1
#>    .pred_class
#>    <fct>      
#>  1 0          
#>  2 1          
#>  3 0          
#>  4 1          
#>  5 0          
#>  6 0          
#>  7 0          
#>  8 0          
#>  9 0          
#> 10 1          
#> 11 0          
#> 12 1          
#> 13 0          
#> 14 0          
#> 15 0          
#> 16 0

^{Created on 2021-07-09 by the reprex package (v2.0.0)}

The workflows functions are very strict about factor levels and other aspects of the new data, ensuring that they match up with the training data.

New factor levels not present in the training data

I tested my speculation that the ordered factors were the source of the problem, and get no error when the only thing I do is remove the "ordered" from the classes of that structure. I don't see in the documentation that ordered factors are not allowed, but I also do not see that they were specifically considered. It's possible that this hasn't come up before. It would seem that ordering would impose additional complexities and that if you wanted order to be accounted for that you could instead offer the as.numeric(.) "scores" to the RF algorithm.