How to Use Random Forests in R with Missing Values

How to use random forests in R with missing values?

My initial reaction to this question was that it didn't show much research effort, since "everyone" knows that random forests don't handle missing values in predictors. But upon checking ?randomForest I must confess that it could be much more explicit about this.

(Although, Breiman's PDF linked to in the documentation does explicitly say that missing values are simply not handled at all.)

The only obvious clue in the official documentation that I could see was that the default value for the na.action parameter is na.fail, which might be too cryptic for new users.

In any case, if your predictors have missing values, you have (basically) two choices:

  1. Use a different tool (rpart handles missing values nicely.)
  2. Impute the missing values

Not surprisingly, the randomForest package has a function for doing just this, rfImpute. The documentation at ?rfImpute runs through a basic example of its use.

If only a small number of cases have missing values, you might also try setting na.action = na.omit to simply drop those cases.

And of course, this answer is a bit of a guess that your problem really is simply having missing values.

Predict on new data with R random forest when there are missing data

I think I understand what you want. You want to take a trained model and make predictions on new data which may have missing values. Rather than impute the missing values, you want the predicted value to be NA for those rows with missing values.

Here is one way to do that. I can even maintain the original row order. The assumptions are that your new data is in a data.frame called new_data and your trained random forest model is called my_forest. Replace these with the names of your objects. I'm also assuming a regression model. If this is a classification problem, let me know and I can alter the code.

Here is a step-by-step method explaining what we are doing.

library(tidyr)
library(dplyr)
new_data <- new_data %>% rowid_to_column() # add column with rownumber
new_data_na <- new_data %>%
filter(!complete.cases(.)) # save those rows with NA in separate data.frame
new_data_complete <- new_data %>%
filter(complete.cases(.)) # keep only those rows with no NA
new_data_complete$predicted <- predict(my_forest, newdata = new_data_complete) # make predictions
new_data_na$predicted <- NA_real # ensure that that NA is the same data type
new_data_predicted <- rbind(new_data_na, new_data_complete) # bind rows
arrange(new_data_predicted, rowid) # return data to original order

Here is a mode code-efficient pipe method from using the tools of dplyr. Note how simple this looks. The case_when structure checks each row for NA values with complete.cases(.). The . in the argument tells complete.cases to use all columns. If there are no NA values, complete.cases(.) returns TRUE, and the prediction runs on that row. Again, newdata = . is used to tell predict() to use all columns. If there is one or more NA values, complete.cases(.) will return FALSE. The second line of the case_when structure is a catchall for when the first line is not TRUE. If the first line is not TRUE, we want the predicted value to return NA. Note that this method does not involve taking the data apart, and so no effort needs to be made to put it back together.

library(dplyr)
new_data %>%
mutate(predicted = case_when(complete.cases(.) ~ predict(my_forest, newdata = .),
TRUE ~ NA_real_)

Error in randomForest, NA, missing values in object

Here is a complete example with reprex data. Without your data I can't make a perfect answer, but if you follow this logic you should be ok.

library(randomForest)

# Generate Some Fake Data
fake_data <- data.frame(
age = runif(500, 30, 65),
martial = sample(c("single", "married", "divorced"), 500, T),
default = sample(c("yes", "no"), 500, T),
balance = runif(500,0,2100),
housing = sample(c("yes", "no"), 500, T),
loan = sample(c("yes", "no"), 500, T),
stringsAsFactors = FALSE
)

# Add some missing data for example

fake_data[sample(x = 1:500, size = 5), "loan"] <- NA

# Check for NAs

fake_data_2 <- fake_data[!is.na(fake_data$loan),]

cat("You have removed ", nrow(fake_data)-nrow(fake_data_2), " records")

# Add target and make sure it is a factor

fake_data_2$y <- as.factor(fake_data_2$loan)

# Make characters into factors
library(dplyr)

fake_data_2 <- fake_data_2 %>%
mutate_if(is.character, as.factor)

fit <- randomForest(y ~ ., data = fake_data_2)

This will yield a valid random forest model.

Random Forest Error in na.fail.default: missing values in object

Assuming train() is from caret, you can specify a function to handle na's with the na.action parameter. The default is na.fail. A very common one is na.omit. The randomForest library has na.roughfix that will "Impute Missing Values by median/mode."

mod_rf <-
train(left_school ~ job_title
+ gender +
+ marital_status + age_at_enrollment + monthly_wage + educational_qualification + cityD + educational_qualification + cityC.
+ cityB +cityA + duration_in_program, # Equation (outcome and everything else)
data=train_data, # Training data
method = "ranger", # random forest (ranger is much faster than rf)
metric = "ROC", # area under the curve
trControl = control_conditions,
tuneGrid = tune_mtry,
na.action = na.omit
)
mod_rf

Error when using Random Forest model to predict test data

You can remove NA first then use the predict, RMSE, R2 function like

library(caret)

tr.Control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5)

rf3 <- caret::train(Lifeexp~.,
data = train.dat2,
method = "rf",
trControl = tr.Control ,
preProcess = c("center", "scale"),
ntree = 1500,
tuneGrid = expand.grid(mtry = seq(1, ncol(train.dat2)-1))
)

#Remove the NA from the data freme
test.dat2 <- na.omit(test.dat2)
rf.pred <- predict(rf3, newdata = test.dat2, type = "raw")

RMSE.tree = RMSE(rf.pred, test.dat2$Lifeexp)
Rsquare.tree = R2(rf.pred, test.dat2$Lifeexp)


Related Topics



Leave a reply



Submit