R - Random Forest and More Than 53 Categories

Categorical Variable has a Limit of 53 Values

I can tell you that the caret approach is correct. caret contains tools for data splitting, preprocessing, feature selection and model tuning with resampling cross-validation. Here I post a typical workflow for fitting a model with the caret package (example with the data you posted).

First, we set a cross-validation method for tuning the hyperparameters of the chosen model (in your case the tuning parameters are mtry for both ranger and randomForest, splitrule and min.node.size for ranger). In the example, I choose a k-fold corss-validation with k=10

library(caret)
control <- trainControl(method="cv",number = 10)

then we create a grid with the possible values that the parameters to be tuned can assume

rangergrid <- expand.grid(mtry=2:(ncol(data)-1),splitrule="extratrees",min.node.size=seq(0.1,1,0.1))
rfgrid <- expand.grid(mtry=2:(ncol(data)-1))

finally, we fit the chosen models:

random_forest_ranger <- train(response ~., 
data = data,
method = 'ranger',
trControl=control,
tuneGrid=rangergrid)

random_forest_rf <- train(response ~.,
data = data,
method = 'rf',
trControl=control,
tuneGrid=rfgrid)

the output of the train function look like this:

> random_forest_rf
Random Forest

162 samples
4 predictor
2 classes: 'a', 'b'

No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 146, 146, 146, 145, 146, 146, ...
Resampling results across tuning parameters:

mtry Accuracy Kappa
2 0.6852941 0.00000000
3 0.6852941 0.00000000
4 0.6602941 -0.04499494

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.

For more info on the caret package look a the online vignette

Error with RandomForest in R because of too many categories

Edited to address followup from OP

I believe using as.matrix in this case implicitly creates factors. It is also not necessary for this packages. You can keep it as a data frame, but will need to make sure that any unused factor levels are dropped by using droplevels (or something similar). There are many reasons an unused factor may be in your data set, but a common one is a dropped observation.

Below is a quick example that reproduces your error:

library('randomForest')

#making a toy data frame
x <- data.frame('one' = c(1,1,1,1,1,seq(50) ),
'two' = c(seq(54),NA),
'three' = seq(55),
'four' = seq(55) )

x$one <- as.factor(x$one)

x <- na.omit(x) #getting rid of an NA. Note this removes the whole row.

randomForest(one ~., data = as.matrix(x)) #your first error
randomForest(one ~., data = x) #your second error

x <- droplevels(x)

randomForest(one ~., data = x) #OK

Can not handle categorical predictors with more than 32 categories

So, just to complete this. I had the exact same problems, and it took me 10 minutes to figure out that there were hidden comments. Thus:

the solution may be in that the null values are interpreted as characters

Try to use the na.strings option:

read.csv("filename.csv", na.strings=c("", "NA", "NULL"))

randomForest Categorical Predictor Limits

I think you still had all the factor levels in your variable. Try adding this line before you fit the forest again:

df2$college_id <- factor(df2$college_id)


Related Topics



Leave a reply



Submit