Categorical Variable has a Limit of 53 Values
I can tell you that the caret
approach is correct. caret
contains tools for data splitting, preprocessing, feature selection and model tuning with resampling cross-validation. Here I post a typical workflow for fitting a model with the caret
package (example with the data you posted).
First, we set a cross-validation method for tuning the hyperparameters of the chosen model (in your case the tuning parameters are mtry
for both ranger
and randomForest
, splitrule
and min.node.size
for ranger
). In the example, I choose a k-fold corss-validation with k=10
library(caret)
control <- trainControl(method="cv",number = 10)
then we create a grid with the possible values that the parameters to be tuned can assume
rangergrid <- expand.grid(mtry=2:(ncol(data)-1),splitrule="extratrees",min.node.size=seq(0.1,1,0.1))
rfgrid <- expand.grid(mtry=2:(ncol(data)-1))
finally, we fit the chosen models:
random_forest_ranger <- train(response ~.,
data = data,
method = 'ranger',
trControl=control,
tuneGrid=rangergrid)
random_forest_rf <- train(response ~.,
data = data,
method = 'rf',
trControl=control,
tuneGrid=rfgrid)
the output of the train
function look like this:
> random_forest_rf
Random Forest
162 samples
4 predictor
2 classes: 'a', 'b'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 146, 146, 146, 145, 146, 146, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.6852941 0.00000000
3 0.6852941 0.00000000
4 0.6602941 -0.04499494
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
For more info on the caret
package look a the online vignette
Error with RandomForest in R because of too many categories
Edited to address followup from OP
I believe using as.matrix
in this case implicitly creates factors. It is also not necessary for this packages. You can keep it as a data frame, but will need to make sure that any unused factor levels are dropped by using droplevels
(or something similar). There are many reasons an unused factor may be in your data set, but a common one is a dropped observation.
Below is a quick example that reproduces your error:
library('randomForest')
#making a toy data frame
x <- data.frame('one' = c(1,1,1,1,1,seq(50) ),
'two' = c(seq(54),NA),
'three' = seq(55),
'four' = seq(55) )
x$one <- as.factor(x$one)
x <- na.omit(x) #getting rid of an NA. Note this removes the whole row.
randomForest(one ~., data = as.matrix(x)) #your first error
randomForest(one ~., data = x) #your second error
x <- droplevels(x)
randomForest(one ~., data = x) #OK
Can not handle categorical predictors with more than 32 categories
So, just to complete this. I had the exact same problems, and it took me 10 minutes to figure out that there were hidden comments. Thus:
the solution may be in that the null values are interpreted as characters
Try to use the na.strings option:
read.csv("filename.csv", na.strings=c("", "NA", "NULL"))
randomForest Categorical Predictor Limits
I think you still had all the factor levels in your variable. Try adding this line before you fit the forest again:
df2$college_id <- factor(df2$college_id)
Related Topics
R: Removing Duplicate Elements in a Vector
Vector of Cumulative Sums in R
Merge Data.Frames with Duplicates
Finding Number of Elements in One Vector That Are Less Than an Element in Another Vector
R: Read in Random Rows from File Using Fread or Equivalent
Error: Could Not Find Build Tools Necessary to Build Dplyr
R Ggplot2 Using Italics and Non-Italics in the Same Category Label
Merge Multiple Data.Frames in R with Varying Row Length
Variable Results with Dplyr Summarise, Depending on Output Variable Naming
Convert Jpg to Greyscale CSV Using R
Calculate Row Means Based on (Partial) Matching Column Names
Using Italic() with a Variable in Ggplot2 Title Expression
Why Does "Hello" > 0 Return True