Reduce Number of Levels for Large Categorical Variables

Reduce number of levels for large categorical variables

Here is an example in R using data.table a bit, but it should be easy without data.table also.

# Load data.table
require(data.table)

# Some data
set.seed(1)
dt <- data.table(type = factor(sample(c("A", "B", "C"), 10e3, replace = T)),
                 weight = rnorm(n = 10e3, mean = 70, sd = 20))

# Decide the minimum frequency a level needs...
min.freq <- 3350

# Levels that don't meet minumum frequency (using data.table)
fail.min.f <- dt[, .N, type][N < min.freq, type]

# Call all these level "Other"
levels(dt$type)[fail.min.f] <- "Other"

Pandas reduce number of categorical variables in value_counts() tabulation

You can use value_counts with numpy.where, where is condition with isin.

If your variable is of type object see below. If your variable is of type category, then skip down toward the bottom.

df = pd.DataFrame({'Color':'Red Red Blue Red Violet Blue'.split(), 
                   'Value':[11,150,50,30,10,40]})
print (df)
    Color  Value
0     Red     11
1     Red    150
2    Blue     50
3     Red     30
4  Violet     10
5    Blue     40

a = df.Color.value_counts()
print (a)
Red       3
Blue      2
Violet    1
Name: Color, dtype: int64

#get top 2 values of index
vals = a[:2].index
print (vals)
Index(['Red', 'Blue'], dtype='object')

df['new'] = np.where(df.Color.isin(vals), 0,1)
print (df)
    Color  Value  new
0     Red     11    0
1     Red    150    0
2    Blue     50    0
3     Red     30    0
4  Violet     10    1
5    Blue     40    0

Or if need replace all not top values use where:

df['new1'] = df.Color.where(df.Color.isin(vals), 'other')
print (df)
    Color  Value   new1
0     Red     11    Red
1     Red    150    Red
2    Blue     50   Blue
3     Red     30    Red
4  Violet     10  other
5    Blue     40   Blue

For category type:

df = pd.DataFrame({'Color':'Red Red Blue Red Violet Blue'.split(), 
                   'Value':[11,150,50,30,10,40]})
df.Color = df.Color.astype('category')

a= df.Color.value_counts()[:2].index
print(a)
CategoricalIndex(['Red', 'Blue'], 
                categories=['Blue', 'Red', 'Violet'], 
                ordered=False, dtype='category')

Notice that violet is still a category. So we need .remove_unused_categories().

vals = df.Color.value_counts()[:2].index.remove_unused_categories()
CategoricalIndex(['Red', 'Blue'], 
                 categories=['Blue', 'Red'], 
                 ordered=False, dtype='category')

As mentioned in the comments, a ValueError will occur when setting the new variable. The way around that is type changing.

df['new1'] = df.Color.astype('object').where(df.Color.isin(vals), 'other')
df['new1'] = df['new1'].astype('category')

Encoding categorical variables with hundreds of levels for machine learning algorithms?

I've seen feature hashing and embedding mentioned in comments. Apart from that you can try clustering players by IDs if you have some additional data.

Another approach which is suitable for categorical data with many level is mean encoding.

Mean encoding (also sometimes called target encoding) consists of encoding categories with means of target (for example in regression if you have classes 0 and 1 then class 0 is encoded by mean of response for examples with 0 and so on). There are some answers on this site on that which provide more detail. I also encourage you to see this video if you want to get more about how it works and how you can implement it (there are several ways that to do mean encoding and each has its pros and cons).

In Python you can do mean encoding yourself (some approaches are shown in the video from the series I linked) or you can try Category Encoders from scikit-learn contrib.

Categorical Variable has a Limit of 53 Values

I can tell you that the caret approach is correct. caret contains tools for data splitting, preprocessing, feature selection and model tuning with resampling cross-validation. Here I post a typical workflow for fitting a model with the caret package (example with the data you posted).

First, we set a cross-validation method for tuning the hyperparameters of the chosen model (in your case the tuning parameters are mtry for both ranger and randomForest, splitrule and min.node.size for ranger). In the example, I choose a k-fold corss-validation with k=10

library(caret)
control <- trainControl(method="cv",number = 10)

then we create a grid with the possible values that the parameters to be tuned can assume

rangergrid <- expand.grid(mtry=2:(ncol(data)-1),splitrule="extratrees",min.node.size=seq(0.1,1,0.1))
rfgrid <- expand.grid(mtry=2:(ncol(data)-1))

finally, we fit the chosen models:

random_forest_ranger <- train(response ~., 
                       data = data, 
                       method = 'ranger',
                       trControl=control,
                       tuneGrid=rangergrid)

random_forest_rf <- train(response ~., 
                       data = data, 
                       method = 'rf',
                       trControl=control,
                       tuneGrid=rfgrid)

the output of the train function look like this:

> random_forest_rf
Random Forest 

162 samples
  4 predictor
  2 classes: 'a', 'b' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 146, 146, 146, 145, 146, 146, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa      
  2     0.6852941   0.00000000
  3     0.6852941   0.00000000
  4     0.6602941  -0.04499494

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.

For more info on the caret package look a the online vignette

Reduce Number of Levels for Large Categorical Variables