Reduce number of levels for large categorical variables
Here is an example in R
using data.table
a bit, but it should be easy without data.table
also.
# Load data.table
require(data.table)
# Some data
set.seed(1)
dt <- data.table(type = factor(sample(c("A", "B", "C"), 10e3, replace = T)),
weight = rnorm(n = 10e3, mean = 70, sd = 20))
# Decide the minimum frequency a level needs...
min.freq <- 3350
# Levels that don't meet minumum frequency (using data.table)
fail.min.f <- dt[, .N, type][N < min.freq, type]
# Call all these level "Other"
levels(dt$type)[fail.min.f] <- "Other"
Pandas reduce number of categorical variables in value_counts() tabulation
You can use value_counts
with numpy.where
, where is condition with isin
.
If your variable is of type object see below. If your variable is of type category, then skip down toward the bottom.
df = pd.DataFrame({'Color':'Red Red Blue Red Violet Blue'.split(),
'Value':[11,150,50,30,10,40]})
print (df)
Color Value
0 Red 11
1 Red 150
2 Blue 50
3 Red 30
4 Violet 10
5 Blue 40
a = df.Color.value_counts()
print (a)
Red 3
Blue 2
Violet 1
Name: Color, dtype: int64
#get top 2 values of index
vals = a[:2].index
print (vals)
Index(['Red', 'Blue'], dtype='object')
df['new'] = np.where(df.Color.isin(vals), 0,1)
print (df)
Color Value new
0 Red 11 0
1 Red 150 0
2 Blue 50 0
3 Red 30 0
4 Violet 10 1
5 Blue 40 0
Or if need replace all not top values use where
:
df['new1'] = df.Color.where(df.Color.isin(vals), 'other')
print (df)
Color Value new1
0 Red 11 Red
1 Red 150 Red
2 Blue 50 Blue
3 Red 30 Red
4 Violet 10 other
5 Blue 40 Blue
For category type:
df = pd.DataFrame({'Color':'Red Red Blue Red Violet Blue'.split(),
'Value':[11,150,50,30,10,40]})
df.Color = df.Color.astype('category')
a= df.Color.value_counts()[:2].index
print(a)
CategoricalIndex(['Red', 'Blue'],
categories=['Blue', 'Red', 'Violet'],
ordered=False, dtype='category')
Notice that violet is still a category. So we need .remove_unused_categories()
.
vals = df.Color.value_counts()[:2].index.remove_unused_categories()
CategoricalIndex(['Red', 'Blue'],
categories=['Blue', 'Red'],
ordered=False, dtype='category')
As mentioned in the comments, a ValueError will occur when setting the new variable. The way around that is type changing.
df['new1'] = df.Color.astype('object').where(df.Color.isin(vals), 'other')
df['new1'] = df['new1'].astype('category')
Encoding categorical variables with hundreds of levels for machine learning algorithms?
I've seen feature hashing and embedding mentioned in comments. Apart from that you can try clustering players by IDs if you have some additional data.
Another approach which is suitable for categorical data with many level is mean encoding.
Mean encoding (also sometimes called target encoding) consists of encoding categories with means of target (for example in regression if you have classes 0 and 1 then class 0 is encoded by mean of response for examples with 0 and so on). There are some answers on this site on that which provide more detail. I also encourage you to see this video if you want to get more about how it works and how you can implement it (there are several ways that to do mean encoding and each has its pros and cons).
In Python you can do mean encoding yourself (some approaches are shown in the video from the series I linked) or you can try Category Encoders from scikit-learn contrib.
Categorical Variable has a Limit of 53 Values
I can tell you that the caret
approach is correct. caret
contains tools for data splitting, preprocessing, feature selection and model tuning with resampling cross-validation. Here I post a typical workflow for fitting a model with the caret
package (example with the data you posted).
First, we set a cross-validation method for tuning the hyperparameters of the chosen model (in your case the tuning parameters are mtry
for both ranger
and randomForest
, splitrule
and min.node.size
for ranger
). In the example, I choose a k-fold corss-validation with k=10
library(caret)
control <- trainControl(method="cv",number = 10)
then we create a grid with the possible values that the parameters to be tuned can assume
rangergrid <- expand.grid(mtry=2:(ncol(data)-1),splitrule="extratrees",min.node.size=seq(0.1,1,0.1))
rfgrid <- expand.grid(mtry=2:(ncol(data)-1))
finally, we fit the chosen models:
random_forest_ranger <- train(response ~.,
data = data,
method = 'ranger',
trControl=control,
tuneGrid=rangergrid)
random_forest_rf <- train(response ~.,
data = data,
method = 'rf',
trControl=control,
tuneGrid=rfgrid)
the output of the train
function look like this:
> random_forest_rf
Random Forest
162 samples
4 predictor
2 classes: 'a', 'b'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 146, 146, 146, 145, 146, 146, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.6852941 0.00000000
3 0.6852941 0.00000000
4 0.6602941 -0.04499494
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
For more info on the caret
package look a the online vignette
Related Topics
Asynchronous Method Call in Python
Reduce Number of Levels for Large Categorical Variables
Pyobjc VS Rubycocoa for MAC Development: Which Is More Mature
Programmatically Extract Data from an Excel Spreadsheet
Getting an Instance Name Inside Class _Init_()
How to Change the Name of a Django App
Defining the Midpoint of a Colormap in Matplotlib
Running Windows Shell Commands with Python
Displaying Subprocess Output to Stdout and Redirecting It
Error When Installing Rpy2 Module in Python with Easy_Install
Restrictons of Python Compared to Ruby: Lambda'S
In Python Can One Implement Mixin Behavior Without Using Inheritance
How to Operate on a Dataframe with a Series for Every Column
Popen Waiting for Child Process Even When the Immediate Child Has Terminated