Random Forest with Classes That Are Very Unbalanced

Random Forest with classes that are very unbalanced

You should try using sampling methods that reduce the degree of imbalance from 1:10,000 down to 1:100 or 1:10. You should also reduce the size of the trees that are generated. (At the moment these are recommendations that I am repeating only from memory, but I will see if I can track down more authority than my spongy cortex.)

One way of reducing the size of trees is to set the "nodesize" larger. With that degree of imbalance you might need to have the node size really large, say 5-10,000. Here's a thread in rhelp:
https://stat.ethz.ch/pipermail/r-help/2011-September/289288.html

In the current state of the question you have sampsize=c(250000,2000), whereas I would have thought that something like sampsize=c(8000,2000), was more in line with my suggestions. I think you are creating samples where you do not have any of the group that was sampled with only 2000.

Random forest with unbalanced class (positive is minority class), low precision and weird score distributions

  • Upsampling with SMOTE and SMOTEENN : I am far from being an expert with those but by upsampling your dataset you might amplify existing noise which induce overfitting. This could explain the fact that your algorithm cannot correctly classify, thus giving the results in the first graph.

I found a little bit more info here and maybe how to improve your results:
https://sci2s.ugr.es/sites/default/files/ficherosPublicaciones/1773_ver14_ASOC_SMOTE_FRPS.pdf

  • When you downsample you seem to encounter the same overfitting problem as I understand it (at least for the target result of the previous year). It is hard to deduce the reason behind it without a view on the data though.

  • Your overfitting problem might come from the number of features you use that could add unnecessary noise. You might try to reduce the number of features you use and gradually increase it (using a RFE model). More info here:

https://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/

For the models you used, you mention Random Forest and XGBoost, but you did not mention having used simpler model. You could try simpler model and focus on you data engineering.
If you have not try it yet, maybe you could:

  • Downsample your data
  • Normalize all your data with a StandardScaler
  • Test "brute force" tuning of simple models such as Naive Bayes and Logistic Regression

    # Define steps of the pipeline
    steps = [('scaler', StandardScaler()),
    ('log_reg', LogisticRegression())]

    pipeline = Pipeline(steps)

    # Specify the hyperparameters
    parameters = {'C':[1, 10, 100],
    'penalty':['l1', 'l2']}

    # Create train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,
    random_state=42)

    # Instantiate a GridSearchCV object: cv
    cv = GridSearchCV(pipeline, param_grid=parameters)

    # Fit to the training set
    cv.fit(X_train, y_train)

Anyway, for your example the pipeline could be (I made it with Logistic Regression but you can change it with another ML algorithm and change the parameters grid consequently):

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score

from imblearn.combine import SMOTEENN
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline

param_grid = {'C': [1, 10, 100]}

clf = LogisticRegression(solver='lbfgs', multi_class = 'auto')
sme = SMOTEENN(smote = SMOTE(k_neighbors = 2), random_state=42)
grid = GridSearchCV(estimator=clf, param_grid = param_grid, score = "f1")

pipeline = Pipeline([('scale', StandardScaler()),
('SMOTEENN', sme),
('grid', grid)])

cv = StratifiedKFold(n_splits = 4, random_state=42)
score = cross_val_score(pipeline, X, y, cv=cv)

I hope this may help you.

(edit: I added score = "f1" in the GridSearchCV)

Unbalanced classification using RandomForestClassifier in sklearn

You can pass sample weights argument to Random Forest fit method

sample_weight : array-like, shape = [n_samples] or None

Sample weights. If None, then samples are equally weighted. Splits
that would create child nodes with net zero or negative weight are
ignored while searching for a split in each node. In the case of
classification, splits are also ignored if they would result in any
single class carrying a negative weight in either child node.

In older version there were a preprocessing.balance_weights method to generate balance weights for given samples, such that classes become uniformly distributed. It is still there, in internal but still usable preprocessing._weights module, but is deprecated and will be removed in future versions. Don't know exact reasons for this.

Update

Some clarification, as you seems to be confused. sample_weight usage is straightforward, once you remember that its purpose is to balance target classes in training dataset. That is, if you have X as observations and y as classes (labels), then len(X) == len(y) == len(sample_wight), and each element of sample witght 1-d array represent weight for a corresponding (observation, label) pair. For your case, if 1 class is represented 5 times as 0 class is, and you balance classes distributions, you could use simple

sample_weight = np.array([5 if i == 0 else 1 for i in y])

assigning weight of 5 to all 0 instances and weight of 1 to all 1 instances. See link above for a bit more crafty balance_weights weights evaluation function.

How to handle class imbalance in sklearn random forests. Should I use sample weights or class weight parameter

Class weights are what you should be using.

Sample weights allow you to specify a multiplier for the impact a particular sample has. Weighting a sample with a weight of 2.0 roughly has the same effect as if the point was present twice in the data (although the exact effect is estimator dependent).

Class weights have the same effect, but it used for applying a set multiplier to every sample that falls into the specified class. In terms of functionality, you could use either, but class_weights is provided for convenience so you do not have to manually weight each sample. Also it is possible to combined the usage of the two in which the class weights are multiplied by the sample weights.

One of the main uses for sample_weights on the fit() method is to allow boosting meta-algorithms like AdaBoostClassifier to operate on existing decision tree classifiers and increase or decrease the weights of individual samples as needed by the algorithm.

Implementing Balanced Random Forest (BRF) in R using RandomForests

You can balance your random forests using case weights. Here's a simple example:

library(ranger) #Best random forest implementation in R

#Make a dataste
set.seed(43)
nrow <- 1000
ncol <- 10
X <- matrix(rnorm(nrow * ncol), ncol=ncol)
CF <- rnorm(ncol)
Y <- (X %*% CF + rnorm(nrow))[,1]
Y <- as.integer(Y > quantile(Y, 0.90))
table(Y)

#Compute weights to balance the RF
w <- 1/table(Y)
w <- w/sum(w)
weights <- rep(0, nrow)
weights[Y == 0] <- w['0']
weights[Y == 1] <- w['1']
table(weights, Y)

#Fit the RF
data <- data.frame(Y=factor(ifelse(Y==0, 'no', 'yes')), X)
model <- ranger(Y~., data, case.weights=weights)
print(model)


Related Topics



Leave a reply



Submit