Dealing with the class imbalance in binary classification
Both weighting (cost-sensitive) and thresholding are valid forms of cost-sensitive learning. In the briefest terms, you can think of the two as follows:
Weighting
Essentially one is asserting that the ‘cost’ of misclassifying the rare class is worse than misclassifying the common class. This is applied at the algorithmic level in such algorithms as SVM, ANN, and Random Forest. The limitations here consist of whether the algorithm can deal with weights. Furthermore, many applications of this are trying to address the idea of making a more serious misclassification (e.g. classifying someone who has pancreatic cancer as non having cancer). In such circumstances, you know why you want to make sure you classify specific classes even in imbalanced settings. Ideally you want to optimize the cost parameters as you would any other model parameter.
Thresholding
If the algorithm returns probabilities (or some other score), thresholding can be applied after a model has been built. Essentially you change the classification threshold from 50-50 to an appropriate trade-off level. This typically can be optimized by generated a curve of the evaluation metric (e.g. F-measure). The limitation here is that you are making absolute trade-offs. Any modification in the cutoff will in turn decrease the accuracy of predicting the other class. If you have exceedingly high probabilities for the majority of your common classes (e.g. most above 0.85) you are more likely to have success with this method. It is also algorithm independent (provided the algorithm returns probabilities).
Sampling
Sampling is another common option applied to imbalanced datasets to bring some balance to the class distributions. There are essentially two fundamental approaches.
Under-sampling
Extract a smaller set of the majority instances and keep the minority. This will result in a smaller dataset where the distribution between classes is closer; however, you have discarded data that may have been valuable. This could also be beneficial if you have a very large amount of data.
Over-sampling
Increase the number of minority instances by replicating them. This will result in a larger dataset which retains all the original data but may introduce bias. As you increase the size, however, you may begin to impact computational performance as well.
Advanced Methods
There are additional methods that are more ‘sophisticated’ to help address potential bias. These include methods such as SMOTE, SMOTEBoost and EasyEnsemble as referenced in this prior question regarding imbalanced datasets and CSL.
Model Building
One further note regarding building models with imbalanced data is that you should keep in mind your model metric. For example, metrics such as F-measures don’t take into account the true negative rate. Therefore, it is often recommended that in imbalanced settings to use metrics such as Cohen’s kappa metric.
Issues with imbalanced dataset in case of binary classification
Make sure you're dropping the target variable from your features before feeding the data to the classifier:
X = df.drop('target',axis=1)
y = df['target']
I'd also check if some independent variables are highly correlated with the target. It may give your an idea what causes an unrealistically perfect classiification:
import seaborn as sns
sns.heatmap(X_train.corr())
Dealing with class imbalance with mlr3
to answer your questions:
I am afraid that this approach will also perform class balancing with new data predicting.
This is not correct, where did you get this?
Am I correct not to balance classes in testing data?
Class balancing usually works by adding or removing rows (or adjusting weights). All those steps should not be applied during the prediction step, as we want exactly one predicted value for each row in the data. Weights on the other hand usually have no effect during the prediction phase.
Your assumption is correct.
If so, is there a way of doing this in mlr3?
Just use the PipeOp
as described in the blog post.
During training, it will do the specified over- or under- sampling, while it does nothing during the prediction.
Cheers,
Related Topics
Error Message: 'Chromedriver' Executable Needs to Be Path
Django: How to Serve Media/Stylesheets and Link to Them Within Templates
What Is a "Good" Palette for Divergent Colors in R? (Or: Can Viridis and Magma Be Combined Together)
Python VS Groovy VS Ruby? (Based on Criteria Listed in Question)
Separate a Row of Strings into Separate Rows
Why Can't Python Find Shared Objects That Are in Directories in Sys.Path
How to Do Multiple Substitutions Using Regex
Check If Any Alert Exists Using Selenium with Python
How to Parse a Time String Containing Milliseconds in It with Python
Pyqt Showing Video Stream from Opencv
Using a Django Variable in a CSS File
Ruby Equivalent of Python's "Dir"
Why Is Ruby More Suitable for Rails Than Python
Dead Simple Example of Using Multiprocessing Queue, Pool and Locking