At Cmotions we’ve always had a lot of discussion amongst colleagues whether or not to balance our datasets when building a predictive model. Everyone has their own personal preference and way of working, but which approach is “the best”? We simply didn’t know until we stumbled upon this paper written by Jacques Wainer and Rodrigo A. Franceschinell in October 2018. Unlike a lot of other research papers, this paper focuses on the practitioners (us) and not so much on researchers, since in general, a practitioner has a couple of days or weeks to build a model on a dataset, while a researcher may spend several years working with the same dataset. Because not everybody likes to dive into scientific papers, I’ve tried to summarize their conclusions in this blog as best I can.
Research setting
Let’s start with the setting of the research:
- Binary classification problems
- 82 datasets
- The datasets were altered (if possible) to have an imbalance between 0.1% (severe) and 5% (moderate)
- Focus on strong general classifiers:
- Focus on easy to implement balancing strategies:
- Baseline: no balancing
- Class weight: assign more weight to the small class, the general setting used here is that the large class has a weight of 1 and the small class a weight of the inverse of the imbalance rate. Thus, if this rate is 1% the weight of the small class would be 100
- SMOTE (synthetic minority over-sampling): use KNN to find neighbours of the small class to randomly create a new data point for this class
- Underbagging: bagging of the base classifier, where each instance is trained on a balanced sample of the dataset, where the small class is preserved and the large class is randomly undersampled. This strategy has a parameter n, the number of sampled datasets created, in this research they used n as a hyperparameter to find the best value for each dataset
- Focus on most used metrics (in research):
- Area under the ROC curve (AUC): the estimated probability that a random positive example will have a higher score than a randomly selected negative example
- Accuracy (acc): the proportion of correct predictions
- Balanced accuracy (bac): the corrected accuracy based on the caculation of the arithmetic mean of the recall (the proportion of true predictions for the positive examples) and specificity (the proportion of true predictions for the negative examples)
- F-measure (F1): the corrected accuracy based on the harmonic mean of recall and specificity
- G-mean (gmean): the corrected accuracy based on the geometric means of recall and specificity
- Matthew’s correlation coefficient (mcc): the Pearson correlation between the prediction and the correct class
The researchers mainly made these choices based on what we as practitioners have in our most-used-toolbox. For myself, I can support this for most of the decisions. Although I must say, I very rarely use SVM.
The key take aways when looking at these settings:
- The lowest class imbalance is 5%, which is considered moderate in this paper, thus more balanced datasets are not taken into account here
- This only holds for binary problems and not multiclass
- The researchers chose to use strong general classifiers on purpose, since we would use them in practice too, just keep in mind this affects the extent to which the balancing strategy has an effect on result, since some of these classifiers are able to deal with imbalancing to some extent
Research questions
In this paper Wainer and Franceschinell answer multiple questions, we will focus on the five questions that are most relevant to us. Which leads down to determining the best balancing strategy in general, for a given metric and if the data is severely or moderately imbalanced. They’ve also looked at which classifier works best with which balancing strategy or without using any balancing strategy. All questions we as Data Scientists have to think about every time we are building a predictive model.
If in the tables below multiple answers are given for the same metric, keep in mind that the order given is random, since the difference will be insignificant so we can’t order them based on that.
What is in general the best strategy for imbalanced data (for each metric)?
Metric | best performing Balancing strategy |
AUC | Baseline | Class weight |
acc | Baseline | Class weight |
bac | Underbagging |
f1 | SMOTE |
gmean | Underbagging |
mcc | Baseline | Class weight | SMOTE |
What is the best balancing strategy for moderately (5%) imbalanced data?
Metric | best performing Balancing strategy |
AUC | Baseline | Class weight | SMOTE |
acc | Baseline | Class weight |
bac | Underbagging |
f1 | SMOTE |
gmean | Underbagging |
mcc | SMOTE |
What is the best balancing strategy for severely (0.1%) imbalanced data?
Metric | best performing Balancing strategy |
AUC | Baseline | Class weight |
acc | Baseline | Class weight |
bac | Underbagging |
f1 | Baseline | Class weight | SMOTE | Underbagging |
gmean | Underbagging |
mcc | Baseline | Class weight | SMOTE | Underbagging |
What is the best combination of base classifier and balancing strategy?
Metric | best performing Balancing strategy |
AUC | RF + Baseline | RF + SMOTE | RF + Class weight | XGB + Baseline | XGB + Class weight | XGB + SMOTE |
acc | XGB + Baseline | XGB + Class weight | RF + Class weight | | RF + Baseline |
bac | XGB + underbagging |
f1 | XGB + SMOTE |
gmean | RF + Underbagging | XGB + Underbagging |
mcc | XGB + SMOTE |
What is the best base classifier when not using as balancing strategy?
Metric | best performing Balancing strategy |
AUC | RF | XGB |
acc | RF | XGB |
bac | XGB |
f1 | XGB |
gmean | XGB |
mcc | XGB |
Conclusion
The first conclusion will probably be no surprise for most of us. The XGB classifier is a very strong and stable classifier, which performs better than the other classifiers in most use cases.
The next conclusion surprised us a lot more. We never expected that balancing data isn’t even necessary in most cases, let alone that it is even less necessary if your data is severely imbalanced.
The last conclusion is the fact that the best balancing strategy strongly depends on the metric chosen to evaluate the model performance. Which would mean that we should base our balancing strategy on the metric we think is the best to use in a given use case. Unfortunately, this is never an easy decision to make, as it depends on a lot of factors.
These are the guidelines given in the papers conclusion about this:
- If the user is using AUC as the metric of quality then baseline or class weight strategies are likely the best alternatives (regardless if the imbalance is moderate (5%) or severe (0.1%). If the practitioner has control over which base classifier to use, then one should opt for a gradient boosting or a random forest.
- If one is using accuracy as metric then baseline or class weight are the best strategies and the best classifier is gradient boosting.
- If one is using f-measure as metric then SMOTE is a good strategy for any imbalance rate and the best classifier is gradient boosting.
- If one is using G-mean as metric then Underbagging is the best strategy for any imbalance rate and the best classifiers are random forest and gradient boosting.
- If one is using Matthew’s correlation coefficient as metric then almost all strategies perform equally well in general, but SMOTE seems to perform better for less imbalanced data and the best classifier is gradient boosting.
- If one is using the balanced accuracy as metric then Underbagging is the best strategy for any imbalance rate and best classifier are random forest and gradient boosting.
How do we use this in our day-to-day work
You might wonder, did this settle all our discussions for once and for all? To be honest, not really, since we just love to debate and question every decision we make. That is how we make sure we keep growing and learning. But it did help us a lot in our daily work.
The biggest insight for most of us was that balancing the data is often not even necessary, and the most counterintuitive is that this is even more the case for severely imbalanced datasets.
The biggest change for me actually had nothing to do with choosing the right balancing strategy, but with more deliberately picking a metric (or a couple of metrics) when evaluating a model. When it comes to picking a balancing strategy my own preferred way of working is still to simply try a couple of balancing strategies and evaluating all of the resulting models coming from this. Maybe some of you are horrified by this, but it works for me.
What is your prefered way of working when it comes to choosing a metric and/or choosing a balancing strategy? I’m always looking for people who can convince me to change my habits 😉