Class imbalance can seriously damage the precision of your binary classifier. In this post you will learn some simple ways of evening the size of your classes before training to prevent your classifier from cheating.
The class imbalance problem
Binary classification is a very common problem in machine learning. The algorithm learns the underlying relationship between the features and the label. Then, it makes a prediction of the label of new instances based on their feature values. If the problem is very complicated sometimes the algorithms use shortcuts instead of those relationships. For example, if the number of instances of each class in our training data is very imbalanced, we could achieve a “good” classifier just by predicting always the majority class for any new sample. We would have created an algorithm with good accuracy (number of correctly classified samples), but completely useless in practice.
In this post, I will show you several techniques to correct class imbalance before training your algorithm. Since all of them imply some kind of random selection, they can be implemented at intermediate points of your training substituting, for example, naïf boosting for ensemble architectures.
Our data set
Before presenting the solution, let’s set up the problem. Imagine you want to predict the sign of the daily returns of the SPX500. We will split our problem in train (from 1995 to 2016) and test set (from 2017). Fortunately for any investor, the majority of the returns are positive. The imbalance is not very large, but as we will see, it is enough to mess up with our classifiers.
Positive samples | Negative samples | |
Train set | 2974 (54%) | 2566 (46%) |
Test set | 241 (57%) | 181 (43%) |
As features in our prediction, we will use the 42 previous logarithm returns of the same index. We have a train set of 5540 instances with 42 features and a test set of 422 samples.
I already mentioned that accuracy is not an ideal performance score, because it hides the information for each class. In this post, we will use the whole confusion matrix to identify the precision and recall of each class independently.
For simplicity all the experiments showed here are computed with a standard Random Forest classifier from the python library Sklearn (for pythonists we initialized it with the params: n_estimators=300, max_depth=6, min_samples_split=5).
The solutions shown here try to even the classes before applying the machine learning algorithm. That means having the same number of positive and negative examples so we force the classifier to find the underlying relationship between features and labels.
The solutions
No balance
In order to show the consequences of class imbalance in our models, I trained the algorithm with the original labels. The confusion matrix for the test set is:
Predicted negative | Predicted positive | |
True negatives | 13 | 167 |
True positives | 6 | 235 |
Please note that although the accuracy of this prediction is 58.8%, the confusion matrix clearly shows that that classifier is cheating and instead of learning from the data it is classifying the vast majority of samples as positives. This is exactly the problem that we would like to correct with the next tricks.
Upsampling minority class
The first trick we are going to use is upsampling the negative class by repeating some of its samples to match the number of positive examples. Among the entire negative set we randomly choose (repetitions allowed) 408 samples to expand the train set. The confusion matrix obtained on the test set shows that we haven’t solved the problem yet.
Predicted negative | Predicted positive | |
True negatives | 63 | 118 |
True positives | 82 | 159 |
This is not entirely surprising because although the number of negative example increases we haven’t included extra information so the classifier is not learning anything useful from them.
Downsampling majority class
To even the number of samples of each class, instead of making the minority class larger, we can make the majority class smaller. For that, we randomly choose (with repetition) only 2566 examples of the majority class and all the negative examples. In this case the confusion matrix ends up being:
Predicted negative | Predicted positive | |
True negatives | 109 | 72 |
True positives | 149 | 92 |
It seems that with this down sampling we are overshooting the results. We damaged a bit the recall of the majority class.
Upsamplig minority class with SMOTE
Lastly, we are going to implement a more sophisticated way of upsampling our minority class. Instead of simply repeating random examples we can create artificial examples of the minority class. For that, I use the technique called SMOTE (Synthetic Minority Over-sampling Technique). The protocol is easy: first, we randomly choose a negative example e1 and a second negative example e2 that is a neighbor of e1. Then we choose a random point along the imaginary line between e1 and e2 in the feature space. That is our new example of the negative class. Therefore, the new synthetic examples are linear combinations of the original examples.
With this expanded dataset the confusion matrix obtained is:
Predicted negative | Predicted positive | |
True negatives | 90 | 91 |
True positives | 116 | 125 |
The recall of the negative class is finally much better and the classifier is not suffering that much from the class imbalance. If only the features would have some useful information to predict the label, we would be rich by now.
To conclude
For the sake of simplicity we are going to leave it here, but there are even more sophisticated ways of creating artificial examples. We could also focus on creating synthetic examples only at the edge of the two classes (MSMOTE) to add extra information only where it matters the most.
So, while training your modes remember to check if your algorithm is learning your problem or taking a shortcut.