Data imbalance problem, or to be more exact class imbalance problem occurs when an overwhelming majority of your data belong to one class (majority class) and you have very few samples for your minority class.
You don't really have data imbalance problem if:
You are not really interested in correctly classifying the minority class (rarely happens!).
You have a huge training dataset that even though your class sample ratio is very imbalanced, you still have a lot of samples in your minority class.
Accuracy is not your best choice:
Imagine you have 1,000 samples (990 of them belong to class A and 10 to class B. Your smart algorithm decides to classify everything as A! Your accuracy in this case is 99%!
Looks great! Except if you are trying to develop an AI-driven disease recognition kit, you are heading towards another Theranos story.
Depending on the price (and consequences) of false positives and false negatives you can decide which one should be more focused on.
For instance if a false positive means that you will lose $100k and for false negative you will lose $10, then you know that your metric should be more focused on minimizing the false positive!
How to Deal with the problem?
There are 3 types of approaches:
Focus on Data: Using various sampling, boosting techniques and etc. The focus is to have a more balanced data set.
Focus on Algorithm: Most of these approaches focus on modifying the loss function.
Hybrid Methods: Combination of sampling and algorithmic changes.
Oversampling and Under-sampling: Perhaps one of the most common ways of dealing with data imbalance is via over-sampling the minority class or under-sampling the majority class.
Very informative blog post on oversampling in Python
SMOTE: One interesting generalization of oversampling is by synthesizing new data points in the minority class by creating new samples rather than just oversample with replacement. It uses nearest neighbors and random number to generate a new synthetic (fake) data point:
Randomly choose a data point and call it "main".
Find five (k) of the nearest neighbors.
Randomly choose one of those 5 neighboring points.
Draw a line between you "main" point and the randomly chosen neighbor.
Create a random point on the line you drew and classify it.
Cost-sensitive Classification: Common classification algorithms try to minimize the classification error (if you misclassify A as B, I will penalize the algorithm by increasing the cost function). But for a cost-sensitive classifier, not all misclassifications are created equal! They punish the the algorithm more when it misclassifies the minority class (the minority class gets a higher weight).
There is so much more but pragmatically these are the very useful techniques for dealing with unbalanced data. I also want to point out that sometimes it's just easier to re-frame the problem as an anomaly detection problem! In anomaly detection the question becomes how to define (and classify) normal behavior.
Hybrid version of the aforementioned techniques are also used.
How about Convolutional Neural Nets (CNN)? (Updates)
Methods mentioned above work fine. But some interesting nuances emerge that are not necessarily true for "classical" Machine Learning techniques. Here is a quick read experimental paper on the topic.
Focal Loss: This relatively new method was brought to my attention by one of my friends. The method is mainly designed to address one and two stage object recognition and is an upgrade on cross entropy loss function by introducing a focusing factor. Focus factor gives higher weight to the more "hard" examples. Read the paper by Facebook AI Research.