Data imbalance problem, or to be more exact class imbalance problem occurs when an overwhelming majority of your data belong to one class (majority class) and you have very few samples for your minority class.
You don't really have data imbalance problem if:
You are not really interested in correctly classifying the minority class (rarely happens!).
You have a huge training dataset that even though your class sample ratio is very imbalanced, you still have a lot of samples in your minority class.
Accuracy is not your best choice:
Imagine you have 1,000 samples (990 of them belong to class A and 10 to class B. Your smart algorithm decides to classify everything as A! Your accuracy in this case is 99%!
Looks great! Except if you are trying to develop an AI-driven disease recognition kit, you are heading towards another Theranos story.
Depending on the price (and consequences) of false positives and false negatives you can decide which one should be more focused on.
For instance if a false positive means that you will lose $100k and for false negative you will lose $10, then you know that your metric should be more focused on minimizing the false positive!
SMOTE: One interesting generalization of oversampling is by synthesizing new data points in the minority class by creating new samples rather than just oversample with replacement. It uses nearest neighbors and random number to generate a new synthetic (fake) data point:
Randomly choose a data point and call it "main".
Find five (k) of the nearest neighbors.
Randomly choose one of those 5 neighboring points.
Draw a line between you "main" point and the randomly chosen neighbor.
Create a random point on the line you drew and classify it.
Cost-sensitive Classification: Common classification algorithms try to minimize the classification error (if you misclassify A as B, I will penalize the algorithm by increasing the cost function). But for a cost-sensitive classifier, not all misclassifications are created equal! They punish the the algorithm more when it misclassifies the minority class (the minority class gets a higher weight).
There is so much more but pragmatically these are the very useful techniques for dealing with unbalanced data. I also want to point out that sometimes it's just easier to re-frame the problem as an anomaly detection problem! In anomaly detection the question becomes how to define (and classify) normal behavior.
Hybrid version of the aforementioned techniques are also used.
How about Convolutional Neural Nets (CNN)? (Updates)
Focal Loss: This relatively new method was brought to my attention by one of my friends. The method is mainly designed to address one and two stage object recognition and is an upgrade on cross entropy loss function by introducing a focusing factor. Focus factor gives higher weight to the more "hard" examples. Read the paper by Facebook AI Research.