Balancing your Target Feature with SMOTE() Function.
Hello there!
It's a known fact that Machine Learning Algorithms will reach higher accuracy levels according to the amount of information you feed to them of a certain class/target.
A balanced target not always comes that easily, specially when it comes to rare events. That's when balancing techniques like SMOTE comes in handy.
This function allows you to generate artificial data that will mirror the statistical relations between the targets while balancing the dataset.
To show how the practical application works, I'll use a dataset that represents one of the most famous case of rare events. Also my work field as well.
Fraud detection on credit card transactions!
You can find the dataset I'll be working with on Kaggle:
This file consists of many credit card transactions, where the transactions that actually end up turning into fraud can be checked on the last feature "Class".
First, I'll import the CSV info into a pandas DataFrame. Let me also show you the amount of registers we have of each class.
You'll also find what each feature represents on the last link. I'm not gonna bother bringing that information to play since we're only interested on the balancing method.
We've seen only 492 transactions are bad on that entire dataset. On the following lines of code I will import the package of SMOTE() and separate the features from the targets.
DF is the name of my pandas DataFrame. It's good to create a variable to hold on the feature names as well (you'll see why in a couple).
The use of len(DF.columns) is just because I know my target is on the last columns of the DataFrame, which allows me avoid having to count the amount of columns on the dataset before hand, since it will be aiming to the last column regardless of how many will be.
With that set, I'll call the SMOTE() object and call fit_resample() function to apply the balance method. The default value of balancing is 50/50.
Now the new variables are balanced. The last part will be simply putting them back together in a single dataset. That's why I hold up that FeatureNames variable on the very beginning of the code.
Checking again how my "Class" feature looks now:
That's the most simple way of using this function. Hope you enjoyed that quick tip on your Preprocessing Stage!
If you did, please don't forget to check out the other posts on my profile.
Thanks!