Student Blog

Fraud Analytics: ML Tutorial on Dealing with an Imbalanced Dataset

October 28, 2019

This blog is posted by WeCloudData’s Immersive Bootcamp student Anthony Chen.

Fraud analytics provide a certain challenge that people may glance over at first. The problem of the imbalanced dataset. How do we approach it? What angle should we start at? What kind of performance measures do we use? The goal of this article is to provide some insights into this unique challenge and present some tools to enhance your models.  The example provided will be done on Python. However, only some code will be shown with a heavier emphasis on the high-level view of the workflow itself.



We need to define the business goal and scope of the project before we do anything else. Our objective is to detect instances of credit card fraud without sacrificing the quality of our predictions. The cost of letting fraud go through is costly, so why can’t we make a model that predicts the majority of transactions to be fraud? Imagine that every time you decide to use your credit card, it gets flagged as a fraudulent event. Then a phone call is made to the credit card company every single time to rectify the situation. Classifying numerous legitimized events as fraud would strain the relationship between the company and client and could result in losing the business of those clients. Consequently, investigating a suspicious transaction expends human and monetary resources. Misfiring on a moderate amount of predictions could be more costly than missing a fraudulent one.


The dataset was collected by a collaboration between Worldline and the Machine Learning Group of the Université Libre de Bruxelles. We have 284, 807 credit card transactions made by European cardholders over a two-month span during 2013. Background information about the data is sparse due to confidentiality issues and principal component analysis (PCA) procedure has already been applied. Since the challenge is to deal with class imbalance, I have chosen this particular dataset where neither feature engineering or feature scaling is necessary.


Descriptive analysis can help describe what happened to aid you in preparing your data for predictive analysis.

A quick look into the target class (Fraud label) distribution yields an interesting graph.

capture 4

Among the 284, 807 data points, only 492 are Class 1 (fraud). They make up only 0.172% of the total dataset with non-frauds accounting for the other 99.828%. This staggering discrepancy presents some unique obstacles to this type of dataset and common mistakes people make trying to remedy these complications.

  1. During the training phase of the modeling process, the algorithm will see a substantial proportion of non-fraud cases compared to fraud ones.  The underrepresented fraud features may be treated as noise and be ignored, resulting in underfitting.
  2. Resampling is a great technique to the fix class distribution issue. However, poor implementation will create the new problem of your model overfitting on the training data.
  3. Most commonly used scoring metrics such as accuracy and ROC-AUC score are not the best measures when gauging the performance of your model.

Using the sklearn package in Python, a baseline model was created using a Random Forest to illustrate some of these issues. A random forest is ideal for this scenario due to the fact that it handles an imbalance dataset quite well. A quick summary of predicted vs actual outcomes is displayed below in a confusion matrix.

70/30 train-test random split with no hyperparameter tuning


True Negative (TN) = 85271      Accuracy = (TP + TN) / (P + N) = 0.9991
False Positive (FP) = 36              Recall = TP / (TP + FN) = 0.6691
True Positive (TP) = 91                Precision  = TP / (TP + FN) = 0.7165
False Negative (FN) = 45

The accuracy score is at an impressive 99.91% but as mentioned above, it is extremely misleading. If you look at it mathematically, the score depends on the TP and TN values in the numerator since the denominator is whole testing set. Considering that the raw set had more than 99%  of non-fraud labels, one can base 99% of their predictions to be not fraud and hit that accuracy mark. We need to target recall while balancing precision.

Recall is your true positive rate, or the percentage of all frauds correctly identified. A recall value of 66.91% means that we are not capturing a third of all frauds. Precision, on the other hand, denotes the quality of your recall. Of all fraudulent predictions made, how many of them were actually a fraud. In this case, the baseline model was correct 71.65% of the time when it predicted a fraud. So how do we balance the two out? The better question is, how do we maximize both? The F1 score measures the harmonic mean between recall and precision.


What’s important to note is that precision and recall ignore true negatives in their calculations. Since F1 is derived from precision and recall, it will be sensitive to the relationships in changes between TP, FP, and FN.


Armed with the knowledge of the previous section, we can take some steps to handle the class imbalance and improve on the baseline model. Below is a simple framework I used to build my models and improve my scores.


Data Splitting

We want to split our raw data into a training set and a sacred holdout set. By creating a separate training set it will allow us to resample the data to even out the distribution of your class label. The data from the holdout set should not be seen during any of the model training processes and not be used for parameter tuning. A simple train-test stratified split on sklearn is applied using the ‘Fraud’ class as the criteria to split upon. This keeps the 99.9/0.1 distribution the same when we decide to refit our best model. After the split, 199364 rows of data has been allocated towards the training set and the rest towards the holdout.

Resampling our training set

As mentioned earlier, since a massive proportion of the representation belongs to class FRAUD=NO or 0, the standard classifiers will be heavily biased towards it, and thus will predict it a majority of the time. The features of the minority class, FRAUD=YES or 1, will be treated as noise and be ignored during the modeling and fitting process. We resample our data to decrease the count of majority instances (undersampling), increase the amount of minority instance (oversampling), or a combination of both.  I have chosen to perform undersampling and here is a simple code for it.

# find the number of fraud samples so we can down sample our majority to it
yes_fraud = len(train[train['Class'] ==1])

# retrieve the indices of the non-fraud and fraud samples 
yes_fraud_ind = train[train['Class'] == 1].index
no_fraud_ind = train[train['Class'] == 0].index

# random sample the non-fraud indices based on the amount of fraudulent samples
new_no_fraud_ind = np.random.choice(no_fraud_ind, yes_fraud, replace = False)

# merge the two indices together
undersample_ind = np.concatenate([new_no_fraud_ind, yes_fraud_ind])

# get undersampled dataframe from the merged indices
undersampled_data = train.loc[undersample_ind]


We end up with 688 samples with 344 each of fraud and not fraud.

Model  Building & Training

There are two key decisions to make during the predictive modeling process:

  1. What is the best algorithm for my model
  2. What are the optimal set of hyperparameters to use for each of my algorithms

There are many ways to decide which model to use. You can perform K-folds cross-validation across different algorithms and compare scoring metrics. In many machine learning projects, plotting ROC curves for each model can often help you determine which model will perform the best. However, in a very imbalanced set, ROC curve is not ideal.

An ROC curve is a plot that displays the relationship between the true positive rate (TPR) and false positive rate (FPR) at all predictive probability thresholds for that model.

Suppose we have the following information and example:

TPR = TP / (TP + FN)
FPR = FP / (FP + TN)

Algorithm A which gives a high amount of TN and x amount of FN
Algorithm B which gives a high amount of TN and 2x amount of FN

Looking at the FPR formula, you can tell that high amounts of TN will always give you a low FPR. In contrast, increases in FN will only influence TPR by a small amount. Hence, the ROC curve will be optimistic about both models even if it is performing poorly.

With F1 score as the target metric to maximize, we will use Logistic Regression, SVM, and Random Forest as our algorithms with the undersampled set.

# divide undersampled_data into features and class label
X_under = undersampled_data.drop(['Class'], axis = 1)
y_under = undersampled_data(['Class']

# import our three algorithms 
from sklearn.ensemble import RandomForestClassifier
from sklean import svm
from sklearn.linear_model import LogisticRegression

# instantiate our classifiers
rfc = RandomForestClassifier(n_estimators = 100)
svc = svm.SVC()
lr = LogisticRegression()

K-folds cross-validation with these algorithms will help with the fact that we now only have 688 data points. If we do a simple hold out set, removing a portion of that for validation leaves us low for the training set. Thus it can result in underfitting if there isn’t enough data to discover the underlying patterns. With K-folds, we reduce bias because all the data is used for fitting, and we reduce variance because all the data is used for testing as well.

from sklearn.model_selection import cross_val_score

cross_val_score(<classifier> , X_under, y_under, cv = 5, scoring = '<scoring method>')


Above is an example to use cross_val_score to retrieve a certain score using a certain classifier. It will return a score for each of the K-folds, but we’ll use the mean score across all the folds.

A brief summary quickly shows that all three models performed well on the undersampled dataset. Logistic regression and random forest have high scores across the board. Support vector machine scored the highest recall but also a much lower precision, which is reflected on its F1 score. Parameter tuning is an important process if we want to improve our models. However, finding the optimal set of parameters via manual tuning can be like trying to find a needle in a haystack.]

Grid Search is a technique that systematically works through all the combinations of parameter values (given by the user) while cross validating. It can be slow and computationally expensive, but our data is not large and we’re not going to attempt a fully exhaustive search.

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

param_grid = {"n_estimators": [50,100,200],
 "max_depth": [3, None],
 "max_features": [1, 3, 10],
 "min_samples_split": [2, 3, 10],
 "min_samples_leaf": [1, 3, 10],
 "bootstrap": [True, False],
 "criterion": ["gini", "entropy"]}

clr = GridSearchCV(rf, param_grid, scoring = 'f1', cv = 5), y_under)

best_estimator_: returns the set of parameters that yielded the highest score
best_score_: returns the mean cross-validated score of the of the best_estimator_

chart 2

The F1 scores here are an improvement compared to not tuning any parameters, with SVM seeing the biggest increase.

Holdout Testing

The last step is to refit the models onto the training data. Generalization is key here and the holdout has the 99.9/0.1 a class distribution data which we will predict on. The pre-sampled training data has the same distribution because we did a stratified split earlier.

chart 99

A recall of nearly 80% and precision of nearly 96% with random forest is a great improvement over our earlier baseline test. We’ve managed to increase both our recall and precision score. Achieving this is great because it was tested on previously unseen data and we hope that this will translate well on future incoming data!


I plan to improve the performance of this predictive model in the future with other sampling techniques (such as oversampling and SMOTE) and different algorithms. I would also like to introduce other ensemble methods apart from random forest. Even though I dismissed the ROC curve, delving into the precision-recall curve and space is something I will look into.

To see Anthony’s original blog post please click here. To follow and see Anthony’s latest blog posts, please click here.

To find out more about the courses our students have taken to complete these projects and what you can learn from WeCloudData, click here to see our upcoming course schedule.

Join our programs and advance your career in Business IntelligenceData Science

"*" indicates required fields

This field is for validation purposes and should be left unchanged.
Other blogs you might like
Student Blog
The blog is posted by WeCloudData’s  student Sneha Mehrin. Steps to Create a Data Warehouse and Automate the Process…
by Student WeCloudData
November 4, 2020
Student Blog
The blog is posted by WeCloudData’s Data Science Bootcamp student Weichen Lu. Once, I was talking with my colleague…
by Student WeCloudData
October 28, 2019
Student Blog
The blog is posted by WeCloudData’s Data Engineering course student Rupal Bhatt.  Here is a Donut Chart prepared from…
by Student WeCloudData
January 8, 2020

Kick start your career transformation