Detecting Credit Card Fraud with Machine Learning

Tackling credit card fraud, a major concern for banks and fintechs.

15 min readAug 24, 2021

Para ler em Português, clique aqui!

In Brazil, about 12.1 million people have been victims of some financial fraud in the last year (including relatives of mine). Translated into figures, the financial scams have exceeded R$1.8 Billion in losses per year. Therefore, the non-detection of these frauds will cause considerable losses, both for financial institutions and the consumer.

Another factor to be considered is the amount of false positives, i.e. that annoyance when you try to make a purchase and have your card preventively blocked. Generally, this occurs when the transactions are outside the client’s standard.

These are the reasons that make the investment in the area of fraud detection through Artificial Intelligence continue to grow every year, representing a great opportunity in Data Science.

With large volumes of data as a historical basis, a machine learning algorithm just a little better than the previous ones already represents savings of millions of dollars. This is the challenge, to increasingly improve the use of algorithms to inhibit or prevent fraudulent transactions.

Getting the Data

The data used was made available by some European credit card companies. In the dataset used, we have only 492 frauds among almost 290 thousand transactions, the transactions presented occurred on only two days.

It is possible to observe how extremely unbalanced this dataset is, where frauds represent only 0.17% of the total transactions. Therefore, we will have a considerable amount of work to do when balancing the data.

Some other interesting details are the fact that the variables are all numeric and are private (to preserve the privacy of the clients and for security) and this is why the columns are presented to us as [V1, V2, V3… V28].

On the kaggle page, where the data were allocated, it is informed that the variables were transformed with the principal component analysis (PCA) method.

This method allows dimensionality reduction while keeping as much information as possible. To do this, the algorithm finds a new set of features called components.

These components are fewer or equal to the original variables. In the case of this project, the components found by the PCA transformation are the columns [V1,V2,V3…V28] themselves.

Once we have the data imported into a DataFrame, we can begin our exploratory analysis and ultimately prepare a Machine Learning model or even some.

Let’s start!

Exploratory Data Analysis

Here is where I will show you some relevant information, so that you can better understand the analysis and feel comfortable while reading.

We can start by checking the first 5 entries of the DataFrame and see what they tell us.

So we can notice that the columns Time and Amount have been kept at their original values, so as not to harm the analysis. We can also clarify that our target variable is located in the Classcolumn, where 0 is a common transaction and 1 is a fraudulent transaction.

First entries

Using the describe() method, it is possible to confirm that the variables that have undergone the PCA Transformation have no apparent discrepancy, as does the Time column.

The variable Amount, however, has a mean value (for both classes) of 88.34, a median of 22, and an impressive standard deviation of 250.12. Its maximum value is 25691.16, which explains the standard deviation, since most transactions are performed in smaller amounts.

DataFrame describe

You can celebrate, because when we searched for missing values, we found NONE! That’s right, you don’t need to do all that data cleaning.

We already know that transactions classified as fraudulent represent only 0.17% of our dataset. To get a better visualization, I’m going to plot a bar chart, so we can confirm this unbalance.

In the graph above, the discrepancy between the classes is clear. Therefore, it will be necessary to balance the data, so that our models are not harmed by training on unbalanced data. It seems that even if we save work with missing values, it will still require a lot of work to balance this data.

In order to compare the behavior and distribution of the classes, 4 graphs were plotted.

There are 2 graphs (1 and 3) — which use the time dimension (Time) as the reference, however, no information was identified from these graphs.
And 2 more graphics (2 and 4) — which use as reference the value of the transactions (Amount), in these two it was possible to notice that the value of the common transactions stays between 0 and 5000, unlike the fraudulent transactions that very few have their value above 500.

Look at the graphics below:

Boxplots were also plotted, a great tool to check for differences in the pattern of transactions in relation to their value (Amount).

Below you can see that the two classes have a different distribution, probably our models will benefit from this behavior.

Boxplot to check the distribution in relation to the value of transactions

Models Presentation

To show the difference in performance between models trained with balanced and unbalanced data, 8 models were built, being:

4 Logistic Regression Models, where one of them will be trained with unbalanced data and the other three will be trained with balanced data by different methods.

4 Decision Tree Models, which will be distributed in the same way as the Logistic Regression Models.

Data Preparation

Before starting to build the models, it is necessary to prepare the data. First, the columns Time and Amount were standardized, as they previously had their original values. To do this, the StandardScaler class was used.

# Import packages needed from Scikit-Learn to prepare data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler# Create copy of DataFrame
df_clean = df.copy()# Standardize the Time and Amount columns
scaler = StandardScaler()# Standardizing the Amount column
df_clean['Scaler_amount'] = scaler.fit_transform(df_clean['Amount'].values.reshape(-1, 1))# Standardizing coluna Time
df_clean['Scaler_time'] = scaler.fit_transform(df_clean['Time'].values.reshape(-1, 1))# Excluding the columns with original values
df_clean.drop(['Time', 'Amount'], axis=1, inplace=True)# See the first entries
df_clean.head()

First Entries with Standardized Data

Separate Training and Testing

In order for the models to perform well, it is necessary to separate the data between training and testing, so that it can make more accurate predictions when dealing with data that it has not had contact with.

# Separate the data between feature matrix and target vector
X = df_clean.drop('Class', axis=1)
y = df_clean['Class']# Split the dataset between training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, shuffle=True)

Data Balancing

Finally, data balancing was done so that the models perform better in identifying fraudulent transactions. With balancing done, it is possible to avoid overfitting (when a model becomes great at making predictions with known data, but performs less well when dealing with new data that has a smaller footprint).

For balancing, 3 methods were used, these being:

RandomUnderSampling (RUS) — This method discards a random subset of the majority class, preserving the characteristics of the minority class, ideal when you have a large volume of data. This method can lead to lower performance when making predictions of the majority class.
ADASYN — Generates new samples, close to the original ones, that are misclassified using a K-Nearest Neighbors classifier. To better understand this method, click here.
SMOTE — The basic implementation of SMOTE, meanwhile, will not distinguish between easy and difficult samples to be classified, as is done in ADASYN.

Balancing Data with Under-Sampling (RUS)

# Balancing the data
from imblearn.under_sampling import RandomUnderSampler# Setting Parameters
rus = RandomUnderSampler()
X_rus, y_rus = rus.fit_sample(X_train, y_train)# Check class balancing
print(pd.Series(y_rus).value_counts())# Plot the new Class distribution
sns.countplot(y_rus);

Balancing Data with Over-Sampling (SMOTE)

# Balancing the data
from imblearn.over_sampling import SMOTE# Setting Parameters
smo = SMOTE()
X_smo, y_smo = smo.fit_resample(X_train, y_train)# Check class balancing
print(pd.Series(y_smo).value_counts())# Plot the new Class distribution
sns.countplot(y_smo);

Balancing Data with Over-Sampling (ADASYN)

# Balancing the data
from imblearn.over_sampling import ADASYN# Setting Parameters
ada = ADASYN()
X_ada, y_ada = ada.fit_resample(X_train, y_train)# Check class balancing
print(pd.Series(y_ada).value_counts())# Plot the new Class distribution
sns.countplot(y_ada);

Machine Learning

First, it is necessary to understand what Machine Learning is.

A Machine Learning algorithm, also called a model, is a mathematical expression that represents information or data in the context of any particular problem, often a business problem. The main goal is to go from data to insight, via the algorithm. There are two categories of machine learning: supervised and unsupervised.

For this project, only supervised models were built.

Supervised models, are used to explain or predict data, this is done with the help of old data that will be intended for training the model and thus it will be able to predict output data for new inputs.

Logistic Regression

Logistic Regression is a machine learning algorithm used for classification problems, performs predictive analysis, and is based on the concept of probability.

If you want an in-depth understanding of logistic regression, click here!

Building the Model without Balancing the Data

The training process for this model was done with only split data between training and testing, but not with balanced data. Scikit-learn was used to build the model.

# Choose and import a model
from sklearn.linear_model import LogisticRegression# Choose and instantiate Hyperparameters
model_log = LogisticRegression()# Model Fit (Training)
model_log.fit(X_train, y_train)# Make predictions on new data
y_pred_log = model_log.predict(X_test)

Evaluating Model Performance

Here we will print out a classification report, the area under the curve (AUC) and a confusion matrix. This will help us evaluate the performance of the model, so that we can compare it to the following models and tell which has a better performance in detecting fraud.

A great metric that you can use to see the fraud hit rate is the recall column of the classification report.

The AUC is one of the best metrics for model evaluation, since when predictions are 100% wrong its value is zero and when they are 100% right its value is 1. Each model will present a different AUC value, this helps us determine the best model. To better understand AUC, click here!

Below we can see that:

Our model has a large amount of false negatives (46 fraudulent transactions classified as common), which is not good at all, as the customer or the bank will have to bear the costs.
There was a great performance for predicting common transactions, but an underperformance when it came to predicting fraud (only 77).
In cases like this, precision and accuracy are not our priority, since if the model gets more wrong for false positives and less for false negatives, it is already a great improvement.
The AUC of this model was not the worst, but it is possible for us to improve this 0.8130

Building the (RUS) Model

The training process for this model is extremely similar to the previous one, the difference being that it was trained on balanced data by the (RUS) method and the predictions were made on top of the test data.

# Choose and instantiate Hyperparameters
model_log_rus = LogisticRegression()# Fit of the model with RUS data (Train)
model_log_rus.fit(X_rus, y_rus)# Make predictions on new data
y_pred_rus = model_log_rus.predict(X_test)

Evaluating Model Performance

This time, the metrics tell us that:

The amount of false negatives has decreased considerably (from 46 to 10), which is a positive point. This way, the bank’s losses will be lower.
With the improvement in fraud prediction (from 77 to 113) almost doubling, there was also a big increase in false positives (from 8 to 2710), but this was already expected and is not the worst case scenario.
This model had an excellent AUC — 0.9403

Model Building (SMOTE)

In this case, the model was trained on the balanced data by the (SMOTE) method and then made its predictions on top of the test data.

# Choose and import a model
from sklearn.linear_model import LogisticRegression# Choose and instantiate Hyperparameters
model_log_smo = LogisticRegression()# Fit the model with SMOTE data (Train)
model_log_smo.fit(X_smo, y_smo)# Make predictions on new data
y_pred_smo = model_log_smo.predict(X_test)

Evaluating the Model’s Performance

With this model it was possible to notice, that:

The number of false negatives had an increase, when compared to the (RUS) model, there were 12 false negatives against 10.
The amount of predicted frauds also suffered changes, there were 111 against 113 of the (RUS) model.
However, the differential of this model is that the amount of false positives decreased a lot (from 2710 to 1702) a decrease of more than 1000 cases, causing the bank customers using this model to have their cards blocked less often.
This model had an excellent AUC but not the best: 0.9392

Building the Model (ADASYN)

This model was trained on balanced data using the (ADASYN) method and made its predictions on top of the test data.

# Choose and instantiate Hyperparameters
model_log_ada = LogisticRegression()# Fit the model with ADASYN data (Train)
model_log_ada.fit(X_ada, y_ada)# Make predictions on new data
y_pred_ada = model_log_ada.predict(X_test)

Evaluating the Model’s Performance

This is the last Logistic Regression model and it showed very interesting results, let’s take a look:

The amount of false negatives was the lowest among our logistic regression models, with only 6 false negatives, compared to 10 in the (RUS) model.
The amount of predicted frauds was the highest among all the models, there were 117 predicted frauds, while in the other models it was no more than 113 predictions.
However, a negative point of this model is that the amount of false positives increased a lot, having the highest amount when compared to the other models (6533 against the 1702 of the SMOTE model), so the bank clients that chose this model would have their cards blocked more times, but in compensation, the loss would be lower if their card was fraudulent.
This model had the lowest AUC of the models that trained on balanced data: 0.9297

However, it is worth remembering that the AUC alone is not enough to say that this is the best or the worst model, because it depends on which solution is the most efficient for the institution. Sometimes the best solution for the bank is the one that will have a higher amount of false positives and have fewer false negatives, than the other way around.

Decision Trees

The following processes are extremely similar to the previous ones, but now the model will be a decision tree, so the method used is different for classifying transactions.

This way we will be able to define which model is better for Fraud Detection, a decision tree or a logistic regression.

For better understanding, I will give a brief explanation of how a Decision Tree model works.

Basically a decision tree consists of a start node (also known as the root), internal nodes, branches and leaves. So that the data can be divided into increasingly pure subsets, the greatest purity is found in its classification.

For a graphical visualization of how a decision tree is built, visit the R2D3 link.

Building the Tree Model without Data Balancing

To build the model, once again Scikit-Learn was used. As with the Logistic Regression models, this first model was trained on unbalanced data.

# Choose and import a model
from sklearn.tree import DecisionTreeClassifier# Choose the Hyperparameters
model_tree = DecisionTreeClassifier(max_depth=4,criterion='entropy')# Model Fit(Train)
model_tree.fit(X_train, y_train)# Making predictions on top of test data
y_pred_tree = model_tree.predict(X_test)

Evaluating Model Performance

Here we will use the metrics you are already familiar with, the classification report, the area under the curve (AUC) and a confusion matrix.

Always remember that we can see the fraud hit rate in the recall column of the classification report.

Let’s take a look at how our first decision tree model performed:

Our model has a lower amount of false negatives than the first RL (Logistic Regression) model (34 versus the 46 fraudulent transactions classified as common), here we can see that a decision tree performed better with unbalanced data.
There was great performance for predicting the common transactions, even better than the RL model, but a performance below what would be satisfactory (89 versus 77).
Again, in cases like this, precision and accuracy are not our priority, since if the model gets more wrong for false positives and less for false negatives, it is already a great improvement.
The AUC of this model was better than our RL model: 0.8617.

So far, everything indicates that Decision Tree models are better in this situation, but is it only with unbalanced data or with balanced data as well?

This you will find out in the next models.

Building the Model Tree(RUS)

Same as before, model trained on RUS-balanced data, but this time on a decision tree. Here you can check to see if Decision Trees will continue to perform better than Logistic Regression Models.

# Choose the hyperparameters
model_tree_rus = DecisionTreeClassifier(max_depth=8,
criterion='entropy')# Model Fit with RUS Data(Train)
model_tree_rus.fit(X_rus, y_rus)# Making predictions on top of test data
y_pred_tree_rus = model_tree_rus.predict(X_test)

Evaluating Model Performance

Let’s see how our first decision tree model, with balanced data, performed:

The amount of false negatives decreased considerably (from 34 to 13), which is a positive point, but unfortunately the RL model outperformed when training with this data.
There was an improvement in the prediction of frauds (from 89 to 110) and this generated a large increase in false positives (from 6 to 6863), this was somewhat unexpected, this increase becomes disastrous when compared to the RL model.
We can state that the AUC of 0.8989, was the lowest when compared to models trained with balanced data.

Building the Tree Model (SMOTE)

This model was trained on balanced data by the (SMOTE) method and then performed its predictions on top of the test data.

# Choose the Hyperparameters
model_tree_smo = DecisionTreeClassifier(max_depth=6,
criterion='entropy')# Fit Model with SMOTE data (Train)
model_tree_smo.fit(X_smo, y_smo)# Making predictions on top of test data
y_pred_tree_smo = model_tree_smo.predict(X_test)

Evaluating the Model’s Performance

With this model, we can see that:

The amount of false negatives increased, when compared to the tree model (RUS), from 13 false negatives to 18.
The amount of predicted frauds fell, were 111 against 113 of the tree model (RUS).
However, this tree model was the one that presented the least amount of false positives (there are 3588 against the 6863 of the previous model), a decrease of more than 3200 cases, making the bank clients that chose to use this model, have their cards blocked less often.
This model had an excellent AUC but not the best — 0.9016

The RL models still appear to perform better at detecting fraud.

Building the Tree Model (ADASYN)

This model was trained on balanced data by the (ADASYN) method and performed its predictions on top of the test data.

# Choose the Hyperparameters
model_tree_ada = DecisionTreeClassifier(max_depth=8, criterion='entropy')# Fit the Model with ADASYN data (Train)
model_tree_ada.fit(X_ada, y_ada)# Making predictions on top of test data
y_pred_tree_ada = model_tree_ada.predict(X_test)

Evaluating Model Performance

In the last Decision Tree model we had the most balanced results, but not the best, take a look:

The amount of false negatives was neither the lowest nor the highest among the tree models, with 15 false negatives in the middle, compared to 13 for the tree model (RUS).
The amount of frauds remained balanced once again, there were 108 predicted frauds, while another model was able to make 110 predictions.
Like the RL model (ADASYN), the tree model also has a negative point regarding the amount of false positives (being 4032 against SMOTE’s 3588), but it is worth remembering that this is not the model with the highest amount of false positives.
This model had the highest AUC of the tree models — 0.9107

Conclusion

As we can see, this is a different project from the others, since there are no missing values and it was not necessary to clean the data. Although we worked with good quality and clean data, it was necessary to deal with unbalancing and the PCA Transformation, thus demanding a considerable amount of work.

It can be concluded that:

Logistic Regression algorithms perform better when dealing with balanced data.

Decision Tree algorithms were superior when dealing with unbalanced data.
The ideal solution is the one that best serves the institution, and can be the one with the highest AUC or the one with the highest number of fraud detection.
The algorithm that best predicted fraud was the Logistic Regression Model trained on balanced data using the ADASYN method.

To access the complete project, click here! Follow me on LinkedIn and keep an eye on my GitHub, there you can find more projects in the future.

Detecting Credit Card Fraud with Machine Learning

Tackling credit card fraud, a major concern for banks and fintechs.

Getting the Data

Let’s start!

Exploratory Data Analysis

Models Presentation

Data Preparation

Separate Training and Testing

Data Balancing

Balancing Data with Under-Sampling (RUS)

Balancing Data with Over-Sampling (SMOTE)

Balancing Data with Over-Sampling (ADASYN)

Machine Learning

Logistic Regression

Building the Model without Balancing the Data

Evaluating Model Performance

Building the (RUS) Model

Evaluating Model Performance

Model Building (SMOTE)

Evaluating the Model’s Performance

Building the Model (ADASYN)

Evaluating the Model’s Performance

Decision Trees

Building the Tree Model without Data Balancing

Evaluating Model Performance

Building the Model Tree(RUS)

Evaluating Model Performance

Building the Tree Model (SMOTE)

Evaluating the Model’s Performance

Building the Tree Model (ADASYN)

Evaluating Model Performance

Conclusion

Written by João Gustavo