Blog

Student Blog

Predictive Churn Modeling Using Python

October 28, 2019

This blog is posted by WeCloudData’s Data Science Bootcamp student Austin Jung.

Customer churn is a common business problem in many industries. Losing customers is costly for any business, so identifying unhappy customers early on gives you a chance to offer them incentives to stay. In this post, I am going to talk about machine learning for the automated identification of unhappy customers, also known as customer churn prediction.

Data: Telecom customer data
Tool: Python
Machine Learning: Logistic, SVM, KNN, and Random Forest

Let’s get started!

First, I imported all the libraries and read csv file into a pandas DataFrame.

1

The feature space holds 3,333 observations and 21 features including ‘Churn’, which is our target variable.

2.png

We can use the describe() method to confirm whether or not our columns have missing values. Fortunately, our data set has already been pre-cleaned, so no missing values were detected. In real life, we might have to deal with unclean data sets unless someone cleans them for you.

3.png

From 3,333 total customers, 483 of them were churned while 2,850 stayed.

4

Isolate target data (Churn).

5.png

There are few variables that need to be removed from our columns. Not all the variables are useful in predicting if a customer will churn. For example, the customer’s  information (area code and phone number) or geographical information (state) are completely useless in predicting churn.

A technique called information gain is used to see which variables are most important in predicting churn. The most important variables are ‘customer service calls’, ‘number of mins called’ and ‘credit used’.

‘yes’/’no’ has to be converted to Boolean values such as “Int’l Plan” and “VMail Plan”.

6.png

After, I created a StandardScaler() object called scaler, which fit the scaler to the features, then transformed the features to a scaled version.

7.png

Let’s split my data into a training set (70%) and a testing set (30%) and import all the additional libraries for Logistic, KNN, SVC, and Random Forest.

8.png

Let’s compare Logistic, SVC, KNN, and Random Forest.

9.png

For this churn analysis, I did not use accuracy for evaluation since it can be misleading for imbalanced classes such as ours. For the evaluation of our model, I used precision and recall instead.

What are precision and recall?

Precision is a measure of accuracy achieved in positive prediction. In other words, when it predicts yes, how often is it correct?
Precision = True Positive/predicted yes

Recall (also known as sensitivity) is a measure of actual observations, which are predicted correctly. In other words, when it’s actually yes, how often does it predict yes? Recall = True Positive/actual yes

According to the chart above,
Precision: RF > SVC > KNN > Logistic
Recall: RF > SVC > KNN > Logistic

We can also construct a confusion matrix and a ROC curve to dig further into the quality of our results.

10

In the graph above, we tried the hold-out validation method.

12.png

Let’s also try with k-fold cross validation, but only for SVC and Random Forest (RF) because as we see above, precision values for logistic regression and KNN are lower (below 0.90) compared to the other two (above 0.90). Cross validation attempts to avoid overfitting while still producing a prediction for each observation dataset.

13.png

When K = 3, model accuracy for SVM and RF are 0.920 and 0.939 respectively. After using K = 5, model performance improved to 0.940 for RF.

The table below (using random forest) shows predictive probability (pred_prob), number of predictive probability assigned to an observation (count), and true probability (true_prob).

14.png

The choice of probability threshold will be based on business context. If the business cares more about their budget (marketing expenditure), a higher threshold should be targeted (above 0.8 or 0.9). Otherwise, lower thresholds can be targeted, so the company can target larger amounts of customers who are at risk of churning.

For example, let’s say the company wants to target high thresholds such as probability above 0.9. According to our chart, the random forest predicted 77 people had a 0.9 probability of churning and in actuality that group had about a 0.948052 rate.

We should consider a lift. For example, suppose we have an average churn rate of 5% (baseline), but our model has identified a segment with a churn rate of 20%. Then that segment would have a lift of 4.0 (20%/5%). This can help us get a better picture of the overall performance of our model. It can also help us set a threshold, to see which customers are worth targeting.

Okay, so we have our model that can predict which customers are at risk of churning, what can we do with it then? We can start thinking about how to retain these customers with a high probability of leaving our company.

Many things have to be considered before reaching out those customers. First, we should consider our campaign or marketing budget, and then decide the  reasonable threshold probability (or bins or deciles) of customers. Customers with a low probability of churning can be removed from re-targeting lists, which could lead to cost-saving in marketing.

From those targeted customers, we can use clustering techniques such as Kmeans to identify segments in our data. We should prepare better strategies for those different target audiences. For those customers, we can offer different kinds of discounts or other incentives. Retention marketing is also another good way to retain customers such as Facebook, Google Ads, Twitter, or email campaigns.

To find out more about the courses our students have taken to complete these projects and what you can learn from WeCloudData, click here to see our upcoming course schedule

SPEAK TO OUR ADVISOR
Join our programs and advance your career in Data Science

"*" indicates required fields

Name*
This field is for validation purposes and should be left unchanged.
Other blogs you might like
Consulting
Background Our client is a company manufacturing consumer electronic products like mobile devices, printers, computer monitors and so on,…
by Beam Data
October 19, 2021
Learning Guide, WeCloud Faculty
This blog post was written by WeCloudData’s Data Science Instructor, Tianshu Luan. “To get the best result, students and…
by Tianshu Luan
September 24, 2021
Student Blog
The blog is posted by WeCloudData’s  student Sneha Mehrin. Overview on how to ingest stack overflow data using Kinesis…
by Student WeCloudData
October 28, 2020
Previous
Next

Kick start your career transformation