Program  

Courses
Location
Corporate
Student Success
Resources
Bootcamp Programs
Short Courses
Portfolio Courses
Bootcamp Programs

Launch your career in Data and AI through our bootcamp programs

  • Industry-leading curriculum
  • Real portfolio/industry projects
  • Career support program
  • Both Full-time & Part-time options.
Data Science Bootcamp

Become a data engineer by learning how to build end-to-end data pipelines

 

Become a data analyst through building hands-on data/business use cases

Become an AI/ML engineer by getting specialized in deep learning, computer vision, NLP, and MLOps

Become a DevOps Engineer by learning AWS, Docker, Kubernetes, IaaS, IaC (Terraform), and CI/CD

Short Courses

Improve your data & AI skills through self-paced and instructor-led courses

  • Industry-leading curriculum
  • Portfolio projects
  • Part-time flexible schedule
AI ENGINEERING
Portfolio Courses

Learn to build impressive data/AI portfolio projects that get you hired

  • Portfolio project workshops
  • Work on real industry data & AI project
  • Job readiness assessment
  • Career support & job referrals

Build data strategies and solve ML challenges for real clients

Help real clients build BI dashboard and tell data stories

Build end to end data pipelines in the cloud for real clients

Location

Choose to learn at your comfort home or at one of our campuses

Corporate Partners

We’ve partnered with many companies on corporate upskilling, branding events, talent acquisition, as well as consulting services.

AI/Data Transformations with our customized and proven curriculum

Do you need expert help on data strategies and project implementations? 

Hire Data, AI, and Engineering talents from WeCloudData

Student Success

Meet our amazing alumni working in the Data industry

Read our students’ stories on how WeCloudData have transformed their career

Resources

Check out our events and blog posts to learn and connect with like-minded professionals working in the industry

Read blogs and updates from our community and alumni

Explore different Data Science career paths and how to get started

Our free courses and workshops gives you the skills and knowledge needed to transform your career in tech

Blog

Student Blog

Predictive Churn Modeling Using Python

October 28, 2019

This blog is posted by WeCloudData’s Data Science Bootcamp student Austin Jung.

Customer churn is a common business problem in many industries. Losing customers is costly for any business, so identifying unhappy customers early on gives you a chance to offer them incentives to stay. In this post, I am going to talk about machine learning for the automated identification of unhappy customers, also known as customer churn prediction.

Data: Telecom customer data
Tool: Python
Machine Learning: Logistic, SVM, KNN, and Random Forest

Let’s get started!

First, I imported all the libraries and read csv file into a pandas DataFrame.

1

The feature space holds 3,333 observations and 21 features including ‘Churn’, which is our target variable.

2.png

We can use the describe() method to confirm whether or not our columns have missing values. Fortunately, our data set has already been pre-cleaned, so no missing values were detected. In real life, we might have to deal with unclean data sets unless someone cleans them for you.

3.png

From 3,333 total customers, 483 of them were churned while 2,850 stayed.

4

Isolate target data (Churn).

5.png

There are few variables that need to be removed from our columns. Not all the variables are useful in predicting if a customer will churn. For example, the customer’s  information (area code and phone number) or geographical information (state) are completely useless in predicting churn.

A technique called information gain is used to see which variables are most important in predicting churn. The most important variables are ‘customer service calls’, ‘number of mins called’ and ‘credit used’.

‘yes’/’no’ has to be converted to Boolean values such as “Int’l Plan” and “VMail Plan”.

6.png

After, I created a StandardScaler() object called scaler, which fit the scaler to the features, then transformed the features to a scaled version.

7.png

Let’s split my data into a training set (70%) and a testing set (30%) and import all the additional libraries for Logistic, KNN, SVC, and Random Forest.

8.png

Let’s compare Logistic, SVC, KNN, and Random Forest.

9.png

For this churn analysis, I did not use accuracy for evaluation since it can be misleading for imbalanced classes such as ours. For the evaluation of our model, I used precision and recall instead.

What are precision and recall?

Precision is a measure of accuracy achieved in positive prediction. In other words, when it predicts yes, how often is it correct?
Precision = True Positive/predicted yes

Recall (also known as sensitivity) is a measure of actual observations, which are predicted correctly. In other words, when it’s actually yes, how often does it predict yes? Recall = True Positive/actual yes

According to the chart above,
Precision: RF > SVC > KNN > Logistic
Recall: RF > SVC > KNN > Logistic

We can also construct a confusion matrix and a ROC curve to dig further into the quality of our results.

10

In the graph above, we tried the hold-out validation method.

12.png

Let’s also try with k-fold cross validation, but only for SVC and Random Forest (RF) because as we see above, precision values for logistic regression and KNN are lower (below 0.90) compared to the other two (above 0.90). Cross validation attempts to avoid overfitting while still producing a prediction for each observation dataset.

13.png

When K = 3, model accuracy for SVM and RF are 0.920 and 0.939 respectively. After using K = 5, model performance improved to 0.940 for RF.

The table below (using random forest) shows predictive probability (pred_prob), number of predictive probability assigned to an observation (count), and true probability (true_prob).

14.png

The choice of probability threshold will be based on business context. If the business cares more about their budget (marketing expenditure), a higher threshold should be targeted (above 0.8 or 0.9). Otherwise, lower thresholds can be targeted, so the company can target larger amounts of customers who are at risk of churning.

For example, let’s say the company wants to target high thresholds such as probability above 0.9. According to our chart, the random forest predicted 77 people had a 0.9 probability of churning and in actuality that group had about a 0.948052 rate.

We should consider a lift. For example, suppose we have an average churn rate of 5% (baseline), but our model has identified a segment with a churn rate of 20%. Then that segment would have a lift of 4.0 (20%/5%). This can help us get a better picture of the overall performance of our model. It can also help us set a threshold, to see which customers are worth targeting.

Okay, so we have our model that can predict which customers are at risk of churning, what can we do with it then? We can start thinking about how to retain these customers with a high probability of leaving our company.

Many things have to be considered before reaching out those customers. First, we should consider our campaign or marketing budget, and then decide the  reasonable threshold probability (or bins or deciles) of customers. Customers with a low probability of churning can be removed from re-targeting lists, which could lead to cost-saving in marketing.

From those targeted customers, we can use clustering techniques such as Kmeans to identify segments in our data. We should prepare better strategies for those different target audiences. For those customers, we can offer different kinds of discounts or other incentives. Retention marketing is also another good way to retain customers such as Facebook, Google Ads, Twitter, or email campaigns.

To find out more about the courses our students have taken to complete these projects and what you can learn from WeCloudData, click here to see our upcoming course schedule

SPEAK TO OUR ADVISOR
Join our programs and advance your career in Data Science

"*" indicates required fields

Name*
This field is for validation purposes and should be left unchanged.
Other blogs you might like
Student Blog
The blog is posted by WeCloudData’s  student Sneha Mehrin. Steps to Create a Data Warehouse and Automate the Process…
by Student WeCloudData
November 4, 2020
Student Blog
The blog is posted by WeCloudData’s student Amany Abdelhalim. There are two steps that I followed to create this pipeline…
by Student WeCloudData
June 26, 2020
WeCloud Courses, WeCloud Faculty
Life is Science Fiction: AI Project Teaser – by Rhys Williams I walked away from David Greig’s stage adaption…
by WeCloudData Faculty
October 29, 2019
Previous
Next

Kick start your career transformation