This blog is posted by WeCloudData’s Data Science Bootcamp student Austin Jung.
Customer churn is a common business problem in many industries. Losing customers is costly for any business, so identifying unhappy customers early on gives you a chance to offer them incentives to stay. In this post, I am going to talk about machine learning for the automated identification of unhappy customers, also known as customer churn prediction.
Data: Telecom customer data
Tool: Python
Machine Learning: Logistic, SVM, KNN, and Random Forest
Let’s get started!
First, I imported all the libraries and read csv file into a pandas DataFrame.
The feature space holds 3,333 observations and 21 features including ‘Churn’, which is our target variable.
We can use the describe() method to confirm whether or not our columns have missing values. Fortunately, our data set has already been pre-cleaned, so no missing values were detected. In real life, we might have to deal with unclean data sets unless someone cleans them for you.
From 3,333 total customers, 483 of them were churned while 2,850 stayed.
Isolate target data (Churn).
There are few variables that need to be removed from our columns. Not all the variables are useful in predicting if a customer will churn. For example, the customer’s information (area code and phone number) or geographical information (state) are completely useless in predicting churn.
A technique called information gain is used to see which variables are most important in predicting churn. The most important variables are ‘customer service calls’, ‘number of mins called’ and ‘credit used’.
‘yes’/’no’ has to be converted to Boolean values such as “Int’l Plan” and “VMail Plan”.
After, I created a StandardScaler() object called scaler, which fit the scaler to the features, then transformed the features to a scaled version.
Let’s split my data into a training set (70%) and a testing set (30%) and import all the additional libraries for Logistic, KNN, SVC, and Random Forest.
Let’s compare Logistic, SVC, KNN, and Random Forest.
For this churn analysis, I did not use accuracy for evaluation since it can be misleading for imbalanced classes such as ours. For the evaluation of our model, I used precision and recall instead.
What are precision and recall?
Precision is a measure of accuracy achieved in positive prediction. In other words, when it predicts yes, how often is it correct?
Precision = True Positive/predicted yes
Recall (also known as sensitivity) is a measure of actual observations, which are predicted correctly. In other words, when it’s actually yes, how often does it predict yes? Recall = True Positive/actual yes
According to the chart above,
Precision: RF > SVC > KNN > Logistic
Recall: RF > SVC > KNN > Logistic
We can also construct a confusion matrix and a ROC curve to dig further into the quality of our results.
In the graph above, we tried the hold-out validation method.
Let’s also try with k-fold cross validation, but only for SVC and Random Forest (RF) because as we see above, precision values for logistic regression and KNN are lower (below 0.90) compared to the other two (above 0.90). Cross validation attempts to avoid overfitting while still producing a prediction for each observation dataset.
When K = 3, model accuracy for SVM and RF are 0.920 and 0.939 respectively. After using K = 5, model performance improved to 0.940 for RF.
The table below (using random forest) shows predictive probability (pred_prob), number of predictive probability assigned to an observation (count), and true probability (true_prob).
The choice of probability threshold will be based on business context. If the business cares more about their budget (marketing expenditure), a higher threshold should be targeted (above 0.8 or 0.9). Otherwise, lower thresholds can be targeted, so the company can target larger amounts of customers who are at risk of churning.
For example, let’s say the company wants to target high thresholds such as probability above 0.9. According to our chart, the random forest predicted 77 people had a 0.9 probability of churning and in actuality that group had about a 0.948052 rate.
We should consider a lift. For example, suppose we have an average churn rate of 5% (baseline), but our model has identified a segment with a churn rate of 20%. Then that segment would have a lift of 4.0 (20%/5%). This can help us get a better picture of the overall performance of our model. It can also help us set a threshold, to see which customers are worth targeting.
Okay, so we have our model that can predict which customers are at risk of churning, what can we do with it then? We can start thinking about how to retain these customers with a high probability of leaving our company.
Many things have to be considered before reaching out those customers. First, we should consider our campaign or marketing budget, and then decide the reasonable threshold probability (or bins or deciles) of customers. Customers with a low probability of churning can be removed from re-targeting lists, which could lead to cost-saving in marketing.
From those targeted customers, we can use clustering techniques such as Kmeans to identify segments in our data. We should prepare better strategies for those different target audiences. For those customers, we can offer different kinds of discounts or other incentives. Retention marketing is also another good way to retain customers such as Facebook, Google Ads, Twitter, or email campaigns.
To find out more about the courses our students have taken to complete these projects and what you can learn from WeCloudData, click here to see our upcoming course schedule