Program  

Courses
Location
Corporate
Our Students
Resources
Bootcamp Programs
Short Courses
Portfolio Courses
Bootcamp Programs

Launch your career in Data and AI through our bootcamp programs

  • Industry-leading curriculum
  • Real portfolio/industry projects
  • Career support program
  • Both Full-time & Part-time options.
Data Science & Big Data
Data Engineering

Become a data analyst through building hands-on data/business use cases

Become an AI/ML engineer by getting specialized in deep learning, computer vision, NLP, and MLOps

Become a DevOps Engineer by learning AWS, Docker, Kubernetes, IaaS, IaC (Terraform), and CI/CD

Short Courses

Improve your data & AI skills through self-paced and instructor-led courses

  • Industry-leading curriculum
  • Portfolio projects
  • Part-time flexible schedule
AI ENGINEERING
Portfolio Courses

Learn to build impressive data/AI portfolio projects that get you hired

  • Portfolio project workshops
  • Work on real industry data & AI project
  • Job readiness assessment
  • Career support & job referrals

Build data strategies and solve ML challenges for real clients

Help real clients build BI dashboard and tell data stories

Build end to end data pipelines in the cloud for real clients

Location

Choose to learn at your comfort home or at one of our campuses

Corporate Partners

We’ve partnered with many companies on corporate upskilling, branding events, talent acquisition, as well as consulting services.

AI/Data Transformations with our customized and proven curriculum

Do you need expert help on data strategies and project implementations? 

Hire Data, AI, and Engineering talents from WeCloudData

Our Students

Meet our amazing alumni working in the Data industry

Read our students’ stories on how WeCloudData have transformed their career

Resources

Check out our events and blog posts to learn and connect with like-minded professionals working in the industry

Let’s get together and enjoy the fun from treasure hunting in massive real-world datasets

Read blogs and updates from our community and alumni

Explore different Data Science career paths and how to get started

Blog

Student Blog

Preprocessing Criteo Dataset for Prediction of Click Through Rate on Ads

October 15, 2020

The blog is posted by WeCloudData’s student Amany Abdelhalim.

In this post, I will be taking you through the steps that I performed to preprocess the Criteo Data set.

Image for post

Some Aspects to Consider when Preprocessing the Data

Criteo data set is an online advertising dataset released by Criteo Labs. It contains feature values and click feedback for millions of display Ads, the data serves as a benchmark for clickthrough rate (CTR) prediction. Every Ad has features that describe the data. The data set has 40 attributes, the first attribute is the label where a value 1 represents that the Ad has been clicked on and a 0 represents it wasn’t clicked on. The attributes consist of 13 integer columns and 26 categorical columns.

I used PySpark on DataBricks notebooks to preprocess the data and train ML models.

I started by reading the Critieo data in a data frame, the Criteo data was a TSV file tab separated and zipped (.gz). I saved the Criteo data set as a Parquet file, which saves the data in columnar format that allows the data to be handled mush faster than CSV format.

One side note here to consider about Parquet files, that sometimes I noticed that overwriting an existing Parquet file could cause the data to be corrupted so I solved this situation by creating a new Parquet file rather than overwriting the old one.

I started by exploring the data in order to check whether it is balanced or not. The total number of records in the dataset is 45,840,617 records. I counted the records where the label equals 1 and where the labels equals 0 as follows:

I found that the number of clicked on Ads is only approximately 26% of the data while 74% of the data represents unclicked on Ads.

 

The count of clicked on adds (1) versus unclicked on adds (0)

The conclusion here is that the data is unbalanced. So I balanced the data by creating a new dataset that consists of all the Ad records that was clicked on added to it a random sample of the Ad records that was not clicked on, the size of that sample approximately equals the number of the Ad records that was clicked on.

So, 11,745,671 records were selected from the 34,095,179 records of not clicked on Ads.

The size of selected sample of the not clicked Ads is almost equal to the clicked on Ads

 

I explored whether the columns had null values or not. I calculated the percentage of null values in each column. I found that out of the 39 coulmns, 15 coulmns had no null values and 24 columns had null values with different percentages.

 

Out of the 24 coulmns that had null values, I found that 2 columns (one integer coulmn and one categorical coulmn) had more than 75% null values which is a high percentage. I decided to drop those two columns. Now the dataset has the label column, and 37 other coulmns which are 12 integer columns and 25 categorical columns. Five other attributes (2 integer columns and 3 categorical columns) had more than 40% null values between 44% to 45%. I decided to convert those coulmns into boolean types True/False. I treated the null/zero values as False otherwise True.

For the columns that had null values (missing values) less than 40%, I did the following: for the integer coulmns, I calculated the mean for the clicked on Ad records and for the not clicked on Ad records. I replaced the null values in each column by the mean value based on whether the missing value exists in an Ad record that was clicked on or unclicked on. For the categorical columns, I calculated the mode for the clicked on Ad records and for the unclicked Ad records. I replaced the null values in each column by the mode value based on whether the missing value exists in an Ad record that was clicked on or unclicked on.

I also looked for outliers in the integer columns and replaced those values by the mean value. I Calculated the interquartile range (IQR) for the data in each coulmn. I then considered any value that is more than 1.5 x (IQR) plus the third quartile or less than the first quartile minus 1.5 x (IQR) as an outlier.

Since the integer columns had continuous values with different ranges. I z-scored the values of the integer columns, then bucketized them.

The categorical columns in the dataset are anonymized strings, I transformed the values to indexes using the StringIndexr() transformer. Then I counted the distinct values for each categorical column.

Column name, distinct values count

I found that most of them have high dimensionality (e.g. 5,736,858 distinct values) and some considered low dimensional (e.g. 27 distinct values). I applied one hot encoding on the low dimensional categorical columns only. I combined the preprocessed features into one sparse vector using vector assembler.

For the steps described above, I needed to develop seven custom transformers. I developed transformers to perform the boolean transformation, to replace the null values in both the integer and categorical columns, to cap the outliers in the integer columns, scale the integer columns

From pyspark.ml.feature, I used the built in Bucketizer() to bucketize the integer columns, the StringIndexr() to index the categorical columns with both low and high cardinality, the OneHotEncoder() to one hot encode the categorical columns with low cardinality and the VectorAssembler() to combine relevant columns into one column that I called “features”.

I added all the transformers for all the columns to the stages of a pipeline and then I fitted and transformed the dataset using the pipeline.

I arranged the data that will be used for training and testing ML models into a dataset that contains two columns; the label and the features column. I splited the data into a testing set (25%) and a training set (75%).

I trained three ML models; a logistic Regression, Random Forest and a Linear SVM Model on the training data set using the default parameter values for those models. I transformed the testing set using the three models and got label predictions. I calculated the accuracy of each model. The accuracy was approximately 70% (Linear SVM), 72% (logistic Regression) and 74% (Random Forest). By tunning the parameter values I was able to increase the models accuracy, for instance with the Random forest model I was able to get the accuracy to jump to 78% by setting the numTrees=50 and the maxDepth=20. You can always use something like grid search in sklearn which in Spark is referred to as ParamGridBuilder() to experiment couple of values for different parameters and get the best model that acheives the highest accuracy. You just need to ensure that you experiment with the right parameter values. For instance you need to ensure that the maxDepth can not go beyond 30, more than 30 is not supported yet. You also need to insure that your cluster configuration is done properly. I had set my cluster autopilot options to enable “auto scaling”, and I realized that when training my RF model I was getting errors indicating that some jobs has failed and from the error I got an indication that may be because the workers were auto scaling this could have been the problem where the jobs kept loosing some workers. Disabling the “auto scaling” option solved my problem. You should also make sure that the machines on your cluster have enough RAM. For instance, if you are training with a value more than 20 for the maxDepth you need machines with definitely more than 32 GB. You also need to expect that the training time will increase depending on the size of the data. For instance, training a random forest model with numTrees=50 and the maxDepth=20 took me more than 2 hours on a cluster of 16 machines, m5.2xlarge (32 GB, 8 cores).

You can find my code here. I hope, you enjoyed my post and found it useful. In my next post, I will take you through the steps that I followed to use the trained models to make predictions on streaming data. You can find the next post here.

To find out more about the courses our students have taken to complete these projects and what you can learn from WeCloudData, click here to see the learning path. To read more posts from Amany, check out her Medium posts here.

Other blogs you might like
Student Blog
The blog is posted by WeCloudData’s student Luis Vieira. I will be showing how to build a real-time dashboard on…
by Student WeCloudData
October 21, 2020
Uncategorized
Take a central role The Bank of Canada has a vision to be “a leading central bank—dynamic, engaged and…
by Shaohua Zhang
May 21, 2020
Uncategorized
Big Data for Data Scientists – Info Session from WeCloudData…
by WeCloudData
November 9, 2019
Previous
Next

Kick start your career transformation

WeCloudData

WeCloudData is the leading data science and AI academy. Our blended learning courses have helped thousands of learners and many enterprises make successful leaps in their data journeys.

Sign up for newsletter
This field is for validation purposes and should be left unchanged.