*The blog is posted by WeCloudData’s student *Amany Abdelhalim.

Criteo data set is an online advertising dataset released by Criteo Labs. It contains feature values and click feedback for millions of display Ads, the data serves as a benchmark for clickthrough rate (CTR) prediction. Every Ad has features that describe the data. The data set has 40 attributes, the first attribute is the label where a value 1 represents that the Ad has been clicked on and a 0 represents it wasn’t clicked on. The attributes consist of 13 integer columns and 26 categorical columns.

I used PySpark on DataBricks notebooks to preprocess the data and train ML models.

I started by reading the Critieo data in a data frame, the Criteo data was a TSV file tab separated and zipped (.gz). I saved the Criteo data set as a Parquet file, which saves the data in columnar format that allows the data to be handled mush faster than CSV format.

One side note here to consider about Parquet files, that sometimes I noticed that overwriting an existing Parquet file could cause the data to be corrupted so I solved this situation by creating a new Parquet file rather than overwriting the old one.

I started by exploring the data in order to check whether it is balanced or not. The total number of records in the dataset is 45,840,617 records. I counted the records where the label equals 1 and where the labels equals 0 as follows:

I found that the number of clicked on Ads is only approximately 26% of the data while 74% of the data represents unclicked on Ads.

The conclusion here is that the data is unbalanced. So I balanced the data by creating a new dataset that consists of all the Ad records that was clicked on added to it a random sample of the Ad records that was not clicked on, the size of that sample approximately equals the number of the Ad records that was clicked on.

So, 11,745,671 records were selected from the 34,095,179 records of not clicked on Ads.

I explored whether the columns had null values or not. I calculated the percentage of null values in each column. I found that out of the 39 coulmns, 15 coulmns had no null values and 24 columns had null values with different percentages.

Out of the 24 coulmns that had null values, I found that 2 columns (one integer coulmn and one categorical coulmn) had more than 75% null values which is a high percentage. I decided to drop those two columns. Now the dataset has the label column, and 37 other coulmns which are 12 integer columns and 25 categorical columns. Five other attributes (2 integer columns and 3 categorical columns) had more than 40% null values between 44% to 45%. I decided to convert those coulmns into boolean types True/False. I treated the null/zero values as False otherwise True.

For the columns that had null values (missing values) less than 40%, I did the following: for the integer coulmns, I calculated the mean for the clicked on Ad records and for the not clicked on Ad records. I replaced the null values in each column by the mean value based on whether the missing value exists in an Ad record that was clicked on or unclicked on. For the categorical columns, I calculated the mode for the clicked on Ad records and for the unclicked Ad records. I replaced the null values in each column by the mode value based on whether the missing value exists in an Ad record that was clicked on or unclicked on.

I also looked for outliers in the integer columns and replaced those values by the mean value. I Calculated the interquartile range (IQR)** **for the data in each coulmn. I then considered any value that is more than 1.5 x (IQR) plus the third quartile or less than the first quartile minus 1.5 x (IQR) as an outlier.

Since the integer columns had continuous values with different ranges. I z-scored the values of the integer columns, then bucketized them.

The categorical columns in the dataset are anonymized strings, I transformed the values to indexes using the StringIndexr() transformer. Then I counted the distinct values for each categorical column.

I found that most of them have high dimensionality (e.g. 5,736,858 distinct values) and some considered low dimensional (e.g. 27 distinct values). I applied one hot encoding on the low dimensional categorical columns only. I combined the preprocessed features into one sparse vector using vector assembler.

For the steps described above, I needed to develop seven custom transformers. I developed transformers to perform the boolean transformation, to replace the null values in both the integer and categorical columns, to cap the outliers in the integer columns, scale the integer columns

From *pyspark.ml.feature,* I used the built in Bucketizer() to bucketize the integer columns, the StringIndexr() to index the categorical columns with both low and high cardinality, the OneHotEncoder() to one hot encode the categorical columns with low cardinality and the VectorAssembler() to combine relevant columns into one column that I called “features”.

I added all the transformers for all the columns to the stages of a pipeline and then I fitted and transformed the dataset using the pipeline.

I arranged the data that will be used for training and testing ML models into a dataset that contains two columns; the *label* and the *features* column. I splited the data into a testing set (25%) and a training set (75%).

I trained three ML models; a logistic Regression, Random Forest and a Linear SVM Model on the training data set using the default parameter values for those models. I transformed the testing set using the three models and got label predictions. I calculated the accuracy of each model. The accuracy was approximately 70% (Linear SVM), 72% (logistic Regression) and 74% (Random Forest). By tunning the parameter values I was able to increase the models accuracy, for instance with the Random forest model I was able to get the accuracy to jump to 78% by setting the numTrees=50 and the maxDepth=20. You can always use something like grid search in sklearn which in Spark is referred to as ParamGridBuilder() to experiment couple of values for different parameters and get the best model that acheives the highest accuracy. You just need to ensure that you experiment with the right parameter values. For instance you need to ensure that the maxDepth can not go beyond 30, more than 30 is not supported yet. You also need to insure that your cluster configuration is done properly. I had set my cluster autopilot options to enable “auto scaling”, and I realized that when training my RF model I was getting errors indicating that some jobs has failed and from the error I got an indication that may be because the workers were auto scaling this could have been the problem where the jobs kept loosing some workers. Disabling the “auto scaling” option solved my problem. You should also make sure that the machines on your cluster have enough RAM. For instance, if you are training with a value more than 20 for the maxDepth you need machines with definitely more than 32 GB. You also need to expect that the training time will increase depending on the size of the data. For instance, training a random forest model with numTrees=50 and the maxDepth=20 took me more than 2 hours on a cluster of 16 machines, m5.2xlarge (32 GB, 8 cores).

You can find my code **here**. I hope, you enjoyed my post and found it useful. In my next post, I will take you through the steps that I followed to use the trained models to make predictions on streaming data. You can find the next post **here**.