Student Blog

Embarrassingly Parallel Model Training on Spark — Pandas UDF

June 1, 2020

The blog is posted by WeCloudData’s Big Data course student Udayan Maurya.

Spark is one of the most popular tool to perform map-reduce tasks efficiently on large scale distributed data-sets. Additionally, Spark comes with MLlib package to perform Machine Learning on distributed data. On the flip side Python has very mature libraries: Numpy, Pandas, Scikit-Learn, etc. to perform data analysis and machine learning.

Spark’s native ML library though powerful generally lack in features. Python’s libraries usually provide more options to tinker with model parameters, resulting in better tuned models. However, Python is bound on single compute machine and one contiguous block of memory, which makes it infeasible to be used for training on large scale distributed data-set.

Pandas — UDF (Get best of both worlds)

Pyspark performs computation through following data-flow:

data flow chart

We trigger our spark-context program on Master(“Local” in image here) node. The Spark program is executed on JVMs. Whenever during execution of code Spark encounters Python functions JVMs starts Python processes on working clusters and Python takes on the assigned task. Necessary data required for computation is pickled to Python and results are pickled back to JVM. Below is schematic data-flow for Pandas UDFs.


Pandas-UDF have similar data-flow. Additionally, to make the process more performance efficient “Arrow” (Apache Arrow is a cross-language development platform for in-memory data.) is used. Arrow allows data transfer from JVMs to Python processes in vectorized form, resulting in better performance in Python processes.

There is ample of reference material available for data munging, data manipulation, grouped data aggregation and model prediction/execution using Pandas UDFs. This tutorial serves the purpose of providing a guide to perform “Embarrassingly Parallel Model Training” using Pandas UDF.


  • Spark cluster (Databricks free community edition can be used)
  • Python Libraries: Numpy, Pandas, Scikit-Learn, and Joblib

*I have used Databricks environment for this tutorial

1. Create Directory to store required files

To operate on Databricks File System we use dbutils commands. Create a temporary folder in your DBFS.

This creates a “temporary” directory in “dbfs:/FileStore”

2. Import required Libraries

3. Create dummy data for training

We created two data-sets df1 and df2 to train models in parallel.

  • df1: Y = 2.5 X + random noise
  • df2: Y = 3.0 X + random noise

We are saving the data-sets in temporary folder from step 1 in zipped format. Zipped files are read in Saprk in one single partition. We are using this feature to keep different data-sets in separate partition, so we can train individual models on each of them.


4. Read files as Spark DataFrame


As files are in zipped format we get one partition for each file. Resulting in total two partitions of SparkDF, instead of default 8. This facilitates in keeping all the data in one worker node and one contiguous memory. Therefore, Machine Learning model can easily be trained using Python libraries on each partition to produce individual models.

5. Pandas UDF function for Model Training

The UDF function here trains Scikit-Learn model Spark Data-frame partitions and pickle individual models to “file:/databricks/driver”. Unfortunately, we cannot pickle directly to “FileStore” in Databricks environment. But, can later move the model files to “FileStore”!

6. Train the models

Create “sparkDF2” which call “train_lm_pandas_udf” on “sparkDF”, created in step 4.

To initiate model training we need to trigger action on sparkDF2. Make sure the action used to trigger touches all partitions otherwise training will only happen for partitions touched.

Spark-UI: Job DAG

As evident form Job DAG, the columns are navigated to Python using Arrow, to complete the model training. Additionally, we can see that there is no Shuffle Write/Read due to Embarrassingly parallel nature of the job.

Spark-UI: Stages Event Timeline

Above displays event timeline for stages of execution of spark job. Parallel execution reduces the time to train by 1/2! (which can be improved by improving on size of your cluster and training more models).

7. Testing the models

As we have trained models on dummy data, we have good idea of what model coefficients should be. So we can simply load model files and see if model coefficients make sense.


Trained Model Files


Evident from results above. Model coefficient for df1 is 2.55, which is close to our equation for df1: Y = 2.5 X + random noise. And for df2 it is 2.93 consistent with df2: Y = 3.0 X + random noise.

Full notebook and code and be downloaded from:


Spark is one of the most popular tool to perform map-reduce tasks efficiently on large scale distributed data-sets…

Review one use case of parallel model training here:

Let’s Read Customer Reviews (actually-make machine do it!)

Making sense of Customer Reviews

To find out more about the courses our students have taken to complete these projects and what you can learn from WeCloudData, click here to see the learning path. To read more posts from Udayan, check out his Medium posts here.
Join our programs and advance your career in Data EngineeringData Science

"*" indicates required fields

This field is for validation purposes and should be left unchanged.
Other blogs you might like
Student Blog
The blog is posted by WeCloudData’s Big Data course student Abhilash Mohapatra. This story represents an easy path to…
by Student WeCloudData
April 30, 2020
Job Market
Hello, data enthusiasts and aspiring data scientists! I’m thrilled to present a  comprehensive exploration into the North American data…
by Cherice
December 22, 2023
Student Blog
The blog is posted by WeCloudData’s  student Sneha Mehrin. Overview on how to ingest stack overflow data using Kinesis…
by Student WeCloudData
October 28, 2020

Kick start your career transformation