Student Blog

An Introduction To Spark and Its Behavior.

October 7, 2020

The blog is posted by WeCloudData’s Big Data course student Abhilash Mohapatra.

Checklist Followed:

  1. Mapreduce, Hadoop and Spark.
  2. Spark Architecture.
  3. Spark in Cluster.
  4. Predicate Pushdown, Broadcasting and Accumulators.

1. Mapreduce, Hadoop and Spark

For this section, let the below table represents data stored in S3 which is to be processed.

Loan Payment Table

Below table represents the Map and Shuffle + Reduce in Green and Blue color respectively. As shown below, Data is never read into driver program.

Map Reduce in Hadoop Cluster.

As part of 0th Step, input data is read from source. In the First step, data is loaded into various nodes of cluster. Secondly, data is mapped into key value pair depending on required columns. Thirdly, the key value pair is shuffled across nodes to calculate the final count. Fourthly, the key value pair is reduced to total counts. Finally, count is loaded into driver.

The above technique of processing the data in parallel is called MapReduce. The techniques algorithm can be written in any language. Hadoop has already implemented this technique of Map and Reduce along with logic to Manage Clusters for data storage.

During Map, Shuffle and Reduce, MapReduce algorithm writes results to Disk for each intermediate steps. Writing to disk for each intermediate step being an expensive solution, Spark retains the data in RAM for all intermediate steps data storage. Along with RAM, Spark uses the concept of LineageRDDTransformation and Action.

2. Spark Architecture

Below are few of the basic components of Spark Architecture.

SparkSession vs SparkContext
Group By in Spark
Data Lineage
DAG Representation
Spark Output on Action.

3. Spark in Cluster

Spark application can be submitted to the cluster using spark-submit. Along with the Spark application program, spark-submit accepts various spark configuration parameters required to execute the application successfully in cluster with conf parameter. Once spark-submit triggers an application, SparkSession with the help of SparkContext and Cluster Manager creates RDD, perform transformation and manage cluster for application execution.

Transformation + Action in a submitted Spark Application

Each submitted application consists of a number of Jobs. The Number of Jobs triggered = Number of Action present in submitted application. Each Job (A set of Transformations) is further divided into Stages. Each Shuffling of data across cluster nodes results in a New Stage. Each Stage consists of Tasks. The Number of Tasks = Number of Partitions the RDD is divided into while operation.

From the above example, Number of Jobs = 1 = Number of Action. Though 5 jobs were planned, only one was executed and rest were skipped. Number of Shuffle = 2 = Number of Stages. One for Mapping RDD from Input File and other during Group By. Number of Tasks = 1 = Number of Partitions.

4. Predicate Pushdown, Broadcasting, Accumulator

Predicate Pushdown

For parallel processing, spark uses shared variable. When driver sends a task to executors, a copy of shared variable is goes to each node so that it can be used for performing tasks. Spark uses two types of shared variables as below

Broadcasting Variable.

Thanks for reading. Questions and Suggestions are welcome.



To find out more about the courses our students have taken to complete these projects and what you can learn from WeCloudData, view the learning path. To read more posts from Abhi, check out his Medium posts here.

Join our programs and advance your career in Data EngineeringData Science

"*" indicates required fields

This field is for validation purposes and should be left unchanged.
Other blogs you might like
Student Blog
The blog is posted by WeCloudData’s student Luis Vieira. I will be showing how to build a real-time dashboard on…
by Student WeCloudData
October 21, 2020
Student Blog
The blog is posted by WeCloudData’s Data Engineering course student Rupal Bhatt.  Here is a Donut Chart prepared from…
by Student WeCloudData
January 8, 2020
Student Blog
The blog is posted by WeCloudData’s Data Science Bootcamp student Ryan Kang.  Like Amazon AWS, Google Cloud is a…
by Student WeCloudData
October 28, 2019

Kick start your career transformation