- Mapreduce, Hadoop and Spark.
- Spark Architecture.
- Spark in Cluster.
- Predicate Pushdown, Broadcasting and Accumulators.
1. Mapreduce, Hadoop and Spark
For this section, let the below table represents data stored in S3 which is to be processed.
Below table represents the Map and Shuffle + Reduce in Green and Blue color respectively. As shown below, Data is never read into driver program.
As part of 0th Step, input data is read from source. In the First step, data is loaded into various nodes of cluster. Secondly, data is mapped into key value pair depending on required columns. Thirdly, the key value pair is shuffled across nodes to calculate the final count. Fourthly, the key value pair is reduced to total counts. Finally, count is loaded into driver.
The above technique of processing the data in parallel is called MapReduce. The techniques algorithm can be written in any language. Hadoop has already implemented this technique of Map and Reduce along with logic to Manage Clusters for data storage.
During Map, Shuffle and Reduce, MapReduce algorithm writes results to Disk for each intermediate steps. Writing to disk for each intermediate step being an expensive solution, Spark retains the data in RAM for all intermediate steps data storage. Along with RAM, Spark uses the concept of Lineage, RDD, Transformation and Action.
2. Spark Architecture
Below are few of the basic components of Spark Architecture.
a. Cluster and Cluster Manager — As shown in the above representation, each Cluster has an Master/Driver Node and Worker Nodes. All the co-ordination between master and workers along their operation is taken care by an application called Cluster Manager. Spark has its own Cluster Manager and can also integrate with YARN, Mesos and Kubernates. In AWS EMR, Spark cluster is managed by YARN.
b. Driver and Executor — Each spark application is associated with a single driver and multiple executor programs. Driver is the main controller and manages task distribution between executors for data processing, in association with cluster manager. The driver program can be executed either on Master node or Worker node and is controlled by parameter deploy-mode. In environment with multiple users, its suggested to submit spark application in Cluster mode to avoid overloading Driver node. The no of CPU cores and memory associated with driver and executor are controlled by parameters like spark.driver.cores, spark.driver.memory, spark.executor.cores and spark.executor.memory. For more details, visit An Introduction to Data Pipeline with Spark in AWS.
c. SparkSession and SparkContext — Both SparkSession and SparkContext are entry points to spark cluster as shown above. SparkContext was the entry point into Spark JVM before 2.0. SparkSession was introduced after 2.0 which encapsulates spark , hive and sql context. More than one SparkContext can be executed in the same JVM using parameter spark.driver.allowMultipleContexts, but it makes the JVM more unstable. Crashing of one SparkContext may lead to crashing of other. After 2.0, SparkSession is the entry point to JVM. Multiple SparkSession can be launched with their respective configurations, tables and views to work independently. As shown in the above figure, all Spark Sessions will have a single SparkContext followed by Cluster Manager for managing and allocating executors to individual sessions.
d. Transformation and Action — Spark follows Lazy framework. For all calculations or mappings called transformations, spark creates the data flow plan called DAG. Only when an action(show, collect, write, cache) is called, data flow is triggered and executed according to the DAG or Plan created. For any failures, DAG is followed to retrieve the lost data. For the example explained in MapReduce section, below is the Spark code.
Once the above codes are executed, only the Plan or Data Lineage is created by Spark and can be obtained by explain parameter as shown below.
Once any Action is executed like show, the data plan is executed and data movement takes place generating the DAG diagram and output as below.
As from above, total 5 different intermediate RDD’s are created for final query. The 5 different RDD’s can be compared with the 5 different steps as shown in Map Reduce in Hadoop Cluster. MapReduce, while calculating the final answer writes all the intermediate results to Disk, while Spark works on RDD and retains data in RAM till final results is calculated.
3. Spark in Cluster
Spark application can be submitted to the cluster using spark-submit. Along with the Spark application program, spark-submit accepts various spark configuration parameters required to execute the application successfully in cluster with conf parameter. Once spark-submit triggers an application, SparkSession with the help of SparkContext and Cluster Manager creates RDD, perform transformation and manage cluster for application execution.
Each submitted application consists of a number of Jobs. The Number of Jobs triggered = Number of Action present in submitted application. Each Job (A set of Transformations) is further divided into Stages. Each Shuffling of data across cluster nodes results in a New Stage. Each Stage consists of Tasks. The Number of Tasks = Number of Partitions the RDD is divided into while operation.
From the above example, Number of Jobs = 1 = Number of Action. Though 5 jobs were planned, only one was executed and rest were skipped. Number of Shuffle = 2 = Number of Stages. One for Mapping RDD from Input File and other during Group By. Number of Tasks = 1 = Number of Partitions.
4. Predicate Pushdown, Broadcasting, Accumulator
a. Predicate Pushdown — Whenever any Filter is applied on data, the Filter Logic is pushed down to data source while reading instead of pulling all data and applying filter. This concept of pushing the filter down to data source is called predicate pushdown.
For parallel processing, spark uses shared variable. When driver sends a task to executors, a copy of shared variable is goes to each node so that it can be used for performing tasks. Spark uses two types of shared variables as below
b. Broadcast Variable — Broadcast variable is used to save a copy of data in each node to avoid data shuffling increasing efficiency. While performing a join between a small and large file, small file can be broadcasted to each node to avoid data shuffling between cluster nodes. Broadcast variable is a read only variable.
c. Accumulator — Accumulator is the shared variable used for accumulative operation. Each worker nodes adds value to accumulator while only the driver can access the global variable. Workers can only add value to accumulator while not being able to read it. Accumulator variables are sent to worker nodes by driver each time any task is triggered.
Thanks for reading. Questions and Suggestions are welcome.
To find out more about the courses our students have taken to complete these projects and what you can learn from WeCloudData, view the learning path. To read more posts from Abhi, check out his Medium posts here.