Program  

Courses
Location
Corporate
Student Success
Resources
Bootcamp Programs
Short Courses
Portfolio Courses
Bootcamp Programs

Launch your career in Data and AI through our bootcamp programs

  • Industry-leading curriculum
  • Real portfolio/industry projects
  • Career support program
  • Both Full-time & Part-time options.
Data Science Bootcamp

Become a data engineer by learning how to build end-to-end data pipelines

 

Become a data analyst through building hands-on data/business use cases

Become an AI/ML engineer by getting specialized in deep learning, computer vision, NLP, and MLOps

Become a DevOps Engineer by learning AWS, Docker, Kubernetes, IaaS, IaC (Terraform), and CI/CD

Short Courses

Improve your data & AI skills through self-paced and instructor-led courses

  • Industry-leading curriculum
  • Portfolio projects
  • Part-time flexible schedule
AI ENGINEERING
Portfolio Courses

Learn to build impressive data/AI portfolio projects that get you hired

  • Portfolio project workshops
  • Work on real industry data & AI project
  • Job readiness assessment
  • Career support & job referrals

Build data strategies and solve ML challenges for real clients

Help real clients build BI dashboard and tell data stories

Build end to end data pipelines in the cloud for real clients

Location

Choose to learn at your comfort home or at one of our campuses

Corporate Partners

We’ve partnered with many companies on corporate upskilling, branding events, talent acquisition, as well as consulting services.

AI/Data Transformations with our customized and proven curriculum

Do you need expert help on data strategies and project implementations? 

Hire Data, AI, and Engineering talents from WeCloudData

Student Success

Meet our amazing alumni working in the Data industry

Read our students’ stories on how WeCloudData have transformed their career

Resources

Check out our events and blog posts to learn and connect with like-minded professionals working in the industry

Read blogs and updates from our community and alumni

Explore different Data Science career paths and how to get started

Our free courses and workshops gives you the skills and knowledge needed to transform your career in tech

Blog

Student Blog

An Introduction To Spark and Its Behavior.

October 7, 2020

The blog is posted by WeCloudData’s Big Data course student Abhilash Mohapatra.

Checklist Followed:

  1. Mapreduce, Hadoop and Spark.
  2. Spark Architecture.
  3. Spark in Cluster.
  4. Predicate Pushdown, Broadcasting and Accumulators.

1. Mapreduce, Hadoop and Spark

For this section, let the below table represents data stored in S3 which is to be processed.

Loan Payment Table

Below table represents the Map and Shuffle + Reduce in Green and Blue color respectively. As shown below, Data is never read into driver program.

Map Reduce in Hadoop Cluster.

As part of 0th Step, input data is read from source. In the First step, data is loaded into various nodes of cluster. Secondly, data is mapped into key value pair depending on required columns. Thirdly, the key value pair is shuffled across nodes to calculate the final count. Fourthly, the key value pair is reduced to total counts. Finally, count is loaded into driver.

The above technique of processing the data in parallel is called MapReduce. The techniques algorithm can be written in any language. Hadoop has already implemented this technique of Map and Reduce along with logic to Manage Clusters for data storage.

During Map, Shuffle and Reduce, MapReduce algorithm writes results to Disk for each intermediate steps. Writing to disk for each intermediate step being an expensive solution, Spark retains the data in RAM for all intermediate steps data storage. Along with RAM, Spark uses the concept of LineageRDDTransformation and Action.

2. Spark Architecture

Below are few of the basic components of Spark Architecture.

SparkSession vs SparkContext
Group By in Spark
Data Lineage
DAG Representation
Spark Output on Action.

3. Spark in Cluster

Spark application can be submitted to the cluster using spark-submit. Along with the Spark application program, spark-submit accepts various spark configuration parameters required to execute the application successfully in cluster with conf parameter. Once spark-submit triggers an application, SparkSession with the help of SparkContext and Cluster Manager creates RDD, perform transformation and manage cluster for application execution.

Transformation + Action in a submitted Spark Application

Each submitted application consists of a number of Jobs. The Number of Jobs triggered = Number of Action present in submitted application. Each Job (A set of Transformations) is further divided into Stages. Each Shuffling of data across cluster nodes results in a New Stage. Each Stage consists of Tasks. The Number of Tasks = Number of Partitions the RDD is divided into while operation.

From the above example, Number of Jobs = 1 = Number of Action. Though 5 jobs were planned, only one was executed and rest were skipped. Number of Shuffle = 2 = Number of Stages. One for Mapping RDD from Input File and other during Group By. Number of Tasks = 1 = Number of Partitions.

4. Predicate Pushdown, Broadcasting, Accumulator

Predicate Pushdown

For parallel processing, spark uses shared variable. When driver sends a task to executors, a copy of shared variable is goes to each node so that it can be used for performing tasks. Spark uses two types of shared variables as below

Broadcasting Variable.

Thanks for reading. Questions and Suggestions are welcome.

Regards

Abhi.

To find out more about the courses our students have taken to complete these projects and what you can learn from WeCloudData, view the learning path. To read more posts from Abhi, check out his Medium posts here.

SPEAK TO OUR ADVISOR
Join our programs and advance your career in Data EngineeringData Science

"*" indicates required fields

Name*
This field is for validation purposes and should be left unchanged.
Other blogs you might like
Student Blog
This blog is posted by WeCloudData’s Data Science Bootcamp student Austin Jung. Customer churn is a common business problem…
by Student WeCloudData
October 28, 2019
Career Guide
It has been approximately one year since I decided to make a career switch from Civil Engineering to the…
by Student WeCloudData
February 10, 2020
Learning Guide, WeCloud Faculty
This blog post was written by WeCloudData’s Assistant Instructor and Program Manager, Sonia Chhay. Hi, my name is Sonia…
by Sonia Chhay
September 24, 2021
Previous
Next

Kick start your career transformation