Program  

Courses
Location
Corporate
Student Success
Resources
Bootcamp Programs
Short Courses
Portfolio Courses
Bootcamp Programs

Launch your career in Data and AI through our bootcamp programs

  • Industry-leading curriculum
  • Real portfolio/industry projects
  • Career support program
  • Both Full-time & Part-time options.
Data Science Bootcamp

Become a data engineer by learning how to build end-to-end data pipelines

 

Become a data analyst through building hands-on data/business use cases

Become an AI/ML engineer by getting specialized in deep learning, computer vision, NLP, and MLOps

Become a DevOps Engineer by learning AWS, Docker, Kubernetes, IaaS, IaC (Terraform), and CI/CD

Short Courses

Improve your data & AI skills through self-paced and instructor-led courses

  • Industry-leading curriculum
  • Portfolio projects
  • Part-time flexible schedule
AI ENGINEERING
Portfolio Courses

Learn to build impressive data/AI portfolio projects that get you hired

  • Portfolio project workshops
  • Work on real industry data & AI project
  • Job readiness assessment
  • Career support & job referrals

Build data strategies and solve ML challenges for real clients

Help real clients build BI dashboard and tell data stories

Build end to end data pipelines in the cloud for real clients

Location

Choose to learn at your comfort home or at one of our campuses

Corporate Partners

We’ve partnered with many companies on corporate upskilling, branding events, talent acquisition, as well as consulting services.

AI/Data Transformations with our customized and proven curriculum

Do you need expert help on data strategies and project implementations? 

Hire Data, AI, and Engineering talents from WeCloudData

Student Success

Meet our amazing alumni working in the Data industry

Read our students’ stories on how WeCloudData have transformed their career

Resources

Check out our events and blog posts to learn and connect with like-minded professionals working in the industry

Read blogs and updates from our community and alumni

Explore different Data Science career paths and how to get started

Our free courses and workshops gives you the skills and knowledge needed to transform your career in tech

Blog

Student Blog

Data Processing Stack Overflow Data Using Apache Spark on AWS EMR

November 2, 2020

The blog is posted by WeCloudData’s  student Sneha Mehrin.

An overview on how to process data in spark using DataBricks, add the script as a step in AWS EMR and output the data to Amazon Redshift

This article is part of the series and continuation of the previous post.

finger print image

In the previous post, we saw how we can stream the data using Kinesis Firehose either using stackapi or using Kinesis Data Generator.

In this post, let’s see how we can decide on the key processing steps that need to be performed before we send the data to any analytics tool.

The key steps, I followed in this phase are as below :

Step 1 : Prototype in DataBricks

I always use DataBricks for prototyping as they are free and can be directly connected to AWS S3. Since EMR is charged by the hour, if you want to use spark for your projects DataBricks is the best way to go.

Key Steps in DataBricks

  1. Create a cluster

This is fairly straightforward, just create a cluster and attach to it any notebook.

2. Create a notebook and attach it to the cluster

3. Mount the S3 bucket to databricks

The good thing about databricks is that you can directly connect it to S3, but you need to mount the S3 bucket.

Below code mounts the s3 bucket so that you can access the contents of the S3 bucket

This command will show you the contents of the S3 bucket

%fs ls /mnt/stack-overflow-bucket/

After this we do some Exploratory data analysis in spark

Key Processing techniques done here are :

Check and Delete duplicate entries ,if any.

Convert the date into Salesforce compatible format (YYYY/MM/DD).

Step 2 : Create a Test EMR cluster

pyspark script that you developed in the DataBricks environment gives you the skeleton of the code .

However, you are not done yet!
You need to ensure that this code won’t fail when you run the script in aws. So for this purpose, I create a dummy EMR cluster and test my code in the iPython notebook.This step is covered in detail in this post.

Step 3:Test the Prototype code in pyspark

In the Jupyter notebook that you created, make sure to test out the code.

Key points to note here:

Kinesis Streams output the data into multiple files within a day. The format of the file depends on the prefix of the S3 file location set while creating the delivery stream.

Spark job runs on the next day and is expected to pick all the files of Yesterday’s date from the s3 bucket.

Function get_latest_filename creates a text string which matches the file name spark is supposed to pick up.

So in the Jupyter notebook, i am testing if spark is able to process the file without any errors

Step 4 & 5 :Convert the ipython notebook into python notebook and upload it in s3.

You can use the below commands to convert the ipython notebook to pyscript

pip install ipython

pip install nbconvert

ipython nbconvert — to script stack_processing.ipynb

After this is done ,upload this script to the S3 folder

Step 5 : Add the script as a step in EMR job using boto3

Before we do this, we need to configure the redshift cluster details and test if the functionality is working.

This will be covered in the Redshift post.

To find out more about the courses our students have taken to complete these projects and what you can learn from WeCloudData, view the learning path. To read more posts from Sneha, check out her Medium posts here.
SPEAK TO OUR ADVISOR
Join our programs and advance your career in Business IntelligenceData EngineeringData Science

"*" indicates required fields

Name*
This field is for validation purposes and should be left unchanged.
Other blogs you might like
Learning Guide, WeCloud Faculty
This blog post was written by WeCloudData’s Assistant Instructor and Program Manager, Sonia Chhay. Hi, my name is Sonia…
by Sonia Chhay
September 24, 2021
Student Blog
The blog is posted by WeCloudData’s  student Sneha Mehrin. Overview on how to ingest stack overflow data using Kinesis…
by Student WeCloudData
October 28, 2020
Student Blog
The blog is posted by WeCloudData’s Data Science Bootcamp student Weichen Lu. Once, I was talking with my colleague…
by Student WeCloudData
October 28, 2019
Previous
Next

Kick start your career transformation