Program  

Courses
Location
Corporate
Student Success
Resources
Bootcamp Programs
Short Courses
Portfolio Courses
Bootcamp Programs

Launch your career in Data and AI through our bootcamp programs

  • Industry-leading curriculum
  • Real portfolio/industry projects
  • Career support program
  • Both Full-time & Part-time options.
Data Science Bootcamp

Become a data engineer by learning how to build end-to-end data pipelines

 

Become a data analyst through building hands-on data/business use cases

Become an AI/ML engineer by getting specialized in deep learning, computer vision, NLP, and MLOps

Become a DevOps Engineer by learning AWS, Docker, Kubernetes, IaaS, IaC (Terraform), and CI/CD

Short Courses

Improve your data & AI skills through self-paced and instructor-led courses

  • Industry-leading curriculum
  • Portfolio projects
  • Part-time flexible schedule
AI ENGINEERING
Portfolio Courses

Learn to build impressive data/AI portfolio projects that get you hired

  • Portfolio project workshops
  • Work on real industry data & AI project
  • Job readiness assessment
  • Career support & job referrals

Build data strategies and solve ML challenges for real clients

Help real clients build BI dashboard and tell data stories

Build end to end data pipelines in the cloud for real clients

Location

Choose to learn at your comfort home or at one of our campuses

Corporate Partners

We’ve partnered with many companies on corporate upskilling, branding events, talent acquisition, as well as consulting services.

AI/Data Transformations with our customized and proven curriculum

Do you need expert help on data strategies and project implementations? 

Hire Data, AI, and Engineering talents from WeCloudData

Student Success

Meet our amazing alumni working in the Data industry

Read our students’ stories on how WeCloudData have transformed their career

Resources

Check out our events and blog posts to learn and connect with like-minded professionals working in the industry

Read blogs and updates from our community and alumni

Explore different Data Science career paths and how to get started

Our free courses and workshops gives you the skills and knowledge needed to transform your career in tech

Blog

Student Blog

Streaming Stack Overflow Data Using Kinesis Firehose

October 28, 2020

The blog is posted by WeCloudData’s  student Sneha Mehrin.

Overview on how to ingest stack overflow data using Kinesis Firehose and Boto3 and store in S3

grid of tv channels

This article is a part of the series and continuation of the previous post.

Why using Streaming data ingestion?

Traditional enterprises follow a methodology of batch processing where you gather the data, load it periodically into a database, and analyse it hours, days, or weeks later.

However, due to the numerous data sources that continuously generates streams of data, it has become imperative for most of the business to process and analyse massive scale of data within a latency of milliseconds.

Apache Kafka and Amazon Kinesis are two of the more widely adopted messaging queue systems.

Two Main Services Offered by Amazon

Kinesis Firehose is the easiest way to persist your streaming data to a supported destination.

Key advantage of Kinesis streams is that data is immediately available to start consuming almost immediately after the data is added.

Key Steps in the data ingestion pipeline

Most of the steps in Kinesis Firehose is pretty straight forward, so let’s get straight to it.

Pre-Requisites

  1. Set up your aws account.
  2. Install awscli (pip install awscli)
  3. Run aws-configure to set the AWS access key and Secret Access Key.

4. You can run ls -alf ~ to locate the aws file( It is in the hidden folders)

5. Configure aws credentials for a demo profile

aws configure — profile bigdata-demo

In all the scripts, I am connecting to aws using pycharm and the profile i configured above.

Creating a Delivery Stream

Step 1 : Login to aws console and choose the service kinesis. Select Kinesis Firehose

Step 2 : Give a name for Kinesis firehose

Step 3 : Choose the source as Direct put or other sources, as we will be streaming using python boto3.

Step 4 : Choose the default options for processing records as we will be using spark to process these records.

Step 5 : Choose the destination S3 and choose the S3 bucket

It is important to give a prefix for the S3 bucket name as we will be using to spark process the records from the exact folder location

S3 Bucket Prefix

StackOverFlow/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}

Error Prefix

myError/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/!{firehose:error-output-type}

Step 6 : Leave the default option for configure settings

Step 6 :Choose a IAM role which provides read write access to Kinesis.

Step 7: Review and create the delivery stream

Sending Data to Kinesis Firehose

Now that we have the delivery stream created, our next step is to send the data to Firehose.

There are two ways to send data to Firehose :

In my use-case, i will be using python to connect to AWS using boto3 and then use the Kinesis firehose api to send the data to Firehose.

Below is the code which streams data from Stack overflow and sends data into Kinesis Firehose.

<script src=”https://gist.github.com/snehamehrin/b1b2db14eb420b7bc398010e31a4e07b.js”></script>

Key Tips

We will be streaming data created only today and then output it to s3 bucket.

sort= “creation” streams the Stackoverflow data created only today.

Spark Job runs the next day will process the data from the s3 bucket based on the previous day and append it to redshift.

Stackoverflow api has a lot of limitations and can only process 10,000 requests in a day. So in-order to get more data i used Kinesis data generator .

Once i got an idea about how the actual data looked like using the script above, i used Kinesis data generator to generate a lot of fake data for our analytics purpose.

You can check out the article for generating data using Kinesis data generator here.

The next step is to process this data using Spark. This is covered in detail in this article

To find out more about the courses our students have taken to complete these projects and what you can learn from WeCloudData, view the learning path. To read more posts from Sneha, check out her Medium posts here.

SPEAK TO OUR ADVISOR
Join our programs and advance your career in Data EngineeringData Science

"*" indicates required fields

Name*
This field is for validation purposes and should be left unchanged.
Other blogs you might like
Student Blog
The blog is posted by WeCloudData’s student Sneha Mehrin. An Overview of Designing & Building a Technical Architecture for…
by Student WeCloudData
October 26, 2020
Learning Guide, WeCloud Faculty
This blog post was written by WeCloudData’s Data Science Instructor, Tianshu Luan. “To get the best result, students and…
by Tianshu Luan
September 24, 2021
Student Blog
The blog is posted by WeCloudData’s Big Data course student Udayan Maurya. Spark is one of the most popular…
by Student WeCloudData
June 1, 2020
Previous
Next

Kick start your career transformation