Program  

Courses
Location
Corporate
Student Success
Resources
Bootcamp Programs
Short Courses
Portfolio Courses
Bootcamp Programs

Launch your career in Data and AI through our bootcamp programs

  • Industry-leading curriculum
  • Real portfolio/industry projects
  • Career support program
  • Both Full-time & Part-time options.
Data Science Bootcamp

Become a data engineer by learning how to build end-to-end data pipelines

 

Become a data analyst through building hands-on data/business use cases

Become an AI/ML engineer by getting specialized in deep learning, computer vision, NLP, and MLOps

Become a DevOps Engineer by learning AWS, Docker, Kubernetes, IaaS, IaC (Terraform), and CI/CD

Short Courses

Improve your data & AI skills through self-paced and instructor-led courses

  • Industry-leading curriculum
  • Portfolio projects
  • Part-time flexible schedule
AI ENGINEERING
Portfolio Courses

Learn to build impressive data/AI portfolio projects that get you hired

  • Portfolio project workshops
  • Work on real industry data & AI project
  • Job readiness assessment
  • Career support & job referrals

Build data strategies and solve ML challenges for real clients

Help real clients build BI dashboard and tell data stories

Build end to end data pipelines in the cloud for real clients

Location

Choose to learn at your comfort home or at one of our campuses

Corporate Partners

We’ve partnered with many companies on corporate upskilling, branding events, talent acquisition, as well as consulting services.

AI/Data Transformations with our customized and proven curriculum

Do you need expert help on data strategies and project implementations? 

Hire Data, AI, and Engineering talents from WeCloudData

Student Success

Meet our amazing alumni working in the Data industry

Read our students’ stories on how WeCloudData have transformed their career

Resources

Check out our events and blog posts to learn and connect with like-minded professionals working in the industry

Read blogs and updates from our community and alumni

Explore different Data Science career paths and how to get started

Our free courses and workshops gives you the skills and knowledge needed to transform your career in tech

Blog

Student Blog

How to Build a Technical Design Architecture for an Analytics Data Pipeline

October 26, 2020

The blog is posted by WeCloudData’s student Sneha Mehrin.

An Overview of Designing & Building a Technical Architecture for an Analytics Data Pipeline Problem.

image explaining pipelines

This article is a continuation of the previous post and will outline how to transform our user requirements into a technical design and architecture.

Let’s summarise our two major requirements:

Discovery phase is usually the hardest , because you have to engage multiple stakeholders and tech team to build a right solution.

The only way to get through this is to ask Questions!!!

imgflp.com

Key Questions to ask during design phase

1. What are the Key KPT metrics?

  • Number of questions per day(should be able to visualise by month)
  • Number of answers per day(should be able to visualise by month)
  • Number of accepted answers in a day( should be able to visualise by month as well).
  • Number of unaccepted answers in a day(should be able to visualise by month as well).
  • Average View Count of a question.
  • Number of questions with no answers.
  • Number of votes in a day

2. Does this key metrics need to be calculated or is it readily available in a database?

  • Data will most likely be available at the lowest granularity (day wise)in a database.
  • Huge organisations will have data-warehouses which will have data marts to aggregate this data by month or year.
  • Aggregation can be performed in analytics to give a monthly or yearly view depending on the data volume and the timeline of data needed.
  • For instance, if the requirement is to show 5 years worth of data and then current year day wise- the data volume will be huge. In this case it is better to load the aggregated 5 years data into analytics from the data-warehouse and load the day wise raw data into analytics and only perform calculation for the day wise metrics.

3. Where does the data come from?

  • Historical data might be available in a data base or data-warehouse.
  • Real time data can be ingested through stackapi.
  • In this project, I will be using the stackapi to stream the data & Kinesis Data Generator to mock up some streaming data

In an ideal situation, historical records should be loaded as a one time activity and daily questions should be stored in the data lake and synced by analytics.

4. What is the format of this data?

  • Data Streamed using stackapi is in the form of json format.

5. Is there any additional data modelling required?

  • Data streamed through stackapi is in the form of a fact table. No further modelling is required.

6. Do you need stream processing or batch processing?

  • For the user group identified, batch processing would suffice.
  • Jobs can be scheduled on a daily basis and the dashboard can be refreshed on a daily basis.

7. Where can you store the data?

  • Data can be stored in Amazon Redshift.

8. What would be the volume of the data?

  • Stackapi has a limitation 10,000 requests per day, therefore we have a limitation of the number of records that can be streamed per day.
  • In-order to have a wholesome view of the pipeline, I also used Kinesis data generator to mock up the data.

9. What will be the visualisation tool?

  • Einstein Analytics will be used to visualise this.

Technical Architecture

After having an idea of all these questions, we can now conceive a technical architecture diagram to build a data pipeline.

Technical architecture Diagram

Brief Overview of the data pipeline:

  • Kinesis Firehose is chosen to stream the data from stack api and output it to S3 bucket folder.
  • Spark will batch process the streams from S3 on a daily basis and output the transformed data back to Redshift- This will be a script scheduled on ec2 once every day.
  • Einstein Analytics will use it native S3 connector to sync the data and display it in the dashboards- Dashboards will be refreshed everyday with the data of yesterday’s

In the upcoming articles, I will be exploring each component of the pipeline in depth.

Here is the article describing how I streamed the data using Kinesis and stored it in S3 for further processing!

To find out more about the courses our students have taken to complete these projects and what you can learn from WeCloudData, view the learning path. To read more posts from Sneha, check out her Medium posts here.

SPEAK TO OUR ADVISOR
Join our programs and advance your career in Data EngineeringData Science

"*" indicates required fields

Name*
This field is for validation purposes and should be left unchanged.
Other blogs you might like
Student Blog
The blog is posted by WeCloudData’s student Amany Abdelhalim. In this post, I will be taking you through the steps…
by Student WeCloudData
October 15, 2020
Student Blog
The blog is posted by WeCloudData’s student Amany Abdelhalim. In this article, I am illustrating how to collect tweets…
by Student WeCloudData
June 23, 2020
Guest Blog
Machine learning applications in healthcare was a great hit with the NYC audience. At least 130 enthusiastic attendees joined…
by WeCloudData Faculty
October 28, 2019
Previous
Next

Kick start your career transformation