Blog

Student Blog

How to Build a Technical Design Architecture for an Analytics Data Pipeline

October 26, 2020

The blog is posted by WeCloudData’s student Sneha Mehrin.

An Overview of Designing & Building a Technical Architecture for an Analytics Data Pipeline Problem.

image explaining pipelines

This article is a continuation of the previous post and will outline how to transform our user requirements into a technical design and architecture.

Let’s summarise our two major requirements:

Discovery phase is usually the hardest , because you have to engage multiple stakeholders and tech team to build a right solution.

The only way to get through this is to ask Questions!!!

imgflp.com

Key Questions to ask during design phase

1. What are the Key KPT metrics?

  • Number of questions per day(should be able to visualise by month)
  • Number of answers per day(should be able to visualise by month)
  • Number of accepted answers in a day( should be able to visualise by month as well).
  • Number of unaccepted answers in a day(should be able to visualise by month as well).
  • Average View Count of a question.
  • Number of questions with no answers.
  • Number of votes in a day

2. Does this key metrics need to be calculated or is it readily available in a database?

  • Data will most likely be available at the lowest granularity (day wise)in a database.
  • Huge organisations will have data-warehouses which will have data marts to aggregate this data by month or year.
  • Aggregation can be performed in analytics to give a monthly or yearly view depending on the data volume and the timeline of data needed.
  • For instance, if the requirement is to show 5 years worth of data and then current year day wise- the data volume will be huge. In this case it is better to load the aggregated 5 years data into analytics from the data-warehouse and load the day wise raw data into analytics and only perform calculation for the day wise metrics.

3. Where does the data come from?

  • Historical data might be available in a data base or data-warehouse.
  • Real time data can be ingested through stackapi.
  • In this project, I will be using the stackapi to stream the data & Kinesis Data Generator to mock up some streaming data

In an ideal situation, historical records should be loaded as a one time activity and daily questions should be stored in the data lake and synced by analytics.

4. What is the format of this data?

  • Data Streamed using stackapi is in the form of json format.

5. Is there any additional data modelling required?

  • Data streamed through stackapi is in the form of a fact table. No further modelling is required.

6. Do you need stream processing or batch processing?

  • For the user group identified, batch processing would suffice.
  • Jobs can be scheduled on a daily basis and the dashboard can be refreshed on a daily basis.

7. Where can you store the data?

  • Data can be stored in Amazon Redshift.

8. What would be the volume of the data?

  • Stackapi has a limitation 10,000 requests per day, therefore we have a limitation of the number of records that can be streamed per day.
  • In-order to have a wholesome view of the pipeline, I also used Kinesis data generator to mock up the data.

9. What will be the visualisation tool?

  • Einstein Analytics will be used to visualise this.

Technical Architecture

After having an idea of all these questions, we can now conceive a technical architecture diagram to build a data pipeline.

Technical architecture Diagram

Brief Overview of the data pipeline:

  • Kinesis Firehose is chosen to stream the data from stack api and output it to S3 bucket folder.
  • Spark will batch process the streams from S3 on a daily basis and output the transformed data back to Redshift- This will be a script scheduled on ec2 once every day.
  • Einstein Analytics will use it native S3 connector to sync the data and display it in the dashboards- Dashboards will be refreshed everyday with the data of yesterday’s

In the upcoming articles, I will be exploring each component of the pipeline in depth.

Here is the article describing how I streamed the data using Kinesis and stored it in S3 for further processing!

To find out more about the courses our students have taken to complete these projects and what you can learn from WeCloudData, view the learning path. To read more posts from Sneha, check out her Medium posts here.

SPEAK TO OUR ADVISOR
Join our programs and advance your career in Data EngineeringData Science

"*" indicates required fields

Name*
This field is for validation purposes and should be left unchanged.
Other blogs you might like
Uncategorized
Big Data for Data Scientists – Info Session from WeCloudData…
by WeCloudData
November 9, 2019
Student Blog
This blog series is posted by WeCloudData’s Data Science Immersive Bootcamp student Bob Huang (Linkedin) Continuing from the first…
by Student WeCloudData
October 28, 2019
Blog, Consulting
Machine learning has revolutionized email spam detection, offering sophisticated solutions to combat the continuous influx of unwanted emails. Deep…
by WeCloudData
August 6, 2024
Previous
Next

Kick start your career transformation