1. What are the Key KPT metrics?
- Number of questions per day(should be able to visualise by month)
- Number of answers per day(should be able to visualise by month)
- Number of accepted answers in a day( should be able to visualise by month as well).
- Number of unaccepted answers in a day(should be able to visualise by month as well).
- Average View Count of a question.
- Number of questions with no answers.
- Number of votes in a day
2. Does this key metrics need to be calculated or is it readily available in a database?
- Data will most likely be available at the lowest granularity (day wise)in a database.
- Huge organisations will have data-warehouses which will have data marts to aggregate this data by month or year.
- Aggregation can be performed in analytics to give a monthly or yearly view depending on the data volume and the timeline of data needed.
- For instance, if the requirement is to show 5 years worth of data and then current year day wise- the data volume will be huge. In this case it is better to load the aggregated 5 years data into analytics from the data-warehouse and load the day wise raw data into analytics and only perform calculation for the day wise metrics.
3. Where does the data come from?
- Historical data might be available in a data base or data-warehouse.
- Real time data can be ingested through stackapi.
- In this project, I will be using the stackapi to stream the data & Kinesis Data Generator to mock up some streaming data
In an ideal situation, historical records should be loaded as a one time activity and daily questions should be stored in the data lake and synced by analytics.
4. What is the format of this data?
- Data Streamed using stackapi is in the form of json format.
5. Is there any additional data modelling required?
- Data streamed through stackapi is in the form of a fact table. No further modelling is required.
6. Do you need stream processing or batch processing?
- For the user group identified, batch processing would suffice.
- Jobs can be scheduled on a daily basis and the dashboard can be refreshed on a daily basis.
7. Where can you store the data?
- Data can be stored in Amazon Redshift.
8. What would be the volume of the data?
- Stackapi has a limitation 10,000 requests per day, therefore we have a limitation of the number of records that can be streamed per day.
- In-order to have a wholesome view of the pipeline, I also used Kinesis data generator to mock up the data.
9. What will be the visualisation tool?
- Einstein Analytics will be used to visualise this.
After having an idea of all these questions, we can now conceive a technical architecture diagram to build a data pipeline.
Brief Overview of the data pipeline:
- Kinesis Firehose is chosen to stream the data from stack api and output it to S3 bucket folder.
- Spark will batch process the streams from S3 on a daily basis and output the transformed data back to Redshift- This will be a script scheduled on ec2 once every day.
- Einstein Analytics will use it native S3 connector to sync the data and display it in the dashboards- Dashboards will be refreshed everyday with the data of yesterday’s
In the upcoming articles, I will be exploring each component of the pipeline in depth.
Here is the article describing how I streamed the data using Kinesis and stored it in S3 for further processing!
To find out more about the courses our students have taken to complete these projects and what you can learn from WeCloudData, view the learning path. To read more posts from Sneha, check out her Medium posts here.