Blog

Consulting

Consulting Case Study: Real-time Data Streaming Pipeline Optimization

October 19, 2021

Background

Our client is providing advanced agriculture tools and digital information to farmers to become more profitable. The company utilizes sensor solutions and provides real-time and actionable insights. It also provides farmers with the power to control their operating costs. Their product is a solution that saves farms over $20,000 annually by improving energy efficiency and reducing machine maintenance through predictive analytics.

The main service that WeCloudData team provided to them was on these two parts:

Comprehensive data streaming pipeline optimization
Real-time data visualization using Quickset

The new proposed pipeline turns out to be way more efficient and functional in terms of the massive amount of data collection, visualization, and in-time notifications in communicating with end users.

Problem Statement

The Client uses AWS as the main cloud provider. They use Kinesis Firehose and AWS Lambda to transform and store the data the devices collect. The data is served to the client’s app via RDS and Dynamo DB. The app provides some time-series analytics, energy consumption and cost associated with it.

However, with the pressure of increasing amount of real-time data collection and its in-time analysis, the client wanted to update the pipeline infrastructure to make it more robust, reliable, and scalable. The current pipeline randomly breaks, takes a long time to process data for frontend users, DynamoDB has a rate limit. A few changes were proposed to the client by WeCloudData to improve the pipeline reliability and scalability.

Tools used: AWS (IoT Core, Kinesis Data Firehose, Kinesis data Analytics, S3, Lambda, DynamoDB, API Gateway, SNS, Athena, Quickset)

Challenges

The current pipeline is quite sophisticated and took some time to understand the data get transformed and consumed by the end-users. The infrastructure has a loosely coupled structure that needed a detailed overview and complete understanding in the entire data flow.

Original way of data collection and storage

During the overview of the pipeline, a few flaws were discovered and patched immediately. A few pipeline design changes were proposed by WeCloudData team to improve reliability and reduce the cost of the infrastructure.

Key results

We discovered that there were a few glue crawlers running every hour on buckets related to some devices. These crawlers contributed to the extensive infrastructure cost increase. It was recommended to pause the crawlers and enable glue metadata registry on the Kinesis level. This approach significantly saves the time and amount of idle tasks and makes it less reliable on glue crawlers.

We proposed a few designed changes, one of the most suitable method is to use Athena and QuickSight for data analytics and data visualization (See appendix for dashboard that WeCloudData team created).

Proposed pipeline to address issues with DynamoDB and provide visualization to end-users.

The team also add one step of pre-aggregating the data per minute base instead of saving each data point (per second base) from each device using Kinesis Analytics. This should result in less intensive computation of some statistics and prove cost-saving benefits. We also recommended differentiating devices per type which allows streamlining the process of deploying new devices.

In addition to aggregation, we deployed a pre-trained anomaly detection model provided by Amazon that is built on Random Cut Forest algorithm. This extra functionally will output an anomaly score for each appliance that the device is connected to, the lambda function checks for any abnormal score and notify the user via text message using SNS service.

Prototype pipeline to aggregate data and detect anomalies

Conclusion

The proposed ideas on the data infrastructure has been test out to significantly reduce the cost of infrastructure and make the pipeline more resilient. By taking this opportunity, WeCloudData gained the consulting experiences in smart agriculture industry which implement the application of IoT solutions.

Appendix: Dashboard of the real-time voltage usage (demo)

The client has a subscription-based app targeted at farmers where they can loginand visualize key information related to their energy expenditure. This information is collected by sensors provided by the client and store in AWS S3 and AWS DynamoDB.

The client requested the creation of QuickSight Dashboards templates which could provide valuable KPIs and metrics to their customers about their energy expenditure. Among these metrics, the client mentioned during our first contact it would be nice to have predictions and forecasts included.

SPEAK TO OUR ADVISOR

Join our programs and advance your career in Business IntelligenceCloud EngineeringData Engineering

"*" indicates required fields

Name*

First Last

Email*

Phone Number*

Phone

This field is for validation purposes and should be left unchanged.

Other blogs you might like

Blog, Job Market

Analyzing Remote Work Opportunities in Today’s North America Job Market

Introduction In recent years, the landscape of employment has undergone a profound transformation, with remote work emerging as a…

by Student WeCloudData

March 13, 2024

Career Guide, Guest Blog, WeCloud Faculty, WeCloud News

Interview with Shaohua Zhang, Data Scientist and CEO of WeCloudData – by Reena Shaw

This is a repost of Reena Shaw’s interview with our CEO published on Medium. Thanks, Reena (Linkedin Medium) for…

by WeCloudData Faculty

October 28, 2019

Learning Guide, WeCloud Faculty

4 Reasons to Choose WeCloudData

This blog post was written by WeCloudData’s Data Science Instructor, Tianshu Luan. “To get the best result, students and…

by Tianshu Luan

September 24, 2021

Bootcamp Programs

Short Courses

Project Course

Learning Path

Data Science >

Data Engineering >

Data/BI Analytics >

Machine Learning Engineering >

DevOps Engineering >

Corporate Partners

Corporate Training >

DevOps >

Consulting Services >

Talent Program >

Resources

Blogs >

Career Guides >

WeCloudOpen >

Consulting Case Study: Real-time Data Streaming Pipeline Optimization

Background

Problem Statement

Challenges

Key results

Conclusion

Appendix: Dashboard of the real-time voltage usage (demo)

Join our programs and advance your career in Business IntelligenceCloud EngineeringData Engineering

Other blogs you might like

Analyzing Remote Work Opportunities in Today’s North America Job Market

Interview with Shaohua Zhang, Data Scientist and CEO of WeCloudData – by Reena Shaw

4 Reasons to Choose WeCloudData

Kick start your career transformation

Sign up for newsletter

Programs

Corporate Services

Resources

Company

Let’s Connect!