Career Guide, Guest Blog, Learning Guide

Data Engineering Series #2: Cloud Services and FOSS in Data Engineer’s World

December 7, 2020

Open Source (OSS) frameworks have improved the quality of Big Data processing with its diverse set of tools addressing numerous use cases

In fact, if you are a part of a team working on building a modern data architecture, chances are high you are using an open-source stack.

Similarly, Cloud Computing has been enabling Big Data Solutions in yielding scalable and cost-effective solutions in analytics space.

Open Source and Cloud : The Correlation
In the cloud ecosystem, many of the commercially available cloud services are either

Similar to an OSS ➡ Similar in Features (Eg: AWS Step Functions and Apache Airflow )


Modeled after an OSS ➡ Follows/ Inherits the design principles of an existing Open Source framework. (Eg: AWS Kinesis and Apache Kafka)


Managed service of an OSS ➡ Takes care of deployment & maintenance of the OSS framework and making it ready to use. (Eg: AWS RDS Postgres and PostgresDB)To understand more, Let’s touch upon the basics…

Getting to know the cloud
The first step that many of us go through while getting to know about cloud services is to start wondering where to start from the plethora of services available out there.

So, For the ease of understanding, Irrespective of the cloud provider (AWS, Azure, GCP, etc). let’s group the big data related cloud services into these stages.

cloud service processes in a chart format

Now, Let’s try to understand the cloud ecosystem by comparing AWS cloud services with its equivalent open source frameworks. (Similar comparison can be drawn with Azure and GCP as well)

???? Data Ingestion:

AWS Service What it does Relation with OSS OSS Alternative
Kinesis Stream Processing Modelled After Apache Kafka
SQS Message Queue Similar to RabbitMQ
Managed Streaming for Kafka (MSK) Stream Processing Managed Service of Apache Kafka

???? Data Storage:

AWS Service What it does Relation with OSS OSS Alternative
S3 Object store Similar to MinioSwiftCeph, …
RDS Relational database Managed Service of MariaDBMySQLPostgres
DynamoDB NoSQL database Similar to Apache Cassandra
ElastiCache In-memory cache Managed Service of MemcachedRedis
Neptune Graph database Similar to Neo4j
Amazon QLDB Ledger database Modelled After Hyperledger
Amazon DocumentDB Document database Similar to MongoDB
AWS Lake Formation Data lake Similar to HDFS
EC2 EBS Block storage for EC2 Similar to OpenEBSPortworx

???? Data Processing:

AWS Service What it does Relation with OSS OSS Alternative
Elastic Map Reduce Hadoop Managed Service of Hadoop,
Step Functions Worflow Orchestrator Similar to Apache Airflow , Flyte
AWS Glue ETL Managed Service of Apache Spark
Lambda Serverless Similar to KnativeOpenFaaSFn
Batch Batch Job Computing Similar to Apache Airflow on Kubernetes

???? Data Analysis & Visualization:

AWS Service What it does Relation with OSS OSS Alternative
Amazon Redshift Data warehousing Similar to Spark SQLApache HivePresto
Athena Data warehousing Similar to Spark SQLApache HivePresto
CloudSearch Search Similar to Elasticsearch
Elasticsearch Service Search Managed Service of Elasticsearch
QuickSight Business analytics Similar to PowerBI

???? Deployment:

AWS Service What it does Relation with OSS OSS Alternative
Elastic Container Registry (ECR) Container registry Managed Service of Docker RegistryQuay
Elastic Container Service (ECS) Container orchestration Managed Service of KubernetesMarathon
Elastic Kubernetes Services (EKS) Container orchestration Managed Service of Kubernetes
Cloud Formation Infrastructure as a code Similar to Terraform

Some of the notable cloud adoptions with respect to Big Data.

– Till now, AWS users have launched more than 15 million Hadoop clusters. (EMR / Containerized versions)
– “container-as-a-service” (EKS, ECS) and “Database-as-a-service” (RDS, DynamoDB) are the most commonly used managed services in 2020.
– Database services usage up 127% year over year.

Next Steps…

  1. You can understand how these services are put to use in real-world use cases in this article
  2. This Whitepaper from AWS on Big Data will be a good place to understand its Services.
  3. And start getting hands-on following this repo

Going forward, I’ll publish detailed posts on tools and frameworks used by Data Engineers day in and day out.

Follow for updates.

To read more posts from Srinidhi, check out her posts here.

Join our programs and advance your career in Cloud EngineeringData Engineering

"*" indicates required fields

This field is for validation purposes and should be left unchanged.
Other blogs you might like
Career Guide, Student Blog
The blog is posted by WeCloudData’s full-time data science diploma program student Yining Zhuang. In this blog, I would…
by Student WeCloudData
November 27, 2020
Job Market
Hello, data enthusiasts and aspiring data scientists! I’m thrilled to present a  comprehensive exploration into the North American data…
by Cherice
December 22, 2023
Student Blog
[Student Project] Visualizing New York City Taxi Data This blog is created by WeCloudData’s Data Science Bootcamp alumni Yaoyu…
by Student WeCloudData
October 28, 2019

Kick start your career transformation