Big Data for Data Engineering

Track Course
Advanced
Fully Ready

About the Course

This course introduces big data principles, distributed computing with Apache Spark, and modern architectures like data lakes and lakehouses. It emphasizes using Databricks for large-scale data processing, covers NoSQL and schema-on-read, and explores real-time streaming.

Learning Outcomes

By the end of this course, participants will be able to:

  • Explain the principles of big data and distributed computing, including the role of Apache Spark in processing large-scale datasets.
  • Design and implement data lake and lakehouse architectures using tools such as Azure Data Lake Storage, Delta Lake, and open table formats.
  • Build scalable data processing workflows on Databricks, leveraging Spark for batch and real-time structured data.
  • Integrate NoSQL databases and schema-on-read designs into modern data architectures to support unstructured and semi-structured data at scale.

Curriculum

  • Chapter 1: Big Data Foundations

    Overview:

    In this chapter, learners will understand the principles of big data, distributed systems, and MapReduce concepts, and get introduced to Apache Spark on Databricks.

    Topics:

    • Big data significance and distributed computing concepts
    • Hadoop ecosystem: HDFS, YARN, Hive, HBase, Sqoop, Zookeeper, Kafka, NiFi
    • Introduction to Apache Spark
    • Databricks: workspace setup, Spark cluster creation, DBFS, ADLS integration
    • Labs: Spark DataFrame and Spark SQL exercises
    • Mini Project
  • Chapter 2: Data Lake Architecture

    Overview:

    This chapter covers the design and implementation of data lakes for semi-structured and unstructured data, including NoSQL databases.

    Topics:

    • Advantages of data lakes
    • NoSQL databases: Cosmos DB, MongoDB, Cassandra
    • Unstructured and semi-structured data modeling
    • Schema-on-read strategies
    • Labs: ingest and process unstructured data, ETL in data lakes using Spark, Azure Data Factory
  • Chapter 3: Lakehouse Architecture

    Overview:

    Learners will explore lakehouse concepts and architectures, combining features of data lakes and warehouses.

    Topics:

    • Data lakehouse vs. data warehouse
    • ACID in data lakes
    • Open table formats: Hudi, Iceberg, Delta
    • Key lakehouse attributes: schema-on-read, schema evolution, time travel
    • Structured streaming with Databricks Spark
    • Labs: streaming data ingestion, upserts, deletes, merges, and copy operations in Azure Synapse/Fabric

Tools

Apache Spark, Databricks
Azure (Data Lake Storage, Synapse Analytics, Data Factory)
Delta Lake, Open table formats (Hudi, Iceberg, Delta)
NoSQL databases ( Cosmos DB, MongoDB, Cassandra)
Ready to start learning?

Get access to top-rated courses, real projects, and job-ready skills.

Have questions?

We’re here to help. Talk to our advisors. 

STUDENT REVIEWS

What our graduates are saying

Recommended if you're interested in Big Data for Data Engineering
Standard Course

AI Automation

Standard Course

Introduction to GitHub Actions

Standard Course

GCP Fundamentals

Standard Course

Introduction to Large Language Models

Learning Track

DevOps Engineering Track

Learning Track

MLOps Engineering Track

Learning Track

Cloud Engineering Track

Learning Track

Artificial Intelligence (AI) Engineering Track

Common Questions

Find answers to your questions about the Learning Track
  • Standard Courses: Focused, short courses that build foundational or intermediate skills through hands-on exercises, enabling you to apply what you learn immediately.
  • Track Courses: Structured learning paths that guide you from beginner to advanced levels. They include practical projects that integrate multiple tools and workflows, aligned with industry best practices, helping you gain the skills and confidence to tackle real-world challenges.

No. Track Courses are only accessible through the Professional or Unlimited+ subscription plans.

  • Standard Plan gives you access to all Standard Courses.
  • Professional Plan gives you access to both Standard and Track Courses within your chosen domain.
  • Unlimited+ Plan provides full access to all courses — both Standard and Track — across all domains.

 

Yes, all courses are designed to be self-paced. Learn when it fits your schedule.

Each course includes prerequisites if needed. Many Standard Courses are beginner-friendly.

Still have questions?

If you have other queries or specific concerns, don’t hesitate to let us know. Your feedback is important to us, and we aim to provide the best support possible.

Your Learning Journey Awaits 🚀

Grow your skills, build projects you’ll be proud of, and unlock new opportunities — all at your pace.

Download Big Data for Data Engineering Course Package
This site is registered on wpml.org as a development site. Switch to a production site key to remove this banner.