Data engineering is a hot topic in recent years, mainly due to the rise of artificial intelligence, big data, and data science. Every enterprise is transforming in the direction of digitalization. For enterprises, data is full of infinite value. For all the data requirements of organizations, the first thing they need to do is to establish a data architecture/platform and establish a pipeline to collect, transmit and transform data, which is what data engineering does.
Data Engineering is a field that has been growing rapidly in the past few years. The google trend above speaks for itself. You must have lots of questions about data engineering. For example,
- So what is data engineering?
- Why is data engineering important?
- Why should companies care about data engineering?
This article explains what data engineering is and introduces some useful use cases.
What is Data Engineering
Data Engineering is the process of collecting, ingesting, storing, transforming, cleaning, and modelling big data. It is properly governed by people and process, driven by modern tools, and adopted by data engineers to support data & analytics teams to drive innovation.
It is safe to say that anything that’s data related has something to do with data engineering. And most importantly, all companies that invest in data science and AI need to have very solid data engineering capability.
The following are important aspects of data engineering.
Data Collection
Data collection is the process of collecting and extracting data from different sources. Most companies have at least a dozen different data sources. Data can come from different application systems, SaaS platforms, the Internet, and so on.
- If data comes from a database, data collection means building the connectors to read data from the databases (e.g., a relational database or NoSQL database)
- If data needs to be extracted from a SaaS software platform such as Mailchimp or Hubspot, then data collection means writing the connectors to read data from different data APIs
- When data is not directly available, data collection may involve writing javascript or python code to scrape data from the Web
Data Ingestion
When data gets collected from the source, it needs to be ingested into a destination. We can save the data as a CSV file on a server, however a well designed system will take many things into consideration:
- How fast does the data needs to be ingested?
- How frequently should the data be ingested?
- How many different destination will share the same data and at what pace?
- How scalable does the ingestion pipeline need to be?
Data Storage
Data storage is a big topic. Depending on the use cases of the data, data engineers may choose different storage systems and formats. For example, analytics engineers and data analysts may prefer data to be stored in a relational database (RDBMS) such as Postgres or MySQL, or a data warehouse system such as Snowflake, and Redshift. Big data engineers may want the data to be persisted in a data lake such as Hadoop HDFS, or AWS s3. Modern data engineers may even want the big data to be stored in a Lakehouse.
Data can also be stored in many different formats, from flat files such as CSV, to JSON and Parquet. In the database world, data can be stored in a row-oriented or columnar-oriented databases.
Data Transformation
Most data systems have a staging environment where raw data can be prepared before loaded into the destination systems. At this stage, data will go through a series of transformations to meet the business, compliance, and analytics needs. This is a stage where data engineers spend a big chunk of their effort on data transformations. Common data transformation steps include:
- Data selection
- Data filtering
- Data merging and joining
- Data aggregation
- Data grouping and windowing
There are many tools that are leveraged in this stage by data engineers. Depending on the data processing platforms, data engineers may use tools such as:
- DataFrame (Pandas, Dask)
- SQL
- Spark (MapReduce, SparkSQL)
Data Pipelines & Automation
Data engineers usually build data processing DAGs (Direct Acyclic Graph) that have dependencies on each other. In order to build high-quality data applications, data pipelines should be as automated as possible, and only intervened when parts of the lineage break that requires a data engineer’s attention. It turns out that data pipelines can become quite complex and unwieldy. It needs to be well-tested and software engineering best practices also need to be adopted by data engineers.
Data Quality and Governance
Data quality is a big part of enterprise data governance effort. While building a data pipeline, data engineers always need to include data quality checks so that the pipeline doesn’t run into Garbage in, garbage out
problems. A small issue in a pipeline may result in very bad data quality and therefore ruin the entire data science and analytics effort.
Data governance is not just a technique. It involves people and processes. Companies assign data owners and stewards to come up with governance policies. Data quality and monitoring processes are essential parts of the process.
The Convergence of Engineering Disciplines
In the past few years, we’ve started to see the convergence of the engineering disciplines. Modern data engineering requires skill sets from many disciplines. For example, in some companies a data engineer may need to know DevOps to certain degree so that she/he can write well-tested and automated data pipelines. Data engineers are often writing production-grade software code such as data connectors and pipelines. In many organizations, data engineers may also need to do basic analytics, build dashboards, and even help data scientists automate machine learning pipelines.
Can Data Engineering be automated?
At WeCloudData, we often hear this from our students. Can data engineering be automated? The answer is: of course! And it should be automated as much as possible.
Automation is happening in every industry with the advancement in AI. For example, ChatGPT can probably write a data pipeline script by simply predicting the next word. Github copilot is already powering a lot of developer and software engineering’s daily work.
In recent, we also see many projects and startups building GUI-based data engineering tools, which make building data engineering a low-code or no-code effort.
However, from WeCloudData and many experts’ perspective, because producing high quality data for analytics and AI is so crucial for a company’s innovation, companies should and will rarely completely rely on code generated by bots and AI. While many low-level code should be automated, the business logics are often hard to automate. For example, banks using exactly the same tools and systems will end up building very different data engineering pipelines due to the differences among legacy systems, talent, budget, existing infrastructure, internal policies, and business logics.
Traditional Data Engineering
We’d like to point out that data engineering is actually not new concepts. It has been around for as long as data systems exist. However, it may have lived under different names. For example, data engineers might be called ETL Developer in a bank, or Big Data Developer in a tech company, Data Warehouse Engineer or Data Architect in an insurance company, or simply software engineer in a data-driven startup. The reason why we call it traditional data engineering
is because the field has evolved so fast it the past 8-12 years that many old
technology stacks are really used or are being phased out in many companies. The data industry is abundant with tooling choices and companies are constantly looking for better ways to scale and manage their data engineering efforts.
ETL Developer
- Big companies used to hire many ETL developers to work on data migration and ETL projects. ETL is short for Extract-Transform-Load. ETL developers are called upon when new data sources are identified and new data integration is required for the business.
- ETL developers are very specialized in data transformation and usually work with legacy platforms such as IBM data stage, Informatica, etc.
SQL Developer
- Companies working with relational databases and data warehouses often need SQL developers to help write advanced queries to load and transform data. Compared to data analysts and data scientists, SQL developers are more specialized in database internals and are able to write more efficient and advanced queries. This usually means cost saving and improvement of query performance.
Big Data Developer
- Ever since the birth of MapReduce and Hadoop in 2016, big data developers have been in steady demand, especially in enterprises or tech firms that collect and process big data.
- Big Data Developers are usually coming from software engineering background. They understand distributed systems well and can write MapReduce and Spark jobs to process large datasets.
Modern Data Engineering
While the fundamentals of data engineering haven’t changed dramatically in the past 10 years. The tools and ecosystems have grown tremendously. For example, Apache Spark has dethroned MapReduce to become the king of big data processing. Many companies that have invested heavily in data lake are now realizing that they need to chase the next wave of Lakehouse initiatives while keeping running the legacy
big data systems.
There’re a several data data engineering trends to watch out for in the coming years:
The Modern Data Stack
- SQL is having a comeback in recent years. This is certainly good news for many because SQL has been widely adopted and loved by analysts, engineers, and developers. The modern data stack community usually choose certain data technology stack and believe that SQL can be used to solve most of the modern data challenges.
- One example of a modern data stack is as follows:
- Data ingestion: Fivetran
- ELT: dbt + Snowflake
- Reverse ETL: Hightouch
The Data Lakehouse
- The traditional big data tech has proven to be clunky and doesn’t meet all the data processing needs by enterprises. In the past few years, streaming data processing has become more and more popular and companies have been asking for ACID features in the data lake.
- Data Lakehouse technologies such as Apache Hudi, Apache Iceberg, and Databrick’s Delta Lake have been the buzzwords in big data tech. It allows companies to combine the processing of both real-time and batch data.
Serverless Data Engineering
- Another interesting trend in modern data engineering is driven by the rapid adoption of Cloud technologies. Cloud service providers such as AWS, Azure, and GCP have built new capabilities that enable data engineers to run data processing and analytics pipelines in a serverless fashion.
- While it won’t meet every company’s needs, and can get overly complex for large-scale problems, Serverless brings the benefits of flexibility, transparency in pricing, and scalability.
Data Engineering Use Cases
Data Warehouse
A data warehouse is a centralized location for storing a company structured data. It usually stores product, sales, operations, finance, as well as marketing and customer data. Raw data will be extracted from data sources, properly cleaned and transformed, modelled and loaded into the data warehouse. An enterprise data warehouse powers most of the business intelligence and analytics efforts and therefore is a critical piece of data infrastructure.
For example, the sales team may want to understand daily/weekly sales performance. Traditionally, IT may help write SQL queries to prepare the data for ad-hoc reporting. However, the prepared data may have useful information that can be leveraged by marketing, sales, as well as product teams. Therefore, having a data engineer prepare the data and load the data into a data warehouse that can be access by different teams will be very valuable. Different teams will access a single source of truth so data interpretation can be as accurate and consistent as possible.
Who should care about Data Engineering?
The answer is simple: everyone should.
Executives and stakeholders
It is important that company’s executives understand the importance of data engineering. It’s directly related to budgeting and companies should invest in data engineering as early as possible. As the data systems and problems become more complex, the cost of data engineering and architect mistakes become higher.
Data Scientist
Data Scientists and analysts should have a solid understand of common data engineering best practices. There’s a culture in many organizations and among data scientists that data should always be prepared by data engineers. We believe that data scientists need to know how to write efficient data processing scripts, and rely less on data engineers to prepare data. Though data scientists don’t need to write and automate many data pipelines as data engineers do, acknowledging the importance of writing efficient and scalable code is important.
Developers
Software engineers and developers write web/mobile applications that generate raw data. They usually work with data engineers to collect and ingest data into data lakes and warehouses. Frequent communications are required between developers and data engineers to sync up on source data structure changes so make sure impact on data ingestion pipelines are minimized when changes happen.
IT
IT will often need to work with data engineers to provide source data connection details, providing access to infrastructure, and respond to other internal requests. It’s important for IT professionals to understand common data engineering workflows.