So what is a data engineer and what does a data engineer do? This article gives an introduction to types of data engineers, data engineer skills and the differences among different data jobs.
In 2022, people have created an estimated 90 zettabytes of data (1 zettabyte is equivalent to 1,000,000,000,000,000,000,000 (10^21) bytes). Internet of Things collecting data from the edge devices, social media generating more unstructured data such as video and voice.
It created a strong demand for data engineers. While the growth of demand for data engineers is not linearly correlated with the astonishing growth speed of data generated (because a big part of data collection and ingestion can be automated), the data industry did see a steady increase of interest in data engineers.
What is a Data Engineer
In the previous chapter, we introduced that data engineering as the process of collecting, ingesting, storing, transforming, cleaning, and modelling big data. Data engineers are the talents that are mainly involved in this process. Let’s break it down and give more specific examples.
Depending on the data sources, data engineers work with different tools to extract data.
- When the source data is a relational database such as Postgres or MySQL, data engineers will usually write SQL queries to extract the data. The queries contain business logic and data engineers contract the queries to retrieve data from databases. When the source data is stored in NoSQL databases, data engineers will construct queries using the NoSQL database specific API
- When the data sources are SaaS software such as Mailchimp or Hubspot, modern data engineers may work with pre-built connectors provided by tools such as FiveTran or Airbyte to automate the data extraction and ingestion. When connectors are not available, data engineers may need to write custom connectors or work directly with the SaaS API.
- Sometimes data engineers need to use Python or other programming tools to do web scraping and crawling to extract the data from websites. Scrapy, Selenium, and Beautifulsoup are popular python packages data engineers would use for web scraping.
Data engineers work with many tools for data ingestion. Which tool to use usually depends on the company’s existing data infrastructure and project goals and budget.
- In companies that build modern data stacks, data engineers might be working with Fivetran, Stitch, or Airbyte to directly ingest data into the data warehouse. In this case, coding is only required when custom connectors need to be created. The focus on data engineers will then be on the data transformation. This approach follows the ELT paradigm.
- In companies that deal with lots of unstructured data, data engineers may need to ingest data into the data lake first. For example, AWS S3 or Hadoop Distributed File System. In this case, data engineers will need to work with the input data format and help define the output data structure. Data engineers will also need to write code to do some preliminary data encryption and masking to comply with the privacy act. Serverless functions such as Lambda functions are often used to transform data on the fly before it gets saved into the destination.
- In many cases, modern data engineers may also need to work on ingesting streaming data for real-time processing and analytics. Data streams from the sources systems will be ingested into distributed message systems such as Kafka and Kinesis.
Data engineers also need to work with many different storage systems:
- Relational databases (Postgres, MySQL, SQL Server, etc.)
- NoSQL databases (MongoDB, DynamoDB, Cassandra, Elasticsearch, etc.)
- Data warehouse (Snowflake, Redshift, Big Query, Azure SQL Warehouse)
- Data lake storage (Hadoop, Object storage)
- Data lakehouse storage
Data engineers should understand the differences among different database engines and their particular use cases and help the company choose the right databases for storage. This is sometimes what data architects would do in bigger companies.
Data transformation is where data engineers spend the bulk of their time. Data engineers work with tools like SQL, Spark, and Lambda functions to transform the data in batch or real-time.
Data transformation is important because only when business logics get applied to the process will the data generate real business value. This is where we see significant differences among companies.
- When the company deals with large datasets, data engineers may write Spark jobs to process data in a distributed cluster
- When the company uses a data warehouse, data engineers may leverage tools such as dbt and SQL for transforming complex data
- When is dataset is small and messy, data engineers may just work with Pandas for data transformation and the pipeline can get deployed in AWS Lambda.
Data transformation can happen in different places.
- If modern data stack is adopted, data engineers would write data transformation logics inside the data warehouse and use tools such as DBT to automate the workflow
- If the company adopts the ETL approach, data engineers might be using external tools and platforms such as Apache Spark for data transformations and then load the data into a traditional warehouse.
- The more traditional approach in big companies will have data engineers work with specialized ETL tools such as Informatica where transformation is done also in an external environment before data gets loaded into the downstream data store.
Data engineers working on ETL/ELT need to have very solid understanding of data modelling for RDBMS and NoSQL engines.
Data Pipelines & Automation
ETL/ELT are very complex processes. The code base can get hairy and hard to control as the project grow, especially when the business logics are complex. In some large systems, data engineers may need to deal with thousands of database tables and without automation is can be quite daunting tasks.
Data processing units often times have dependencies on each other. For example, a data pipeline that processes one massive transaction table might need to be used by several other downstream processes. Instead of building separate pipelines and re-process the hour-long data pipeline, it makes sense for all downstream pipelines to depend on the same pipeline that produces the transaction table. This creates a DAG (Direct Acyclic Graph) and data engineers will need to build and maintain the DAGs for complex data processing needs.
Data engineers will leverage platforms such as Apache Airflow, Prefect, or Dagster to build and orchestrate pipelines. Some of these platforms can be orchestrated on Kubernetes clusters so data engineers also need to have basic understanding of containers and container orchestration.
Data Engineer vs Data Scientist vs Software Engineer vs DevOps Engineer
You probably notice that when we introduced data engineering, we didn’t really mention statistics, math, and machine learning. That is one of the major difference between data engineering and data science. Data engineers mainly focus on building data pipelines, data ingestion, and preparing data for data analytics projects and don’t need to build machine learning models. However, data engineers work very closely with data scientists and share a set of common skill sets. Data engineers sometimes also need to have software engineering and devops skill sets.
Data Engineer vs Data Scientists
- Both data engineers and data scientists (or BI engineers) need to be very good with writing SQL queries, which is the fundamental data skills for manipulating data stored in relational databases. While data engineers will mainly use SQL to transform raw data and load them into analytics environments, data scientists will be extracting relevant data from the analytics databases for exploratory analysis and machine learning. In short, data engineers make sure that data is prepared properly with high quality and data scientists will perform actual analysis on those prepared data for insights.
- Data engineers and data scientists also need to be skilled at programming. When data is not processed in SQL databases, they may write python and spark scripts to process data in different execution environments such as a Spark cluster, or AWS Lambda function (serverless functions), or a python process in a container.
- Data engineers will be writing python scripts to automate data pipelines in Apache Airflow or Prefect, while data scientists might be using the same tools for building the machine learning pipelines. In some companies, ML pipeline automation can also be done by the data engineers but it’s up to the data scientists to develop, train, and tune the machine learning models.
- Data engineers and data scientists will also need to be comfortable with working with big data frameworks such as Hadoop and Spark, with data scientists focusing more on the ML side.
Data Engineer vs DevOps Engineer / Cloud Engineer
When you look at WeCloudData’s data engineering bootcamp curriculum, you will notice that we cover a lot of cloud computing knowledge. We not only teach students how to work with data but also cover important aspects of data and cloud infrastructure. Both DevOps Engineer and Cloud Engineer need to work with cloud infrastructure as well. Here’re the similarities between data engineers and DevOps/Cloud Engineers
- DevOps engineers help software team build CI/CD pipelines. This requires knowledge of software development lifecycle, continuous integration, and continuous deployment. On the CI/CD side, DevOps engineers mainly deal with software code. Data engineers on the other hand, work with data pipelines. They not only work with pipeline code but also need to make sure that the data is properly version controlled and is of high quality.
- DevOps engineers version control code while data engineers version control both code and data.
- Both DevOps engineers and data engineers need to know how to write test cases and follow test-driven best practices.
- Both DevOps engineers and data engineers work with cloud infrastructures. While DevOps engineers focus more on IaaS (Infrastructure-as-a-service) and IaC (Infrastructure-as-code), data engineers work more closely with PaaS (Platform-as-a-service) tools. For example, data engineers might be working with Snowflake for data warehousing and Databricks for big data processing, but they usually don’t set up the underlying infrastructure for these platforms. From time to time, during PoC (proof-of-concept) stage data engineers may be required to benchmark different tools and will set up the environment using docker containers and work with cloud compute and storage systems such as AWS EC2, ECS, and S3.
- DevOps and Cloud Engineers will spend more effort on creating and maintaining the cloud infrastructure. For example, to help the company scale the application and data systems, DevOps engineers will work with cloud engineers to create a Kubernetes cluster. They make sure that the infrastructure is reliable, secure, and scalable. On the other hand, data engineers focus on processing the data that gets generated through applications and prepare them for analytics purposes.
Data Engineer vs Software Engineer
In big tech companies, a lot of data engineers actually come from software engineering background. Some of these are data-savvy engineers and developers. Data engineers and software engineers sometimes need to work closely to define the data format used for extraction and ingestion purpose. Any upstream changes on the software application side such as adding new tables, changing table structures, or adding/removing fields need to be communicated properly with data engineers who work on data ingestion.
- Data engineers also share a common set of skills with software engineers. For example, they all need to have good foundational knowledge of computer science. They know at least one programming language very well, such as Python, Java, or Scala.
- Both data engineers and software engineers need to be familiar with distributed systems and big data. The big data engineers might be focusing on streaming data processing and building data-intensive applications, while data engineers focus more on building big data pipelines for batch and streaming data.
- Both data engineers and software engineers need to know SQL very well. Software engineers write CRUD (Create, Read, Update, Delete) operations so their applications can interact with databases, while data engineers will be working with lots of data at the same time (extracting, transforming, loading) in batches.
- Software engineers and data engineers also need to know how to write unit test and integration test cases. Data engineer will write test cases to make sure that data transformation functions and UDFs are validate and take all cases into consideration. They also need to write queries to test the validity of the data tables they are dealing with in the data pipeline, to avoid garbage in, garbage out situations.
The Value of Data Engineers
Truth be told, data engineers don’t get the same glamour as data scientists do. DS and DA work more closely with the business teams such as product, marketing, and sales. Business teams are always hungry for data and insights to drive better decisions. Therefore, data scientists and analytics are more likely to get the recognition.
Data engineers are the unsung heroes. But many organizations today are starting to realize the importance of data engineers. In the U.S., senior data engineers earn higher salary on average compared to DS and DA jobs. Many startups are hiring data architect and engineers before hiring data scientists. We created the diagram below to help you understand how different teams collaborate and the importance of data engineers.
Let’s start from the bottom of the diagram.
- Data Engineers
- Before any machine learning and analytics can happen, data need to be collected. Data engineers work with the application team to identify data sources and metadata.
- Data engineers help ingest the raw data into the data warehouse or data lake and help encrypt and mask the raw data.
- Data engineers then build data pipelines to transform and integrate all the data to create the analytics data tables for analytics purpose.
- Data Analysts
- Data analysts extract data from the analytics database and create ad-hoc reports to address business users’ questions.
- Data analysts work with dbt to transform data and build wide tables for reporting purposes (in some companies this could be done by BI developers, analytics engineers, or data engineers)
- Data analysts present the dashboard and ad-hoc reports to the business, get their feedback, and tweak the dashboards/reports till it’s accepted by the end users
- Data Scientists
- Data scientists extract data from analytics database or data lake and begin with data exploration (EDA → Exploratory Data Analysis). Heavy lifting data preparation steps are usually performed in data warehouse (snowflake, redshift, big query), or data lake (athena, trinodb, spark, etc.)
- Data scientists then bring data into the machine learning environment such as AWS SageMaker and Databricks spark cluster
- Data scientists work on various experimentations and build ML pipelines for supervised or unsupervised learning
- The predictive modeling outcome will be interpreted by data scientists to the product and marketing teams
- Once the models are accepted, data scientists work with data engineers or machine learning engineers to automate the scoring process (in many companies, the model scoring and deployment is handled by data scientists as well)
- Machine Learning Engineer
- Machine learning engineer help data scientists train and tune ML models at scale. For data intensive ML jobs, machine learning engineers may leverage distributed platforms such as Spark for data transformation and feature engineering
- For model training, machine learning engineers leverage open source frameworks such as Ray or Spark ML for large-scale model training. AWS Sagemaker is used by MLE at companies that prefers the AWS ML stack
- Machine learning engineer also need to help deploy the ML models. Some of the tasks cover model packaging, building prediction service endpoints, and model testing and validation after deployment. Model retraining and tuning pipelines are automated by ML Engineers as well. Some of these fall under the MLOps realm but in many organizations, MLE actually wear multiple hats. The line between MLOps and MLE are pretty blurred.
Hopefully this section explains the importance of data engineers. One suggestion WeCloudData would like to give is try to develop crossover skills. Whether you want to specialize in DS, DE, or MLE, it’s always helpful to learn some skills from other roles.
A Day in the Life of a Data Engineer
What does a typical day look like for a data engineer? Well, it varies and depends on what type of data engineer we will talk about. For this section, let’s have a look at a typical day in the life of a data engineer working on data lake projects. Note that not all data engineers work in the same way so take it with a grain of salt.
- 9:30 – 10:00 – Log on to work laptop and reply some emails. Have a quick check the data pipeline dashboard to make sure nothing alarming’s going on
- 10:00 – 10:30 – Meet with the team in daily project standup meeting to discuss accomplishment, challenges, roadblocks for the day
- 10:30 – 12:30 – Then pick an assigned task from JIRA and start to work on a new data transformation task using AWS Lambda function
- 12:30 – 13:30 – Lunch break
- 13:30 – 14:30 – Wrapping up the data transformation task, test it locally, and deploy it to the cloud
- 14:30 – 15:00 – Join a meetup with the product manager and data scientists to discuss new data ingestion requirements. Contribute to the meeting minutes.
- 15:00 – 17:00 – Draft the data pipeline architecture using Lucidchart and send the draft diagram to related teams
- 17:00 – 17:30 – Send slack messages to the platform team to discuss the pros and cons of different CDC (change data capture) strategies for a new ingestion project. Log the work in JIRA or Notion and wrap up the day.
Different Types of Data Engineer
If you’re about to embark on a data engineering career, make sure you understand that there are different types of data engineers and depending on the teams, the scope of their work may be very different.
Data Architect vs Data Engineer
We got this questions often: what is the difference between a data architect and engineer?
Well, while a data architect may come from a data engineering background, he/she primarily work at the strategy level. A data architect spends less time doing actual hands-on implementation but they know enough (many are very experienced) about different tool stacks, data governance, and data architecture so that they can help the organization set directions on tooling, high-level data infrastructure planning and data systems design. Data engineers on the other hand spend more time on the executions. Let’s dive in and discuss the different types of data engineers.
Four types of data engineers
In this section, WeCloudData categorizes data engineers into four different types:
- ETL/ELT Engineer
- ETL engineers are typically involved in data ingestion, integration, and transformation. They work with various data sources and bring the source data into the data lake/warehouse. They work with different data connectors and data ingestion tools to bring web analytics data, SaaS data, and operational data together and integrate them via big data processing. Some big companies use traditional tools such as Data Stage and Informatica and require data engineers who have specialized experiences. Many tech companies may work with open source tools such as Apache NiFi, Kafka, and Hadoop for data ingestion and integration. ETL engineers need to learn many different tools as they move from projects to projects. But keep in mind… tools are just tools, the most important skills of ELT engineers is the ability to write good code to transform data based on complex business logic, and this requires strong SQL, python, spark programming skills, solid understanding of data models, and great attention to details.
- Data-savvy Software Engineer
- In big tech companies, many data engineers are from software engineer background. This type of data engineers help companies build data-intensive applications. For example, to build a real-time recommender system that can handle tens of thousands of requests every second requires very good system design skills. Unlike the ETL/ELT data engineers who take care of the quality of data throughout the transformation stages, data software engineers care about the scalability of the data applications, latencies, and reliability. They work with streaming data ingested into Kafka and write Spark Streaming jobs or Apache Flink jobs to process the data in near real-time, and ingest the processed data to other data store such as caching or NoSQL databases. Instead of SQL and python, they are more likely to work with programming tools such as Scala, Java, or Go.
- BI/Analytics Engineer
- BI and Analytics Engineers mainly work in the data warehouse environments. They work with data connectors such as Fivetran to ingestion data into the data warehouse such as Snowflake, use dbt for SQL-based data transformations, build the dimensional models or wide tables for analytics purposes.
- Analytics engineers don’t need to be specialized in python and spark and they need to be really good with complex sql queries and data modeling.
- The rapid development of the modern data stack ecosystem has helped create many analytics engineer jobs. Cloud data warehouse platforms such as snowflake has also resurged sql-based data modeling.
- ML Data Engineer
- A common understanding is that ML engineers work on building ML pipeline. But, in many companies, especially startups data engineers need to wear multiple hats. Ultimately it’s the job description, not the job title that defines a role. So don’t be surprised to be asked to automate ML pipelines.
- Data engineers also work really closely with the data science team or even work in the data science team. A lot of the ML Engineering work actually involve data engineering skills.
- WeCloudData’s suggestion for aspiring data engineers is never give up the opportunities to learn crossover skills.
We hope that this article helps you understand what data engineers are and what they do on a daily basis. If you liked this article, please help us spread the word.