Data Engineer is just Another Engineer

How data engineer should use engineering principles on their data pipelines

11 min readMar 27, 2021

Introduction to Data Engineering

The name tells you everything, data engineering is the engineering counterpart of data science, just like chemical engineering is the engineering counterpart of chemistry and any other engineering vs science counterparts. We all know sometime engineering implements science to build something and sometimes science uses engineering product to discover something. In case of data engineering, data science is most likely to uses data engineering tool which is data pipeline to provide data. The relation can be seen on the data analytics hierarchy of needs.

The hierarchy of needs is basically saying that the bottom needs are the foundation of the top. So in this case, data engineering need an input from the development that generates data, it can be from IoT devices, web app, mobile app and so on. data analysis and data science (as they both often overlap each other and for topic relevance simplicity) need the input data from data engineering and the machine learning engineering need a model from data scientist to be deployed to the end customers.

Engineering Principles

There are a lot of definitions regarding engineering principles, but in this post, I want us to agree about the definition. I would put as simple as the process that must exists to get things done.

Almost all engineering projects need this principles. for example when you build a bridge, there must be an architecture for the bridge that design how the bridge will be build, sometimes it also provide the measurement, scale and so on to make the next step easier. Infrastructure is a must, without a proper material, foundation and other infra things, the bridge will not be sufficient. The operations is actually quite clear as it’s actually what we want, when the infrastructure of the bridge had been ready, the bridge will be ready to operate, vehicles from bike to truck can pass to the other side using the bridge. During the bridge operations, there should be a monitoring system to check if the bridge has an issue or not, this is important to make sure the operations on the bridge move flawless and what actions need to take in case there is an issue.

As the title said, “data engineering is just another engineering”, it needs a proper infrastructure to run the operations, which most of the time is data pipeline. The infrastructure needs an architecture design to make sure there is no missing links and so on. Unfortunately, most of data engineering course I found, they only focus on operations on how to create a data pipeline without explaining the engineering principles, if there is an extra scope, they will give an insight about the importance about monitoring. This is quite make sense because most data engineering courses targeting a beginner that want to kickstart their career as a data engineer, while architecture and infrastructure are quite high level decision that made by senior to managerial level data engineers. But this is important so they can think from top down approach which will be beneficial when they want to accelerate their career or if they want to work at early stage startups which have flat structures and they need to build the data pipeline from scratch alone.

Data Engineering Architecture

The data engineering architecture contain the blueprint of an end-to-end process of the data pipeline from the source to the destination. The source can be vary but the destination is ideally one, as we want to create a single source of truth. The architecture tells what techstacks will be use and how they are communicate each other in order to make the data from the source can automatically delivered to the destination. There are several concept needed to design the architecture of a data pipeline such as:

ELT (Extract Load Transform) or ETL(Extract Transform Load): both are the concept to extract the data from the source and load it the destination. The is actually the decision on where should we do the transformation, in the during the pipeline or in the destination. In my experience, the decision depends on the destination technology. If the destination have a powerful engine to transform the data, then it would be better to use ELT as we transform the data in the destination. If the destination has no transform process capability then we use ETL. The simplest metaphor is the supplier to warehouse integration. We want to arrange our supply on delivery or on the warehouse, of course it depends on the warehouse capabilities.

Batch Processing or Stream Processing: To decide which concept to use, we need to ensure the needs, are they need a real-time data or is it okay to have delays? If they need real-time data (and it has to be ensured to the stakeholders) then you have to use stream processing. If it’s okay to have a delay, then use batch processing. This decision is very important because stream processing technologies are expensive and harder to built. Batch processing also have to be decided when the approach to extracting the data is by pull not push. Pull extraction needed when the data source is static, we need to query or request for data in a scheduled fashion. Just like a supplier that wait for our delivery service to take their product. Push extraction in other hand needed when the data is moving and we need to capture it or in other word “there is another trigger that move the data”.

Batch Processing ELT (Left) vs Stream Processing ELT (Right)

Incremental Load or Full Load: This case is for batch processing as stream processing should be incremental load. Incremental load means that the data you extract and load to the destination is only the new data from the source that not yet exist from in the destination. While Full load means that you extract all the data from the source and load it all to the destination at once. It’s quite clear that incremental load is more efficient, but there are some cases that we cannot use incremental load like when there is no order key and no unique key on the source. To do the Full Load is quite simple, we just have to query all the data from the source and load it to the destination. For Incremental Load, we need to get the load state from the destination or we can save the load state somewhere so we don’t have to access the destination for getting a load state. Load state is basically the latest value of the order key (the column uses as the basis to order/sort the data to get the most updated data). After we get the load state we can query the source using the load state so that we can get the new only data and the we load it to the destination. In addition to the transform process, as there are several cases when the data from the source is updated, we have to keep only the updated data, so we have to remove the old data by removing the older data that have duplicate in unique key (column that determine a row is unique entity of the data). That’s why we need both order key and unique key in the source table in order to perform incremental load.

Data Lake or Data Warehouse or Data Mart: Not like previous explanation, I’ll start with the definitions and their differences. Data Lake is the collection of Big Raw Data (usually unstructured) This is a good approach if you have no schema defined previously or it can also be a data stage where the data not yet transformed. The best thing of this approach is that there will be no issue occurred when there is a changes in the source. Data Warehouse in other hand is the collection of big clean data. The data has to be clean and easy to access and it should integrate all the data sources needed. Data Mart is the smaller data that has been used after we transform the data from the data warehouse. The transformation is designed to be as specific as possible. We don’t really have to choose between these three, we can implement all of them at once. For example in ELT approach, we put the data inside the data lake then we transform it and put them into data warehouse and the last we create data marts from the data warehouse based on the request of the stakeholders.

Data Lake to Data Warehouse to Data Mart

And many more.

After defining what concept to be used, the techstack will need be decided. There are several managed service in AWS, GCP, Azure or any other cloud providers, you can also uses open source technologies. For example we decided to uses the concepts of ELT Batch Processing with Incremental Load, we have to decide what is our scheduler, process engine and destination. In my case the data sources are more often to exist in AWS ecosystem but we want to use GCP BigQuery as the destination so we have to use both techstacks.

Scheduler: EventBridge
Process Engine: Lambda Function
Destination: BigQuery as Data Lake and Data Warehouse

Here’s our architecture:

ELT Batch Processing with Incremental Load

In case of stream processing we need a capturer, luckily AWS has API Gateway to capture an HTTP request data. Also sometimes we have to collect the data first before load it to the destination to reduce cost. In this case we can use Kinesis. Here’s the example of ELT Stream Processing.

ELT Stream Processing with Stream Collector

Of course this architecture is still oversimplified as we didn’t include monitoring (which will have its own section) and the credentials access. But the overall process is already explained in above architecture.

Data Engineering Infrastructure

After we have the architecture, we have to allocate the resources for the implementation using the infrastructure techstack decided from the architecture. For example how does the API Gateway signature, how does it’s confirm the service is sending the data we want, how much memory needed for each lambda functions, how many rows of data need to be collected in one batch of kinesis and many more. Using the above techstack we actually have minimize the number of infrastructure setting needed because the above techstacks are serverless which mean there are only a minimum number of setting to manage actual server. If we decide to use kubernetes or deploy an orchestrator like airflow, we need to manage a bunch of infrastructure. Such as how the autoscaling will be, how much resource need to be allocated in one node or worker, how to manage resource access using VPC, security groups subnets and so on. In my experience, I have to work closely with the DevOps team and I think it’s so much fun doing that. In case of a Data Engineering Course, we might need to elaborate how to reduce cost and uses our techstacks efficiently.

Data Engineering Operations (Data Pipeline)

The operations part is actually the part when we code the actual data pipeline. As I mention above, most of data engineering courses are more focus on this part and miss several essence from the architecture design and infrastructure. As we have design an architecture and build the infrastructure, we can develop the data pipeline for operation easily. I would recommend to use AWS Chalice to develop and deploy the data pipeline as it’s really pythonic. Of course it still have several limitations like scheduler input and so on, but it worth the simplicity. As the transformation is done in the destination which in this case BigQuery, we can manifest the benefit of scheduled query or if you prefer all operations in python, you can query the BigQuery table from Python. Usually I’ll put the JSON string inside the Raw Data “Lake”. It would be something like this.

╔══════════════════════════════════════════════════════╗
║ raw_data                                             ║
╠══════════════════════════════════════════════════════╣
║ {"column1": "value1", "column2": 1, "column3": true} ║
║ {"column1": "value2", "column2": 7, "column3": false}║
║ {"column1": "value3", "column2": 3, "column3": true} ║
║ {"column1": "value4", "column2": 4, "column3": true} ║
╚══════════════════════════════════════════════════════╝

The JSON string value can be extracted using JSON_EXTRACT_SCALAR function from BigQuery. This part is important to make sure that the destination schema won’t be affected by the change of the source schema. If we extract the value of the JSON string, we can get following table in the clean data warehouse.

╔══════════╦═════════╦═════════╗
║ column1  ║ column2 ║ column3 ║
╠══════════╬═════════╬═════════╣
║ value1   ║ 1       ║ TRUE    ║
║ value2   ║ 7       ║ FALSE   ║
║ value3   ║ 3       ║ TRUE    ║
║ value4   ║ 4       ║ TRUE    ║
╚══════════╩═════════╩═════════╝

Data Engineering Monitoring

Monitoring is essential in data pipeline. It will ensure that there is no error occurred in the data pipeline. If there is an error occurred, we can notice and handle it ASAP. The simplest monitoring is using the dashboard provided by the AWS Lambda, but sometimes we don’t have time to monitor the pipeline all the time. We don’t have to check for errors if there is no error. So, the answer is alerting. The best method for alerting is using the API of the communication channel that used the most. In my case I created a Slack Channel which has Slack Webhook URL that I can push a message to it using an API call. For every function I developed, I always put a try clause before the complete code and at the end of it, it will except a Granular Exception and alias it as a variable which later be a message to be passed as the error message to be sent to Slack Channel.

Conclusion

Data Engineering is just another engineering that should follow engineering principles in the practice. The principles are design architecture, build infrastructure, develop data pipeline and monitor the errors. The first two principles are not often to be discussed in a beginner data engineering course while those two are important to accelerate their career in data engineering and will be super helpful on data engineering of an early stage startups.

Note: The above architecture, infrastructure, data pipeline and monitoring model are conceptually production grade, but the implementation is still oversimplified.