(Published for RefinePro on May 15, 2020)
Extracting, transforming, and loading (ETL) data is a complex process at the center of most organizations’ data extraction projects. As we saw in our article on web scraping and ETL, the implementation of an ETL workflow is a process that requires a lot of in-depth knowledge in several subfields of statistics and programming.
ETL developers thus work at the crossroad of different fields. They must ensure that every aspect of the data life cycle has been addressed to ensure its operability and maintainability. To help your developer navigate the deep and dark waters of ETL, RefinePro has drawn on their years of experience to create a list of ETL principles and best practices.
What you see here is not meant to be a grocery list; these guidelines need to be considered, and then implemented or rejected. Your developer will draw on their understanding of the project and their experience to decide which principles are needed, when, and at what range.
TEN BEST PRACTICES FOR ETL WORKFLOW IMPLEMENTATION
Modularity is the process of writing reusable code structures to help you keep your job consistent in terms of size and functionalities. With modularity, your project structure is easier to understand, making troubleshooting easier too. The ultimate goal is to improve job readability and maintainability by avoiding the need to write the same code over and over again.
Atomicity is used to break down complex jobs into independent and more understandable parts. The workflow is divided between distinct units of work and small and individual executable processes. Each of these parts can be executed separately. It makes testing and troubleshooting easier since the developer doesn’t need to run a long-running process to debug a single operation.
3. Change Detection and Increment
A change detection strategy detects differences and allows incremental data loading. This means that only records changed since the last update is brought into the ETL process, avoiding unnecessary transformations.
Your ETL process should be implemented in a way that ensures its scalability to allow your project to adapt to the growing volume of data. This way, you don’t have to redefine a new project at every new stage of your growth, saving you time and money.
DATA QUALITY AND ERROR MANAGEMENT
Another important aspect of any ETL workflow implementation is data quality and error management. We have an article explaining in detail how to guarantee the quality of the data loaded in your databases, and how to deal with validation errors. So, we won’t go into too much detail in this article, but here’s a list of principles to consider during implementation:
5. Error Detection and Data Validation
During validation, the data being extracted is checked according to a predefined profile. This profile represents what a “good” data is for your project. The goal is to control data as early as possible to limit the computing time and avoid processing data that will be rejected later on. It also makes recovery easier as errors are detected early in the process.
6. Recovery and Restartability
Recovery and restartability address the capability of the workflow to resume after an error. It includes the process by which the data stays in a stable state following an error. That requires the use of database backup, as well as commit and rollback features. When we commit, we make permanent a set of changes in the code. Rollback, on the other hand, is the capability to return a program back to an earlier version. They are both used to manage workflow and errors, in combination with another practice known as idempotence.
An operation is defined as idempotent when it gives the same result after being called once or multiple times. In real life, the best example would be the elevator button: you have the same result whether you push it once or fifteen times. How can idempotence be relevant in data transformation, knowing how your data is always changing? Because sometimes, it doesn’t. If a data suddenly stops being updated, you still get the same results in your tables. The same would apply if your own transformation deployments were to stop. With idempotent transformations, you avoid a system failure when the ETL process itself fails.
8. Data Lineage
Data lineage helps identify which ETL steps a specific data point went through and to understand where it originated from, when it was loaded, and how it was transformed. It eases debugging and increase trust in the data by making the process transparent, thus validating the integrity of the end results. Thanks to lineage, we can guarantee the integrity of the data and the process that extracted and loaded it into your database.
Checking your logs for potential mistakes is not enough to ensure that your load was a success. Your system should be designed to check for errors and to support auditing of your primary metrics (like the number of rows processed).
10. Script Configuration
When executed, the configuration variable modifies a workflow behaviour. Variables are stored separately from the job so you can modify them without editing and deploying the scripts. Identifying the right variables is an integral part of any professional ETL job. For example, configuration variables contains parameters such as the server name and credentials, which should never be hardcoded in a job.
FOUR BEST PRACTICES FOR ETL WORKFLOW TO SCHEDULE, MONITOR, AND MAINTAIN
Every data project includes an administrative component. We already covered the scheduling, monitoring, and maintaining of web scraping. However, these three important tasks also need to be executed at the ETL process level. Here’s a list of principles that will help your developer manage the workflows more efficiently. Most ETL software comes with a server edition that provides those four features. If you are looking for a technology agnostic (or cheaper) solution, contact RefinePro for more details.
Your developer will ensure that all the moving parts of your workflow come together to deal with the different nature, frequency, and cadence of your source data. It includes executing the different ETL modules and their dependencies, in the right order, along with logging, scheduling, alert monitoring, and managing code and data storage. This orchestration demands a high level of know-how, but also access to the right resources. You should never hesitate to ask for the services of an expert like RefinePro to help you implement your project.
12. Metadata Management
Metadata is basically data about your data. They hold all kinds of information describing our ETL workflow. It can include information on where the data comes from, how many data points it contains and the data extraction strategy. Most importantly, a well-designed metadata system will maintain various versions of execution, including the status, the extraction and transformation methods used, the changes in source systems, etc. Thanks to the metadata, your developer can keep track of all these changes over several months or even years and your team will have all it needs to analyze the system more efficiently.
Every step of an ETL project must be logged using a central logging component. Relevant events are then recorded whether they happen before, during, or after extraction, transformation, or loading.
11. Code Management and Storage
You should always keep track of your code and store it somewhere safe. There are three main reasons for this. First, you should always keep track of your code versions, so you can come back and restore your scripts after a bug or an error. Second, code management and storage will enable collaboration between members of your team and other external collaborators. And third, because your code needs to be kept separate from the execution environment.
AND WITH THAT…
With those fourteen best practices, you have everything you need to make sure your ETL workflow fits your organization’s needs. Whether your plan is to grow, to discover new markets, to know more about your competitors, to develop a business plan, or to introduce a new product to the market, data should always be at the cornerstone of your analysis. Selecting the best ETL tool will only be the first step to ensure your end data is reliable and relevant to your goals. Implementing and maintaining these tools and the process that binds them is paramount to your project success.
This article only scratches the surface of ETL design principles and best practices. Your developer will need to know which ones need to be applied, when they should be implemented, and at what range. Your developer needs to balance the robustness of the data pipeline and its development cost. But these principles and guidelines implemented at the right moment with the right goal in mind will guarantee the quality of your data, but will also help you manage an already complex process with more ease and fewer headaches.
RefinePro’s team has been helping clients implement ETL projects for years, and doing so, has developed a deep understanding of its many internal mechanisms. And they rely on such best practices to guarantee that every ETL workflows they create answer all their client’s needs.