(Published for RefinePro on April 5, 2020)
Maintaining the quality of your data is paramount to any web scraping or data integration project. Think about it: there’s absolutely no point in collecting a massive amount of data if you can’t rely on it to make sound decisions! And the only way to maintain high quality is by implementing quality checks and validation at every step of your data pipeline. As the saying goes: garbage in, garbage out!
We’ve discussed already what a good ETL (Extract Transform Load) tool is and what it should do, and we’ll now learn how data quality insurance fits into that process. ETL is the process of extracting, transforming and loading the data. It defines what, when, and how data gets from your source sites to your readable database. Data quality, on the other hand, relies on the implementation of a system from the early stage of extraction all the way to the final loading of your data into readable databases.
Choosing the right scraper and the right ETL tools will help you streamline this process, but these tools don’t automatically guarantee the quality of your end results. That’s why you need to work with a partner, like us, who will put in place all the proper checkpoints.
DATA QUALITY IN YOUR ETL PROCESS
When you sat down to define your web scraping project, you made a list of sources you would be collecting the data from. Already, the choices you made will have an impact on the quality of the data. It’s important to always rely on trustworthy source sites that are relevant to your goals.
Don’t forget, scheduling, maintaining, and monitoring are essential aspects to ensure your data is current. For example, if you always extract your competitor prices the day they put everything on sale, your data won’t reflect the real sale prices.
At the extracting phase, you know what your data is, and you should already implement scripts that will check its quality. This way, you allow the system to troubleshoot closer to the source itself, and you can take action immediately before your data are transformed.
Transformation is when most of the quality checks are done. No matter what tool is used, it should at least perform the following tasks:
- Data profiling: the data is analyzed in terms of quality, but also format, volume, etc.
- Data matching and cleansing: related entries are merged, and duplicates are eliminated.
- Data enrichment: the value of your data is extended by adding other relevant data to it.
- Data normalization and validation: the integrity of the data is checked, and validation errors are managed.
That last point is particularly important. Data validation is the process used to ensure the high quality of the data after it’s been cleaned and enriched. During the validation process, data is being checked against predefined rules, constraints, or routines regarding its meaningfulness and correctness. And to implement data security solution, dynamic profiling is, without a doubt, the best solution. Integrated into the validation process, it automatically profiles your dataset, including value type, mandatory field, number of records, and use them to classify data according to a baseline of what is “acceptable”: the profile. Once deployed, it automatically warns you when a data does not match the defined profile, without the need for manual handling.
When we sit down to create a project, we will help you define and document what a good profile is for your data. While developing the ETL workflow, we make sure the system continuously validates the results. And we implement a test-driven development process to ensure the data quality is not over-engineered.
At this point, you know your data. It’s been transformed to fit your needs and, if your quality check system is efficient, the data that reaches you is reliable. This way, you avoid overloading your database or data warehouse with unreliable or bad quality data, and you ensure that the end results have been validated. Your team can then use it to define strategies and make plans.
THREE DATA QUALITY STRATEGIES
With validation and dynamic profiling, you ensure that the database you send to your team contains only data that’s usable and meets your most specific requirements. But in order to do that, you also need to deal with error validation. These errors occur when a data does not respect the validation rules established for the project. In other words, the data you extracted can’t get in the transformation step, or it doesn’t meet your requirements. This error can originate from either a bug in your workflow (you had the right data, but implementation did not work out) or changes in the data coming from the source (the extracted data structure did not match what was expected).
Depending on the on how critical is the data, we apply one of the following strategy to deal with data errors.
When a validation error occurs, processing continues with the incorrect records. The system logs the error to be reviewed later. This strategy works well for non-critical systems with a limited budget for data quality supervision.
When a validation error occurs, the processing continues and the erroneous record is rejected. The rejected records need to be reviewed and corrected before they can be reintroduced into the workflow. It is the most common approach. It protects downstream systems while ensuring good data keep flowing.
When an error occurs, the circuit opens, preventing low-quality data from propagating to downstream processes. Latest changes are rolled back, so the target system is closer to the version that existed before the execution of the ETL flow. In this case, the resulting data will not include all records, but records are guaranteed to be correct if present. This is the most conservative approach. However, you have to keep in mind that the system is on pause until someone addresses the error and restarts the system.
WHY DATA QUALITY SHOULD BE DECOUPLED
We saw how your quality check system is integrated into the ETL process of a data pipeline. But in real life, your dataquality insurance system should cover all the data pipeline. When we work with clients, we create scripts that support multiple data collection and transformation scripts (and so, numerous ETL processes).
How would that work? Let’s say you’re trying to gather data from your suppliers’ catalog to populate your eCommerce site. You could create ten different collections and transformation scripts for ten different suppliers, but only write a single data quality script to ensure quality.
The other good reason to implement ETL and quality checks as distinct processes is to ensure easy restartability. Collection and transformation can take hours to process. By having reliable data quality check different from the web scraping or ETL, you don’t have to restart the entire pipeline when you make a small correction, or you fix a bug.
If you rely on your processing steps (scraping or ETL) to check the quality of your data, you run the risk of having a system that is too cumbersome to manage, and so you run the risk of letting bad data slip through the cracks.
AND WITH THAT
Having a robust ETL tool supported by a great scraper is crucial to any data aggregation project. But to ensure that the end results meet your needs, you also need to make sure you have a quality check system in place.
When you sit down with an expert like RefinePro to define what your needs are, they make sure that the system generates data that is reliable, usable, but also relevant to your goals. They make sure to create scripts that allow efficient error management in a timely manner. Because errors will occur: websites get updated all the time without notice, and unless your system helps you deal with these changes, you run the risk of having missing or erroneous data embedded in your datasets. Or you run the risk of your system crashing without knowing how to fix it. Your expert will help you develop data pipelines that integrate all the required validation processes to make sure the end results are what you’re expecting.