For probably the umpteenth time, we use the term ‘garbage in, garbage out’ when we summarize problems with data quality. It has indeed become a cliché. Various industry studies have uncovered the high cost of bad data, and it’s estimated that poor data quality costs organizations an average of $12 million yearly. Data teams waste 40% of their time troubleshooting data downtime, even at mature data organizations, and utilizing advanced data stacks.
Data quality remains an Achilles heel for CIOs, CCOs and CROs, which has always been a critical component of enterprise data governance. In fact, data quality has become even more challenging to tackle with the prolific increase in data volume and types — structured, unstructured and semi-structured data.
Data quality is not just a technology problem and never will be because we rarely think of the quality of the data we source when implementing new business initiatives and technology. Technology is only an enabler, and to get the most from the technology, we need to think about the business processes and look for opportunities to re-engineer or revamp these business processes when we start a new technology project. Some of the aspects of understanding these business processes are:
- What data do we need?
- Do we understand the sources of this data?
- Do we have control over these sources?
- Do we need to apply any transformations (i.e., changes to this data)?
- Most importantly, do our end users trust the data for their usage and reporting?
These questions sound basic and obvious. However, most organizations have trust issues with their data. The end users rarely know the source of truth, so they end up building their data fiefdoms, creating their own reports and maintaining their own dashboards. Eventually, this causes ‘multiple sources of ‘truth,’ each being a different version of the other. As a result, this causes sleepless nights, especially when we want to submit a regulatory report, make any executive decisions or submit SEC filings. Not only is this wasting valuable engineering time, but it’s also costing precious revenue and diverting attention away from initiatives moving the business’s needle. In addition, this is a misuse of data scientists’ core skills and adds additional costs and time that could be better used for the organization’s business priorities.
Over time, data quality issues have become more extensive, complex, and costlier to manage. A survey conducted by Monte Carlo suggests that nearly half of all organizations measure data quality most often by the number of customer complaints their company receives, highlighting the ad hoc nature of this vital element of modern data strategy. Most organizations decide to address this issue in a piecemeal fashion which is a practical approach but requires a tremendous effort to understand the data, document the lineage, identify data owners, identify key data elements (KDE), maintain these KDEs and apply the data governance lifecycle to the data. No wonder this is only a tactical solution; sooner or later, we need to start working on another tactical project to resolve the issues caused by the previous tactical project and so on. This means an endless cycle of massive spending on IT, frustration because of low return on investment from technology projects and buying new technology products that promise a total overhaul.
What is data quality management?
Data quality management (DQM) is the set of procedures, policies, and processes an enterprise uses to maintain reliable data in a data warehouse as a system of record, golden record, master record or single version of the truth. First, the data must be cleansed using a structured workflow involving profiling, matching, merging, correcting and augmenting source data records. DQM workflows must also ensure the data’s format, content, handling, and management comply with all relevant standards and regulations.
So how do we tackle data quality with a proactive approach? There are a few options, from the traditional approach to the real-time solution.
- Traditional approach: Data quality at the source
- This is the traditional and, in most cases, the best approach to handling data quality
- This includes identifying all the data sources (external and internal)
- Documenting the data quality requirements and rules
- Applying these rules at the source level (in the case of external sources, we apply these rules where the data enters our environment)
- Once the quality is handled at the source level, we publish this data for the end users through applications such as a data lake or a data warehouse. This data lake or warehouse becomes the ‘system of insight’ for everyone in the organization.
- Pros of this approach:
- Most reliable approach
- One-time and strategic solution
- It helps you with optimizing your business processes
- Cons of this approach
- We need a cultural shift to look at data quality at the source level, ensuring this is applied every time there is a new data source.
- This is possible only with executive sponsorship, i.e., a top-down decision-making approach, making it an integral part of every employee’s daily activities.
- Data owners must be ready to invest time and funding to implement data quality at the sources they are responsible for.
- Implementation of a data quality management tool
- Modern DQM tools automate profiling, monitoring, parsing, standardizing, matching, merging, correcting, cleansing, and enhancing data for delivery into enterprise data warehouses and other downstream repositories. The tools enable creating and revising data quality rules. They support workflow-based monitoring and corrective actions, both automated and manual, in response to quality issues.
- This approach includes working with the business stakeholders to develop an overall data quality strategy and framework and selecting and implementing the best tool for that framework.
- The implemented tool should be able to discover all data, profile it and find patterns. The tool then needs to be trained with data quality rules.
- Once the tool is trained to a satisfactory level, it starts applying the rules, which helps improve the overall data quality.
- The training of the tool is perpetual — it keeps learning more as you discover and input the new rules.
- Pros of this approach:
- Easy to implement, quick results
- There is no need to separately work on in-depth lineage documentation (tool automates the data lineage) and governance methodology; we need to define the DQ workflows so tools can automate those.
- Cons of this approach:
- Training of the tool requires a good understanding of data, data quality requirements
- There is a tendency to expect that everything will be automated. This is not the case.
- This is not a strategic solution; it does not help with business process improvement.
Based on the above considerations, we believe the best approach is a combination of the traditional and the DQM tools approach:
- First, set up a business-driven data quality framework and an organization responsible for supporting it
- Second, define an enterprise DQ philosophy: “whoever creates the data owns the data.” Surround this with guiding principles and appropriate incentives. Organize around domain-driven design and treat data as a product.
- Third, develop an architectural blueprint that treats good data and bad data separately and deploy a robust real-time exception framework that notifies the data owner of data quality issues. This framework should include a real-time dashboard highlighting success and failure with clear and well-defined metrics. Bad data should never flow into the good data pipeline.
- Fourth, incorporating this holistic DQ ecosystem should be mandated for each domain/source/application in a reasonable timeframe and every new application going forward.
Data quality remains one of the foremost challenges for most organizations. There is no guaranteed approach to solving this problem. One needs to look at the various factors, such as the organization’s technology landscape, legacy architecture, existing data governance operating model, business processes, and, most importantly, the organizational culture. The problem cannot be solved only with new technology or by adding more people. It needs to be a combination of business process re-engineering, a data-driven decision-making culture, and the ability to use the DQ tools most optimally. It is not a one-time effort but a lifestyle change for the organization.