A common challenge for many of our clients is organizing and understanding all the data available to them. While this issue is not new, it has been magnified by the sheer volume of data being generated, consumed and analyzed, along with the various business processes and uses the data are supporting.
In response to this challenge, organizations are looking for ways to increase data literacy and understanding to empower their workforces. Data dictionaries have been a longstanding potential solution, but these have given way to a more holistic concept – the data catalog – that combines technical and functional metadata along with other capabilities such as auto-discovery of new data to help. As these tools evolve, we are seeing enhanced AI use cases for auto-generation of business definitions, along with capabilities to automatically classify data based upon analysis of the field names, data values and other factors.
A properly defined data catalog can head off business challenges, such as terminology contention and misuse across teams. For example, a marketing team might define potential customers as ‘any individual who could potentially purchase our goods and services,’ while the sales team might view potential customers as ‘individuals who have purchased from our company in the past six months.’ While both definitions are valid within their own context, there needs to be an enterprise-wide data catalog to put added context around data to ensure the data is ‘right fit’ to the intended use.
Business glossaries (aka business data dictionaries) have along been used to maintain the context associated with their key terms and fields, but many organizations have struggled to maintain these artifacts with the constant flow of new data coming in and the ever-increasing number of analytics use cases. As such having some of the combined automation of a good data catalog to help maintain architecture, technical schema information (e.g., data definition language, or DDL), and lineage of fields within data sources can lessen the load. As such, data catalogs are often defined as the merging of both business context and technical information to create a ‘one stop shop’ for important metadata maintained across an organization.
A data catalog tool typically possesses the following capabilities:
- Data discovery: The tool can connect to data sources with authorized credentials configured and ingest technical metadata (the data definition language, or DDL, information) directly into the data catalog. This supports automated scanning and indexing of data assets across the organization.
- Metadata management: Allows for capturing and storing attributes for business, technical, and common metadata. Specific attributes may include, but are not limited to, application, field definition, ownership structure, associated data domains, classification, data quality definitions/rules and prioritization/governance categories. This facilitates centralized management of metadata, improving consistency and accuracy.
- Data classification/labeling: Establishes conventions for organizing information based on data type, sensitivity level, access requirements and retention policies. Tools can support enhanced data governance by categorizing data appropriately through automated matching, which aids in compliance efforts.
- Searchability: Enables users to retrieve information about specific terms or types of information relevant to their queries. Enhances user experience by allowing quick and efficient search capabilities across vast amounts of metadata, making it easier to find relevant datasets.
- Data ownership: Allows assignment of different owners (e.g., data owners, business owners) who will support different aspects of the maintenance of metadata. As an example, this includes activities such as the initial population of metadata, updates to it and recertification of recorded metadata for their owned data assets. This ensures ongoing maintenance and accuracy of metadata entries.
- Data lineage: Defines upstream/downstream flow of connected datasets. Tools can vary here, but generally have options which may include automating data lineage based on discovery efforts, visualization through flow charts the lineage recorded/maintenance, and automatically updating lineage diagrams and materials as underlying scans are adjusted with new fields or new sources. Provides visibility into the lifecycle of data as it moves through different stages. This helps users understand dependencies and impacts within workflows or reports.
- Data observability: Monitors health status indicators like freshness, volume anomalies, schema changes etc., for various datasets. Ensures that any issues affecting dataset quality are promptly identified. It helps maintain trust in analytics outcomes by ensuring timely detection and resolution of inconsistencies/errors.
- Workflow for approvals: Allows for the assignment and tracking of tasks to record individualized record keeping and task management for decision points related to a wide array of custom activities ranging from updating metadata, assigning stewardship, determining classification, defining quality rules, and/or approving use cases for workstreams like a Privacy Impact Assessment (PIA), Data Product Evaluation, or a periodic certification of a data element.
The establishment of specific calculation notes or acceptable thresholds significantly enhances adoption and application of data quality metrics. Demand for high-quality data is one of the most frequent requests from business entities. However, without adequate cataloging efforts in place it often remains an elusive goal.
Capturing key metadata helps companies build a library of agreed upon standards. This allows organizations to quickly spot and address both metadata and underlying data issues, leading to better operational decisions.
To learn more about our Data and Analytics services, contact us.