Just like physical wellness, maintaining data health has to be achieved holistically. Here are some tips for achieving this.
Data quality is essential to data health. When data quality becomes a company-wide priority, analytics teams will not have to face the specific challenge of combining disparate sources, and instead can focus on driving some of the most important decisions of the organization.
The dimensions of data quality cover a number of metrics that indicate the overall quality of files, databases, data lakes, and data warehouses.
Academic research describes up to 10 data quality dimensions—sometimes more—but, in practice, there are five that are critical to most users:
- Completeness: Is the data sufficiently complete for its intended use?
- Accuracy: Is the data correct, reliable, and/or certified by some governance body? Data provenance and lineage—where data originates and how it has been used—may also fall in this dimension, as certain sources are deemed more accurate or trustworthy than others.
- Timeliness: Is this the most recent data? Is it recent enough to be relevant for its intended use?
- Consistency: Does the data maintain a consistent format throughout the dataset? Does it stay the same between updates and versions? Is it sufficiently consistent with the other datasets to allow joins or enrichments?
- Accessibility: Is the data easily retrievable by the people who need it?
Each of these dimensions corresponds to a challenge for an analytics group: if the data does not provide a clear and accurate picture of reality, it will lead to poor decisions, missed opportunities, increased cost, or compliance risks.
In addition to these common dimensions, business-domain specific dimensions are usually added as well, typically for compliance.
So, what are the keys to building a good data health system just as we would build a good healthcare program?
- Identification of risk factors: Some risks are endogenous, such as the company’s own applications, processes, and employees, while others (partners, suppliers, customers) come from the outside. By recognizing the areas that present the most risk, we can more effectively prevent dangers before they arise.
- Prevention programs: Good data hygiene requires good data practices and disciplines. Consider the approach to nutrition labels: the generalization of standardized nutrition facts or nutrition scores function as education on how a given meal will affect your overall health. Similarly, in Talend, we use our ‘Trust Score’ to internally assess and control the intake of data, producing information that is easier to understand and harder to ignore.
- Proactive inoculation: Vaccines teach the body to recognize and fight a pathogen before an infection begins. For our data infrastructure, machine learning serves a similar function, training our systems to recognize bad data and suspect sources before they can take hold and contaminate our programs, applications, or analytics.
- Regular monitoring: In the medical realm, the annual checkup used to be the primary method of monitoring a patient’s health over time. With the advent of medical wearables that can collect a number of indicators, from standard indicators such as activity or heart rate to more specific functions such as monitoring blood sugar levels in a person with diabetes, the human body becomes observable. In the data world, we use term like assessment or profiling, but it is basically the same — and continuous observability might soon become a reality here, as well. The sooner an issue is detected, the higher the chances of an effective treatment. In medicine it can be a matter of life and death (the Apple Watch has already saved lives). The risks are different, of course, but data quality observability could save corporate lives, too.
- Protocols for continuous prognosis. Doctors can only prescribe the right therapy when they know what to treat. But medicine is not purely a hard science. The prognosis is a model that requires constant revision and improvement. It is fair to set this expectation in data health too: it is a continuously improving model, but you cannot afford not to have it.
- Efficient treatments: Any medical treatment is always a risk/benefit assessment. A treatment is recommended when the benefits outweigh the potential side effects—but that does not mean you only move ahead when there is zero risk. In data, there are tradeoffs as well. Data quality can introduce extra steps into the process. Crucial layers of security can also slow things down. There is a long tail of edge-case data quality problems that cannot be solved with pure automation and a human touch, despite the potential human errors. Good data health professionals master this balancing act just like doctors do.
By establishing a culture of continuous improvement, backed by people equipped with the best tools and software available for data quality, we can protect ourselves from the biggest and most common risks. And if we embed quality functionality into the data lifecycle before it enters the pipeline—while it flows through the system, and as it is used by analysts and applications—we can make data health a way of life.