Skip to content

Classifications of Data Quality Costs and Benefits
April 2010: Originally published in IDQ Newsletter Vol 6 Issue 2
Carlo Batini and Monica Scannapieco

Cover: Data Quality: Concepts, Methodologies and TechniquesEditor’s Note: This article is an excerpt from the book Data Quality: Concepts, Methodologies and Techniques by Carlo Batini and Monica Scannapieco. This excerpt has been slightly modified from its original form to fit the format of a stand-alone article. Readers will find a link to the book at our IAIDQ Bibliography page at bibliography.iaidq.org.
 
 

Introduction

Cost-benefit analysis is an arduous task in many cost domains, and it is even more arduous in the area of Data quality (DQ) due to the less consolidated nature of the discipline. This article discusses how to quantify: (a) the costs of current poor data quality, (b) the costs of initiatives to improve data quality, and (c) the benefits that are gained from such initiatives.

Cost Classifications

Three very detailed costs classifications are proposed by English1, Loshin3, and Eppler and Helfert2. We first present and describe these three classifications, then we propose a common classification framework to compare them all.

The English Classification

The English classification is shown in Figure 1. English divides data quality costs into three broad categories as follows (with English’s recommendations listed in parenthesis):

  • Costs caused by low data quality (eliminate these),
  • Assessment and inspection costs, incurred in order to verify if processes are performing properly (minimize these), and
  • Process improvement and defect prevention costs, resulting from activities to improve the quality of data, with the goal of eliminating or reducing, the costs of poor data quality (invest in these and optimize them).

Costs due to low data quality are analyzed in depth in the English approach and are subdivided into three categories.

  • Process failure costs. These costs occur when poor quality information causes a process not to perform properly. As an example, inaccurate mailing addresses cause correspondence to be misdelivered.
  • Information scrap and rework. Information of poor quality requires several types of defect management activities, such as reworking, cleaning, or rejecting. Examples of this category are:
    • Redundant data handling: if the poor quality of a source makes it useless, time and money has to be spent to collect and maintain data in another database;
    • Business rework costs: these costs arise from re-performing failed processes, such as resending correspondence;
    • Data verification costs: when data users do not trust the data, they have to perform their own quality inspection, to remove low quality data.
  • Loss and missed opportunity costs. These correspond to the revenues and profits not realized because of poor information quality. For example due to low accuracy of customer e-mail addresses, a percentage of customers already acquired cannot be reached in periodic advertising campaigns. This results in lower revenues, roughly proportional to the decrease of accuracy in addresses.


Figure 1
Figure 1: The English Classification


The Loshin Classification

The Loshin classification is shown in Figure 2. Loshin groups the costs of low data quality according to their impacts:

  • The operational impacts domain covers the aspects of a system for processing information and the costs of maintaining the operation of that system. Operational issues are typically short-term issues;
  • The tactical impacts domain, which attempts to address system problems before they arise. Tactical issues are typically medium-term issues;
  • The strategic impacts domain, which addresses the decisions that affect the long term.

Several cost sub-categories are listed for both the operational impacts and tactical/strategic impacts domains. Examples of operational impact costs include:

  • Detection costs are incurred when a data quality problem provokes a system error or processing failure;
  • Correction costs are associated with the actual correction of a problem;
  • Rollback costs are incurred when work that has been performed needs to be undone;
  • Rework costs are incurred when a processing stage must be repeated;
  • Prevention costs arise when a new activity is implemented to take the necessary actions to prevent operational failure due to a detected data quality problem.

Examples of tactical/strategic impact costs include:

  • Delay, due to inaccessible data resulting in a delayed decision process that, in turn, may cause productivity delays,
  • Lost opportunities, i.e., the negative impact on potential opportunities in strategic initiatives, and
  • Organizational mistrust. Unsatisfied by inconsistencies in data, managers decide to implement and maintain their own decision support system, frequently using the same data form the same data sources. This in turn leads to redundant work and further data inconsistencies across the organization.

Figure 2: The Loshin Classification of Data Quality Costs
Figure 2: The Loshin Classification

The Eppler-Helfert Classification

The Eppler-Helfert classification is shown in Figure 3. Eppler and Helfert derive their classification with a bottom up approach. First, they produce a list of specific costs that have been mentioned in the literature, such as higher maintenance costs and data re-input costs. Then, they generate a list of direct costs associated with improving or assuring data quality, such as training costs of improving data quality know-how. At this point, they put together two sub-classifications corresponding to the two major classes of costs, namely cost due to poor data quality and quality improvement and assurance costs.

Costs due to poor data quality are further categorized in terms of their measurability or impact, resulting in direct vs. indirect cost classes. Direct costs are those monetary effects that arise immediately from low data quality, while indirect costs arise from the intermediate effects. Improvement costs are further categorized into three subgroups: prevention costs, detection costs, and repair costs.

Figure 3: The Eppler-Helfert Classification of Data Quality Costs
Figure 3: The Eppler-Helfert Classification

Comparative Classification for Costs

For the purpose of producing a new classification that allows for the integration of the three classifications discussed above, we use a second classification proposed by Eppler and Helfert. Such a classification produces a conceptual framework that can be used in the cost-benefit analysis of data quality programs. It is based on the data production life cycle approach, which distinguishes between data entry, data processing, and data usage costs.

The iterative attribution of all the cost categories of the three previous classifications to this new high-level classification leads to the comparative classification depicted in Figure 4 (next page). The different background patterns used for the English, Loshin, and EpplerHelfert classification items are shown in the legend.

When comparing the three classifications, we notice that they have very few items in common, all placed at an abstract level, namely corrective costs, preventive costs, and process improvement costs and the two most similar classifications are the English and Loshin ones.

Figure 4: A Comparative Classification for Costs
Figure 4: A Comparative Classification for Costs

Benefits Classification

Benefits can typically be classified into three categories:

  • Monetizable, when they correspond to values that can be directly expressed in terms of money. For example, improved data quality results in increased monetary revenues.
  • Quantifiable, when they cannot be expressed in terms of money, but one or more indicators exist that measure them, expressed in a different numeric domain. For example, improved data quality in Government-to-Business relationships may result in reduced wasted time by businesses, which can be expressed in terms of a time indicator. Observe that in several contexts a quantifiable benefit can be expressed in terms of a monetizable benefit if a reasonable and realistic conversion function is found between the quantifiable domain and money. In our example, if the time wasted by business is productive time, the “wasted time” quantifiable benefit can be translated in terms of the monetizable benefit “unproductively spent money”.
  • Intangible, when they cannot be directly expressed by a numeric indicator. A typical intangible benefit is the loss of image of an agency or a company due to inaccurate data communicated to customers, e.g., requests to citizens for undue tax payments from the revenue agency.

Figure 5 shows the English and Loshin items represented together, corresponding to benefits in the three categories. With regard to monetizable benefits, the two classifications agree in the indication of economic issues related to revenue increase and cost decrease.

As to quantifiable and intangible benefits, the English classification is richer. In addition, listed among the intangible benefits, the reference to service quality is relevant.


Figure 5: A Comparative Classification of Benefits
Figure 5: A Comparative Classification of Benefits

Summary

As already mentioned, cost-benefit analysis is an arduous task in many cost domains. In the data quality domain, existing proposals range from classifications provided for costs and benefits to methodologies for performing the cost-benefit analysis process.

Classifications are either generic, or specific, e.g., for the financial domain. The advantages of generic classifications range from establishing clearer terminology to providing consistent measurement metrics.

Like most generic classifications, the five described in this article can be used as checklists during the cost-benefit analysis activity.


References

1 English, L. P. : Improving Data Warehouse and Business Information Quality, Wiley and Sons, 1999.

2 Eppler, M.J., and Helpfert, M.: A Classification and Analysis of Data Quality Costs. Proceedings of the 9th International Conference on Information Quality (IQ 2004), MIT, Boston.

3 Loshin, D. : Enterprise Knowledge Management - The Data Quality Approach, Morgan Kaufmann Series, 2001


Excerpt from the book Data Quality: Concepts, Methodologies and Techniques

Copyright © 2006 Springer-Verlag Berlin Heidelberg


About the Authors

Carlo Batini's photo

Carlo Batini has been a full professor of Computer Engineering at University of Milano Bicocca since 1986, where he previously became associate professor in 1983. His research interests include cooperative information systems, information systems and database modeling and design, usability of information systems, and data and information quality. From 1995 to 2003 he was a member of the board of directors of the Authority for Information Technology in public administration, where he headed several large scale projects for the modernization of public administration.


 
Monica Scannapieco's photo

Monica Scannapieco is a research associate at the Computer Engineering Department of the University of Roma La Sapienza. Her research interests are data quality issues, including data quality dimensions, measurement and improvement techniques, dynamics of data quality, and record matching.