Skip to content

What is Data Quality?

There are several different opinions and definitions on what data quality is supposed to be. Most of the time, we adapt the “fitness for use by data consumers” definition as defined by Richard Wang and Diane Strong [1] who investigated the subjective quality perception of data quality by data consumers. Although this research work has been a milestone in data quality research, in my opinion data quality is not necessarily always subjective. Imagine valid combinations of cities and countries or accurate population values. Who defines these quality measures? Surely not an individual. These examples are rules that have been derived from public knowledge or stated by natural circumstances. So besides individual requirements from data consumers, the following things may at least also be sources for data quality rules:

  • Real-world phenomena (e.g. city/country combinations)
  • Organizational policies (e.g. all TV’s in my data must have a screen size)
  • Legal regulations (e.g. all groceries must have an expiration date)
  • IT-needs (e.g. URI’s must be dereferencable)
  • Standards (e.g. the syntax of xsd:dateTime or ZIP codes)
  • Task requirements (e.g. population data for all populated places must be complete to calculate the world population).

So reflecting the sources of data quality definitions, I would define data quality as the degree to which data fits to the composed requirements for the task at hand. Thereby, many data quality requirements may be derived from the sources listed above. When using these requirements as data quality rules, we should be aware that they may contradict each other and change over time. Hence, we must manage our data quality rules just like our data.

As part of my PhD-thesis, I am currently investigating the quality of Semantic Web data sets on instance-level. I have also published a data quality constraints library at http://semwebquality.org/ontologies/dq-constraints# which may be used in conjunction with SPIN ( http://spinrdf.org/ ) to define data quality requirements as constraint rules based on the knowledge dervied from the sources cited above. The constraints do not directly restrict the openess of the web, since it is up to the data owner/provider whether the instances with potential data quality problems shall be cleansed. The constraints shall rather help to identify incorrect or suspicious data and raise transparency about the quality state of the Semantic Web data sets in first place. If you are interested in this kind of quality assessment, please see my publications and/or contact me.

[1] Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy: what data quality means to data consumers. Journal of Management Information Systems, 12(4), 5-33.