Explain! Data Provenance

The term provenance has its origin in the Latin verb “provenire”, which means “to emerge”, “to originate from” or “to come from”. Other terms used in this context are “lineage” or “pedigree”. In culture or the arts, the term provenance has been in use for many years to document the chronology of the ownership, custody or location of a historical or artificial object, while digital libraries use it to document the lifecycle of a digital object. Recording data provenance, as a part and type of metadata, is important to confirm the authenticity of data and to enable it to be reused. In the context of (scientific) data and data management, provenance implies the documentation of where a piece of data comes from and the processes and methodology by which it was produced.

Data provenance answers the questions why and how the data was produced, where and when and by whom.

Why care about data provenance?

The idea and concept of provenance is about trust, credibility and reproducibility of research. Therefore, the cooperation of data users and data producers is needed in providing provenance metadata. It is an important factor in determining the quality of the data, like the amount of trust one can put in the results, their reproducibility or the reusability of the data.

In data intensive research for example, the data users are not likely to be the data producers. Data producers may configure a simulation or an instrument in a certain way to collect primary data, or apply certain methodologies and processes to extract, transform and analyse input data to produce an output data product.

The accountability of research relies on the credibility and trustworthiness of the input data – as data are the scientific basis of the analysis. And therefore data users may want to check data quality along with the expected level of imprecision.

Recording and managing data provenance

Some provenance information are routinely collected in a metadata set, e.g. date created, creator, instrument or software used, data processing methods, etc. Thus, good data management forms the basis for accurately recording provenance.

  • Provenance can be recorded in a single README text file that describes the data collection and processing methods used
  • Provenance can also be recorded in a more structured way by using metadata standards, ranging from generic to discipline- or topic-specific
  • The W3C Provenance Incubator Group has developed a Provenance Data Model (PROV-DM) and Provenance Ontology (PROV-O)

