Provenance in the Wild: Provenance at The Gazette

July 22nd, 2014 § 0 comments § permalink

This post was originally published at https://lucmoreau.wordpress.com/2014/07/22/provenance-in-the-wild-provenance-at-the-gazette/.

Today, following my post on Provenance in the 2014 National Climate Assessment, I continue to blog about applications making use of PROV. The Gazette is  the UK’s official public record since 1665, The Gazette has a long and established history and has been at the heart of British public life for almost 350 years (see an overview of its history).  Today’s Gazette continues to publish business-critical information – and, thanks to its digital transformation, this information is now more accessible than ever before.

A quick reminder of what I mean by provenance:

Provenance is a record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering a piece of data or a thing.

The W3C PROV recommendations offer a conceptual model for provenance, its mapping to various technologies such as RDF, XML, or simple textual form, but also the means to expose, find and share provenance over the Web.

In a true open government approach, the tender for the Gazette Web site is a available online, and requested the use of PROV (which was then a Candidate Recommendation).  The purpose of provenance on The Gazette is to describe the capture, transformation and enrichment and publishing process applied to all Notices published by the Gazette.  Let us examine how this was actually deployed.

For instance, the notice available at https://www.thegazette.co.uk/notice/2152652 records a change in a partnership. In the right hand column, we see a link to Provenance Trail.

notice2152652-annotated

Following this link, we obtain the provenance information for this notice:

notice2152652-provenance

On this page, we find a graphical representation of the publication pipeline for this notice, and various links to machine-processable representation of provenance (in RDF/XML and JSON).  When uploading this provenance into the Southampton PROV Translator service, we obtain the following graphical representation, which shows a much more complex and detailed pipeline.

 

notice2152652-provenance-tool

Provenance Information is not just exposed in browsable format, but is also exposed in machine processable format. Going back to the original https://www.thegazette.co.uk/notice/2152652 page, and looking at the html source, we can find the following link element, stating the existence of a relation between the current document and the provenance page. The relation http://www.w3.org/ns/prov#has_provenance is defined in the W3C Provenance Working Group Provenance Access and Query specification.

<link rel="http://www.w3.org/ns/prov#has_provenance"
      href="https://www.thegazette.co.uk/id/notice/2152652/provenance"
      title="PROVENANCE" />

Tools like Danius Michaelides’ ProvExtract can pickup this link and feed it into the Southampton Provenance Tool suite. ProvExtract also extracts some metadata, expressed as RDFa embedded in the document.

 

provextract

Unfortunately, a slight inter-operability issue showed up here. The resource https://www.thegazette.co.uk/id/notice/2152652/provenance has only an html representation. In our ProvBook, Paul and I explain how content-negotiation can be used to serve multiple representations, including the standardized ones such as Turtle, PROV-X, and PROV-N. It is an improvement that may be considered by The Gazette in future releases. Overall, I think it is remarkable how The Gazette exposes provenance in both visual and machine-processable formats.  Congratulations to The Gazette’s team for this achievement.

I will finish this post with a few concluding remarks.

  1. While the National Climate Assessment 2014 report  exposes provenance to end-users as text, The Gazette opted for a high-level visualization of the pipeline. It is interesting to observe how simplified The Gazette graphical representation is, compared to the graphical rendering of the raw data displayed in this post. It shows that abstraction of provenance is an important processing step to apply to provenance to make it understandable. It was a recurrent topic of discussion at  Provenance Week 2014.
  2. The Gazette also provides signed version of its provenance (and other metadata). It is a powerful way of asserting the authorship of such provenance: in other words, it is a cryptographic form of provenance of provenance, which is non-repudiable (The Gazette cannot deny publishing such information) and unforgeable (nobody else can claim to have published this information). Her,e practitioners are ahead of standardisation and theory: there is no standard way of signing provenance and there is no formal definition of a suitable normal form of provenance ready for signature.
  3. As we supply such prov data to the tools we have developed in Southampton, we can obtain interesting visualization. It really shows the benefits of standardisation, since PROV data produced by The Gazette team can be consumed by independently developed applications.

Provenance in the Wild: the 2014 National Climate Assessment

July 11th, 2014 § 1 comment § permalink

This was originally posted on http://lucmoreau.wordpress.com/2014/07/11/provenance-in-the-wild-the-2014-national-climate-assessment/.
A year after the publication of PROV recommendations by the W3C provenance working group, it is nice to see the deployment of applications making use of PROV. In this blog, I talk about the 2014 National Climate Assessment report.

A quick reminder of what I mean by provenance:

Provenance is a record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering a piece of data or a thing.

The W3C PROV recommendations offer a conceptual model for provenance, its mapping to various technologies such as RDF, XML, or simple textual form, but also the means to expose, find and share provenance over the Web.

The National Climate Assessment is a four yearly report published by the US government on climate. “The report of the National Climate Assessment provides an in-depth look at climate change impacts on the U.S. It details the multitude of ways climate change is already affecting and will increasingly affect the lives of Americans.” Given the controversy around climate change, it is a critical piece of evidence-based scientific analysis about climate change.

For the 2014 edition, it was decided that provenance would be used to link all artefacts presented in the report to original data sets, methods and scientific articles. Specifically PROV provenance! According to the authors, “the entity-activity-agent model of PROV has been applied  through the use of resources, activities, and contributors”. Let us find some illustrations of PROV and how it is exposed to users, in textual form, but also in machine processable format. For example,  dereferencing  http://data.globalchange.gov/image/1a061197-95cf-47bd-9db4-f661c711a174, we obtain the following page.

nca

It describes an image resource, which is the projected precipitation change for summer.  Besides its attribution to Kenneth Kunkel, we also find interesting provenance information. This image is said to be derived from a specific dataset, and was produced by an activity also linked from that page. This provenance information is not only exposed in textual form, but also it is available in Semantic Web technologies. By clicking on the Turtle button at the bottom of that page, we find a Turtle file including, the following triple including the PROV property prov:wasDerivedFrom.

&lt;http://data.globalchange.gov/image/1a061197-95cf-47bd-9db4-f661c711a174 &gt;  prov:wasDerivedFrom &lt;http://data.globalchange.gov/dataset/nca3-cmip3-r201205 &gt; .

Note, I was not able to find properties linking to the activity, nor explicit attribution (though this is also mentioned in the text). So, it seems that not all the information was exposed through this Turtle resource. For details about the underpinning data model, see http://data.globalchange.gov/resources. A REST API is also described at http://data.globalchange.gov/api_reference. All NCA 2014 resources have representations in turtle.  Many  ontologies are used including, most notably PROV, but also GCIS (An ontology designed for the Global Change Information System) which specializes PROV. A SPARQL Endpoint also exists at http://data.globalchange.gov/sparql.

Overall, NCA 2014 is a very impressive and rich resource, which is not only pleasant to browse, but also exposes very detailed metadata, in particular about the origins of all its artifacts. Congratulations to the NCA team for producing such a system. For the provenance standpoint, it also brings up interesting issues.

  1. NCA 2014 was a massive exercise in provenance reconstruction. In an ideal world, provenance should not be reconstructed, but directly copied, validated, reproduced, and suitably exposed. There are technological, methodological, and cultural impediments to such reproducibility. That may well be the topic of another blog.
  2. It is an interesting question as to how provenance should be exposed to end users. The NCA team opted for a simple conversion of PROV to text. It remains an open question as to whether this is the most usable way of making provenance accessible.