Provenance in the Wild: Provenance at The Gazette

July 22nd, 2014 § 0 comments § permalink

This post was originally published at https://lucmoreau.wordpress.com/2014/07/22/provenance-in-the-wild-provenance-at-the-gazette/.

Today, following my post on Provenance in the 2014 National Climate Assessment, I continue to blog about applications making use of PROV. The Gazette is  the UK’s official public record since 1665, The Gazette has a long and established history and has been at the heart of British public life for almost 350 years (see an overview of its history).  Today’s Gazette continues to publish business-critical information – and, thanks to its digital transformation, this information is now more accessible than ever before.

A quick reminder of what I mean by provenance:

Provenance is a record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering a piece of data or a thing.

The W3C PROV recommendations offer a conceptual model for provenance, its mapping to various technologies such as RDF, XML, or simple textual form, but also the means to expose, find and share provenance over the Web.

In a true open government approach, the tender for the Gazette Web site is a available online, and requested the use of PROV (which was then a Candidate Recommendation).  The purpose of provenance on The Gazette is to describe the capture, transformation and enrichment and publishing process applied to all Notices published by the Gazette.  Let us examine how this was actually deployed.

For instance, the notice available at https://www.thegazette.co.uk/notice/2152652 records a change in a partnership. In the right hand column, we see a link to Provenance Trail.

notice2152652-annotated

Following this link, we obtain the provenance information for this notice:

notice2152652-provenance

On this page, we find a graphical representation of the publication pipeline for this notice, and various links to machine-processable representation of provenance (in RDF/XML and JSON).  When uploading this provenance into the Southampton PROV Translator service, we obtain the following graphical representation, which shows a much more complex and detailed pipeline.

 

notice2152652-provenance-tool

Provenance Information is not just exposed in browsable format, but is also exposed in machine processable format. Going back to the original https://www.thegazette.co.uk/notice/2152652 page, and looking at the html source, we can find the following link element, stating the existence of a relation between the current document and the provenance page. The relation http://www.w3.org/ns/prov#has_provenance is defined in the W3C Provenance Working Group Provenance Access and Query specification.

<link rel="http://www.w3.org/ns/prov#has_provenance"
      href="https://www.thegazette.co.uk/id/notice/2152652/provenance"
      title="PROVENANCE" />

Tools like Danius Michaelides’ ProvExtract can pickup this link and feed it into the Southampton Provenance Tool suite. ProvExtract also extracts some metadata, expressed as RDFa embedded in the document.

 

provextract

Unfortunately, a slight inter-operability issue showed up here. The resource https://www.thegazette.co.uk/id/notice/2152652/provenance has only an html representation. In our ProvBook, Paul and I explain how content-negotiation can be used to serve multiple representations, including the standardized ones such as Turtle, PROV-X, and PROV-N. It is an improvement that may be considered by The Gazette in future releases. Overall, I think it is remarkable how The Gazette exposes provenance in both visual and machine-processable formats.  Congratulations to The Gazette’s team for this achievement.

I will finish this post with a few concluding remarks.

  1. While the National Climate Assessment 2014 report  exposes provenance to end-users as text, The Gazette opted for a high-level visualization of the pipeline. It is interesting to observe how simplified The Gazette graphical representation is, compared to the graphical rendering of the raw data displayed in this post. It shows that abstraction of provenance is an important processing step to apply to provenance to make it understandable. It was a recurrent topic of discussion at  Provenance Week 2014.
  2. The Gazette also provides signed version of its provenance (and other metadata). It is a powerful way of asserting the authorship of such provenance: in other words, it is a cryptographic form of provenance of provenance, which is non-repudiable (The Gazette cannot deny publishing such information) and unforgeable (nobody else can claim to have published this information). Her,e practitioners are ahead of standardisation and theory: there is no standard way of signing provenance and there is no formal definition of a suitable normal form of provenance ready for signature.
  3. As we supply such prov data to the tools we have developed in Southampton, we can obtain interesting visualization. It really shows the benefits of standardisation, since PROV data produced by The Gazette team can be consumed by independently developed applications.

Provenance in the Wild: the 2014 National Climate Assessment

July 11th, 2014 § 1 comment § permalink

This was originally posted on http://lucmoreau.wordpress.com/2014/07/11/provenance-in-the-wild-the-2014-national-climate-assessment/.
A year after the publication of PROV recommendations by the W3C provenance working group, it is nice to see the deployment of applications making use of PROV. In this blog, I talk about the 2014 National Climate Assessment report.

A quick reminder of what I mean by provenance:

Provenance is a record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering a piece of data or a thing.

The W3C PROV recommendations offer a conceptual model for provenance, its mapping to various technologies such as RDF, XML, or simple textual form, but also the means to expose, find and share provenance over the Web.

The National Climate Assessment is a four yearly report published by the US government on climate. “The report of the National Climate Assessment provides an in-depth look at climate change impacts on the U.S. It details the multitude of ways climate change is already affecting and will increasingly affect the lives of Americans.” Given the controversy around climate change, it is a critical piece of evidence-based scientific analysis about climate change.

For the 2014 edition, it was decided that provenance would be used to link all artefacts presented in the report to original data sets, methods and scientific articles. Specifically PROV provenance! According to the authors, “the entity-activity-agent model of PROV has been applied  through the use of resources, activities, and contributors”. Let us find some illustrations of PROV and how it is exposed to users, in textual form, but also in machine processable format. For example,  dereferencing  http://data.globalchange.gov/image/1a061197-95cf-47bd-9db4-f661c711a174, we obtain the following page.

nca

It describes an image resource, which is the projected precipitation change for summer.  Besides its attribution to Kenneth Kunkel, we also find interesting provenance information. This image is said to be derived from a specific dataset, and was produced by an activity also linked from that page. This provenance information is not only exposed in textual form, but also it is available in Semantic Web technologies. By clicking on the Turtle button at the bottom of that page, we find a Turtle file including, the following triple including the PROV property prov:wasDerivedFrom.

&lt;http://data.globalchange.gov/image/1a061197-95cf-47bd-9db4-f661c711a174 &gt;  prov:wasDerivedFrom &lt;http://data.globalchange.gov/dataset/nca3-cmip3-r201205 &gt; .

Note, I was not able to find properties linking to the activity, nor explicit attribution (though this is also mentioned in the text). So, it seems that not all the information was exposed through this Turtle resource. For details about the underpinning data model, see http://data.globalchange.gov/resources. A REST API is also described at http://data.globalchange.gov/api_reference. All NCA 2014 resources have representations in turtle.  Many  ontologies are used including, most notably PROV, but also GCIS (An ontology designed for the Global Change Information System) which specializes PROV. A SPARQL Endpoint also exists at http://data.globalchange.gov/sparql.

Overall, NCA 2014 is a very impressive and rich resource, which is not only pleasant to browse, but also exposes very detailed metadata, in particular about the origins of all its artifacts. Congratulations to the NCA team for producing such a system. For the provenance standpoint, it also brings up interesting issues.

  1. NCA 2014 was a massive exercise in provenance reconstruction. In an ideal world, provenance should not be reconstructed, but directly copied, validated, reproduced, and suitably exposed. There are technological, methodological, and cultural impediments to such reproducibility. That may well be the topic of another blog.
  2. It is an interesting question as to how provenance should be exposed to end users. The NCA team opted for a simple conversion of PROV to text. It remains an open question as to whether this is the most usable way of making provenance accessible.

 

 

 

PROV Book now available on the Kindle

November 25th, 2013 § 0 comments § permalink

The title says it all. PROV Book is now available on Kindle. We think this is a great way to have the book especially if your using it for reference material when building applications.

provkindle

Provenance for disaster response

November 6th, 2013 § 0 comments § permalink

Provenance & PROV being described in the context of the ORCHID disaster response project. Starts at 10:30 into the video.

A little provenance goes a long way

October 11th, 2013 § 1 comment § permalink

PROV is a rich vocabulary that was designed to tackle a variety of use cases.   The Provenance Working Group worked really hard to design PROV to facilitate its adoption.  In our book, Paul and I provide many recipes to design, deploy, and use provenance in the context of a complex data journalism scenario.

However, we argue that identifying a resource, exposing its authors with attribution, and expressing what it  is derived from, already goes a long way towards a provenance-enabled Web.

Echoing Jim Hendler‘s quote  A Little Semantics Goes a Long Way, Paul and I conclude the ProvBook with a quote of our own:

A little provenance goes a long way

How could “we eat our own dog food” and express the provenance of this quote?

Simple, with the following Turtle snippet:

@prefix prov: <http://www.w3.org/ns/prov#>.
@prefix provbook: <http://www.provbook.org/>.

provbook:a-little-provenance-goes-a-long-way a prov:Entity;
prov:value "A little provenance goes a long way";
prov:wasAttributedTo provbook:Paul ;
prov:wasAttributedTo provbook:Luc ;
prov:wasDerivedFrom <http://www.cs.rpi.edu/~hendler/LittleSemanticsWeb.html>.

We have identified the quote, with url http://www.provbook.org/a-little-provenance-goes-a-long-way. For convenience, we provided a copy of the quote itself (using property prov:value). We identified Paul and myself as the authors. And finally, we gave credit to Jim, by indicating that our quote was inspired by his: this notion is called Derivation, and is expressed with the property wasDerivedFrom.

All these statements can be represented graphically. Yellow ellipses represent entities whereas orange pentagons represent agents in PROV; agents here are the authors of the quote.

 

little

 

Apply this motto in your own context, and publish simple provenance statements about your resources. Really, a little provenance goes a long way …

 

PROV Book – Available on Amazon

October 4th, 2013 § 1 comment § permalink

Yesterday, Luc and I received our physical copies of Provenance: An Introduction to PROV in the mail. Even though the book is primarily designed to be distributed digitally – it’s always great actually holding a copy in your hands. You can now order your own physical copy on Amazon. The Amazon page for the book there also includes the ability to look inside the book. 

booksonshelf Prov Book Cover

 

— Paul

Provenance: An Introduction to PROV is now available

September 13th, 2013 § 2 comments § permalink

We are excited to announce the availability of Provenance: An Introduction to PROV . This is the 7th of Morgan and Claypool’s Synthesis Lectures on the Semantic Web edited by Jim Hendler and Ying Ding. We are proud that our book is part of this series which describes the technologies and practices that are creating a more machine  readable web.

In this book, we have tried to give a practical and insightful look at the W3C PROV recommendations for provenance interchange. While PROV is already being adopted, we believe that a introduction focused on the usage of provenance and answering key use cases would be beneficial to those looking to leverage PROV in their systems.

We hope you find the book useful and we are looking forward to your feedback. On this blog, we’ll keep you informed about PROV related topics.

 

Luc Moreau and Paul Groth