Monday, November 15, 2010

IPTC and Semantic Web Technologies - Linked Data, Metadata and Ontology

At the IPTC's most recent face-to-face meeting in Rome, we reviewed our explorations of semantic web technologies for news. The news standards body has been looking at three major areas:

Linked Data
We discussed our work to turn IPTC's subject codes into Linked Data using SKOS concepts and Dublin Core properties. Michael Steidl (Managing Director of the IPTC) was planning to demo the IPTC Linked Data ... but, sadly, Internet access was not working in the hotel! He was, however, able to discuss the proposed collaboration between the IPTC and MINDS on Linked Data for news.

Linking and Mapping

Much of the discussion about IPTC's Linked Data work turned on the difficulties of mapping. In addition to representing the IPTC subject codes in RDF/XML and RDF/Turtle, there was some work done to map from the 17 top level IPTC terms to dbpedia concepts. We quickly figured out that these top level terms are chiefly umbrella terms and so don't map very well to individual dbpedia concepts. The meeting felt that it would be good to map the second level terms, but the problem is that this is quite a lot of work and - as usual - it isn't clear who will do it! We then explored some of the challenges of creating and maintaining the links in Linked Data - that is where a lot of the value, but also much of the investment, lies.

My slides about IPTC's Linked Data work are available on slideshare:

Metadata in HTML - rNews and hNews
Many news providers have created feeds to supply news using IPTC formats such as NITF and NewsML-G2. However, there are an increasing number of consumers of news who only want to work with "pure" web technologies, i.e. HTML rather than XML. So, the IPTC has been looking at the two major paths to represent metadata in HTML - microformats and RDFa.

I discussed hNews - the microformat for news that was adopted by the community in late 2009 - which builds upon hAtom by adding a few news-specific fields (such as Source and Dateline). As well as explaining how to add microformats to your HTML templates, I provided some statistics that the Associated Press has gathered on adoption. (As of October 2010, we know of about 1,200 sites using hNews, predominantly in North America). See my Prezi on hNews at for more.

Evan Sandhaus (Semantic Technologist at The New York Times) described rNews - a proposal for an RDFa vocabulary for news. As the names imply, rNews and hNews are similar in intent (news-specific metadata in HTML) but somewhat different in approach. Whereas hNews went through the microformats process, an RDFa vocabulary can be created by anyone. Evan has created an initial rNews draft based somewhat on the NewsML-G2, NITF and hNews models but it is clearly heavily influenced by the needs of the New York Times.

Members of the IPTC's Semantic Web Yahoo! Group can view Evan's rNews draft and are encouraged to discuss it in that email group. At the Rome face-to-face meeting there was quite a lot of interest, but also several issues raised about the details of the first draft. The meeting generally agreed to continue looking at both hNews and rNews, with a view to making a recommendation on both in 2011.

The benefit of getting rNews and hNews adopted by the IPTC is that greater industry support translates into less work for toolmakers: if many news providers support hNews and/or rNews - and do so in very similar ways - then it is easier to build parsers and tools to extract metadata from HTML.

News Ontology
Benoît Sergent of the European Broadcasting Union discussed the work that he and his colleague Jean-Pierre Evain have been doing to create a news ontology, based upon the NewsML-G2 news model. Benoît described how EBU would like to combine the video content that it produces with content from its member organizations and other third parties. If they can represent this information using a flexible, universal model (the news ontology) they could use off-the-shelf tools (such as a triple store) to query, manipulate and recombine that content.

In many ways, this is the most fundamental piece of the semantic web work that the IPTC is undertaking. It is also the least accessible, for many. Members of the IPTC SemWeb group can view a draft of the news ontology and can comment in that email list.

1 comment: