Wednesday, November 17, 2010

IPTC and Rights Expression Languages

IPTC has been looking at how to express rights for news content. At the Rome IPTC meeting, I presented the work to date (slides are on slideshare). And Daniel Pähler presented ODRL, the Open Digital Rights Language.

The News Industry Need for Machine Readable Rights
Various news publishers have identified the ability to be able to express rights in a machine readable way as being a priority. In part, this reflects the fundamental changes that have been transforming the news industry. Once, an agency such as the Associated Press distributed content to editors at newspapers and broadcasters, who would select which items they would use. In the process of this selection, they would be able to read any editors' notes, which could include any restrictions that needed to be observed. However, increasingly, news outlets are fully automated, with very little - if any - editorial oversight of what is published. Amongst other things, this drives the need for the expression of rights and restrictions in a way that can be evaluated automatically. This automation would allow the editorial process to be more efficient. In general, an editor still needs to exercise their judgement as to whether a particular restriction applies in a particular context. But, automatic evaluation of rights and restrictions can identify the items that need those decisions, rather than having editors inspect every single item. (This exercise of editorial judgement means that these systems are not like DRM, in which particular actions are forbidden and typically enforced by the devices involved).

IPTC and Rights Expression
IPTC reviewed how the various news formats that it maintains allow for the expression of rights. In every case, there are currently semi-structured ways to express rights using natural language, which wouldn't easily allow for the fully machine-readable rights expressions that member companies need to express. On the other hand, the IPTC has consistently decided that it didn't want to develop a machine-readable rights expression itself - members of the IPTC have felt qualified to develop news formats, whereas legal matters are a different domain.

After reviewing several candidates, we felt that ODRL v2 was the best fit for an existing (though not quite yet complete) rights expression language. In particular, it offers the ability to create an industry-specific rights vocabulary that can be "plugged in" to the ODRL framework. The IPTC has been working with ACAP on developing this vocabulary and with the ODRL group to help refine the ODRL v2 framework itself.

Daniel's slides can be downloaded from the ODRL Wiki.  They give a nice introduction to the ODRL effort and the ODRL v2 approach.

Questions on Rights and ODRL
Daniel and I were asked several questions about rights for news in general and ODRL in particular. For example, there was some discussion about whether rights are really only applicable to photographs (the consensus was no - every media type has increasing amounts of rights being applied). There were questions about how to apply rights to the parts of an item (such as the frames within a video). The partMeta structure of NewsML-G2 handles this nicely (assuming we were to include ODRL within the partMeta structure, which I think we would). There were also questions about what industries are using ODRL today. Chiefly, the mobile phone industry uses ODRL v1 as part of the OMA DRM system ( Daniel explained that there are several academic projects that are using or working on ODRL v2 (Daniel himself is from the University of Koblenz).

Get Involved
You can find out more about the ACAP ODRL profile - and even participate in the work - by visiting the a special Wiki page set up by ODRL for ACAP:

Tuesday, November 16, 2010

Adding Foreign Namespace Support to the NITF XML Schema

As previously discussed on this blog, NITF has very limited support for foreign namespaces.  I've been experimenting with ways to remedy this and presented the results at the most recent face-to-face IPTC meeting (in Rome).  I have posted the NITF slides on slideshare.

During the meeting, it was decided to break the NITF 4.0 effort into two:

  • NITF 3.6 release, which would directly address the addition of foreign namespaces and could be approved as early as January 2011
  • NITF 4.0 which would add support for G2 features (such as qcodes) and will likely require significantly more work to develop and approve
It was decided to take this approach, since foreign namespace support addresses pressing needs (in fact, many people do not realize that it isn't legal already to mix and match NITF with other XML schema).  It was also felt that this change is relatively minor and so doesn't merit a major version number change (not sure I agree with that, but ...).

Therefore, I have now created an NITF 3.6 set of schema files, fixing various bugs in the previous experimental schema and adding expansion points that were missing.  So, compared with NITF 3.5, the experimental schema

  • Added any attributes in globalNITFAttributes and commonNITFAttributes
  • Added any element into head, body, docdata, body.head, block, enriched text, after body, media, body.end

I've also versioned the ruby files to 3.6 and altered the comments to discuss NITF 3.6, rather than NITF 3.5. Finally, I've created an NITF instance that exercises the various foreign namespace capabilities.

These files are also available via the NITF 3.6 directory on the IPTC website.

Comments?  Questions?  Critiques?

Monday, November 15, 2010

IPTC and Semantic Web Technologies - Linked Data, Metadata and Ontology

At the IPTC's most recent face-to-face meeting in Rome, we reviewed our explorations of semantic web technologies for news. The news standards body has been looking at three major areas:

Linked Data
We discussed our work to turn IPTC's subject codes into Linked Data using SKOS concepts and Dublin Core properties. Michael Steidl (Managing Director of the IPTC) was planning to demo the IPTC Linked Data ... but, sadly, Internet access was not working in the hotel! He was, however, able to discuss the proposed collaboration between the IPTC and MINDS on Linked Data for news.

Linking and Mapping

Much of the discussion about IPTC's Linked Data work turned on the difficulties of mapping. In addition to representing the IPTC subject codes in RDF/XML and RDF/Turtle, there was some work done to map from the 17 top level IPTC terms to dbpedia concepts. We quickly figured out that these top level terms are chiefly umbrella terms and so don't map very well to individual dbpedia concepts. The meeting felt that it would be good to map the second level terms, but the problem is that this is quite a lot of work and - as usual - it isn't clear who will do it! We then explored some of the challenges of creating and maintaining the links in Linked Data - that is where a lot of the value, but also much of the investment, lies.

My slides about IPTC's Linked Data work are available on slideshare:

Metadata in HTML - rNews and hNews
Many news providers have created feeds to supply news using IPTC formats such as NITF and NewsML-G2. However, there are an increasing number of consumers of news who only want to work with "pure" web technologies, i.e. HTML rather than XML. So, the IPTC has been looking at the two major paths to represent metadata in HTML - microformats and RDFa.

I discussed hNews - the microformat for news that was adopted by the community in late 2009 - which builds upon hAtom by adding a few news-specific fields (such as Source and Dateline). As well as explaining how to add microformats to your HTML templates, I provided some statistics that the Associated Press has gathered on adoption. (As of October 2010, we know of about 1,200 sites using hNews, predominantly in North America). See my Prezi on hNews at for more.

Evan Sandhaus (Semantic Technologist at The New York Times) described rNews - a proposal for an RDFa vocabulary for news. As the names imply, rNews and hNews are similar in intent (news-specific metadata in HTML) but somewhat different in approach. Whereas hNews went through the microformats process, an RDFa vocabulary can be created by anyone. Evan has created an initial rNews draft based somewhat on the NewsML-G2, NITF and hNews models but it is clearly heavily influenced by the needs of the New York Times.

Members of the IPTC's Semantic Web Yahoo! Group can view Evan's rNews draft and are encouraged to discuss it in that email group. At the Rome face-to-face meeting there was quite a lot of interest, but also several issues raised about the details of the first draft. The meeting generally agreed to continue looking at both hNews and rNews, with a view to making a recommendation on both in 2011.

The benefit of getting rNews and hNews adopted by the IPTC is that greater industry support translates into less work for toolmakers: if many news providers support hNews and/or rNews - and do so in very similar ways - then it is easier to build parsers and tools to extract metadata from HTML.

News Ontology
Benoît Sergent of the European Broadcasting Union discussed the work that he and his colleague Jean-Pierre Evain have been doing to create a news ontology, based upon the NewsML-G2 news model. Benoît described how EBU would like to combine the video content that it produces with content from its member organizations and other third parties. If they can represent this information using a flexible, universal model (the news ontology) they could use off-the-shelf tools (such as a triple store) to query, manipulate and recombine that content.

In many ways, this is the most fundamental piece of the semantic web work that the IPTC is undertaking. It is also the least accessible, for many. Members of the IPTC SemWeb group can view a draft of the news ontology and can comment in that email list.

Thursday, September 2, 2010

Expanding Beyond the Point - Geo Geometries and Features

When adding "geo data" to your content, it is tempting to think you just need to add a latitude and longitude and you're done.  After all, this allows you to plot your data on a map, by applying a push pin or the like.
what's the point?

However, that lat+long pair may indicate the centre of a location in your content, but it doesn't reveal anything about the scale or type of the location.  There are two ways you can do this.  One is to make use of a geometry, the other is to identify the feature type.  And you can use both together.

Geo Geometries

Geometry at the National Mosque
If you look at the various geo standards, you'll see that it is common to use different types of geometry for geographic data.  For example, if you look at you'll see


The polygon, box and circle all express areas.  The needs of your application will likely drive which one might be most appropriate.  For example, if you want to display a map at the right "zoom level", it is likely that a point (aka the centroid) and a box (aka the bounding box) are enough.  (In which case, you'll need three points to express them - one for the centroid, two for the corners of the bounding box.  Obviously, the other geometries have different characteristics - you only need the centroid and a radius to express a circle, whereas a polygon must have at least three points, but can have many more).

Geo Features

Pigeon Point / Sky Whale...
On the other hand, it could be that what you're aiming to do is to describe what is known as the "geo feature", rather than or in addition to the geometry of an area.  In other words, how is the area classified - is it a city, a country, a park, a farm, a forest, a lake ...?  If so, then there are some geo ontologies that might help.  A popular one is the geonames ontology, which breaks down geo features into subtypes of administrative boundaries, hydrographic, area, populated place, road / railroad, spot, hypsographic, undersea and vegetation

Wednesday, July 21, 2010

Linked Data and the World Cup

In January of 2010, I attended the News Linked Data Summit. This was a collection of several organizations involved in news production and distribution, looking to see if there was a way to collaborate on moving forward the Semantic Web (and particularly Linked Data) for news. It was a very interesting discussion, lead mainly by the BBC and The Guardian. At the end of the day, the group decided to move forward together by producing Linked Data for the UK Election. (In January, the election was clearly going to happen in the near future, but not yet announced - it wound up happening in May 2010).

I had a counter proposal for a Linked Data experiment - the World Cup. This made more sense for my employer and seemed to be of interest to at least some others. It was not to be, however. (I later spoke to some folks about whatever happened with the UK Election Linked Data experiment. As far as I can tell, the Guardian did produce some election data. But it isn't clear to me that this turned into anything bigger).

By Shine 2010

However, it turns out that the BBC did use the Semantic Web to power their World Cup “microsite” (actually their World Cup site has more pages than the non-SemWeb Sports section). In "The World Cup and a call to action Around Linked Data", BBC Architect John O'Donovan gives an overview of how they used fine-grained metadata to be able to produce their site with far less need for editorial curation of individual pages. In an accompanying piece, Jem Rayfield describes the technical details of the "dynamic semantic publishing".

Although their descriptions are couched in the terms of the Semantic Web, much of what they describe is more to do with the application of fine-grained metadata than with the particular data formats they use. They do describe how they make use of certain Semantic Web technologies - such as an RDF Triplesore / SPARQL system - to apply derived properties. And it is almost certainly the case that the use of the RDF model has given them more flexibility than can be the case when you use RDBMS systems - or even XML-based schema. However, fundamentally, what they are describing is the potential for what could be done via the application of metadata to content. It is interesting to see it in action.

The England Team didn't do so well at the World Cup - thanks Doug88888!

It is also interesting how much of a revelation this is to people. (The first article was widely distributed via Twitter. And the comments are quite breathless in their admiration).

Hopefully, the work that the IPTC is doing on Linked Data and the Semantic Web will help other news organizations (including the AP!) unlock some of the great metadata work that is going on behind the scenes...

Monday, May 17, 2010

Towards NITF 4.0 - Experimental Support for "Foreign" Namespaces

One of the critiques that has been leveled against NITF for many years is that it cannot be customized by including "other" XML namespaces [1]. In NITF 3.5, we completed the step taken towards opening up the NITF 3.4 schema by fixing a bug in namespace support in enriched text [2].

However, we decided that full support for foreign namespaces was such a big change, that this would constitute one of the major pieces of work for NITF 4.0 [3]. I've created an *experimental* NITF XSD with foreign namespace support. It can be downloaded from

I've performed a number of tests with this schema and it seems to me to be going in the right direction. I've also discovered a problem with the NITF 3.5 namespace support, which I've fixed in this experimental XSD [4]. I promise to write a future note about the choices I made in adding foreign namespace support. However, I wanted to get the current version out there, to give people a chance to download and try it out [5].

[1] See for example, the excellent discussion of schemas and extensibility by Bob DuCharme and how NITF is too closed

[2] Get the NITF schema at

[3] Discussion of the plans for NITF 4.0 (amongst other things) can be reviewed in

[4] A special prize for the first person to figure out what the bug was

[5] Note that the NITF 4.0 experimental schema contains the documentation I copied over from the NITF 3.5 DTD. I'm still keen to get feedback on this too!

Friday, May 14, 2010

SKOS and Protoge HOWTO

In the IPTC, we are doing some work to figure out how to represent the IPTC Controlled Vocabularies as Linked Data.  We've decided to use SKOS as the RDF Vocabulary.  One of the things we wanted to do was to use a tool that "understands" SKOS.  We decided to look at Protoge for this.  Here are the steps we figured out to make it work (for some values of work):

Download version 4.0.2 from
-          Install it in your PC
-          Add the SKOSed plugin (use Check for plugins... item under File)
-          Add the Pellet Reasoner plugin
-          (you have to restart Protege before the new plugins are active)
-          Add Views to the Individuals tab: SKOSed view -> Inferred Concept Hierarchy + SKOS Usage
-          Then you may load a SKOS vocabulary – but only with narrower and broader relationships, does not work with the ...Transitive variants.
-          Then you should run the Pellet Reasoner against this vocabulary
-          Only then you should see the hierarchy in the Inferred Concept Hierarchy frame.
(JPE later adds "I have declared the narrower and broader properties as "transitive" using protégé and it  works.")

Posting them here, to make it easier for me to find (and maybe to help others).

Friday, May 7, 2010

Recently, I was asked for pointers to introductory material on the Semantic Web, specifically for such topics as N3 and Turtle.  I found "The Semantic Web 1-2-3" the most helpful in starting to get to grips with the mysteries of the SemWeb.  Note, however, that it is somewhat outdated now (it refers to DAML+OIL for example).  But it is still a good foundation and it has more links to great semwebby material than you can shake a stick at, if that's your idea of fun.  A couple of extra links that might be of help are
I can't really find anything good that explains Turtle, other than the formal spec.  But my twitter-length explanation is that Turtle is N3 minus the reasoning extensions but plus internationalization (i.e. it is a more exact rendition of RDF than N3 is).

Other great links out there?

Monday, April 26, 2010

Experimenting with Bull Fighting and the Semantic Web

Paul Kelly and I took a look at what it would take to represent NewsML-G2 style Knowledge Items as SKOS.

 Here's a typical G2 style concept from the IPTC subject code vocabulary:

<conceptId qcode="subj:01003000" created="2008-01-01T00:00:00+00:00" />
<type qcode="cpnat:abstract">
<name xml:lang="en-GB">Abstract concept</name>
<name xml:lang="en-GB">bullfighting</name>
<name xml:lang="de">Stierkampf</name>
<name xml:lang="fr">Tauromachie</name>
<name xml:lang="es">toros</name>
<name xml:lang="es">??</name>
<definition xml:lang="en-GB">Classical contest pitting man against the bull</definition>
<definition xml:lang="de">Klassischer Wettkampf Mann gegen Stier.</definition>
<definition xml:lang="fr">Tauromachie, combat entre un homme et un taureau</definition>
<definition xml:lang="es">Clasico enfrentamiento entre Toro y Hombre.</definition>
<definition xml:lang="es">????????????????</definition>
<broader qcode="subj:01000000" type="cpnat:abstract" />

Here's what I think that would render as in SKOS - first in N3:

@prefix dc: <>.
@prefix skos: <>.
@prefix subj: <>.
@prefix rdf: <>.

subj:01003000 rdf:type skos:Concept;
dc:created "2008-01-01T00:00:00+00:00";
skos:prefLabel "bullfighting"@en-GB;
skos:prefLabel "Stierkampf"@de;
skos:prefLabel "Tauromachie"@fr;
skos:prefLabel "toros"@es;
skos:prefLabel "闘牛"@es;
skos:definition "Classical contest pitting man against the bull"@en-GB;
skos:definition "Klassischer Wettkampf Mann gegen Stier."@de;
skos:definition "Tauromachie, combat entre un homme et un taureau"@fr;
skos:definition "Clasico enfrentamiento entre Toro y Hombre."@es;
skos:definition "人間を雄牛と戦わせる伝統的な競技"@es;
skos:broader subj:01000000.

Some conclusions. G2 concepts are easy to map to SKOS! But there are some choices that need to be made (e.g. do we use skos:broader or owl:broader?) It feels like a G2 conceptset is almost identical to a SKOS Concept Scheme.

And there's a bug in the IPTC mapping of the subject codes (the Japanese definition and name is labeled as Spanish).

¡Olé! by karwik

Sunday, April 18, 2010

Connecting oXygen and eXist

What could be more natural than connecting oXygen and eXist? After all, they both share that second upper case X.

In fact, I'm trying to teach myself how to perform XQueries using the eXist-db, the "open source native XML database". This is somewhat tricky to start. Although it is relatively easy to download eXist, it does require ensuring that you have a JDK installed on your Windows machine. But then I ran into the real problem: there is no "hello world" example that tells you completely step-by-step how to write your first XQUERY, upload it to eXist and then execute it via the web interface. Or, at least, I couldn't find anything.

I have, however, figured out a partial workaround: I have figured out how to use oXygen as a sort of front end to eXist, to allow me to upload, edit and run my XQueries. (I still haven't figured out how to execute the XQueries directly from the webserver. Perhaps one day...).

I followed this screencast which shows how to configure your oXygen editor to connect to eXist. Although I had to use the jars listed in this howto, rather than the ones shown in the screencast! Just in case either that howto or that screencast disappears, here's what I did:

-1. Download a JDK (a JRE isn't sufficient)
0. Download eXist
1. In oXygen, switch to the Database Perspective
2. Configure an eXist datasource:
- 2a. Click on "New" under the top panel labeled "Data Sources"
- 2b. Select "eXist" from the "Type" dropdown
- 2c. Give it a name (such as "Exist Datasource")
- 2d. Click "Add"
- 2e. Select the following JARs from within the exist library:
  • exist/exist.jar
  • exist/lib/core/xmldb.jar
  • exist/lib/core/xmlrpc-client-3.1.3.jar
  • exist/lib/core/xmlrpc-common-3.1.3.jar
  • exist/lib/core/ws-commons-util-1.0.2.jar
- 2f. Click "Open" then "OK"
3. Create an Exist Connection
- 3a. Click "New" under the bottom panel, labeled "Connections"
- 3b. In the Data Source dropdown, select the one you created in step two (which I suggested you call "Exist Datasource")
- 3c. Give your connection a name (such as "Exist Connection")
- 3d. In the XML DB URI field, edit the placeholder to be the host address of your Exist installation (and modify anything else that differs, such as the port number)
- 3e. Replace the username and password fields with the correct values
- 3f. Hit "OK"
4. You should now have an exist connection which you can open by double clicking!

Saturday, January 16, 2010

Help Haiti or Apalachia?

A friend of mine on Facebook opined

wonders hwo throwing money at Haiti will fix anything? How about we send that money and that Red Cross aid to the people in Appalachia? They're our own citizens starving and living in hovels. But no, we go to Haiti.....*looks around....steps off soapbox*

I responded:

although it seems as though a lot of people have died in haiti, the real risk is what happens next. if there is no significant help sent right away, then there will likely be more deaths due to shortages of food and water and from epidemics (cholera, typhoid, dysentry, hepatitis) due to lack of sanitation facilities. in 2004, in the aftermath of the indian ocean earthquake, "only" about 200,000 to 300,000 people died because the international response was relatively good. (the usa sent about $2.8bn - which works out to around $9.81 per person).

and it isn't a one way street. in the aftermath of katrina, countries all around the world offered to send aid to the usa in their hour of need.

finally, it isn't a case of either or, is it? most of us could probably afford to send $9.81 to haiti and $9.81 to help people in the slow motion crisis of apalachia.

Tuesday, January 12, 2010

CQL - A Web Friendly Query Language with Metadata Support?

I've been thinking about query languages recently.

As part of my experimentation with MarkLogic, I've been playing with XQUERY, the query language for XML. I find it to be able to do everything I need it to do. I haven't invested enough time and effort to fully grok it (in the way that I did with XSLT last year). However, the fact that XQUERY and XSLT share a common addressing model (XPATH) is a huge win for me. At least I can start *somewhere* with XQUERY. But I can't find an XQUERY equivalent to dpawson's excellent XSL Frequently Asked Questions. I suppose I need to go Old Skool and actually read one of the two books on XQUERY that I bought years ago.

Another query language that I have looked at but have done little more than think about is SPARQL, the query language for RDF. Just like XQUERY (and clearly modeled on the query language category killer SQL) you essentially compose a query document that you post to a service and get back a results document in return (XQUERY lets you construct results in XML, SPARQL lets you construct results as RDF or as variable bindings, as I understand it). There are increasing opportunities to play with SPARQL endpoints, so I suspect that I will have the opportunity to actually work with SPARQL at some point.

However, it is difficult to discuss search without mentioning Google. One of the nice things about Google is that you can compose your searches as URI's. And you can supply additional fields to modify your search parameters using standard URI conventions. I was looking for something that might work in this appealingly simple way but give you access to the full power of searching metadata fields. I didn't want to have to invent my own query syntax.

I looked first at A9's Open Search. This is an attempt to allow search engines to publish a profile of search syntax and formats that they support. Initially very focused on keyword search, there is now a set of draft extensions that are mainly aimed at extended the *results* of a query with additional namespaced information. Although there is some limited support to allow a search to indicate which parameters it supports. So, the Open Search approach seems quite nice, but it doesn't quite hit the sweet spot I was looking for of being able to mix free text search with fielded search, all wrapped up in a RESTful interface.

Then, I came across SRU (Search/Retrieval via URL) and specifically CQL (now the "Contextual Query Language", but formerly known as the "Common Query Language"). It seems that SRU and CQL grew out of efforts to create a fully "of the web" successor to the pre-web Z39.50 library search and retrieve protocol. The SRU part makes it all RESTful (there's an SRW protocol for those of you who prefer to be SOAPy). And the CQL syntax lets you specify fielded search using a nice, extensible mechanism. For the cherry on top, you can return different types of XML (my favourite meta language).

The only thing I am struggling to find is much evidence of widespread adoption or even open source implementation. The Library of Congress are hosting the SRU/CQL pages, so I assume they have adopted it, at least in part. Some other library type organizations (such as COPAC) have SRU/SRW services, alongside their Z39.50 interfaces. And there is some evidence of attempts to somehow bring OpenSearch and SRU/CQL together, apparently by Nature.

Anyone suggestions as to where else I should look?

Monday, January 11, 2010

Writing Use Cases

I thought I would re-familiarize myself with the art of writing "use cases". (A snappy definition is "A use case is a prose description of a system’s behavior when interacting with the outside world.") After a little bit of research, I quickly concluded that Alistair Cockburn's Writing Effective Use Cases, first published in 2000, is still the bible for this topic.

I thought I would see if he has any more recent publications. I found his website and immediately liked it, as he describes himself as "an internationally renowned project witchdoctor". It seems he is still teaching a course based on his book as well as courses based on the Crystal and SCRUM agile methodologies.

Browsing the site, I noticed that Alistair seems to have evolved his thinking a bit on use cases. Reading the pages tagged "use cases" it appears that he has been moving towards a mature and nuanced view of where they work and the exact form(s) the use cases should take. His reflections on the occasion of the tenth anniversary of publishing his book are particularly illuminating. He still uses use cases, but seems to favour "ultra light" use cases in some situations .

My take away is that use cases and stories make sense as a way to structure requirements documents. And that there is value in delivering even just sketches of use cases, particularly when combined with "agile" methods.

Monday, January 4, 2010

Thinking in 3D

I was hoping to see Avatar over the holiday break - I hear it is best to see in 3D. In the meantime, this got me thinking a little about the potential for 3d printing.

When I first heard mention of this technology, it sounded like a sci-fi dream: the ability to create physical objects on demand via printing. But it turns out it is not some far-off future state - 3d printing is used today in manufacturing, mainly for rapid prototyping. And it is started to spread elsewhere. Artists are using 3d printing to create sculpture. RepRap is an open source / open hardware project to create self-replicating machines (3d printers to create more 3d printers).

Current trends include escalating fuel costs and a substitution of the virtual world for physical objects. Could 3d printers be used to create a network of hyper-local micro factories that manufacture the few physical items we want? On demand creation right next door, rather than transporting boxes of things long distances?

Friday, January 1, 2010

Fun, but it didn't work

I think it is safe to admit that my experiment didn't work.

Oh, well. I'm still glad I tried.