Monday, July 22, 2013

AWS, SQS and Poisonous Items

I've been using Amazon Web Services to handle large scale processing of content. One handy AWS is SQS, the Simple Queue Service. This is great, since it allows you to decouple (and hence scale) your processing. (There are other advantages besides). However, I've encountered a problem with queues that I've nicknamed "The Poisonous Item". I thought I would share it, together with a couple of workarounds (but not solutions) for handling it.
Q is for Queue by Darren Tunnicliff

A Typical Architecture Using SQS
Let's pretend that you have a system that uses SQS to process XML documents and update them to your Amazon S3 storage. The processing of each XML document can take a variable amount of time (e.g. depending on the size or complexity of the document), so you decide you want to make it scale nicely. You therefore create an application that writes the document ids to an SQS queue and then you create a second application that performs the processing on each document and writes the result to S3. Because each document can be handled independently from all of the others, you can therefore run as many instances of the document processing application as you like, in parallel, fed by your SQS queue. This can be pictured as in the diagram below.
Visibility Timeout
The SQS queue has a visibility timeout, with a default of 30 seconds. What this means is that when a service fetches something from the queue, the item is hidden for a period of time, to allow for the service to handle it. If all goes well, then the service deletes the original item from the queue. However, if the service crashes, the item becomes visible on the queue again (because the visibility timeout expires). This is all a good thing, since it means that your service is reliable in the face of problems (like an instance of your XML processing application crashing).
Poison by Thorius
Poison Items
The “poison” item scenario is as follows: there’s some problem with one particular item, e.g. it is really huge and takes, say, 10 minutes to process. That means that the item will timeout and become available to each of the consuming services in turn, whilst the others are still processing it. (I also call this the “Titanic” effect where a safety measure actually makes something more vulnerable to certain issues).

The problem, of course, is that every single instance of your XML processing application will eventually be "poisoned" by the long-running item. In the best case, one of the applications eventually completes and removes the poison item from the SQS queue. Even in this case, however, your applications are doing a lot of duplicate work.

So, how can you try to cope with this?
I need a timeout by Ruth Tsang
A Longer Timeout
The first workaround, of course, is to bump up the visibility timeout to be higher than the default 30 seconds. This is relatively simple to do and can completely eliminate the poison item problem altogether. Exactly what level to pick for your timeout is, of course, very much dependent on your application (if it is very high - in the hours - then maybe you need to break your processing into smaller steps?) But you still need to have a timeout, to cope with the legitimate problem of a crashed system.

One rule of thumb is to estimate two standard deviations for the range of processing times. That way, if your processing times conform to a "bell curve" (more formally, a normal distribution) then your timeout will cover almost 98% of the situations your application will encounter. But you also need to weigh this against having too many items in the queue be invisible, since it might mislead you into thinking your processing is complete, when it isn't. Or, if you're using autoscaling, it might result in winding down servers too quickly (since legitimate items might be invisible too long).
Jude'll Fix It no. 103 by Derek Davalos
Fix It!
The other workaround is to try to "fix" the reason for the lengthy processing time. Of course, this is extremely dependent on why the times are variable in the first place. In my situation, my XML application assembles smaller documents into larger documents and runs them through an XSLT. Since the number of subdocuments can vary considerably, the processing time varies just as much (if not more so). So, my "fix" was to limit the total number of subdocuments with a reasonable upper limit.  This kind of limit might not work for you (and is certainly a workaround).

Potatoes-Kipfler-HeatAffectedHarvest-928 8-2040gram

Potatoes-Kipfler-Heat affected harvest 2040gram by graibeard
What else can you do to work around the "poison item" problem?

Monday, July 1, 2013

JSON, Rights and Linked Data for News: the Latest IPTC Meeting

What is the best way to represent news using JSON? How can publishers convey rights metadata, to make automatic publishing more efficient? What role does linked data play in improving the production and consumption of news?

In June 2013, publishers from around the world gathered at the IPTC face-to-face meeting in Paris - graciously hosted by the AFP - to discuss these and other topics.
News in JSON
JSON is a lightweight format which continues to gain in popularity and tool support. The IPTC has therefore undertaken an effort to define the best way to represent key news properties using this technology. Although we could automatically translate from one of IPTC's existing XML standards into JSON, we believe it is better to create a spec for news properties in a way that results in more "natural" JSON. Find out what we're proposing via my latest News in JSON slides, which discuss the current NINJS draft.

Rights Metadata
There is a lot of interest amongst publishers of all sizes in expressing rights metadata, particularly for photos. Presently, most publishers need to have editors read the notes associated with each photo, to find out if there are any restrictions they should observe. The promise of machine-readable rights metadata is to make this a more efficient process: for example, to be able to automatically detect when an editor needs to decide whether to use a particular piece of content, rather than examining every photo, just in case.

I've been leading an effort within the IPTC to create RightsML specifically to support publishing industry requirements, based on the general-purpose W3C Community Group standard ODRL framework. In March 2013, we organized a one day conference with representatives from publishers, news agencies, photographers trade associations, law firms and standards bodies to examine the question: how can technology help assert and protect the rights of content creators? Find out more about that discussion, including video of the presentations. We are now focused on driving adoption of the RightsML standard, with better documentation and examples.
photographer by liz west
Embedding Rights in Photos
ODRL and hence RightsML has a data model with a well-defined representation in XML. However, many producers of photos would rather have the rights expressed inside the binaries themselves, rather than - or in addition to - having an XML "sidecar". (That's in part because many photo workflows discard any other files and just work with the photos themselves). The challenge is what format to use? Our experiments with trying to embed either RDF or XML within XMP didn't work.So, now we're looking at expressing ODRL in JSON.

Currently, NewsML-G2 is the IPTC's flagship standard for news exchange. It continues to evolve, with a full production release each year (and "developer" releases in between). At this meeting, APA unveiled a new perl library to make it easier to produce G2. And we learnt about a major effort to compare the details of NewsML-G2 production by major providers, with the goal of harmonizing them, to make it easier for our customers.
Harmony Roof Sculpture - Opera Garnier - Palais Opera by ell brown
Linked Data and the Semantic Web
We heard about interesting progress from the BBC on using rNews and the Storyline Ontology to enrich the presentation of news on the web. The AFP's medialab showed innovative news prototypes, also leverage rNews, including one for extracting quotes by politicians on various topics. And we at the AP discussed some of the details behind our Metadata Services offering.

Join Us
As well as updates on standards and practical sharing of industry information, the once-a-year AGM is when the IPTC updates its policies. At the Paris meeting, we decided to add an additional level of membership: now individuals can join the group, making it easier for people to participate in the standards used by the news industry and allowing them to attend these kinds of face-to-face meetings. Contact the IPTC Office for more information.