SyntaxHighlighter

SyntaxHighlighter

Friday, October 4, 2013

iCloud and the Kindness of Strangers

Yesterday, I lost my phone.

Don't Worry - This Story has a Happy Ending
After emerging from a rehearsal for the forthcoming production of Spamalot, I dropped my phone. By the time I made it to my car, I realized I had dropped it somewhere between there and the rehearsal room. So, I proceeded to hunt everywhere for it, but to no avail. Kind friends also helped me search everywhere. But it was gone!

Gradually, the full magnitude of the situation hit me. I rely on my phone to help me navigate my life, in every sense. Then I remembered that I had signed up for "Find My iPhone". I'd never tried it or even investigated it. But it seemed worth a shot.

Find Your Phone, Find Your Phone!
I drove home from the theatre, dejected. Then I did some research and followed the quick and easy steps:

1. I logged into iCloud, which I'd never done before
2. I clicked on the friendly radar style symbol for "Find My iPhone"
3. I picked my iPhone from the list of various Apple devices I have apparently registered over the years
4. It showed me where my iPhone was! Back at the theatre, which I kind of knew. But it was nice to know it was still there, as of a minute earlier.
5. I then clicked on the "Lost Mode" and set up a phone number to call and a message to display. I could have set up a passcode, but I already have enabled maximum security for my phone

Almost instantly, someone called me on the number, from my phone! It turns out that someone from another show (Big River) rehearsing at the same theatre had found my phone and turned it in. Soon enough, I was reunited with My Precious...

Lessons Learned
I learnt that I should be just a bit more careful with my phone, of course. And how bereft I felt without it. But that the Find my iPhone technology was quick, easy and effective. Although all that fancy technology would have been worthless, if it hadn't been a simple act of kindness by someone who didn't know me and whom I am unlikely to be able to repay.

Although, I could repay them by going to see their show... If you're anywhere near, you should go see Big River before October 13th. And, while you're at it, why not get tickets for Spamalot, too?

But, before you do anything, check you know where you phone is.

Monday, July 22, 2013

AWS, SQS and Poisonous Items

I've been using Amazon Web Services to handle large scale processing of content. One handy AWS is SQS, the Simple Queue Service. This is great, since it allows you to decouple (and hence scale) your processing. (There are other advantages besides). However, I've encountered a problem with queues that I've nicknamed "The Poisonous Item". I thought I would share it, together with a couple of workarounds (but not solutions) for handling it.
Q is for Queue by Darren Tunnicliff
http://www.flickr.com/photos/darrentunnicliff/3717976312/

A Typical Architecture Using SQS
Let's pretend that you have a system that uses SQS to process XML documents and update them to your Amazon S3 storage. The processing of each XML document can take a variable amount of time (e.g. depending on the size or complexity of the document), so you decide you want to make it scale nicely. You therefore create an application that writes the document ids to an SQS queue and then you create a second application that performs the processing on each document and writes the result to S3. Because each document can be handled independently from all of the others, you can therefore run as many instances of the document processing application as you like, in parallel, fed by your SQS queue. This can be pictured as in the diagram below.
Visibility Timeout
The SQS queue has a visibility timeout, with a default of 30 seconds. What this means is that when a service fetches something from the queue, the item is hidden for a period of time, to allow for the service to handle it. If all goes well, then the service deletes the original item from the queue. However, if the service crashes, the item becomes visible on the queue again (because the visibility timeout expires). This is all a good thing, since it means that your service is reliable in the face of problems (like an instance of your XML processing application crashing).
Poison by Thorius
http://www.flickr.com/photos/thorius/288024760/
Poison Items
The “poison” item scenario is as follows: there’s some problem with one particular item, e.g. it is really huge and takes, say, 10 minutes to process. That means that the item will timeout and become available to each of the consuming services in turn, whilst the others are still processing it. (I also call this the “Titanic” effect where a safety measure actually makes something more vulnerable to certain issues).

The problem, of course, is that every single instance of your XML processing application will eventually be "poisoned" by the long-running item. In the best case, one of the applications eventually completes and removes the poison item from the SQS queue. Even in this case, however, your applications are doing a lot of duplicate work.

So, how can you try to cope with this?
I need a timeout by Ruth Tsang
http://www.flickr.com/photos/ruthtsang/7247429542/
A Longer Timeout
The first workaround, of course, is to bump up the visibility timeout to be higher than the default 30 seconds. This is relatively simple to do and can completely eliminate the poison item problem altogether. Exactly what level to pick for your timeout is, of course, very much dependent on your application (if it is very high - in the hours - then maybe you need to break your processing into smaller steps?) But you still need to have a timeout, to cope with the legitimate problem of a crashed system.

One rule of thumb is to estimate two standard deviations for the range of processing times. That way, if your processing times conform to a "bell curve" (more formally, a normal distribution) then your timeout will cover almost 98% of the situations your application will encounter. But you also need to weigh this against having too many items in the queue be invisible, since it might mislead you into thinking your processing is complete, when it isn't. Or, if you're using autoscaling, it might result in winding down servers too quickly (since legitimate items might be invisible too long).
Jude'll Fix It no. 103 by Derek Davalos
http://www.flickr.com/photos/derekdavalos/9203747318/
Fix It!
The other workaround is to try to "fix" the reason for the lengthy processing time. Of course, this is extremely dependent on why the times are variable in the first place. In my situation, my XML application assembles smaller documents into larger documents and runs them through an XSLT. Since the number of subdocuments can vary considerably, the processing time varies just as much (if not more so). So, my "fix" was to limit the total number of subdocuments with a reasonable upper limit.  This kind of limit might not work for you (and is certainly a workaround).

Potatoes-Kipfler-HeatAffectedHarvest-928 8-2040gram

Potatoes-Kipfler-Heat affected harvest 2040gram by graibeard
http://www.flickr.com/photos/graibeard/4121218392/
Suggestions?
What else can you do to work around the "poison item" problem?

Monday, July 1, 2013

JSON, Rights and Linked Data for News: the Latest IPTC Meeting

What is the best way to represent news using JSON? How can publishers convey rights metadata, to make automatic publishing more efficient? What role does linked data play in improving the production and consumption of news?

In June 2013, publishers from around the world gathered at the IPTC face-to-face meeting in Paris - graciously hosted by the AFP - to discuss these and other topics.
News in JSON
JSON is a lightweight format which continues to gain in popularity and tool support. The IPTC has therefore undertaken an effort to define the best way to represent key news properties using this technology. Although we could automatically translate from one of IPTC's existing XML standards into JSON, we believe it is better to create a spec for news properties in a way that results in more "natural" JSON. Find out what we're proposing via my latest News in JSON slides, which discuss the current NINJS draft.

Rights Metadata
There is a lot of interest amongst publishers of all sizes in expressing rights metadata, particularly for photos. Presently, most publishers need to have editors read the notes associated with each photo, to find out if there are any restrictions they should observe. The promise of machine-readable rights metadata is to make this a more efficient process: for example, to be able to automatically detect when an editor needs to decide whether to use a particular piece of content, rather than examining every photo, just in case.

I've been leading an effort within the IPTC to create RightsML specifically to support publishing industry requirements, based on the general-purpose W3C Community Group standard ODRL framework. In March 2013, we organized a one day conference with representatives from publishers, news agencies, photographers trade associations, law firms and standards bodies to examine the question: how can technology help assert and protect the rights of content creators? Find out more about that discussion, including video of the presentations. We are now focused on driving adoption of the RightsML standard, with better documentation and examples.
photographer by liz west
http://www.flickr.com/photos/calliope/1430290427/
Embedding Rights in Photos
ODRL and hence RightsML has a data model with a well-defined representation in XML. However, many producers of photos would rather have the rights expressed inside the binaries themselves, rather than - or in addition to - having an XML "sidecar". (That's in part because many photo workflows discard any other files and just work with the photos themselves). The challenge is what format to use? Our experiments with trying to embed either RDF or XML within XMP didn't work.So, now we're looking at expressing ODRL in JSON.

NewsML-G2
Currently, NewsML-G2 is the IPTC's flagship standard for news exchange. It continues to evolve, with a full production release each year (and "developer" releases in between). At this meeting, APA unveiled a new perl library to make it easier to produce G2. And we learnt about a major effort to compare the details of NewsML-G2 production by major providers, with the goal of harmonizing them, to make it easier for our customers.
Harmony Roof Sculpture - Opera Garnier - Palais Opera by ell brown
http://www.flickr.com/photos/ell-r-brown/3772502193/
Linked Data and the Semantic Web
We heard about interesting progress from the BBC on using rNews and the Storyline Ontology to enrich the presentation of news on the web. The AFP's medialab showed innovative news prototypes, also leverage rNews, including one for extracting quotes by politicians on various topics. And we at the AP discussed some of the details behind our Metadata Services offering.

Join Us
As well as updates on standards and practical sharing of industry information, the once-a-year AGM is when the IPTC updates its policies. At the Paris meeting, we decided to add an additional level of membership: now individuals can join the group, making it easier for people to participate in the standards used by the news industry and allowing them to attend these kinds of face-to-face meetings. Contact the IPTC Office for more information.

Wednesday, May 29, 2013

Big Data: the Big Picture

In May 2013, I was invited to be part of a panel at DAM NY 2013 talking about "Big Data". I shared some of what we're doing with Big Data at the Associated Press.



In particular, I've worked on an effort to create a digital archive - all of the text, photo, video, graphics and audio content that AP has ever published - which adds up to hundreds of millions of items. We combine that with our rich taxonomy of people, places, companies, organizations and subjects, available to you as the AP Metadata Services. I explained a bit about how we use that archive of content to provide insight and drive further enrichment, all in support of better products and services.

I feel that the term "Big Data" is a bit of a buzzword that is getting attached to a lot of different efforts and there's some healthy skepticism about the true value of some of what is touted under that particular heading. However, we have really found that there's a lot to be learned from bringing together significant data sets. And, also, that working with large data sets really does require some different techniques.

The World of Big Data in a Single Infographic

The other day, this Wikibon Infographic on Big Data caught my eye. It seems to define the term pretty broadly, but manages to convey some of the key technologies that underpin the field (not just Hadoop) and why there seems to be such an upsurge (a combination of low scaling costs, maturing tools and larger enterprises with a thirst for data driven strategies).

Worth a glance.

Monday, February 25, 2013

Mining for eBooks

In February 2013, the W3C, in partnership with IDPF and BISG, organized a workshop on eBooks, in conjunction with O'Reilly's TOC. I was invited to speak about AP's and IPTC's experience with implementing permissions and restrictions with machine readable rights. (ePub lets you include DRM statements; it seems that some publishers are using ODRL v1; IPTC have selected ODRL v2 for the foundation of RightsML). It was a great experience being on the panel and I got a lot of thoughtful and interesting questions.
eBook Readers Galore by libraryman
http://www.flickr.com/photos/libraryman/5052936803/
eBook Newbie
I'm a bit of a eBook neophyte. However, I learnt a lot from hearing the other publishers talking about their experience, hopes and frustrations with this digital publishing mechanism. And it struck me how similar the news industry is to the book industry. In his opening keynote, Bill McCoy talked about the three main ways that publisher deliver books these days: files, apps and websites. Of course, these are also the three main ways that news is delivered, today, too (not to mention dead trees in both cases for non digital publishing). As various other speakers presented at the workshop, they repeatedly used examples from newspapers and magazines (although, sometimes, as illustrations of what *not* to do). And, it has to be said, both book publishers and news publishers are in the same boat of trying to figure out their digital futures.

For more about both the eBook workshop and TOC, I recommend Ivan Herman's reflections.
Evolution of Readers by jblyberg
http://www.flickr.com/photos/jblyberg/4505413539/

Mining for eBooks
Given how easy it is to create and publish an eBook, it would seem that mining a news archive could yield some interesting books. Some news publishers are already conducting experiments with ebooks in this way. For example, the UK's Guardian have a series of Guardian Shorts. (Martin Belam wrote some quite interesting articles about how he worked with the Guardian archive to create ebooks on the Internet and the Olympics). Similarly, Vanity Fair have also started to play with ebooks.

Of course, ebooks aren't the only way to make use of a rich news archive. The New York Times recently launched their TimesMachine which lets you see browse back issues between 1851 and 1922 (“all the news which was fit to print”).

As software continues to eat the world, it will be interesting to see how formerly different kinds of publishers converge and diverge in their attempts to make their digital ways.

Thursday, January 24, 2013

I have been playing a lot with Amazon Web Services. For numerous reasons, I principally like to use these key bits of software in the work I do:

Python
lxml
s3cmd

As a consequence, I find myself repeatedly doing the following steps to bring the Amazon Linux up to scratch for what I need. I thought I would document them here, in case anyone else finds these steps useful. But also as an easy way for me to find it again...

To Just Generally Bring the Server Up To Date
sudo yum update

Amongst other things, this brings you to Python 2.6, which is sufficiently up-to-date for what I need. (By the time you or future me reads this, I suppose it might be more up-to-date than that).

Install lxml
lxml is the best library I've found for working with XML in Python. It is compatible with, but offers lots of nice enhancements beyond, the standard elementtree, including better support for XPath and built in support for XSLT processing.

Based on this very handy blog post, I do the following to install lxml

sudo yum install gcc
sudo yum install python26-devel
sudo yum install libxslt
sudo yum install libxslt-devel
sudo yum install libxml2-devel
sudo easy_install libxml


That last step to install libxml can take a few minutes. But the whole thing typically takes perhaps ten minutes.



Install s3cmd
Since I do a lot of work with Amazon s3, it is handy to have a command line interface to list, get and put files to s3 buckets. I tried following the instructions about how to install s3cmd on the site, but it just wouldn't work. So, now I do this and it works like a charm:

sudo yum --enablerepo epel install s3cmd


And you can run s3cmd --configure to set up and test out the s3 configuration, if you like.

Machine Readable Rights: A One Day Conference in Amsterdam


I'm helping to organize a free, one day conference on 12th March 2013 in Amsterdam, to discuss "Machine Readable Rights and the News Industry".
Sunset Over Amsterdam (Frontpage) by Werner Kunz
http://www.flickr.com/photos/werkunz/4565900446/
We're aiming to bring together the major players who are interested in the topic, across business, legal, editorial and technical groups. We have quite a few people who have signed up already, but it isn't too late to register if you're interested in attending - or speaking.
Coffee cup by Doug88888
http://www.flickr.com/photos/doug88888/2953428679/
The IPTC has organized similar one day conferences before. We've found that the presentations and panel discussions are always thought provoking. And the less formal introductions and discussions that happen over the coffee breaks, lunch time chats and after meeting drinks are at least as important.

If you're interested in machine readable rights, then don't hesitate to sign up!