I've been using Amazon Web Services to handle large scale processing of content. One handy AWS is SQS, the Simple Queue Service. This is great, since it allows you to decouple (and hence scale) your processing. (There are other advantages besides). However, I've encountered a problem with queues that I've nicknamed "The Poisonous Item". I thought I would share it, together with a couple of workarounds (but not solutions) for handling it.
Q is for Queue by Darren Tunnicliff http://www.flickr.com/photos/darrentunnicliff/3717976312/ |
A Typical Architecture Using SQS
Let's pretend that you have a system that uses SQS to process XML documents and update them to your Amazon S3 storage. The processing of each XML document can take a variable amount of time (e.g. depending on the size or complexity of the document), so you decide you want to make it scale nicely. You therefore create an application that writes the document ids to an SQS queue and then you create a second application that performs the processing on each document and writes the result to S3. Because each document can be handled independently from all of the others, you can therefore run as many instances of the document processing application as you like, in parallel, fed by your SQS queue. This can be pictured as in the diagram below.
Visibility Timeout
The SQS queue has a visibility timeout, with a default of 30 seconds. What this means is that when a service fetches something from the queue, the item is hidden for a period of time, to allow for the service to handle it. If all goes well, then the service deletes the original item from the queue. However, if the service crashes, the item becomes visible on the queue again (because the visibility timeout expires). This is all a good thing, since it means that your service is reliable in the face of problems (like an instance of your XML processing application crashing).
Poison by Thorius http://www.flickr.com/photos/thorius/288024760/ |
Poison Items
The “poison” item scenario is as follows: there’s some problem with one particular item, e.g. it is really huge and takes, say, 10 minutes to process. That means that the item will timeout and become available to each of the consuming services in turn, whilst the others are still processing it. (I also call this the “Titanic” effect where a safety measure actually makes something more vulnerable to certain issues).
The “poison” item scenario is as follows: there’s some problem with one particular item, e.g. it is really huge and takes, say, 10 minutes to process. That means that the item will timeout and become available to each of the consuming services in turn, whilst the others are still processing it. (I also call this the “Titanic” effect where a safety measure actually makes something more vulnerable to certain issues).
The problem, of course, is that every single instance of your XML processing application will eventually be "poisoned" by the long-running item. In the best case, one of the applications eventually completes and removes the poison item from the SQS queue. Even in this case, however, your applications are doing a lot of duplicate work.
So, how can you try to cope with this?
I need a timeout by Ruth Tsang http://www.flickr.com/photos/ruthtsang/7247429542/ |
A Longer Timeout
The first workaround, of course, is to bump up the visibility timeout to be higher than the default 30 seconds. This is relatively simple to do and can completely eliminate the poison item problem altogether. Exactly what level to pick for your timeout is, of course, very much dependent on your application (if it is very high - in the hours - then maybe you need to break your processing into smaller steps?) But you still need to have a timeout, to cope with the legitimate problem of a crashed system.
One rule of thumb is to estimate two standard deviations for the range of processing times. That way, if your processing times conform to a "bell curve" (more formally, a normal distribution) then your timeout will cover almost 98% of the situations your application will encounter. But you also need to weigh this against having too many items in the queue be invisible, since it might mislead you into thinking your processing is complete, when it isn't. Or, if you're using autoscaling, it might result in winding down servers too quickly (since legitimate items might be invisible too long).
Jude'll Fix It no. 103 by Derek Davalos http://www.flickr.com/photos/derekdavalos/9203747318/ |
Fix It!
The other workaround is to try to "fix" the reason for the lengthy processing time. Of course, this is extremely dependent on why the times are variable in the first place. In my situation, my XML application assembles smaller documents into larger documents and runs them through an XSLT. Since the number of subdocuments can vary considerably, the processing time varies just as much (if not more so). So, my "fix" was to limit the total number of subdocuments with a reasonable upper limit. This kind of limit might not work for you (and is certainly a workaround).
Potatoes-Kipfler-HeatAffectedHarvest-928 8-2040gramPotatoes-Kipfler-Heat affected harvest 2040gram by graibeardhttp://www.flickr.com/photos/graibeard/4121218392/ |
Suggestions?
What else can you do to work around the "poison item" problem?