SyntaxHighlighter

SyntaxHighlighter

Monday, July 22, 2013

AWS, SQS and Poisonous Items

I've been using Amazon Web Services to handle large scale processing of content. One handy AWS is SQS, the Simple Queue Service. This is great, since it allows you to decouple (and hence scale) your processing. (There are other advantages besides). However, I've encountered a problem with queues that I've nicknamed "The Poisonous Item". I thought I would share it, together with a couple of workarounds (but not solutions) for handling it.
Q is for Queue by Darren Tunnicliff
http://www.flickr.com/photos/darrentunnicliff/3717976312/

A Typical Architecture Using SQS
Let's pretend that you have a system that uses SQS to process XML documents and update them to your Amazon S3 storage. The processing of each XML document can take a variable amount of time (e.g. depending on the size or complexity of the document), so you decide you want to make it scale nicely. You therefore create an application that writes the document ids to an SQS queue and then you create a second application that performs the processing on each document and writes the result to S3. Because each document can be handled independently from all of the others, you can therefore run as many instances of the document processing application as you like, in parallel, fed by your SQS queue. This can be pictured as in the diagram below.
Visibility Timeout
The SQS queue has a visibility timeout, with a default of 30 seconds. What this means is that when a service fetches something from the queue, the item is hidden for a period of time, to allow for the service to handle it. If all goes well, then the service deletes the original item from the queue. However, if the service crashes, the item becomes visible on the queue again (because the visibility timeout expires). This is all a good thing, since it means that your service is reliable in the face of problems (like an instance of your XML processing application crashing).
Poison by Thorius
http://www.flickr.com/photos/thorius/288024760/
Poison Items
The “poison” item scenario is as follows: there’s some problem with one particular item, e.g. it is really huge and takes, say, 10 minutes to process. That means that the item will timeout and become available to each of the consuming services in turn, whilst the others are still processing it. (I also call this the “Titanic” effect where a safety measure actually makes something more vulnerable to certain issues).

The problem, of course, is that every single instance of your XML processing application will eventually be "poisoned" by the long-running item. In the best case, one of the applications eventually completes and removes the poison item from the SQS queue. Even in this case, however, your applications are doing a lot of duplicate work.

So, how can you try to cope with this?
I need a timeout by Ruth Tsang
http://www.flickr.com/photos/ruthtsang/7247429542/
A Longer Timeout
The first workaround, of course, is to bump up the visibility timeout to be higher than the default 30 seconds. This is relatively simple to do and can completely eliminate the poison item problem altogether. Exactly what level to pick for your timeout is, of course, very much dependent on your application (if it is very high - in the hours - then maybe you need to break your processing into smaller steps?) But you still need to have a timeout, to cope with the legitimate problem of a crashed system.

One rule of thumb is to estimate two standard deviations for the range of processing times. That way, if your processing times conform to a "bell curve" (more formally, a normal distribution) then your timeout will cover almost 98% of the situations your application will encounter. But you also need to weigh this against having too many items in the queue be invisible, since it might mislead you into thinking your processing is complete, when it isn't. Or, if you're using autoscaling, it might result in winding down servers too quickly (since legitimate items might be invisible too long).
Jude'll Fix It no. 103 by Derek Davalos
http://www.flickr.com/photos/derekdavalos/9203747318/
Fix It!
The other workaround is to try to "fix" the reason for the lengthy processing time. Of course, this is extremely dependent on why the times are variable in the first place. In my situation, my XML application assembles smaller documents into larger documents and runs them through an XSLT. Since the number of subdocuments can vary considerably, the processing time varies just as much (if not more so). So, my "fix" was to limit the total number of subdocuments with a reasonable upper limit.  This kind of limit might not work for you (and is certainly a workaround).

Potatoes-Kipfler-HeatAffectedHarvest-928 8-2040gram

Potatoes-Kipfler-Heat affected harvest 2040gram by graibeard
http://www.flickr.com/photos/graibeard/4121218392/
Suggestions?
What else can you do to work around the "poison item" problem?

5 comments:

  1. I'm not sure how the processing could be checked from outside, other than by having the process itself periodically set some completion value, not unlike a status counter bar we see all the time when software is loading (to keep the human attention span from timing out) -- the timeout in this situation (as with our push service) isn't really measuring the right thing, it intends to trap stopped processes, but is instead trapping only time since inception.

    the solution is to have the called threads able to report feedback in the status, resetting the clock on every iteration of some feature. it's not possible with 3rd party libraries unless they are modified, so it shows a missing use-case with potentially long-running function calls such as an xsl transform or HTTP PUT; the libraries need to have a facility to load in a call-back status function that is given a user-defined function to call on each iteration of the main, eg on entering each new template or on initiating each 1500 byte transfer block.

    so the alternative solution is a watchdog thread to graft in this facility using timer interrupts: extend the built-in timeout to cover the 100% case, but modify the processes that we can control to notify the wait with an interrupt signal when our called process completes or when the called process fails to report an update to our watchdog within a 'reasonable' time, and throw our own exception in the latter case.

    ReplyDelete
  2. perhaps a simpler solution would be to use a heuristic: before making the connection or submitting the XML for transform, we modify the timeout value proportional to the size of the package!

    ReplyDelete
    Replies
    1. Yes - in fact, right after i published my description of the problem and potential workarounds - a simpler solution struck me: detect whether a particular processing job is going to be large. (In my case, I am assembling smaller XML documents into one big document which is going to be transformed via XSLT. I've decided that once I have more than 250 subdocuments, the processing is going to take too long, i.e. become poisonous). So, my solution is to pass these known long jobs to another queue, which can have a longer timeout and may otherwise do special processing to cope. I think that breaking the work up into further subtypes, using more queues, is simpler than using threads (as you point out).

      Delete
  3. mrG's suggestions are good, I think. If you are not concerned that processing would ever fail you could just use a very high visibility timeout covering all cases and have the responsible process delete the item from the queue when it is finished. If you were concerned that things might from time to time go wrong you could record the fact that you were initiating processing for a job and have every process check that record to see if a job that is visible had previously been taken on by another process. If it had you could log that and take appropriate action, which might include re-trying the job or taking it off the queue and sending an alert of a possible processing failure.

    ReplyDelete
    Replies
    1. Bumping up the timeout value helps a bit (certainly, it cuts back on too many poisonous items) but there are still going to be situations where processing fails. For example, if you're using EC2 spot instances to handle your processing, your instances might disappear all of a sudden. I think that, in general, it is best to assume that failures will happen and that you need to be able to handle things gracefully. Logging (and implicitly locking) a particular item implies some kind of centralized registry, which still runs the risk of breaking when a particular instance of processing fails.

      Another alternative that occurred to me later (in addition to introducing another queue for big processing as described above) included breaking the processing down into smaller pieces, e.g. having processes that do the XML transformation for each individual subdocument and then another process that simply assembles the results. That shouldn't suffer as much from the problem of really long processing times. And perhaps it could be coordinated via AWS's Simple Workflow http://aws.amazon.com/swf/

      Delete