Making use of queueing in your architecture

by Andrew Boag

Catalyst has had a lot to do with all manner of integration solutions, and our focus on using open source technologies has given us the flexibility to integrate with even the most inflexible external system. Whether it’s web services, direct database access, a bespoke protocol or even parsing a CSV file, we’ve learnt a lot about how to package sensible and robust integration solutions.

Over the last two years in the Sydney team we have been doing a lot of work with large AWS application stacks, comprising a number of independent, integrated applications. As part of these stacks, we have made use of the SQS queuing-as-a-service solution (http://aws.amazon.com/sqs/) as well as some open source equivalents such as RabbitMQ and Redis.

Employing a queue has some really nifty advantages.

Placing a queue between applications

In one of our recent projects we glued together a CRM (Customer Relationship Management) system with an LMS (Learning Management System). After an event in the CRM, a course enrolment needs to be triggered in the LMS. Often, one would do this via a web service call from the CRM to the LMS: “Please enrol John Smith in Basket Weaving 101” … Generally, this works just fine.

But if the LMS has been taken down for an upgrade when the CRM makes the web service call, then the web service call may fail and the enrolment will not propagate into the LMS. Not good.

We used an SQS queuing layer between the LMS and the CRM. Meaning that instead of a web service, the CRM puts the enrolment request into a queue which the LMS regularly checks and processes.

In the case of an LMS upgrade or outage, the queue will grow until the LMS comes back online and starts processing the queue. Easy!

In our case, most of the time the queue is empty or has a small number of records.

Some of the advantages of this approach for us were:

There is less cleanup required after an unplanned outage in the LMS when the CRM may still have been online. None of the enrolment requests fail or are lost.
Testing the integration process from the CRM during development was easier as we could create test cases defining what the queue state should be at the end of an operation. This was often easier than having to get a staging CRM talking to a staging LMS.
Testing the LMS enrolment processing also just meant loading a “pre-cooked” queue and watching the LMS process the items.

Obviously, not all integrations will suit this model and we still have to do some web service calls between the systems when real time communication is required, or if a status value is given back as part of a web service call.

Still, we will definitely be looking for ways to apply these lessons to other problems.

Using queues for spike load tolerance

Another great use of queues is as a means to handle large spikes in activity or load. The following is a contrived example.

A Telco might have an SMS vote aggregation application that can process a maximum of 1,000 incoming SMS messages per second.

But during a popular episode of Australia’s Got Talent, where large numbers of viewers are submitting votes to keep their favourite singer around till next week, there might be burst periods of time with up to 10,000 messages per second. This could mean votes get lost and/or our vote counting application completely collapses under the load.

Queuing solutions are easy to horizontally scale (not always the case with relational databases) so we can increase the throughput capacity by using a load balancer across multiple queuing compute nodes, or just by increasing the compute resources available on your queueing server.

So in the case of our SMS vote aggregation problem, placing a queue in front of the message processor means that we can let the queue swell (by 10,000 messages per second) for a brief period of high-vote activity. Once the rate slows down and falls below 1,000 messages per second, the queue processor will start eating away at the queue and give an accurate aggregation of the results. All without a single outage!

Obviously, this means that there might be some latency to get the results … but we have a more robust and reliable system which will always be responsive and online, even during periods of burst activity.

I’m sure there are lots of other ways that we’ll find to apply queueing to the solutions that we provide our clients.

Hope this is helpful to someone out there.