Michael Platt's WebLog

Computer Engineering

Blogs

Architectural Infrastructure and Reliability

  • Comments 3
  • Likes

I really like architecting highly available systems; I think they are the most architecturally demanding. In fact serious performance and scalability issues often result in poor reliability or availability. The standard way to build a highly available system is to cluster however I prefer to start with a queued approach and then cluster as appropriate. The nice thing about this approach is that if something does fail then the system will restart without data or transaction loss.

To give an example of this the European Football system I talked about earlier was built with a transactional application but of course the infrastructure it ran upon was critical to the reliability of the system. As I mentioned there were about 10 servers per city and 10 cities in the system. Each server had dual Lans and was clustered, additionally there were dual WAN’s to each city and the central system (a Digital Alpha) was also clustered with a hot standby (it was a hub and spoke model system).

Basically everything was duplicated everywhere and so when I insisted on queues between all the systems the design team thought I was mad. I made lots of threats about what would happen to them if the system failed and so eventually they included queues between all the servers.

On the day of the cup final when the systems had to be running it was very stormy (nothing unusual for UK weather) however we were more interested in the system performance than the match conditions.

Monitoring from the central hub the systems were all running perfectly when we suddenly lost all the communications links! Panic! We picked up the phone but that was out too. Using a mobile phone we got through to the telecom carrier and it transpired that a lightening strike on the phone carrier’s exchange in the central city had taken out all their lines. I didn’t have a clustered exchange! I was so glad we had included the queues as the remote cities continued to process independently, writing the transactions to the queues. Things were a bit tense as we phoned around the cities monitoring the queue length and increasing it where necessary.

After a few minutes the exchange systems came back on line and the queues started to flush. Immediately the central server went to 100% and stayed there for a couple of minutes whilst the queues all cleared down, teaching me the need for throttling in queued systems very quickly. Luckily the central server kept running and the system then settled down.

So the nice thing about using queues is that they will keep your system running even when something unforeseen happens but, as is the case in most failure analysis, it’s the recovery which is the most difficult part of a high availability system. Needless to say I am a great fan of messaging and queues!

Comments
  • interesting story but it raises an obvious point, most architects rose from the ranks of developers and have little to no infrastructure background. it raises the question of how much do architects need to be aware of the underlying communications & infrastructure which is so often taken for granted?

  • A couple of things, first a lot of older architects eg me come from a systems analysis rather than programming background which is a slightly different focus. I worry a lot about the new breed of developer architects, I am not sure they have the breadth of experience required for architecting big systems, hence this blog and your question

  • To follow on someone (!) told me that I shouldnt have architected with a single point of failure (the exchange). True, mea culpa, I am alas all too fallible which is why I use queues.
    Also the skill requirement for architecture and how that is or is not being met is a big topic I would like to address seperatly.