Interconnected

In the increasingly interconnected micro-services world we’re creating the saying “a chain is only as strong as its weakest link” is particularly pertinent.

It’s quite easy for a single service to be dependent upon a number of downstream services as the diagram below shows.

An outage or go-slow in any one of the downstream services can have a knock on impact upstream and right back to users. Measuring this in SLAs, let’s say each of B, C, D, E, F each aims for an availability SLA of 99.99%. Assuming they meet this, the best A can achieve is 99.95%. More realistically, B, C, D, E and F are probably dependent on other services and before you know it end users are doing well to see anything above 99% uptime.

So what strategies do we have for dealing with this?

Firstly, you could just live with it. Really, don’t knock this option. Question “do I really need the availability?”, “does it really matter if it goes down?”. Before we worry about any elaborate plan to deal with the situation it’s worth considering if the situation is really all that bad.

Ok, so it is… The next question should be “do I need a response immediately?”. If not, go asynchronous and put a queue between them. If the recipient is down messages will queue up until they come back – no problem. Just make sure the queue is as local as possible to the source and persistent.

If it is a request-response model then consider a queue in any case. A queue can often be set to timeout old messages and deal with slow responses (e.g. if no response in 5 seconds then abandon). This can often save having very many messages in a backlog waiting to be processed. These can cause lock ups for requests which will never be processed and block the consumer for much longer than the downstream service is unavailable. And it can often be more efficient to have a queue based competing consumer model than having multiple connections banging away sporadically.

On top of this, ensure you’re using non-blocking libraries and implement circuit-breakers to trip when downstream services go offline. This of course begs the question, “what sort of response do I provide if the circuit-breaker is open?”… Well, that depends…

In some cases you can cache previous responses and serve this. If this sort of caching model works then even better, you can decouple the request for content from that fetching it from a downstream service so that you’re in effect always serving from cache. Allowing stale cache entries to be served whilst revalidating even when downstream services are unavailable can significantly improve the responsiveness and availability of the system. Don’t discard cached items just because they’re old. Keep using them until a fresh copy can be obtained. Size is a concern but if you can afford it then cache your content for as long as the RTO demands (the service should be back by then, e.g 4hrs) and revalidate as frequently as the business demands the content be fresh (e.g. every 10 minutes).

It may sound risky, but this approach can even be used with sensitive data such as user-permissions. You’re looking at a coincidence of bad events which is quite unlikely – e.g. users permissions are revoked (the old version is in cache), at the same time as the permissions system goes down, at the same as the user attempts something they previously could but should no longer be allowed to do.. It’s your risk but what’s worse… One user doing something bad or the whole system being unavailable?

If you can’t or don’t have a cache then can you implement a default or fallback option? Having a blank slot on a page, but a working page otherwise, may be the best of a bad set of options but the best nonetheless.

All else failing, apologise, quickly (see circuit-breaker) and profusely. Let the user know its you, not them, that they needn’t worry (e.g. you’ve not charged them and have unpicked any dependent transactions should you have them) and that you’ll be back as soon as you can.

Finally, log everything, monitor and alert. Regardless of the fact that it’s bad to rely on your customers to tell you when you’ve a problem, in many cases the user may not even realise something is amiss. It can easily be overlooked. Ensuring you log and monitor makes it much easier to know when you’ve an issue as well allowing root-cause analysis faster.

Queues, circuit-breakers, serve-stale-while-revalidate and logging.

Feedback – Logging and Monitoring

It seems to me that we are seeing an increasing number of issues such as this reported by the Guardian. A lost transaction results in a credit default against an individual with the result that they cannot obtain a mortgage to buy a house. Small error for the company, huge impact for the individual.

The company admitted that despite the request being submitted on their website they did not receive the request!? So either the user pressed submit then walked away without noting the response was something other than “all ok!” or the response was “all ok!” and the company failed to process the request correctly.

If the former then, well, user error for being a muppet… As end users we all need to accept some responsibility and check that we get the feedback we expect.

For the latter, there are several reasons why subsequent processing could have failed. Poor transaction management so the request never gets committed, poor process management so the request drops into some dead queue never to be dealt with (either through incompetence or through malicious intent), or through system failure and a need to rollback with resulting data loss.

With the growth in IT over the past couple of decades there are bound to be some quality issues as the result of ever shorter, demanding and more stringent deadlines and budgets. Time and effort needs to be spent exploring the hypothetical space of what could go wrong so that at least some conscious awareness and acceptance of the risks is achieved. This being the case I’m usually quite happy to be overruled by the customer on the basis of cost and time pressures.

However, it’s often not expensive to put in place some logging and monitoring – and in this case there must have been something for the company to admit the request had been submitted. Web logs, application logs, database logs etc. are all valuable sources of information when PD’ing (Problem Determination). You do though need to spend at least some time reviewing and auditing these so you can identify issues and deal with them accordingly.

I remember one case where a code change was rushed out which never actually committed any transaction. Fortunately we had the safety net of some judicious logging which allowed us to recover and replay the transactions. WARNING: It worked here but this isn’t always a good idea!

In general though, logging and monitoring are a very good idea. In some cases system defects will be identified, in others, transient issues will be found which may require further work to deal with them temporarily. Whatever the underlying issue it’s important to incorporate feedback and quality controls into the design of systems to identify problems before they become disasters. At a rudimentary level logging can help with this but you need to close the feedback loop through active monitoring with processes in place to deal with incidents when they arise. It really shouldn’t just be something that we only do when the customer complains.

I don’t know the detail of what happened in this case. It could have been user-error, or we could applaud the company for having logging in place, or they could just have got lucky. In any case, we need to get feedback on how systems and processes are performing and operating in order to deal with issues when they arise, improve quality and indeed the business value of the system itself through continuous improvement.