Skip to main content

Interconnected

In the increasingly interconnected micro-services world we're creating the saying "a chain is only as strong as its weakest link" is particularly pertinent.

It's quite easy for a single service to be dependent upon a number of downstream services as the diagram below shows.




An outage or go-slow in any one of the downstream services can have a knock on impact upstream and right back to users. Measuring this in SLAs, let's say each of B, C, D, E, F each aims for an availability SLA of 99.99%. Assuming they meet this, the best A can achieve is 99.95%. More realistically, B, C, D, E and F are probably dependent on other services and before you know it end users are doing well to see anything above 99% uptime.

So what strategies do we have for dealing with this?

Firstly, you could just live with it. Really, don't knock this option. Question "do I really need the availability?", "does it really matter if it goes down?". Before we worry about any elaborate plan to deal with the situation it's worth considering if the situation is really all that bad.

Ok, so it is... The next question should be "do I need a response immediately?". If not, go asynchronous and put a queue between them. If the recipient is down messages will queue up until they come back - no problem. Just make sure the queue is as local as possible to the source and persistent.

If it is a request-response model then consider a queue in any case. A queue can often be set to timeout old messages and deal with slow responses (e.g. if no response in 5 seconds then abandon). This can often save having very many messages in a backlog waiting to be processed. These can cause lock ups for requests which will never be processed and block the consumer for much longer than the downstream service is unavailable. And it can often be more efficient to have a queue based competing consumer model than having multiple connections banging away sporadically.

On top of this, ensure you're using non-blocking libraries and implement circuit-breakers to trip when downstream services go offline. This of course begs the question, "what sort of response do I provide if the circuit-breaker is open?"... Well, that depends...

In some cases you can cache previous responses and serve this. If this sort of caching model works then even better, you can decouple the request for content from that fetching it from a downstream service so that you're in effect always serving from cache. Allowing stale cache entries to be served whilst revalidating even when downstream services are unavailable can significantly improve the responsiveness and availability of the system. Don't discard cached items just because they're old. Keep using them until a fresh copy can be obtained. Size is a concern but if you can afford it then cache your content for as long as the RTO demands (the service should be back by then, e.g 4hrs) and revalidate as frequently as the business demands the content be fresh (e.g. every 10 minutes).

It may sound risky, but this approach can even be used with sensitive data such as user-permissions. You're looking at a coincidence of bad events which is quite unlikely - e.g. users permissions are revoked (the old version is in cache), at the same time as the permissions system goes down, at the same as the user attempts something they previously could but should no longer be allowed to do.. It's your risk but what's worse... One user doing something bad or the whole system being unavailable?

If you can't or don't have a cache then can you implement a default or fallback option? Having a blank slot on a page, but a working page otherwise, may be the best of a bad set of options but the best nonetheless.

All else failing, apologise, quickly (see circuit-breaker) and profusely. Let the user know its you, not them, that they needn't worry (e.g. you've not charged them and have unpicked any dependent transactions should you have them) and that you'll be back as soon as you can.

Finally, log everything, monitor and alert. Regardless of the fact that it's bad to rely on your customers to tell you when you've a problem, in many cases the user may not even realise something is amiss. It can easily be overlooked. Ensuring you log and monitor makes it much easier to know when you've an issue as well allowing root-cause analysis faster.

Queues, circuit-breakers, serve-stale-while-revalidate and logging.





Comments

Popular posts from this blog

An Observation

Much has changed in the past few years, hell, much has changed in the past few weeks, but that’s another story... and I’ve found a little time on my hands in which to tidy things up. The world of non-functionals has never been so important and yet remains irritatingly ignored by so many - in particular by product owners who seem to think NFRs are nothing more than a tech concern. So if your fancy new product collapses when you get get too many users, is that ok? It’s fair that the engineering team should be asking “how many users are we going to get?”,   or “how many failures can we tolerate?” but the only person who can really answer those questions is the product owner.   The dumb answer to these sort of question is “lots!”, or “none!” because at that point you’ve given carte-blanche to the engineering team to over engineer... and that most likely means it’ll take a hell of a lot longer to deliver and/or cost a hell of a lot more to run. The dumb answer is also “only a couple” and “

Inter-microservice Integrity

A central issue in a microservices environment is how to maintain transactional integrity between services. The scenario is fairly simple. Service A performs some operation which persists data and at the same time raises an event or notifies service B of this action. There's a couple of failure scenarios that raise a problem. Firstly, service B could be unavailable. Does service A rollback or unpick the transaction? What if it's already been committed in A? Do you notify the service consumer of a failure and trigger what could be a cascading failure across the entire service network? Or do you accept long term inconsistency between A & B? Secondly, if service B is available but you don't commit in service A before raising the event then you've told B about something that's not committed... What happens if you then try to commit in A and find you can't? Do you now need to have compensating transactions to tell service B "oops, ignore that previous messag

Equifax Data Breach Due to Failure to Install Patches

"the Equifax data compromise was due to their failure to install the security updates provided in a timely manner." Source: MEDIA ALERT: The Apache Software Foundation Confirms Equifax Data Breach Due to Failure to Install Patches Provided for Apache® Struts™ Exploit : The Apache Software Foundation Blog As simple as that apparently. Keep up to date with patching.