In the increasingly interconnected micro-services world we’re creating the saying “a chain is only as strong as its weakest link” is particularly pertinent.

It’s quite easy for a single service to be dependent upon a number of downstream services as the diagram below shows.

An outage or go-slow in any one of the downstream services can have a knock on impact upstream and right back to users. Measuring this in SLAs, let’s say each of B, C, D, E, F each aims for an availability SLA of 99.99%. Assuming they meet this, the best A can achieve is 99.95%. More realistically, B, C, D, E and F are probably dependent on other services and before you know it end users are doing well to see anything above 99% uptime.

So what strategies do we have for dealing with this?

Firstly, you could just live with it. Really, don’t knock this option. Question “do I really need the availability?”, “does it really matter if it goes down?”. Before we worry about any elaborate plan to deal with the situation it’s worth considering if the situation is really all that bad.

Ok, so it is… The next question should be “do I need a response immediately?”. If not, go asynchronous and put a queue between them. If the recipient is down messages will queue up until they come back – no problem. Just make sure the queue is as local as possible to the source and persistent.

If it is a request-response model then consider a queue in any case. A queue can often be set to timeout old messages and deal with slow responses (e.g. if no response in 5 seconds then abandon). This can often save having very many messages in a backlog waiting to be processed. These can cause lock ups for requests which will never be processed and block the consumer for much longer than the downstream service is unavailable. And it can often be more efficient to have a queue based competing consumer model than having multiple connections banging away sporadically.

On top of this, ensure you’re using non-blocking libraries and implement circuit-breakers to trip when downstream services go offline. This of course begs the question, “what sort of response do I provide if the circuit-breaker is open?”… Well, that depends…

In some cases you can cache previous responses and serve this. If this sort of caching model works then even better, you can decouple the request for content from that fetching it from a downstream service so that you’re in effect always serving from cache. Allowing stale cache entries to be served whilst revalidating even when downstream services are unavailable can significantly improve the responsiveness and availability of the system. Don’t discard cached items just because they’re old. Keep using them until a fresh copy can be obtained. Size is a concern but if you can afford it then cache your content for as long as the RTO demands (the service should be back by then, e.g 4hrs) and revalidate as frequently as the business demands the content be fresh (e.g. every 10 minutes).

It may sound risky, but this approach can even be used with sensitive data such as user-permissions. You’re looking at a coincidence of bad events which is quite unlikely – e.g. users permissions are revoked (the old version is in cache), at the same time as the permissions system goes down, at the same as the user attempts something they previously could but should no longer be allowed to do.. It’s your risk but what’s worse… One user doing something bad or the whole system being unavailable?

If you can’t or don’t have a cache then can you implement a default or fallback option? Having a blank slot on a page, but a working page otherwise, may be the best of a bad set of options but the best nonetheless.

All else failing, apologise, quickly (see circuit-breaker) and profusely. Let the user know its you, not them, that they needn’t worry (e.g. you’ve not charged them and have unpicked any dependent transactions should you have them) and that you’ll be back as soon as you can.

Finally, log everything, monitor and alert. Regardless of the fact that it’s bad to rely on your customers to tell you when you’ve a problem, in many cases the user may not even realise something is amiss. It can easily be overlooked. Ensuring you log and monitor makes it much easier to know when you’ve an issue as well allowing root-cause analysis faster.

Queues, circuit-breakers, serve-stale-while-revalidate and logging.

The Matrix

The matrix may well be the most under-appreciated utility in the toolbox of architects.

We produce diagrams, verbose documents and lists-of-stuff till the cows come home but matrices are an all too rare; almost mythical, beast. Their power though is more real than the healing and purification properties of true Unicorn horns despite what some may say.

Here’s an example.

The diagram below shows a contrived and simplified matrix of the relationship between user stories and components. In many cases such a matrix may cross hundreds of stories and dozens of components.

Picture of a matrix from a spreadsheet

Crucially we can see for a particular story which components are impacted. This provides much needed assurance to the architect that we have the needed coverage and allows us to easily see where functionality has no current solution. In this case “US4: Audit Logging”.

Adding some prioritisation (col C) allows us to see if this is going to be an immediate issue or not. In this case the product owner has (foolishly) decided auditing isn’t important…

Developers can use the matrix to see which components need implementation for a story and see what other requirements are impacted by the components they’re about to develop.

Now, it may well be that we’ll proceed and accept any technical debt associated with high-priority requirements to deliver them faster. It may also be that the lower priority requirements never get delivered, so no-problem. But it may instead be that the next story in the backlog has some particular nuanced requirement which makes things rather hairy, and is best to consider up-front rather than walk into a pit if we do it things another way. It’s a balancing game with pros and cons – the matrix provides visibility to aid the assessment which all parties can use.

And there’s more (in true infomercial style)… We can also see that the “Access Gateway”, “Article Management” and “Database” components appear to cover many stories. This may be fine if the functionality they provide is consistent across requirements – for example the “Access Gateway” may simply be doing authentication and authorisation consistently – but in other cases it suggests some decomposition and refinement is needed – for example we may wish to consider breaking out “Articles” and “Comments” into two separate components which have more clearly defined responsibilities. Regardless, it helps to see that some components are going to be critical to a lot of requirements and may need more care and attention than others.

So where does this particular matrix come from? We could be accused of the near cardinal sin today of following a waterfall mentality with the need for a big up-front design phase. Not so. It’s more akin to a medical triage.

We have a backlog. We need to review the backlog and sketch out the core components required to support this. We don’t need to dig into each component in great detail – just enough to provide assurances that we have what’s needed for the priority requirements and that the requirements have enough detail to support this (basically some high level grooming). Low priority or simple requirements we may skim over (patient will live (or die)), higher priority or complex ones we assess till we can build the assurances we need (patient needs treatment).

When new requirement arise we can also quickly assess these against the matrix to see where the impact we will be.

This is just one of many useful matrices. Story-to-story can help identify requirement dependencies. Likewise for component-to-component. Mappings from logical components to infrastructure helps build a view of the required environment and can; when taken to the physical level, be used for automatic identification of things like firewall rules. You can even connect matrices together to allow for identification of which requirements are fulfilled by which servers – e.g. physical-node to logical-node to component to requirement maps – or use them for problem analysis to work out what’s broken – e.g. “this function isn’t working, which components could this relate to”. Their value of course is only as good as the quality of data they hold though so such capabilities are often not realised.

Like Unicorns, matrices can be magical. Fortunately for us; and – I hate to break this to you – unlike Unicorns, matrices are real (despite what some may say!).