Inter-microservice Integrity

A central issue in a microservices environment is how to maintain transactional integrity between services.

The scenario is fairly simple. Service A performs some operation which persists data and at the same time raises an event or notifies service B of this action.

There’s a couple of failure scenarios that raise a problem.

Firstly, service B could be unavailable. Does service A rollback or unpick the transaction? What if it’s already been committed in A? Do you notify the service consumer of a failure and trigger what could be a cascading failure across the entire service network? Or do you accept long term inconsistency between A & B?

Secondly, if service B is available but you don’t commit in service A before raising the event then you’ve told B about something that’s not committed… What happens if you then try to commit in A and find you can’t? Do you now need to have compensating transactions to tell service B “oops, ignore that previous message!”?

I’ll ignore the use of queues/topics between services as this really just becomes the service B failure point although there are topologies in which risks can be mitigated this way (e.g. through queues local to the origin service A).

There are several options to address this issue:

Event Sourcing – In this model the event is king. Service A persists events to an event store and downstream consumers can subscribe to these events. The event itself does not present the current state of an entity but the history of what has happened. To understand the current state of the entity you need to replay the event history. Service B then consumes from the event store. Beyond this a read view can be maintained by service A to provide an efficient presentation of current state. Note though that in this model there is a risk of read inconsistency since that view is updated separately from writes to the event store. If you can tolerate the additional complexity and eventual consistency then this can work well. It also means you can have a fast write store for the events, independent of what may be a more complex view of current state. Such event sourcing is often used in CQRS implementations.

Outbox (store-and-forward) – In this model, instead of publishing directly to service B, we store the event in the same datastore as used by service A. With many traditional databases this means we can cover the event inside the same transaction as used to persist whatever data we have in the main service. A separate thread can then publish events from the outbox to service B independently.

A variant of the outbox could be to implement fallback options or circuit breakers to retry messages or fallback to the outbox only when messages fail. This can improve the general throughput and responsiveness of services.

Balancing Controls – An alternative to the above solutions is to accept some degree of failure and implement policing strategies to verify that all expected events were received by the subscribing service. For example, how many orders did service B process in the past hour v how many orders where accepted by service A. In some cases it may be acceptable to introduce such delayed checks but these checks can be invasive. In addition, where gaps are identified a strategy is needed to replay events and fill in those gaps. You’ll know there’s a problem, but you may not be able to fix it automatically…

To make such balancing controls easier and more consistent across services a standardised inbox (e.g. on service B) can be used to store inbound messages or signatures thereof once processed and combine this with a similarly standardised outbox (e.g. on service A) so that checks can be made more easily and allow replay from the outbox for missed events. Such standardisation can place undesirable restrictions on the freedom of microservices to vary themselves and optimise for their specific needs but may be the least worst option.

As to which option is preferable depends on what constraints you can accept.

If you can tolerate inconsistent reads and eventual consistency and have high throughput requirements then event-sourcing may be preferable.

If on the other hand you absolutely must be able to read-your-writes and ensure consistency within a service then an outbox may be simpler and provides better consistency.

Alternatively if you can tolerate some level of failure and workaround corrections then perhaps balancing controls may suffice.


I suspect most of us working in IT today use agile methodologies such as Kanban, Scrum and Safe. We also strive to keep up to date with the latest developments in languages, libraries, patterns, architectures etc.

All of this is with the intent of improving the delivery speed, quality, efficiency, maintainability and the cost effectiveness of the systems we build – oh, and whilst improving our CVs at the same time.

Care though is needed to ensure we don’t get distracted by these tools from delivering the solutions they were employed for. This often isn’t through any specific fault with the methods and tools themselves but how we set about using them. 

There can often be a blinkered tendency to maniacally focus on the tools and methods themselves and in doing so fail to deliver most effectively on actual requirements and customer needs. If all you have is a hammer then everything looks like a nail.

Worse still, the job can become about servicing the tools used to do the work rather than the product of the work itself.

Furthermore these tools are often barely distinguishable from each other and for most use cases it doesn’t much matter if you use one or another; Java or .NET, AWS or GCP, Scrum or Kanban. Use what works for your team and be the master of tools, not a slave to them.

The product of your creativity is that which you build, value is derived from how your customers use the product.

Focus on the product and value it gives to your customers. Produce working solutions first and foremost – functionally and non-functionally – as efficiently as can be.

Over time we learn new methods and tools that improve delivery and optimise value for the customer but we need always be mindful that tools and methods are only a means to an end and not an end in themselves.

A Years Worth

A few Christmases ago I was messing about creating bubble maps – no doubt in some mince pie and port induced state of inebriation (quality of code is consequently as you’d expect).

I’d long forgotten about this until Kent Becks recent post on about trying to understand A Year’s Worth of effort.

This looked familiar and so with a little manipulation here’s a simple utility to convert priority ordered stories into a visual bubble map.

SAFe Shite!

SAFe (Scaled Agile Framework – for which I will not provide the link because they don’t deserve it) is worthless shite!

It’s just a bunch of practices developed elsewhere, rebranded by a cabal of consultants to cream money out of large organisations. The only additions it brings are there to placate management by helping mascarade traditional waterfall processes and hierarchical structures as “agile” – it does little to nothing new to actually change the org!

Save your money and go open source with your processes. SAFe is bollocks!


I was/am interested into the whole Equifax hack and how this happened. To this end I posted a brief link yesterday to the Struts teams response. A simple case of failing to patch! Case closed..

But then I’ve been thinking that’s not really very fair.

This was (apparently) caused by a defect that’s been around for a long time. The developers had reacted very quickly when the problem was identified (within 24 hrs) but Equifax – by all accounts – had failed to patch for a further 6 months.

What did we expect? That they’d patch it the next day? No chance. Within a month? Maybe. But if the issue is embedded in some third party product then they’re dependent upon a fix being provided and if it’s in some in-house developed tool then they need to be able to rebuild the app and test it before they can deploy. Struts was/is extremely popular. It was the Spring of its day and is still deeply embedded in all sorts of corporate applications and off the shelf products. Fixing everything isn’t going to happen overnight.

Companies like Equifax will also have hundreds, even thousands, of applications and each application will have dozens of dependencies any one of which could have suffered a similar issue. On top of this, most of these applications will be minor, non critical tools which have been around for many years and which frankly few will care about. Running a programme to track all of these dependencies, patch applications and test them all before rolling them into production would take an army of developers, testers, sys-ops and administrators working around the clock just to tread water. New features? Forget it. Zero-day? Shuffles shoes… Mind you, it’d be amusing to see how change management would handle this…

So we focus on the priority applications and the low hanging fruit of patching (OS etc.) and hope that’s good enough? Humm… anything else we can do?

Well, we’re getting better with build, test and deployment automation but we’re a long way from perfection. So do some of that, it’ll make dealing with change all the easier but its no silver bullet. And again, good luck with change management…

Ultimately though we have to assume we’re not going to have perfect code (there’s no such thing!)… that we’re not able to patch against every vulnerability… and that zero day exploits are a real risk.

Other measures are required regardless of your patching strategy. Reverse proxies, security filters, firewalls, intrusion detection, n-tier architectures, heterogenous software stacks, encryption, pen-testing etc. Security is like layers of swiss-cheese – no single layer will ever be perfect, you just hope the holes don’t line up when you stack them all together. Add to this some decent monitoring of traffic and an understanding of norms and patterns – at least something which you actually have people looking at continually rather than after the event – and you stand a chance of protecting yourself against such issues, or able to identify potential attacks before they become actual breaches.

Equifax may have failed to patch some Struts defect for six months but that’s not the worst of it. That they were vulnerable to such a defect in the first place smells like.. well, like they didn’t have enough swiss-cheese. That an employee tool was also accessible online and provided access to customer information with admin/admin credentials goes on to suggests a real lack of competency and recklessness at senior levels.

Adding insult to injury, to blame an open-source project (for the wrong defect!) which heroically responded and addressed the real issue within 24 hrs of it being identified six month earlier (!?) makes Equifax look like an irresponsible child. Always blaming someone else for their reckless actions.

They claim to be “an innovative global information solutions company”. So innovative they’re bleeding edge and giving their, no our!, data away. I’m just not sure who’s the bigger criminal… the hackers or Equifarce!

Equifax Data Breach Due to Failure to Install Patches

“the Equifax data compromise was due to their failure to install the security updates provided in a timely manner.”

Source: MEDIA ALERT: The Apache Software Foundation Confirms Equifax Data Breach Due to Failure to Install Patches Provided for Apache® Struts™ Exploit : The Apache Software Foundation Blog

As simple as that apparently. Keep up to date with patching.


I should probably have learnt this some time ago…

Quite often we find no-one is willing to do the {insert-task-here}.

I don’t know why. Fear of getting it wrong. Fear of ridicule. Fear of crayons. Whatever. Heres a tip on how to get things moving when no-one seems willing…


It doesn’t even matter if you do it badly. In fact it’s often better to do it badly on purpose!

You’ll be amazed (or maybe not) at the number of people that come out of the woodwork to provide their own “advice”. All of sudden you’ll have no end of input. Just be prepared to bite your tongue and take solace in the knowledge that you took one for the greater good.

Someones got to get the ball rolling…

It’s an older code..

(let’s assume we’re talking about encryption keys here rather than pass codes though it really makes little difference… and note that your passwords are a slightly different concern)

Is it incompetence to use an old code? No.

For synchronous requests (e.g. like those over HTTPS) there’s a handshake process you go through every few minutes to agree a new key. Client and server then continue to use this key until it expires then they agree a new one. If the underlying certificate changes you simply go through the handshake again.

For asynchronous requests things aren’t as easy. You could encrypt and queue a request one minute and change the key the next but the message remains on the queue for another hour before it gets processed. In these cases you can either reject the message (usually unacceptable) or try the older key and accept that for a while longer.

Equally with persistent storage you could change the key every month but you can’t go round decrypting and re-encrypting all historic content and accept the outage this causes every time. Well, not if you’ve billions of records and an availability SLA of greater than a few percent. So again, you’ve got to let the old codes work..

You could use full disk/database encryption but that’s got other issues – like its next to useless once the disks are spinning… And besides, when you change the disk password you’re not actually changing the key and re-encrypting the data, you’re just changing the password used to obtain the key.

So it is ok to accept old codes. For a while at least.

An empire spread throughout the galaxy isn’t going to be able to distribute new codes to every stormtrooper instantaneously. Even if they do have the dark-side on their side…