2020/04/04

Inter-microservice Integrity


A central issue in a microservices environment is how to maintain transactional integrity between services.

The scenario is fairly simple. Service A performs some operation which persists data and at the same time raises an event or notifies service B of this action.



There's a couple of failure scenarios that raise a problem.

Firstly, service B could be unavailable. Does service A rollback or unpick the transaction? What if it's already been committed in A? Do you notify the service consumer of a failure and trigger what could be a cascading failure across the entire service network? Or do you accept long term inconsistency between A & B?

Secondly, if service B is available but you don't commit in service A before raising the event then you've told B about something that's not committed... What happens if you then try to commit in A and find you can't? Do you now need to have compensating transactions to tell service B "oops, ignore that previous message!"?

I'll ignore the use of queues/topics between services as this really just becomes the service B failure point although there are topologies in which risks can be mitigated this way (e.g. through queues local to the origin service A).

There are several options to address this issue:


Event Sourcing - In this model the event is king. Service A persists events to an event store and downstream consumers can subscribe to these events. The event itself does not present the current state of an entity but the history of what has happened. To understand the current state of the entity you need to replay the event history. Service B then consumes from the event store. Beyond this a read view can be maintained by service A to provide an efficient presentation of current state. Note though that in this model there is a risk of read inconsistency since that view is updated separately from writes to the event store. If you can tolerate the additional complexity and eventual consistency then this can work well. It also means you can have a fast write store for the events, independent of what may be a more complex view of current state. Such event sourcing is often used in CQRS implementations.


Outbox (store-and-forward) - In this model, instead of publishing directly to service B, we store the event in the same datastore as used by service A. With many traditional databases this means we can cover the event inside the same transaction as used to persist whatever data we have in the main service. A separate thread can then publish events from the outbox to service B independently.

A variant of the outbox could be to implement fallback options or circuit breakers to retry messages or fallback to the outbox only when messages fail. This can improve the general throughput and responsiveness of services.


Balancing Controls - An alternative to the above solutions is to accept some degree of failure and implement policing strategies to verify that all expected events were received by the subscribing service. For example, how many orders did service B process in the past hour v how many orders where accepted by service A. In some cases it may be acceptable to introduce such delayed checks but these checks can be invasive. In addition, where gaps are identified a strategy is needed to replay events and fill in those gaps. You'll know there's a problem, but you may not be able to fix it automatically...

To make such balancing controls easier and more consistent across services a standardised inbox (e.g. on service B) can be used to store inbound messages or signatures thereof once processed and combine this with a similarly standardised outbox (e.g. on service A) so that checks can be made more easily and allow replay from the outbox for missed events. Such standardisation can place undesirable restrictions on the freedom of microservices to vary themselves and optimise for their specific needs but may be the least worst option.


As to which option is preferable depends on what constraints you can accept.

If you can tolerate inconsistent reads and eventual consistency and have high throughput requirements then event-sourcing may be preferable.

If on the other hand you absolutely must be able to read-your-writes and ensure consistency within a service then an outbox may be simpler and provides better consistency.

Alternatively if you can tolerate some level of failure and workaround corrections then perhaps balancing controls may suffice.


Voyaging dwarves riding phantom eagles

It's been said before... the only two difficult things in computing are naming things and cache invalidation... or naming things and som...