2021/03/30

An Observation

Much has changed in the past few years, hell, much has changed in the past few weeks, but that’s another story... and I’ve found a little time on my hands in which to tidy things up.

The world of non-functionals has never been so important and yet remains irritatingly ignored by so many - in particular by product owners who seem to think NFRs are nothing more than a tech concern.


So if your fancy new product collapses when you get get too many users, is that ok?


It’s fair that the engineering team should be asking “how many users are we going to get?”,  or “how many failures can we tolerate?” but the only person who can really answer those questions is the product owner. 


The dumb answer to these sort of question is “lots!”, or “none!” because at that point you’ve given carte-blanche to the engineering team to over engineer... and that most likely means it’ll take a hell of a lot longer to deliver and/or cost a hell of a lot more to run.


The dumb answer is also “only a couple” and “hell, I don’t care.” because, well... you’ll have different problems, mostly about whether you should be in business at-all I suspect.


So the balance is somewhere in the middle. Take your OKRs and have the discussion with the engineering team about what the concerns may be and where the hot-spots are. Run through some “what-if...?” scenarios to understand the potential impact of under/over-estimating. In many cases there are strategies that can be taken to reactively expand/shrink resources which can at least buy you some time... if not more.


Regardless, from an engineering perspective we need to:

1. Understand expectations and what really matters - driven through an understanding of  product OKRs.

2. Derive NFRs, agree these with product owners as acceptance criteria, and architect and design accordingly.

3. Monitor and observe what’s actually going on - how else do you prove you’ve met your acceptance criteria?


For example, a system I worked on recently involves processing transactions running into the millions of dollars. In this case, losing or duplicating a transaction can be seriously bad for your health.


The consequence? A lot of discussion with product teams resulting in significant design effort to ensure the integrity of transactions as they pass through the system. Realtime monitoring as transactions flow, near-time balancing controls as belt-and-braces and financial accounting and reporting on top (to be honest the business only really care about accounting but that feedback loop is way too slow from an engineering perspective (daily)).


Duplicates transactions are avoided - at the cost of reduced availability - and dropped transactions are detected and alerted on within a few minutes and can be automatically replayed (if we ever grow the balls to turn that on).


This isn’t cheap and involves a lot of testing and verification - including some chaos testing to simulate duplicates (not loses (yet)) and I hope it’s been worth it.


It also involves significant investment in monitoring, tracing and alerting to ensure we have good visibility of what’s going on across the platform so we can spot problems quickly if they do happen (and they can, they just shouldn’t result in a financial loss for customers or ourselves).


This is build I hope never to have to call on and the sort of failures which scare me most involve the outage of entire clusters and complete site failures - this stuff needs specific test focus. Auto-scaling, rebalancing and node failures are such common occurrences now as to be BAU and should not be the reason you’re called out of bed at 2am - we test this stuff by simply breathing these days. 


Through discussion with product owners on OKRs we can start to uncover our NFRs and what degree of failure can and cannot be tolerated. We use this knowledge to architect and design solutions appropriate to the problem space, and we monitor and observe to ensure we’re within bounds. 


And if the boundaries move? Then we adapt, evolve, change accordingly. Through discussion we’ve hopefully had a chance to consider what could go wrong and should already have some escape routes planned... just in case.

Voyaging dwarves riding phantom eagles

It's been said before... the only two difficult things in computing are naming things and cache invalidation... or naming things and som...