Skip to main content

Operability

Operability... The smörgåsbord of non-functional requirements. Often boiled down to maintainability, reliability, availability and supportability it's another area of non-functionals that is often forgotten. In part this may be due to the lack of immediate need pushing such requirements down the priority list but more frequently they're just plain missed. Whatever reason, here's my list of operability topics...

Maintainability - The ability to be able to maintain the system on an ongoing basis. Change happens and from day one you can expect change cases to land resulting in modifications to the baseline. Such change introduces entropy to the system which overtime will degrade the operation of the system unless dealt with. Software maintenance and refactoring may be required as will corrective and preventative action (upgrades, patching, archiving etc.) to ensure smooth running of the system. This leads to a somewhat nebulous requirement to be "able to maintain the system" but in terms of solution simply means it must be patched, monitored, house-kept etc. Configuration management requirements will sprout up to ensure the software can be maintained as will procedures for change management. There are some useful techniques to address some of these which have become more fashionable recently including continuous integration and automated testing which can make maintaining the system more resilient to change - or at least make it easier to identify issues.

Reliability - Shit happens and at some point in time you will experience a failure. This may be due to failure of some component in the system (identify those SPOF's (Single Points of Failure)) or may be due to unexpected variances in input data or simply through sustained operation over a long period of time (for which soak testing may be performed). SLA's can be defined for how much data-loss can be tolerated (e.g. RPO - Recover Point Objective) and how long it should take in order to recover in the event of a failure (e.g. RTO - Recovery Time Objective) - i.e. backup and recovery schedules. Reliability if also tightly coupled with...

Availability - What are the required hours of operation? What maintenance windows need to be scheduled? Availability costs - you immediately start doubling up on resources as soon as you want 99.9% availability (or just under 9hrs of downtime a year for a theoretical 24x7 system). If this includes scheduled maintenance time then you're going to need a parallel system to take the load during this window (redundant components etc.)... and if you're using horizontal scaling to achieve the desired performance and capacity needs then you can't use the same capacity during an outage - it'd not going to be big enough (maybe...). Some components are difficult to make resilient to failures and maintenance (databases and data-centers for example) and so eventually it gets silly expensive and you need to stop and accept some downtime.

Supportability - Ensuring that you've workable processes, procedures, skills and tools in place to deal with issues (help-desk, change management, service management etc.) and that the system logs enough detail to allow root problem determination is required. Monitoring can be introduced but needs to be appropriate - too many alerts and the service-desk won't handle them, too few and the service-desk won't know what's going on. I'm perhaps famed for error messages saying things like "this error should never occur because of xxx...". Some think they're worthless (since they should never occur) but I have more than once seen these in logs and the inclusion of a detailed stack trace helps to PD the issue. I've also come across many more messages of the form "something went wrong" with absolutely no detail. These are the times when I feel like gouging out someones eyes with spoons. Support also extends to skill-set availability. If you can't hire anyone to support the solution then it doesn't matter that the solution is the best thing since sliced-bread because you can't really use it. The costs of skills can vary considerably and choosing technology you can afford to support needs to be considered up-front.

The sort of issues addressed by operability requirements occur infrequently but this rarity is also a very good reason why the procedures, technology and skills need to be exercised on a regular basis. If your backup solution doesn't work then you'll want to know before you have to rely on it. This alone should push such requirements back up the priority list.

And finally, a very good suggestion made to me by a colleague is "plan-for-failure!". At some point you'll need it.

Comments

Popular posts from this blog

An Observation

Much has changed in the past few years, hell, much has changed in the past few weeks, but that’s another story... and I’ve found a little time on my hands in which to tidy things up. The world of non-functionals has never been so important and yet remains irritatingly ignored by so many - in particular by product owners who seem to think NFRs are nothing more than a tech concern. So if your fancy new product collapses when you get get too many users, is that ok? It’s fair that the engineering team should be asking “how many users are we going to get?”,   or “how many failures can we tolerate?” but the only person who can really answer those questions is the product owner.   The dumb answer to these sort of question is “lots!”, or “none!” because at that point you’ve given carte-blanche to the engineering team to over engineer... and that most likely means it’ll take a hell of a lot longer to deliver and/or cost a hell of a lot more to run. The dumb answer is also “only a couple” and “

Inter-microservice Integrity

A central issue in a microservices environment is how to maintain transactional integrity between services. The scenario is fairly simple. Service A performs some operation which persists data and at the same time raises an event or notifies service B of this action. There's a couple of failure scenarios that raise a problem. Firstly, service B could be unavailable. Does service A rollback or unpick the transaction? What if it's already been committed in A? Do you notify the service consumer of a failure and trigger what could be a cascading failure across the entire service network? Or do you accept long term inconsistency between A & B? Secondly, if service B is available but you don't commit in service A before raising the event then you've told B about something that's not committed... What happens if you then try to commit in A and find you can't? Do you now need to have compensating transactions to tell service B "oops, ignore that previous messag

Equifax Data Breach Due to Failure to Install Patches

"the Equifax data compromise was due to their failure to install the security updates provided in a timely manner." Source: MEDIA ALERT: The Apache Software Foundation Confirms Equifax Data Breach Due to Failure to Install Patches Provided for Apache® Struts™ Exploit : The Apache Software Foundation Blog As simple as that apparently. Keep up to date with patching.