Operability

Operability... The smörgåsbord of non-functional requirements. Often boiled down to maintainability, reliability, availability and supportability it's another area of non-functionals that is often forgotten. In part this may be due to the lack of immediate need pushing such requirements down the priority list but more frequently they're just plain missed. Whatever reason, here's my list of operability topics...

Maintainability - The ability to be able to maintain the system on an ongoing basis. Change happens and from day one you can expect change cases to land resulting in modifications to the baseline. Such change introduces entropy to the system which overtime will degrade the operation of the system unless dealt with. Software maintenance and refactoring may be required as will corrective and preventative action (upgrades, patching, archiving etc.) to ensure smooth running of the system. This leads to a somewhat nebulous requirement to be "able to maintain the system" but in terms of solution simply means it must be patched, monitored, house-kept etc. Configuration management requirements will sprout up to ensure the software can be maintained as will procedures for change management. There are some useful techniques to address some of these which have become more fashionable recently including continuous integration and automated testing which can make maintaining the system more resilient to change - or at least make it easier to identify issues.

Reliability - Shit happens and at some point in time you will experience a failure. This may be due to failure of some component in the system (identify those SPOF's (Single Points of Failure)) or may be due to unexpected variances in input data or simply through sustained operation over a long period of time (for which soak testing may be performed). SLA's can be defined for how much data-loss can be tolerated (e.g. RPO - Recover Point Objective) and how long it should take in order to recover in the event of a failure (e.g. RTO - Recovery Time Objective) - i.e. backup and recovery schedules. Reliability if also tightly coupled with...

Availability - What are the required hours of operation? What maintenance windows need to be scheduled? Availability costs - you immediately start doubling up on resources as soon as you want 99.9% availability (or just under 9hrs of downtime a year for a theoretical 24x7 system). If this includes scheduled maintenance time then you're going to need a parallel system to take the load during this window (redundant components etc.)... and if you're using horizontal scaling to achieve the desired performance and capacity needs then you can't use the same capacity during an outage - it'd not going to be big enough (maybe...). Some components are difficult to make resilient to failures and maintenance (databases and data-centers for example) and so eventually it gets silly expensive and you need to stop and accept some downtime.

Supportability - Ensuring that you've workable processes, procedures, skills and tools in place to deal with issues (help-desk, change management, service management etc.) and that the system logs enough detail to allow root problem determination is required. Monitoring can be introduced but needs to be appropriate - too many alerts and the service-desk won't handle them, too few and the service-desk won't know what's going on. I'm perhaps famed for error messages saying things like "this error should never occur because of xxx...". Some think they're worthless (since they should never occur) but I have more than once seen these in logs and the inclusion of a detailed stack trace helps to PD the issue. I've also come across many more messages of the form "something went wrong" with absolutely no detail. These are the times when I feel like gouging out someones eyes with spoons. Support also extends to skill-set availability. If you can't hire anyone to support the solution then it doesn't matter that the solution is the best thing since sliced-bread because you can't really use it. The costs of skills can vary considerably and choosing technology you can afford to support needs to be considered up-front.

The sort of issues addressed by operability requirements occur infrequently but this rarity is also a very good reason why the procedures, technology and skills need to be exercised on a regular basis. If your backup solution doesn't work then you'll want to know before you have to rely on it. This alone should push such requirements back up the priority list.

And finally, a very good suggestion made to me by a colleague is "plan-for-failure!". At some point you'll need it.

No comments:

Post a Comment

Voyaging dwarves riding phantom eagles

It's been said before... the only two difficult things in computing are naming things and cache invalidation... or naming things and som...