Interview Tales I – The Bathtub Curve

I’ve been to a few interviews recently, most of which have been bizarrely enjoyable affairs where I’ve had the opportunity to discover much about how things work elsewhere. However, I recently went to an interview for an organisation which has suffered some pretty high profile  system failures recently which I timidly pointed out; hoping not to offend. The response was, in my view, both arrogant and ignorant – perhaps I did offend…

I was informed; rather snootily, that this incident was a one-off having occurred just once in the 15+ year experience of the interviewer and couldn’t happen again. Humm… I raised the point having worked on a number of technology refresh projects and being familiar with the Bathtub Curve (shown below – image courtesy of Engineering Statistics Handbook).

bathtub2

What this shows is how during the early life of a system failures are common (new build, many defects etc.). Things then settle down to a fairly stable; but persistent, level of failures for the main life of the system before things start to wear out and the number of incidents increases again – ultimately becoming catastrophic.

This is kind of obvious for mechanical devices (cars and the like) but perhaps not so much for software. I still have an old ’80s book on software engineering which states that “software doesn’t decay!”. However, as pointed out previously, software is subject to change from a variety of sources, change brings decay and decay increases failure rates. Bathtub curve applies.

Now the reason I mentioned the failure in the first place was because the press I had read pointed towards a combination of ageing systems and complex integration solutions holding things together. I was therefore expecting an answer along the lines of “yes, we need to work to make sure it doesn’t happen again” and “that’s why we’re hiring because we need to address these issues”. This could then lead on and I could relate my experiences on refresh projects, hurrah!… It didn’t work out like that even though it did seem that the raison d’être behind the role itself was precisely because they didn’t have a good enough grip on the existing overall IT environment.

It’s entirely possible that the interviewer is correct (or gets lucky). However, given there has actually been a couple of such incidents at the same organisation recently – two individually unique issues of course – I’m kind of suspicious that what they’re actually seeing is the start of the ramp up in incidents that is typified by the Bathtub Curve. Time will tell.

I wasn’t offered the job, but then again, I didn’t want it either so I think we’re happy going our separate ways.

Entropy – Part 2

A week or so ago I wrote a piece on entropy and how IT systems have a tendency for disorder to increase in a similar manner to the second law of thermodynamics. This article aims to identify what we can do about it…

It would be nice if there was some silver bullet but the fact of the matter is that; like the second law, the only real way to minimise disorder is to put some work in.

1. Housekeeping

As the debris of life slowly turns your pristine home into something more akin to the local dump, so the daily churn of changes gradually slows and destabilises your previously spotless new IT system. The solution is to crack on with the weekly chore of housekeeping in both cases (or possibly daily if you’ve kids, cats, dogs etc.). It’s often overlooked and forgotten but a lack of housekeeping is frequently the cause of unnecessary outages.

Keeping logs clean and cycling on a regular basis (e.g. hoovering), monitoring disk usage (e.g. checking you’ve enough milk), cleaning up temporary files (e.g. discarding those out of date tins of sardines), refactoring code (e.g. a spring clean) etc. is not difficult and there’s little excuse for not doing it. Reviewing the content of logs and gathering metrics on usage and performance can also help anticipate how frequently housekeeping is required ensure smooth running of the system (e.g. you could measure the amount of fluff hoovered up each week and use this as the basis to decide which days and how frequently the hoovering needs doing – good luck with that one!). This can also lead to additional work to introduce archiving capabilities (e.g. self storage) or purging of redundant data (e.g. taking the rubbish down the dump). But like your home, a little housekeeping done frequently is less effort (cost) than waiting till you can’t get into the house because the doors jammed and the men in white suits and masks are threatening to come in and burn everything.

2. Standards Compliance

By following industry standards you stand a significantly better chance of being able to patch/upgrade/enhance without pain in the future than if you decide to do your own thing.

That should be enough said on the matter but the number of times I see teams misusing APIs or writing their own solutions to what are common problems is frankly staggering. We (and me especially) all like to build our own palaces. Unfortunately we lack sufficient exposure to the space of a problem to be able to produce designs which combines elegance with flexibility to address the full range of use cases or the authority and foresight to predict the future and influence this in a meaningful way. In short, standards are generally thought out by better people than you or me.

Once a standard is established then any future work will usually try to build on this or provide a roadmap of how to move from the old standard to the new.

3. Automation

The ability to repeatedly and reliably build the system decreases effort (cost) and improves quality and reliability. Any manual step in the build process will eventually lead to some degree of variance with potentially unquantifiable consequences. There are numerous tools available to help with this (e.g. Jenkins) though unfortunately usage of such tools is not as widespread as you would hope.

But perhaps the real killer feature is test automation which enables you to continuously execute tests against the system at comparatively negligible cost (when compared to maintaining a 24×7 human test team). With this in place (and getting the right test coverage is always an issue) you can exercise the system in any number of hypothetical scenarios to identify issues; both functional and non-functional, in a test environment before the production environment becomes compromised.

Computers are very good at doing repetitive tasks consistently. Humans are very good at coming up with new and creative test cases. Use each appropriately.

Much like housekeeping, frequent testing yields benefits at lower cost than simply waiting till the next major release when all sorts of issues will be uncovered and need to be addressed – many of which may have been around a while though no-one noticed… because no-one tested. Regular penetration testing and review of security procedures will help to proactively avoid vulnerabilities as they are uncovered in the wild, and regular testing of new browsers will help identify compatibility issues before your end-users do. There are some tools to help automate in this space (e.g. Security AppScan and WebDriver) though clearly it does cost to run and maintain such a continuous integration and testing regime. However, so long as the focus is correct and pragmatic then the cost benefits should be realised.

4. Design Patterns

Much like standards compliance, use of design patterns and good practices such as abstraction, isolation and dependency injection can help to ensure changes in the future can be accommodated at minimal effort. I mention this separately though since the two should not be confused. Standards may (or may not) adopt good design patterns and equally non-standard solutions may (or may not) adopt good design patterns – there are no guarantees either way.

Using design patterns also increases the likelihood that the next developer to come along will be able to pick up the code with greater ease than if it’s some weird hair-brained spaghetti bowl of nonsense made up after a rather excessive liquid lunch. Dealing with the daily churn of changes becomes easier, maintenance costs come down and incidents are reduced.

So in summary, entropy should be considered a BAU (Business as Usual) issue and practices should be put in place to deal with it. Housekeeping, standards-compliance, automation through continuous integration and use of design patterns all help to keep the impact of change minimised and keep the level of disorder down.

Next time, some thoughts on how to measure entropy in the enterprise…

Entropy

Entropy is a measure of disorder in a system. A number of years ago I was flicking through an old book on software engineering from the 1970’s. Beyond being a right riveting read it expounded the view that software does not suffer from decay. That once set, software programs, would follow the same rules over and over and produce the same results time and again ad infinitum. In effect that software was free from decay.

I would like to challenge this view.

We design a system, spend many weeks and months considering every edge case, crafting the code so we’ve handled every possible issue nature can throw at us including those “this exception can never happen but just it case it does..” scenarios. We test it till we can test no more without actually going live and then release our latest most wondrous creation on the unsuspecting public. It works and for a fleeting moment all is well with the universe… from this moment on decay eats away at our precious creation like rats gnawing away on the discarded carcass of the sunday roast.

Change is pervasive and whilst it’s seems reasonable enough that were we able to precisely reproduce the starting conditions the program would run time and again as it did the first time, this isn’t correct for reasons of quantum mechanics and our inability to time travel (at least so far as we know today). However, I’ll ignore the effects of quantum mechanics and time-travel for now and focus on the more practical reasons for change and how this causes decay and increasing entropy in computer systems.

Firstly there’s the general use of the system. Most systems have some sort of data-store; if only for logging, and data is collected in increasing quantities and in a greater variety of combinations over time. This can lead to permutations which were never anticipated which leads to exposure of functional defects or increase volumes beyond the planned capacity of the system. The code may remain the same but when we look at a system and consider it as an atomic unit in its entirety, it is continuously changing. Subsequent behaviour becomes increasingly unpredictable.

Secondly there’s the environment the system exists within – most of which is totally beyond any control. Patches for a whole stack of components are continually released from the hardware up. The first response from most first-line support organisations is “patch to latest level” (which is much easier said than done) but if you do manage to keep up with the game then these patches will affect how the system runs.

Conversely, if you don’t patch then you leave yourself vulnerable to the defects that the patches were designed to resolve. The knowledge that the defect itself exists changes the environment in which the system runs because now the probability that someone will try to leverage the defect is significantly increased – which again increases the uncertainty over how the system will operate. You cannot win and the cost of doing nothing may be more than the cost of addressing the issue.

Then there’s change that we inflict ourselves.

If you’re lucky and the system has been a success then new functional requirements will arise – this is a good thing, perhaps one for later but a system which does not functionally evolve is a dead-end and essentially a failure – call it a “panda” if you wish. The business will invent new and better ways to get the best out of the system, new use cases which can be fulfilled become apparent and a flourish of activities follow. All of which change the original system.

There’s also non-functional requirements change. Form needs a refresh every 18 months or so, security defects need to be fixed (really, they do!), performance and capacity improvements may be needed and the whole physical infrastructure needs to be refreshed periodically. The simple act of converting a physical server to virtual (aka P2V conversion) which strives to keep the existing system as close to current as possible; detritus and all, will typically provide more compute, RAM and disk than was ever considered possible. Normally this makes the old application run so much faster than before but occasionally that speed increase can have devastating effects on the function of the system within time sensitive applications. Legislative requirements, keeping compliant with latest browsers etc., all bring more change…

Don’t get me wrong, change is a good thing normally and the last thing we want is a world devoid of change. The problem is that all this change increases the disorder (entropy) of the system. Take the simple case of a clean OS install. Day 1, the system is clean and well ordered. Examining the disk and logs shows a tidy registry and clean log and temporary directories. Day 2 brings a few patches, which adds registry entries, some logs, a few downloads etc. but it’s still good enough. But by Day n you’ve a few hundred patches installed, several thousand log files and a raft of old downloads and temporary files lying around.

The constant state of flux means that IT systems are essentially subject to the same tendency for disorder to increase as stated in the second law of thermodynamics. Change unfortunately brings disorder and complexity. Disorder and complexity makes things harder to maintain and manage, increasing fragility and instability. Increased management effort results in increased costs.

Well, that’s enough for today.. next week, what we can do about it