Power to the People

Yesterday I received my usual gas and electricity bill from my supplier with the not so usual increase to my monthly direct debit of a nice round 100%! 100% on top of what is already more than I care for… joy!

What followed was the all too familiar vent-spleen / spit-feathers etc. before the situation was resolved by a very nice customer services representative who had clearly seen this before… humm..

So, as I do, I ponder darkly on how such a situation could have arisen. And as an IT guy, I ponder darkly about how said situation came about through IT (oh what a wicked web we weave)… Ok, so pure conjecture, but this lot have previous…

100%! What on earth convinced them to add 100%? Better still, what convinced them to add 100% when I was in fact in credit and they had just reimbursed me £20 as a result?…

Customer service rep: It’s the computers you see sir.

Me: The computers?

CSR: Well because they reimbursed you they altered your direct-debit amount.

Me: Yeah, ok, so they work out that I’m paying too much and reduce my monthly which would kind of make sense but it’s gone up! Up by 100%!

CSR: Well er yes. But I can fix that and change it back to what it was before…

Me: Yes please do!

CSR: Can I close the complaint now?

Me: Well, you can’t do much else can you? But really, you need to speak to your IT guys because this is just idiotic…

(more was said but you get the gist).

So, theories for how this came about:

  1. Some clever-dick specified a requirement that if they refund some money then claw it back ASAP by increasing the monthly DD by 100%!..
  2. A convoluted matrix exists which is virtually impossible to comprehend; a bit like their pricing structure, which details how and when to apply various degrees of adjustment to DD amounts and has near infinite paths that cannot be proven via any currently known mathematics on the face of this good earth.
  3. A defect exists somewhere in the code.

If 1 or 2 then sack the idiots who came up with such a complete mess – probably the same lot who dream up energy pricing models so a “win-win” as they say!

If 3 then, well, shit happens. Bugs happen. Defects exist; god knows I’ve been to root-cause of many…

It’s just that this isn’t the first time, or the second, or the third…

(start wavy dreamy lines)

The last time they threatened to take me to court because I wouldn’t let their meter maid in when they’d already been and so hadn’t even tried again..  and since they couldn’t reconcile two different “computer systems” properly it kept on bitching until it ratcheted up to that “sue the bastards” level. Nice.

(end wavy dreamy lines)

… and this is such an simple thing to test for. You’ve just got to pump in data for a bunch of test scenarios and look at the result – the same applies to that beastly matrix! … or you’ve got to do code reviews and look for that “if increase < 1.0 then surely it’s wrong and add 1.0 to make it right” or “increase by current/current + (1.0*current/current)”.

So bad test practices and/or bad coding practices – or I could also accept, really bad architecture which can’t be implemented, or really bad project management which just skips anything resembling best practices, or really bad leadership which doesn’t give a toss about the customer – and really bad operations and maintenance regardless because they know they’ve got some pretty basic problems and yet they clearly can’t help themselves to get it fixed.

It all comes down to money and with practices like this they’ll be losing customers hand over fist (they do have the worst customer sat ratings apparently).

They could look at agile, continuous-integration, test automation, dev-ops, TDD and BDD practices as is the rage but they need to brush away the fairy dust that often accompanies these concepts (concepts I generally support incidentally) and realise this does not mean you can abandon all sanity and give up on the basic principles of testing and coding!

If anything these concepts weigh even more heavily on such fundamentals – more detailed tracking of delivery progress and performance, pair-programming, reviewing test coverage, using tests to drive development, automating build and testing as much as can be to improve consistency and quality, getting feedback from operational environments so you detect and resolve issues faster, continuous improvement and so on.

Computer systems are more complex and more depended on by society then ever before, they change at an ever increasing rate, interact in an constantly changing environment with other systems and are managed by disparate teams spread across the globe who come and go with the prevailing technological wind. Your customers and your business relies on them 100% to, at best, get by, and at worst, to exist. You cannot afford not to do it right!

I’m sure the issues are more complex than this and there are probably some institutionalised problems preventing efficient resolution mores the pity. But hey, off to search for a new energy provider…

Chaos Monkey

I’ve had a number of discussions in the past about how we should be testing failover and recovery procedures on a regular basis – to make sure they work and everyone knows what to do so you’re not caught out when it happens for real (which will be at the worst possible moment). Scheduling these tests, even in production, is (or should be) possible at some convenient(‘ish) time. If you think it isn’t then you’ve already got a resiliency problem (you’re out when some component fails) as well as a maintenance problem.

I’ve also talked (ok, muttered) about how a healthy injection of randomness can actually improve stability, resilience and flexibility. Something covered by Nassim Taleb in his book Antifragile.

Anyway, beat to the punch again, Netflix developed a tool called Chaos Monkey (aka Simian Army) a few years back with randomly kills elements of the infrastructure to help identify weak points. Well worth checking out on codinghorror.com.

For the record… I’m not advocating that you use Chaos Monkey in production… Just that it’s a good way to test the resiliency of your environment and identify potential failure points. You should be testing procedures in production in a more structured manner.

Entropy – Part 2

A week or so ago I wrote a piece on entropy and how IT systems have a tendency for disorder to increase in a similar manner to the second law of thermodynamics. This article aims to identify what we can do about it…

It would be nice if there was some silver bullet but the fact of the matter is that; like the second law, the only real way to minimise disorder is to put some work in.

1. Housekeeping

As the debris of life slowly turns your pristine home into something more akin to the local dump, so the daily churn of changes gradually slows and destabilises your previously spotless new IT system. The solution is to crack on with the weekly chore of housekeeping in both cases (or possibly daily if you’ve kids, cats, dogs etc.). It’s often overlooked and forgotten but a lack of housekeeping is frequently the cause of unnecessary outages.

Keeping logs clean and cycling on a regular basis (e.g. hoovering), monitoring disk usage (e.g. checking you’ve enough milk), cleaning up temporary files (e.g. discarding those out of date tins of sardines), refactoring code (e.g. a spring clean) etc. is not difficult and there’s little excuse for not doing it. Reviewing the content of logs and gathering metrics on usage and performance can also help anticipate how frequently housekeeping is required ensure smooth running of the system (e.g. you could measure the amount of fluff hoovered up each week and use this as the basis to decide which days and how frequently the hoovering needs doing – good luck with that one!). This can also lead to additional work to introduce archiving capabilities (e.g. self storage) or purging of redundant data (e.g. taking the rubbish down the dump). But like your home, a little housekeeping done frequently is less effort (cost) than waiting till you can’t get into the house because the doors jammed and the men in white suits and masks are threatening to come in and burn everything.

2. Standards Compliance

By following industry standards you stand a significantly better chance of being able to patch/upgrade/enhance without pain in the future than if you decide to do your own thing.

That should be enough said on the matter but the number of times I see teams misusing APIs or writing their own solutions to what are common problems is frankly staggering. We (and me especially) all like to build our own palaces. Unfortunately we lack sufficient exposure to the space of a problem to be able to produce designs which combines elegance with flexibility to address the full range of use cases or the authority and foresight to predict the future and influence this in a meaningful way. In short, standards are generally thought out by better people than you or me.

Once a standard is established then any future work will usually try to build on this or provide a roadmap of how to move from the old standard to the new.

3. Automation

The ability to repeatedly and reliably build the system decreases effort (cost) and improves quality and reliability. Any manual step in the build process will eventually lead to some degree of variance with potentially unquantifiable consequences. There are numerous tools available to help with this (e.g. Jenkins) though unfortunately usage of such tools is not as widespread as you would hope.

But perhaps the real killer feature is test automation which enables you to continuously execute tests against the system at comparatively negligible cost (when compared to maintaining a 24×7 human test team). With this in place (and getting the right test coverage is always an issue) you can exercise the system in any number of hypothetical scenarios to identify issues; both functional and non-functional, in a test environment before the production environment becomes compromised.

Computers are very good at doing repetitive tasks consistently. Humans are very good at coming up with new and creative test cases. Use each appropriately.

Much like housekeeping, frequent testing yields benefits at lower cost than simply waiting till the next major release when all sorts of issues will be uncovered and need to be addressed – many of which may have been around a while though no-one noticed… because no-one tested. Regular penetration testing and review of security procedures will help to proactively avoid vulnerabilities as they are uncovered in the wild, and regular testing of new browsers will help identify compatibility issues before your end-users do. There are some tools to help automate in this space (e.g. Security AppScan and WebDriver) though clearly it does cost to run and maintain such a continuous integration and testing regime. However, so long as the focus is correct and pragmatic then the cost benefits should be realised.

4. Design Patterns

Much like standards compliance, use of design patterns and good practices such as abstraction, isolation and dependency injection can help to ensure changes in the future can be accommodated at minimal effort. I mention this separately though since the two should not be confused. Standards may (or may not) adopt good design patterns and equally non-standard solutions may (or may not) adopt good design patterns – there are no guarantees either way.

Using design patterns also increases the likelihood that the next developer to come along will be able to pick up the code with greater ease than if it’s some weird hair-brained spaghetti bowl of nonsense made up after a rather excessive liquid lunch. Dealing with the daily churn of changes becomes easier, maintenance costs come down and incidents are reduced.

So in summary, entropy should be considered a BAU (Business as Usual) issue and practices should be put in place to deal with it. Housekeeping, standards-compliance, automation through continuous integration and use of design patterns all help to keep the impact of change minimised and keep the level of disorder down.

Next time, some thoughts on how to measure entropy in the enterprise…