Skip to main content

Entropy - Part 2

A week or so ago I wrote a piece on entropy and how IT systems have a tendency for disorder to increase in a similar manner to the second law of thermodynamics. This article aims to identify what we can do about it...

It would be nice if there was some silver bullet but the fact of the matter is that; like the second law, the only real way to minimise disorder is to put some work in.

1. Housekeeping

As the debris of life slowly turns your pristine home into something more akin to the local dump, so the daily churn of changes gradually slows and destabilises your previously spotless new IT system. The solution is to crack on with the weekly chore of housekeeping in both cases (or possibly daily if you've kids, cats, dogs etc.). It's often overlooked and forgotten but a lack of housekeeping is frequently the cause of unnecessary outages.

Keeping logs clean and cycling on a regular basis (e.g. hoovering), monitoring disk usage (e.g. checking you've enough milk), cleaning up temporary files (e.g. discarding those out of date tins of sardines), refactoring code (e.g. a spring clean) etc. is not difficult and there's little excuse for not doing it. Reviewing the content of logs and gathering metrics on usage and performance can also help anticipate how frequently housekeeping is required ensure smooth running of the system (e.g. you could measure the amount of fluff hoovered up each week and use this as the basis to decide which days and how frequently the hoovering needs doing - good luck with that one!). This can also lead to additional work to introduce archiving capabilities (e.g. self storage) or purging of redundant data (e.g. taking the rubbish down the dump). But like your home, a little housekeeping done frequently is less effort (cost) than waiting till you can't get into the house because the doors jammed and the men in white suits and masks are threatening to come in and burn everything.

2. Standards Compliance

By following industry standards you stand a significantly better chance of being able to patch/upgrade/enhance without pain in the future than if you decide to do your own thing.

That should be enough said on the matter but the number of times I see teams misusing APIs or writing their own solutions to what are common problems is frankly staggering. We (and me especially) all like to build our own palaces. Unfortunately we lack sufficient exposure to the space of a problem to be able to produce designs which combines elegance with flexibility to address the full range of use cases or the authority and foresight to predict the future and influence this in a meaningful way. In short, standards are generally thought out by better people than you or me.

Once a standard is established then any future work will usually try to build on this or provide a roadmap of how to move from the old standard to the new.

3. Automation

The ability to repeatedly and reliably build the system decreases effort (cost) and improves quality and reliability. Any manual step in the build process will eventually lead to some degree of variance with potentially unquantifiable consequences. There are numerous tools available to help with this (e.g. Jenkins) though unfortunately usage of such tools is not as widespread as you would hope.

But perhaps the real killer feature is test automation which enables you to continuously execute tests against the system at comparatively negligible cost (when compared to maintaining a 24x7 human test team). With this in place (and getting the right test coverage is always an issue) you can exercise the system in any number of hypothetical scenarios to identify issues; both functional and non-functional, in a test environment before the production environment becomes compromised.

Computers are very good at doing repetitive tasks consistently. Humans are very good at coming up with new and creative test cases. Use each appropriately.

Much like housekeeping, frequent testing yields benefits at lower cost than simply waiting till the next major release when all sorts of issues will be uncovered and need to be addressed - many of which may have been around a while though no-one noticed... because no-one tested. Regular penetration testing and review of security procedures will help to proactively avoid vulnerabilities as they are uncovered in the wild, and regular testing of new browsers will help identify compatibility issues before your end-users do. There are some tools to help automate in this space (e.g. Security AppScan and WebDriver) though clearly it does cost to run and maintain such a continuous integration and testing regime. However, so long as the focus is correct and pragmatic then the cost benefits should be realised.

4. Design Patterns

Much like standards compliance, use of design patterns and good practices such as abstraction, isolation and dependency injection can help to ensure changes in the future can be accommodated at minimal effort. I mention this separately though since the two should not be confused. Standards may (or may not) adopt good design patterns and equally non-standard solutions may (or may not) adopt good design patterns - there are no guarantees either way.

Using design patterns also increases the likelihood that the next developer to come along will be able to pick up the code with greater ease than if it's some weird hair-brained spaghetti bowl of nonsense made up after a rather excessive liquid lunch. Dealing with the daily churn of changes becomes easier, maintenance costs come down and incidents are reduced.

So in summary, entropy should be considered a BAU (Business as Usual) issue and practices should be put in place to deal with it. Housekeeping, standards-compliance, automation through continuous integration and use of design patterns all help to keep the impact of change minimised and keep the level of disorder down.

Next time, some thoughts on how to measure entropy in the enterprise...


  1. […] Well, that’s enough for today.. next week, what we can do about it… […]


Post a comment

Popular posts from this blog

An Observation

Much has changed in the past few years, hell, much has changed in the past few weeks, but that’s another story... and I’ve found a little time on my hands in which to tidy things up. The world of non-functionals has never been so important and yet remains irritatingly ignored by so many - in particular by product owners who seem to think NFRs are nothing more than a tech concern. So if your fancy new product collapses when you get get too many users, is that ok? It’s fair that the engineering team should be asking “how many users are we going to get?”,   or “how many failures can we tolerate?” but the only person who can really answer those questions is the product owner.   The dumb answer to these sort of question is “lots!”, or “none!” because at that point you’ve given carte-blanche to the engineering team to over engineer... and that most likely means it’ll take a hell of a lot longer to deliver and/or cost a hell of a lot more to run. The dumb answer is also “only a couple” and “

Inter-microservice Integrity

A central issue in a microservices environment is how to maintain transactional integrity between services. The scenario is fairly simple. Service A performs some operation which persists data and at the same time raises an event or notifies service B of this action. There's a couple of failure scenarios that raise a problem. Firstly, service B could be unavailable. Does service A rollback or unpick the transaction? What if it's already been committed in A? Do you notify the service consumer of a failure and trigger what could be a cascading failure across the entire service network? Or do you accept long term inconsistency between A & B? Secondly, if service B is available but you don't commit in service A before raising the event then you've told B about something that's not committed... What happens if you then try to commit in A and find you can't? Do you now need to have compensating transactions to tell service B "oops, ignore that previous messag

Equifax Data Breach Due to Failure to Install Patches

"the Equifax data compromise was due to their failure to install the security updates provided in a timely manner." Source: MEDIA ALERT: The Apache Software Foundation Confirms Equifax Data Breach Due to Failure to Install Patches Provided for Apache® Struts™ Exploit : The Apache Software Foundation Blog As simple as that apparently. Keep up to date with patching.