Skip to main content

Interview Tales I - The Bathtub Curve

I've been to a few interviews recently, most of which have been bizarrely enjoyable affairs where I've had the opportunity to discover much about how things work elsewhere. However, I recently went to an interview for an organisation which has suffered some pretty high profile  system failures recently which I timidly pointed out; hoping not to offend. The response was, in my view, both arrogant and ignorant - perhaps I did offend...

I was informed; rather snootily, that this incident was a one-off having occurred just once in the 15+ year experience of the interviewer and couldn't happen again. Humm... I raised the point having worked on a number of technology refresh projects and being familiar with the Bathtub Curve (shown below - image courtesy of Engineering Statistics Handbook).


What this shows is how during the early life of a system failures are common (new build, many defects etc.). Things then settle down to a fairly stable; but persistent, level of failures for the main life of the system before things start to wear out and the number of incidents increases again - ultimately becoming catastrophic.

This is kind of obvious for mechanical devices (cars and the like) but perhaps not so much for software. I still have an old '80s book on software engineering which states that "software doesn't decay!". However, as pointed out previously, software is subject to change from a variety of sources, change brings decay and decay increases failure rates. Bathtub curve applies.

Now the reason I mentioned the failure in the first place was because the press I had read pointed towards a combination of ageing systems and complex integration solutions holding things together. I was therefore expecting an answer along the lines of "yes, we need to work to make sure it doesn't happen again" and "that's why we're hiring because we need to address these issues". This could then lead on and I could relate my experiences on refresh projects, hurrah!... It didn't work out like that even though it did seem that the raison d'ĂȘtre behind the role itself was precisely because they didn't have a good enough grip on the existing overall IT environment.

It's entirely possible that the interviewer is correct (or gets lucky). However, given there has actually been a couple of such incidents at the same organisation recently - two individually unique issues of course - I'm kind of suspicious that what they're actually seeing is the start of the ramp up in incidents that is typified by the Bathtub Curve. Time will tell.

I wasn't offered the job, but then again, I didn't want it either so I think we're happy going our separate ways.


Popular posts from this blog

An Observation

Much has changed in the past few years, hell, much has changed in the past few weeks, but that’s another story... and I’ve found a little time on my hands in which to tidy things up. The world of non-functionals has never been so important and yet remains irritatingly ignored by so many - in particular by product owners who seem to think NFRs are nothing more than a tech concern. So if your fancy new product collapses when you get get too many users, is that ok? It’s fair that the engineering team should be asking “how many users are we going to get?”,   or “how many failures can we tolerate?” but the only person who can really answer those questions is the product owner.   The dumb answer to these sort of question is “lots!”, or “none!” because at that point you’ve given carte-blanche to the engineering team to over engineer... and that most likely means it’ll take a hell of a lot longer to deliver and/or cost a hell of a lot more to run. The dumb answer is also “only a couple” and “

Inter-microservice Integrity

A central issue in a microservices environment is how to maintain transactional integrity between services. The scenario is fairly simple. Service A performs some operation which persists data and at the same time raises an event or notifies service B of this action. There's a couple of failure scenarios that raise a problem. Firstly, service B could be unavailable. Does service A rollback or unpick the transaction? What if it's already been committed in A? Do you notify the service consumer of a failure and trigger what could be a cascading failure across the entire service network? Or do you accept long term inconsistency between A & B? Secondly, if service B is available but you don't commit in service A before raising the event then you've told B about something that's not committed... What happens if you then try to commit in A and find you can't? Do you now need to have compensating transactions to tell service B "oops, ignore that previous messag

Equifax Data Breach Due to Failure to Install Patches

"the Equifax data compromise was due to their failure to install the security updates provided in a timely manner." Source: MEDIA ALERT: The Apache Software Foundation Confirms Equifax Data Breach Due to Failure to Install Patches Provided for Apache® Struts™ Exploit : The Apache Software Foundation Blog As simple as that apparently. Keep up to date with patching.