Skip to main content

Instrumentation Revolution

It's long been good practice to include some sort of tracing in code to help with problems as and when they arise (and they will). And as maligned as simply dumping to stdout is, I would prefer to see this than no trace at all. However, numerous logging frameworks exist and there's little excuse not to use one.

We have though gotten into the habit of disabling much of this valued output in order to preserve performance. This is understandable of course as heavily used components or loops can chew up an awful lot of time and I/O writing out "Validating record 1 of 64000000. Validating record 2 of 64000000..." and so on. Useful huh?

And we have various levels of output - debug, info, warning, fatal - and the ability to turn the output level up for specific classes or libraries. Cool.

But what we do in production is turn off anything below a warning and when something goes wrong we scramble about; often under change control, to try and get some more data out of the system. And most of the time... you need to add more debug statements to the code to get the data out that you want. Emergency code releases, aren't they just great fun?

Let's face it, it's the 1970's and we are; on the whole, British Leyland knocking up rust buckets which break down every few hundred miles for no reason at all.

British Leyland Princess

My parents had several of these and they were, without exception, shit.

One of the most significant leaps forward in the automotive industry over the past couple of decades has been the instrumentation of many parts of the car along with complex electronic management systems to monitor and fine tune performance. Now when you open the bonnet (hood) all you see is a large plastic box screaming "Do not open!".

Screen Shot 2015-07-11 at 11.12.07

And if you look really carefully you might find a naff looking SMART socket when an engineer can plug his computer in to get more data out. The car can tell him which bit is broken and probably talk him through the procedure to fix it...

Meanwhile, back in the IT industry...

It's high time we applied some of the lessons from the failed '70s automotive industry to the computer systems we build (and I don't mean the unionised industries). Instrument your code!

For every piece of code, for every component part of your system, you need to ask, "what should I monitor?". It should go without saying that you need to log exceptions when they're raised but you should also consider logging:

  • Time-spent (in milli or microseconds) for potentially slow operations (i.e. anything that goes over a network or has time-complexity risks).

  • Frequency of occurrence - Just log the event and let the monitoring tools do the work to calculate frequency.

  • Key events - Especially entry points into your application (web access logs are a good place to start), startup, shutdown etc. but also which code path requests went down.

  • Data - Recording specific parameters or configuration items etc. You do though need to be very careful here as to what you record to avoid having any personal or sensitive data in log files - no passwords or card numbers etc...

  • Environment utilisation - CPU, memory, disk, network - Necessary to know how badly you're affecting the environment in which your code is homed.


If you can scale the application horizontally you can probably afford the few microseconds it's going to take to log the required data safely enough.

Then, once logged you need to process and visualise this data. I would recommend decoupling your application from the monitoring infrastructure as much as possible by logging to local files or; if that's not possible, stream it out asynchronously to somewhere (a queue, Amazon Kinesis etc.). By decoupling you keep the responsibilities clear and can vary either without necessarily impacting the other.

You then need some agent to monitor the logged output and upload it to some repository, a repository to store the data and some means of analysing this data as and when required.

Screen Shot 2015-07-11 at 11.39.04

Using tools like Kibana, ElasticSearch and LogStash - all from Elastic - you can easily monitor files and visualise the data in pretty much real-time. You can even do so in the development environment (I run ElasticSearch and Kibana on a Raspberry Pi 2 for example) to try to understand the behaviour of your code before you get anywhere near production.

So now when that production problem occurs you can see when the root event occurred and impact it has across numerous components without needing to go through change-control to get more data out whilst the users suffer yet another IT failure. Once you know where to look the problem is 9 times out of 10 fixed. Dashboards can be set up to show at a glance the behaviour of the entire system and you'll soon find your eye gets used to the patterns and will pick up on changes quite easily if you're watching the right things.

The final step is to automate the processing of this data, correlate it across components and act accordingly to optimise the solution and eventually self-heal. Feedback control.

Screen Shot 2015-07-11 at 12.18.38

 

With the costs of computing power falling and the costs of an outage rising you can't afford not to know what's going on. For now you may have to limit yourself to getting the data into your enterprise monitoring solution - something like Tivoli Monitoring - for operations to support. It's a start...

Without the data we're blind. It's time we started to instrument our systems more thoroughly.

Comments

  1. […] wrote previously that we are moving into an era of instrumentation and things are indeed improving. Just not as fast as I’d like. There’s a lot of good […]

    ReplyDelete

Post a comment

Popular posts from this blog

An Observation

Much has changed in the past few years, hell, much has changed in the past few weeks, but that’s another story... and I’ve found a little time on my hands in which to tidy things up. The world of non-functionals has never been so important and yet remains irritatingly ignored by so many - in particular by product owners who seem to think NFRs are nothing more than a tech concern. So if your fancy new product collapses when you get get too many users, is that ok? It’s fair that the engineering team should be asking “how many users are we going to get?”,   or “how many failures can we tolerate?” but the only person who can really answer those questions is the product owner.   The dumb answer to these sort of question is “lots!”, or “none!” because at that point you’ve given carte-blanche to the engineering team to over engineer... and that most likely means it’ll take a hell of a lot longer to deliver and/or cost a hell of a lot more to run. The dumb answer is also “only a couple” and “

Inter-microservice Integrity

A central issue in a microservices environment is how to maintain transactional integrity between services. The scenario is fairly simple. Service A performs some operation which persists data and at the same time raises an event or notifies service B of this action. There's a couple of failure scenarios that raise a problem. Firstly, service B could be unavailable. Does service A rollback or unpick the transaction? What if it's already been committed in A? Do you notify the service consumer of a failure and trigger what could be a cascading failure across the entire service network? Or do you accept long term inconsistency between A & B? Secondly, if service B is available but you don't commit in service A before raising the event then you've told B about something that's not committed... What happens if you then try to commit in A and find you can't? Do you now need to have compensating transactions to tell service B "oops, ignore that previous messag

Equifax Data Breach Due to Failure to Install Patches

"the Equifax data compromise was due to their failure to install the security updates provided in a timely manner." Source: MEDIA ALERT: The Apache Software Foundation Confirms Equifax Data Breach Due to Failure to Install Patches Provided for Apache® Struts™ Exploit : The Apache Software Foundation Blog As simple as that apparently. Keep up to date with patching.