It’s long been good practice to include some sort of tracing in code to help with problems as and when they arise (and they will). And as maligned as simply dumping to stdout is, I would prefer to see this than no trace at all. However, numerous logging frameworks exist and there’s little excuse not to use one.
We have though gotten into the habit of disabling much of this valued output in order to preserve performance. This is understandable of course as heavily used components or loops can chew up an awful lot of time and I/O writing out “Validating record 1 of 64000000. Validating record 2 of 64000000…” and so on. Useful huh?
And we have various levels of output – debug, info, warning, fatal – and the ability to turn the output level up for specific classes or libraries. Cool.
But what we do in production is turn off anything below a warning and when something goes wrong we scramble about; often under change control, to try and get some more data out of the system. And most of the time… you need to add more debug statements to the code to get the data out that you want. Emergency code releases, aren’t they just great fun?
Let’s face it, it’s the 1970’s and we are; on the whole, British Leyland knocking up rust buckets which break down every few hundred miles for no reason at all.
My parents had several of these and they were, without exception, shit.
One of the most significant leaps forward in the automotive industry over the past couple of decades has been the instrumentation of many parts of the car along with complex electronic management systems to monitor and fine tune performance. Now when you open the bonnet (hood) all you see is a large plastic box screaming “Do not open!”.
And if you look really carefully you might find a naff looking SMART socket when an engineer can plug his computer in to get more data out. The car can tell him which bit is broken and probably talk him through the procedure to fix it…
Meanwhile, back in the IT industry…
It’s high time we applied some of the lessons from the failed ’70s automotive industry to the computer systems we build (and I don’t mean the unionised industries). Instrument your code!
For every piece of code, for every component part of your system, you need to ask, “what should I monitor?”. It should go without saying that you need to log exceptions when they’re raised but you should also consider logging:
- Time-spent (in milli or microseconds) for potentially slow operations (i.e. anything that goes over a network or has time-complexity risks).
- Frequency of occurrence – Just log the event and let the monitoring tools do the work to calculate frequency.
- Key events – Especially entry points into your application (web access logs are a good place to start), startup, shutdown etc. but also which code path requests went down.
- Data – Recording specific parameters or configuration items etc. You do though need to be very careful here as to what you record to avoid having any personal or sensitive data in log files – no passwords or card numbers etc…
- Environment utilisation – CPU, memory, disk, network – Necessary to know how badly you’re affecting the environment in which your code is homed.
If you can scale the application horizontally you can probably afford the few microseconds it’s going to take to log the required data safely enough.
Then, once logged you need to process and visualise this data. I would recommend decoupling your application from the monitoring infrastructure as much as possible by logging to local files or; if that’s not possible, stream it out asynchronously to somewhere (a queue, Amazon Kinesis etc.). By decoupling you keep the responsibilities clear and can vary either without necessarily impacting the other.
You then need some agent to monitor the logged output and upload it to some repository, a repository to store the data and some means of analysing this data as and when required.
Using tools like Kibana, ElasticSearch and LogStash – all from Elastic – you can easily monitor files and visualise the data in pretty much real-time. You can even do so in the development environment (I run ElasticSearch and Kibana on a Raspberry Pi 2 for example) to try to understand the behaviour of your code before you get anywhere near production.
So now when that production problem occurs you can see when the root event occurred and impact it has across numerous components without needing to go through change-control to get more data out whilst the users suffer yet another IT failure. Once you know where to look the problem is 9 times out of 10 fixed. Dashboards can be set up to show at a glance the behaviour of the entire system and you’ll soon find your eye gets used to the patterns and will pick up on changes quite easily if you’re watching the right things.
The final step is to automate the processing of this data, correlate it across components and act accordingly to optimise the solution and eventually self-heal. Feedback control.
With the costs of computing power falling and the costs of an outage rising you can’t afford not to know what’s going on. For now you may have to limit yourself to getting the data into your enterprise monitoring solution – something like Tivoli Monitoring – for operations to support. It’s a start…
Without the data we’re blind. It’s time we started to instrument our systems more thoroughly.