2016/11/12

Instrumentation as a 1st Class Citizen

I wrote previously that we are moving into an era of instrumentation and things are indeed improving. Just not as fast as I'd like. There's a lot of good stuff out there to support instrumentation and monitoring including the likes of the ELK (ElasticSearch, Logstash, Kibana) and TIG (Telegraf, InfluxDB, Grafana) stacks as well as their more commercial offerings such as TICK (Telegraf, InfluxDB, Chronograf, Kapacitor), Splunk, DataDog, AppDynamics and others. The problem is, few still really treat instrumentation as a real concern... until it's too late.

Your customers love this stuff! Really, they do! There's nothing quite as sexy as an interactive graph showing how your application is performing as the load increases - transactions, visitors, response-times, server utilisation, queue-depths etc. When things are going well it gives everyone a warm fuzzy feeling that all is right with the universe. When things are going wrong it helps to quickly focus you in on where the problem is.

However, this stuff needs to be built into everything we do and not be an afterthought when the pressures on to ship-it and you can't afford the time and effort to retrofit it. By then it's too late.

As architects we need to put in the infrastructure and services needed to support instrumentation, monitoring and alerting. At a minimum this means putting in place standards for logging, data-retention polices, a data collection solution, repository for the data and some tooling to allow us to search that data and visualize what's going on. Better still we can add alerting when thresholds breach and use richer analytics to allow us to scale up and down to meet demand.

As developers we need to be considering what metrics we want to capture from the components we build as we're working on them. Am I interested in how long it's taking for this function call? Do I want to know how many messages a service is handling? How many threads are being spawned? What exceptions are being thrown? Where from? What the queue depths are?.. etc. Almost certainly... YES! And this means putting in place strategies for logging these things. Perhaps you can find the data in existing log files.. Perhaps you need to use better tooling for detailed monitoring... Perhaps you need to write some code yourself to track how things are going...

Doing this from the start will enable you to get a much better feel for how things are working before you launch - including a reasonable view of performance and infrastructure demands which will allow you to focus your efforts better later when you do get into sizing and performance testing. It'll mean you're not scrambling around look for log files to help you root-cause issues as your latest release goes into meltdown. And it'll mean your customer won't be chewing your ear off asking you what's going on every five minutes - they'll be able to see it for themselves...

So please, get it in front of your customer, your product owner, your sponsor, your architects, your developers, your testers and make instrumentation a 1st class citizen in the backlog.

No comments:

Post a Comment

Voyaging dwarves riding phantom eagles

It's been said before... the only two difficult things in computing are naming things and cache invalidation... or naming things and som...