Instrumentation Revolution

It’s long been good practice to include some sort of tracing in code to help with problems as and when they arise (and they will). And as maligned as simply dumping to stdout is, I would prefer to see this than no trace at all. However, numerous logging frameworks exist and there’s little excuse not to use one.

We have though gotten into the habit of disabling much of this valued output in order to preserve performance. This is understandable of course as heavily used components or loops can chew up an awful lot of time and I/O writing out “Validating record 1 of 64000000. Validating record 2 of 64000000…” and so on. Useful huh?

And we have various levels of output – debug, info, warning, fatal – and the ability to turn the output level up for specific classes or libraries. Cool.

But what we do in production is turn off anything below a warning and when something goes wrong we scramble about; often under change control, to try and get some more data out of the system. And most of the time… you need to add more debug statements to the code to get the data out that you want. Emergency code releases, aren’t they just great fun?

Let’s face it, it’s the 1970’s and we are; on the whole, British Leyland knocking up rust buckets which break down every few hundred miles for no reason at all.

British Leyland Princess

My parents had several of these and they were, without exception, shit.

One of the most significant leaps forward in the automotive industry over the past couple of decades has been the instrumentation of many parts of the car along with complex electronic management systems to monitor and fine tune performance. Now when you open the bonnet (hood) all you see is a large plastic box screaming “Do not open!”.

Screen Shot 2015-07-11 at 11.12.07

And if you look really carefully you might find a naff looking SMART socket when an engineer can plug his computer in to get more data out. The car can tell him which bit is broken and probably talk him through the procedure to fix it…

Meanwhile, back in the IT industry…

It’s high time we applied some of the lessons from the failed ’70s automotive industry to the computer systems we build (and I don’t mean the unionised industries). Instrument your code!

For every piece of code, for every component part of your system, you need to ask, “what should I monitor?”. It should go without saying that you need to log exceptions when they’re raised but you should also consider logging:

  • Time-spent (in milli or microseconds) for potentially slow operations (i.e. anything that goes over a network or has time-complexity risks).
  • Frequency of occurrence – Just log the event and let the monitoring tools do the work to calculate frequency.
  • Key events – Especially entry points into your application (web access logs are a good place to start), startup, shutdown etc. but also which code path requests went down.
  • Data – Recording specific parameters or configuration items etc. You do though need to be very careful here as to what you record to avoid having any personal or sensitive data in log files – no passwords or card numbers etc…
  • Environment utilisation – CPU, memory, disk, network – Necessary to know how badly you’re affecting the environment in which your code is homed.

If you can scale the application horizontally you can probably afford the few microseconds it’s going to take to log the required data safely enough.

Then, once logged you need to process and visualise this data. I would recommend decoupling your application from the monitoring infrastructure as much as possible by logging to local files or; if that’s not possible, stream it out asynchronously to somewhere (a queue, Amazon Kinesis etc.). By decoupling you keep the responsibilities clear and can vary either without necessarily impacting the other.

You then need some agent to monitor the logged output and upload it to some repository, a repository to store the data and some means of analysing this data as and when required.

Screen Shot 2015-07-11 at 11.39.04

Using tools like Kibana, ElasticSearch and LogStash – all from Elastic – you can easily monitor files and visualise the data in pretty much real-time. You can even do so in the development environment (I run ElasticSearch and Kibana on a Raspberry Pi 2 for example) to try to understand the behaviour of your code before you get anywhere near production.

So now when that production problem occurs you can see when the root event occurred and impact it has across numerous components without needing to go through change-control to get more data out whilst the users suffer yet another IT failure. Once you know where to look the problem is 9 times out of 10 fixed. Dashboards can be set up to show at a glance the behaviour of the entire system and you’ll soon find your eye gets used to the patterns and will pick up on changes quite easily if you’re watching the right things.

The final step is to automate the processing of this data, correlate it across components and act accordingly to optimise the solution and eventually self-heal. Feedback control.

Screen Shot 2015-07-11 at 12.18.38

 

With the costs of computing power falling and the costs of an outage rising you can’t afford not to know what’s going on. For now you may have to limit yourself to getting the data into your enterprise monitoring solution – something like Tivoli Monitoring – for operations to support. It’s a start…

Without the data we’re blind. It’s time we started to instrument our systems more thoroughly.

Power to the People

Yesterday I received my usual gas and electricity bill from my supplier with the not so usual increase to my monthly direct debit of a nice round 100%! 100% on top of what is already more than I care for… joy!

What followed was the all too familiar vent-spleen / spit-feathers etc. before the situation was resolved by a very nice customer services representative who had clearly seen this before… humm..

So, as I do, I ponder darkly on how such a situation could have arisen. And as an IT guy, I ponder darkly about how said situation came about through IT (oh what a wicked web we weave)… Ok, so pure conjecture, but this lot have previous…

100%! What on earth convinced them to add 100%? Better still, what convinced them to add 100% when I was in fact in credit and they had just reimbursed me £20 as a result?…

Customer service rep: It’s the computers you see sir.

Me: The computers?

CSR: Well because they reimbursed you they altered your direct-debit amount.

Me: Yeah, ok, so they work out that I’m paying too much and reduce my monthly which would kind of make sense but it’s gone up! Up by 100%!

CSR: Well er yes. But I can fix that and change it back to what it was before…

Me: Yes please do!

CSR: Can I close the complaint now?

Me: Well, you can’t do much else can you? But really, you need to speak to your IT guys because this is just idiotic…

(more was said but you get the gist).

So, theories for how this came about:

  1. Some clever-dick specified a requirement that if they refund some money then claw it back ASAP by increasing the monthly DD by 100%!..
  2. A convoluted matrix exists which is virtually impossible to comprehend; a bit like their pricing structure, which details how and when to apply various degrees of adjustment to DD amounts and has near infinite paths that cannot be proven via any currently known mathematics on the face of this good earth.
  3. A defect exists somewhere in the code.

If 1 or 2 then sack the idiots who came up with such a complete mess – probably the same lot who dream up energy pricing models so a “win-win” as they say!

If 3 then, well, shit happens. Bugs happen. Defects exist; god knows I’ve been to root-cause of many…

It’s just that this isn’t the first time, or the second, or the third…

(start wavy dreamy lines)

The last time they threatened to take me to court because I wouldn’t let their meter maid in when they’d already been and so hadn’t even tried again..  and since they couldn’t reconcile two different “computer systems” properly it kept on bitching until it ratcheted up to that “sue the bastards” level. Nice.

(end wavy dreamy lines)

… and this is such an simple thing to test for. You’ve just got to pump in data for a bunch of test scenarios and look at the result – the same applies to that beastly matrix! … or you’ve got to do code reviews and look for that “if increase < 1.0 then surely it’s wrong and add 1.0 to make it right” or “increase by current/current + (1.0*current/current)”.

So bad test practices and/or bad coding practices – or I could also accept, really bad architecture which can’t be implemented, or really bad project management which just skips anything resembling best practices, or really bad leadership which doesn’t give a toss about the customer – and really bad operations and maintenance regardless because they know they’ve got some pretty basic problems and yet they clearly can’t help themselves to get it fixed.

It all comes down to money and with practices like this they’ll be losing customers hand over fist (they do have the worst customer sat ratings apparently).

They could look at agile, continuous-integration, test automation, dev-ops, TDD and BDD practices as is the rage but they need to brush away the fairy dust that often accompanies these concepts (concepts I generally support incidentally) and realise this does not mean you can abandon all sanity and give up on the basic principles of testing and coding!

If anything these concepts weigh even more heavily on such fundamentals – more detailed tracking of delivery progress and performance, pair-programming, reviewing test coverage, using tests to drive development, automating build and testing as much as can be to improve consistency and quality, getting feedback from operational environments so you detect and resolve issues faster, continuous improvement and so on.

Computer systems are more complex and more depended on by society then ever before, they change at an ever increasing rate, interact in an constantly changing environment with other systems and are managed by disparate teams spread across the globe who come and go with the prevailing technological wind. Your customers and your business relies on them 100% to, at best, get by, and at worst, to exist. You cannot afford not to do it right!

I’m sure the issues are more complex than this and there are probably some institutionalised problems preventing efficient resolution mores the pity. But hey, off to search for a new energy provider…

Session Abolition

I’ve been going through my bookcase; on orders from a higher-being, to weed out old, redundant books and make way for… well, I’m not entirely sure what, but anyway, it’s not been very successful.

I came across an old copy of Release It! by Michael T. Nygard and started flicking through, chuckling occasionally as memories (good and bad) surfaced. It’s an excellent book but made me stop and think when I came across a note reading:

Serve small cookies
Use cookies for identifiers, not entire objects. Keep session data on the server, where it can't be altered by a malicious client.

There’s nothing fundamentally wrong with this other than it chimes with a problem I’m currently facing and I don’t like any of the usual solutions.

Sessions either reside in some sort of stateful pool; persistent database, session management server, replicated memory etc., or more commonly exist stand-alone within each node of a cluster. In either case load-balancing is needed to route requests to the home node where the session exists (delays in replication means you can’t go to any node even when a stateful pool is used). Such load-balancing is performed by a network load-balancer, reverse proxy, web-server (mod_proxy, WebSphere plugin etc.) or application server and can work using numerous different algorithms; IP based routing, round-robin, least-connections etc.

So in my solution I now need some sort of load-balancer – more components, joy! But even worse, it’s creating havoc with reliability. Each time a node fails I lose all sessions on that server (unless I plumb for a session-management-server which I need like a hole in the head). And nodes fails all the time… (think cloud, autoscaling and hundreds of nodes).

So now I’m going to kind-of break that treasured piece of advice from Michael and create larger cookies (more likely request parameters) and include in them some every-so-slightly-sensitive details which I really shouldn’t. I should point out this isn’t is criminal as it sounds.

Firstly the data really isn’t that sensitive. It’s essentially routing information that needs to be remembered between requests – not my credit card details.

Secondly it’s still very small – a few bytes or so but I’d probably not worry too much until it gets to around 2K+ (some profiling required here I suspect).

Thirdly, there are other ways to protect the data – notably encryption and hashing. If I don’t want the client to be able to read it then I’ll encrypt it. If I don’t mind the client reading the data but want to make sure it’s not been tampered with, I’ll use an HMAC instead. A JSON Web Token like format should well work in most cases.

Now I can have no session on the back-end servers at all but instead need to decrypt (or verify the hash) and decode a token on each request. If a node fails I don’t care (much) as any other node can handle the same request and my load balancing can be as dumb as I can wish.

I’ve sacrificed performance for reliability – both in terms of computational effort server side and in terms of network payload – and made some simplification to the overall topology to boot. CPU cycles are getting pretty cheap now though and this pattern should scale horizontally and vertically – time for some testing… The network penalty isn’t so cheap but again should be acceptable and if I avoid using “cookies” for the token then I can at least save the load on every single request.

It also means that in a network of micro-services, so long as each service propagates these tokens around, the more thorny routing problem in this sort of environment virtually disappears.

I do though now have a key management problem. Somewhere, somehow I need to store the keys securely whilst distributing them to every node in the cluster… oh and don’t mention key-rotation…

A little chit-chat

It’s been a while since I posted anything so  I thought I’d write an article on the pros and cons of chatty interfaces…

When you’re hosting a sleep-over for the kids and struggling to get them all to sleep you’ll no doubt hear that plaintiff cry “I’m thirsty!”, usually following in quick succession by 20 other “Me too!”‘s…

What you don’t do is walk down the stairs, fetch a glass from the cupboard, fill it with milk from the fridge and take it back upstairs… then repeat 20 more times – you may achieve your steps-goal for the day but it’ll take far too long. Instead you collect all the orders, go downstairs once, fill 21 glasses and take them all back upstairs on a tray (or more likely just give them the bottle and let them fight it out).

And so it is with a client-server interface. If you’ve a collection of objects on a client (say customer records) and you wanted to get some more info on each one (such as the address) you’d be better off asking the server the question “can you give me the addresses for customers x, y and z?” in one go rather than doing it three times. The repeated latency, transit time and processing by the server may appear fine for a few records but will very quickly deteriorate as the number of records rises. And if you’ve many clients doing this sort of thing you can get contention for resources server side which breaks things for all users – even those not having a wee natter…

Is chatty always bad?

It comes down to the atomicity of requests, responsibility and feedback.

Atomicity…

If it’s meaningful to ask a question like “can you give me the address for customer x, account-balance for customer y and shoe-size for customer z?” then by all means do so, but most likely that’s really three separate questions and you’ll get better reuse and maintainability if you split them out. Define services to address meaningful questions at the appropriate granularity – what is “appropriate” is for you work out…

Responsibility…

You could break the requests out server side and have a single request which collects lots of data and returns this at once rather than having lots of small requests go back-and-forth. This model is common for web-pages which contain data from multiple sources – one requests for the page, server collects data from sources and combines this in one response back to the user. Problem is, if any one of the the sources is performing badly (or down) the whole page suffers and your server chokes in sympathy as part of a cascading failure. So is it the responsibility of the server to collate the data and provide a unified view (often it is) or can that responsibility be pushed to the client (equally often it can be)? If it can be pushed to the client then you can actually improve performance and reliability through more chatty round-trips. Be lazy, do-one-thing-well, and delegate responsibility as much as possible…

Feedback…

Your end-user is a grumpy and impatient piece of wetware incapable of acting repetitiously who seeks instant gratification with minimal effort and is liable to throwing the tantrums of a two year old when asked to “Please wait…”. God help us all if you send them an actual error… To improve usability you can trigger little round-trip requests to validate data asynchronously as they go and keep the user informed. This feedback is often enough to keep the user happy – all children like attention – and avoids wasted effort. Filling in a 20 page form only to have it rejected because of some cryptic checkbox on page 3 isn’t going to win you any friends. And yet it is still o’ so common… I generally work server-side and it’s important to remember why we’re here – computers exist for the benefit of mankind. User interface development is so much more complicated than server-side and yet simplicity for the user has to be the goal.

So chatty is generally bad for performance and reliability though it can though be good for UX and depending on where the responsibility sits, can improve things overall.

However, as we move towards a micro-services based environment the chatter is going to get louder and more disruptive. Understanding the nature and volumetrics of the questions that are asked of services and ensuring they are designed to address these at the correct granularity will help keep the chatter down whilst making the code more maintainable and reusable. As always, it’s never black-and-white…

 

 

Documents v Wikis

I used to spend a significant proportion of my time working with documents. Nasty 100 page beasties where 80% of them felt like generic copy text designed principally with the goal of obscuring the 20% of new content you were really interested in. Consequently I developed a severe dislike of any document more than a few pages long.

The agile manifesto came along and suggested we focus on “working software over comprehensive documentation” which by some has unfortunately been taken to mean “no documentation”. Let’s just say there’s a significant grey area between the extremes of “comprehensive” and “none-at-all”.

Personally I’m aware that I fall more into the “comprehensive” camp than the other though I put this down to the fact that; for me, putting things down on paper is a great way of helping me think things through. For me, documentation is a design tool.

On the other hand, wikis…! I used to see wikis as a saviour from the hours/days/weeks spent reviewing documents and trying to keep content consistent across multiple work-products. Finally a tool where I can update in once place, link to other content and focus on the detail without the endless repetition. Something in support of the agile manifesto which endeavours to provide enough documentation without going overboard on it. Unfortunately recent experience has exposed a few issues with this.

Table below compares the two.

  Documents Wikis
For Good formatting control.

Easy to provide a historic record.

Provides a point of contract sign-off.

Easy to work offline.

Generally accessible.

Highly used critical content is maintained.

Good sharing within the team.

Hyperlinking – easy to reference other content.

Promotes short/succinct content.

Good historic record (one is kept automatically).

Against You have a naïve hope that someone will read it.

Promotes bloat and duplication.

Promotes the MS monopoly.

Poorly maintained.

Rapidly goes stale.

 

Promotes the illusion of quality.

Poor formatting control.

Requires online connectivity.

Low usage detail becomes rapidly out of date.

Poor sharing outside of the team.

Hyperlinking – nothing is ever in one place.

Poor historic record (too may fine grain changes makes finding a version difficult).

 

Hyperlinks, like the connections in relational databases, are really cool and to my mind often more important than the content itself. That wikis makes such a poor job of maintaining these links is; in my view, a significant flaw in their nature. The web benefits from such loose coupling through competition between content providers (i.e. websites) but wikis – as often maintained by internal departments – don’t have the interested user base or competitive stimulus to make this model work consistently well. Documents just side-step the issue by duplicating content.

So wikis should work well to maintain operational documentation where the user base has a vested interest in ensuring the content is maintained. Just so they don’t get called out at 3am on Sunday morning because no-one knows the location of a database or what script to run to restart the system.

Documents on the other hand should work better as design products and contractual documents which all parties need to agree on at a fixed point and which aren’t going to be turned to very often. Ten commandments stuff.

The problem is the bit connecting the two and the passage of time and decay turning truth into lies. The matrix connecting requirements and solution, architecture and operations, theory and practice, design and build, dogma and pragmatism. Where the “why?” is lost and forgotten, myth becomes currency and where the rain dance needed each time the database fails-over originates. Not so much the whole truth and nothing but the truth, as truth in parts, assumptions throughout and lies overall.

The solution may be to spend more effort maintaining the repositories of knowledge – whether it’s documents, wikis or stone tablets sent from above. It’s just a shame that effort costs so much and is seen as being worth so little by most whilst providing the dogmatic few with the zeal to pursue a pointless agenda.

 

Computation, Data and Connections

My ageing brain sees things in what feels like an overly simplistic and reductionist way. Not exactly the truth so much as a simplified version of reality which I can understand well enough to allow me to function (to a degree).

So in computing I see everything as being composed of three fundamental (overly simplified) elements; computation, data and connections.

Computation refers to those components which do stuff, essentially acting on data to transform it from one form to another. This may be as simple as taking data from a database and returning it as a web page or as complex as processing video data to identify and track movements of individuals.

Data is the information itself which critically has no capability to do anything. Like Newtons first law, data at rest will remain at rest until an external computation acts upon it. Data may be in a database (SQL or NoSQL), file-system, portable USB drive, memory etc. (memory is really just fast temporary data storage).

And connectivity hooks things together. Without some means of connectivity computation can’t get a handle on the data and nothing much happens regardless. This may be a physical ethernet cable between devices, logical database driver, a file-system handle etc.

So in every solution I can define each component (physical or logical) as fulfilling one of these three core functions.

The next step from a non-functional perspective is to consider a matrix of these elements and my four classifications for non-functional requirements (security, performance & capacity, operability and form) to discuss the key considerations that are required for each.

For example, backup and recovery is primarily a consideration for data. Computational elements have no data as such (log files for example would be data output from compute components) and so a different strategy can be adopted to recover these (e.g. redundant/resilient instances, configuration management, automated-build/dev-ops capabilities etc.). Like wise, connectivity recovery tends to be fairly manual since connections can traverse many components, any one of which could fail, and so more effort is required to PD (problem determination) and identify root cause. A key requirement to aid this is a full understanding of the connections between components and active monitoring of each so you can pinpoint problems more easily – unfortunately it’s naive to think you understand all the connections between components in your systems.

The matrix thus looks something like this:

Computation Data Connections
Security  Computation v Security
Performance & Capacity
Operability Redundancy, configuration management, build automation. Backup, recovery, resiliency. Connectivity model, monitoring.
Form

 

Now I just need to fill the rest of it in so when I work on a new solution I can classify each component and refer to the matrix to identify non-functional aspects to consider. At least that’s the theory…

Interview Tales I – The Bathtub Curve

I’ve been to a few interviews recently, most of which have been bizarrely enjoyable affairs where I’ve had the opportunity to discover much about how things work elsewhere. However, I recently went to an interview for an organisation which has suffered some pretty high profile  system failures recently which I timidly pointed out; hoping not to offend. The response was, in my view, both arrogant and ignorant – perhaps I did offend…

I was informed; rather snootily, that this incident was a one-off having occurred just once in the 15+ year experience of the interviewer and couldn’t happen again. Humm… I raised the point having worked on a number of technology refresh projects and being familiar with the Bathtub Curve (shown below – image courtesy of Engineering Statistics Handbook).

bathtub2

What this shows is how during the early life of a system failures are common (new build, many defects etc.). Things then settle down to a fairly stable; but persistent, level of failures for the main life of the system before things start to wear out and the number of incidents increases again – ultimately becoming catastrophic.

This is kind of obvious for mechanical devices (cars and the like) but perhaps not so much for software. I still have an old ’80s book on software engineering which states that “software doesn’t decay!”. However, as pointed out previously, software is subject to change from a variety of sources, change brings decay and decay increases failure rates. Bathtub curve applies.

Now the reason I mentioned the failure in the first place was because the press I had read pointed towards a combination of ageing systems and complex integration solutions holding things together. I was therefore expecting an answer along the lines of “yes, we need to work to make sure it doesn’t happen again” and “that’s why we’re hiring because we need to address these issues”. This could then lead on and I could relate my experiences on refresh projects, hurrah!… It didn’t work out like that even though it did seem that the raison d’être behind the role itself was precisely because they didn’t have a good enough grip on the existing overall IT environment.

It’s entirely possible that the interviewer is correct (or gets lucky). However, given there has actually been a couple of such incidents at the same organisation recently – two individually unique issues of course – I’m kind of suspicious that what they’re actually seeing is the start of the ramp up in incidents that is typified by the Bathtub Curve. Time will tell.

I wasn’t offered the job, but then again, I didn’t want it either so I think we’re happy going our separate ways.