2016/11/26

Scalable = Horizontal

There's two ways of scaling; vertical and horizontal, but there's only one which is really scalable.

Vertical scaling essentially means bigger nodes. If you've got 8GB RAM, go to 16GB. If you've 2 cores, go to 4.. and so on.

Horizontal scaling means adding more nodes. One node to two nodes, to three and so on.

As a rule, horizontal scaling is good. Theoretically there's no limit to the number of nodes you can have.

As a rule, vertical scaling is bad. You quickly run into constraints over the number of cores or RAM you can support. And for many of todays problems this just doesn't work. Solutions need to be both scalable at the internet scale and available as in 24x7. Relying on large single nodes in such situations is not ideal. (and those supercomputers with 250,000+ processors are really horizontal solutions as well).

The problem is, horizontal scaling isn't trivial. The culprits here are data and networking (plumbing really). State and caches need to be distributed and available to all. Databases need copies across nodes and need to be synchronised. Sharding usually becomes necessary (or you just end up with many very large nodes). And so on... Best bet is to avoid state as much as possible. But once you've cracked it you can build much larger solutions more efficiently (commodity hardware, virtualisation, containers etc.) and flex more rapidly than in the vertical world.

I could go on about how historically the big players love the vertical-scaling thing (think Oracle and IBM trying to sell you those huge servers and SQL databases solutions with the $$$ price-tags)... The world found NoSQL solutions which take a fundamentally different approach by accepting that consistency in many cases really isn't as important as we once thought - and many of these are open-source...

Whatever, there's only one way to scale... Horizontal!

 

2016/11/12

Instrumentation as a 1st Class Citizen

I wrote previously that we are moving into an era of instrumentation and things are indeed improving. Just not as fast as I'd like. There's a lot of good stuff out there to support instrumentation and monitoring including the likes of the ELK (ElasticSearch, Logstash, Kibana) and TIG (Telegraf, InfluxDB, Grafana) stacks as well as their more commercial offerings such as TICK (Telegraf, InfluxDB, Chronograf, Kapacitor), Splunk, DataDog, AppDynamics and others. The problem is, few still really treat instrumentation as a real concern... until it's too late.

Your customers love this stuff! Really, they do! There's nothing quite as sexy as an interactive graph showing how your application is performing as the load increases - transactions, visitors, response-times, server utilisation, queue-depths etc. When things are going well it gives everyone a warm fuzzy feeling that all is right with the universe. When things are going wrong it helps to quickly focus you in on where the problem is.

However, this stuff needs to be built into everything we do and not be an afterthought when the pressures on to ship-it and you can't afford the time and effort to retrofit it. By then it's too late.

As architects we need to put in the infrastructure and services needed to support instrumentation, monitoring and alerting. At a minimum this means putting in place standards for logging, data-retention polices, a data collection solution, repository for the data and some tooling to allow us to search that data and visualize what's going on. Better still we can add alerting when thresholds breach and use richer analytics to allow us to scale up and down to meet demand.

As developers we need to be considering what metrics we want to capture from the components we build as we're working on them. Am I interested in how long it's taking for this function call? Do I want to know how many messages a service is handling? How many threads are being spawned? What exceptions are being thrown? Where from? What the queue depths are?.. etc. Almost certainly... YES! And this means putting in place strategies for logging these things. Perhaps you can find the data in existing log files.. Perhaps you need to use better tooling for detailed monitoring... Perhaps you need to write some code yourself to track how things are going...

Doing this from the start will enable you to get a much better feel for how things are working before you launch - including a reasonable view of performance and infrastructure demands which will allow you to focus your efforts better later when you do get into sizing and performance testing. It'll mean you're not scrambling around look for log files to help you root-cause issues as your latest release goes into meltdown. And it'll mean your customer won't be chewing your ear off asking you what's going on every five minutes - they'll be able to see it for themselves...

So please, get it in front of your customer, your product owner, your sponsor, your architects, your developers, your testers and make instrumentation a 1st class citizen in the backlog.

Voyaging dwarves riding phantom eagles

It's been said before... the only two difficult things in computing are naming things and cache invalidation... or naming things and som...