In the increasingly interconnected micro-services world we’re creating the saying “a chain is only as strong as its weakest link” is particularly pertinent.

It’s quite easy for a single service to be dependent upon a number of downstream services as the diagram below shows.

An outage or go-slow in any one of the downstream services can have a knock on impact upstream and right back to users. Measuring this in SLAs, let’s say each of B, C, D, E, F each aims for an availability SLA of 99.99%. Assuming they meet this, the best A can achieve is 99.95%. More realistically, B, C, D, E and F are probably dependent on other services and before you know it end users are doing well to see anything above 99% uptime.

So what strategies do we have for dealing with this?

Firstly, you could just live with it. Really, don’t knock this option. Question “do I really need the availability?”, “does it really matter if it goes down?”. Before we worry about any elaborate plan to deal with the situation it’s worth considering if the situation is really all that bad.

Ok, so it is… The next question should be “do I need a response immediately?”. If not, go asynchronous and put a queue between them. If the recipient is down messages will queue up until they come back – no problem. Just make sure the queue is as local as possible to the source and persistent.

If it is a request-response model then consider a queue in any case. A queue can often be set to timeout old messages and deal with slow responses (e.g. if no response in 5 seconds then abandon). This can often save having very many messages in a backlog waiting to be processed. These can cause lock ups for requests which will never be processed and block the consumer for much longer than the downstream service is unavailable. And it can often be more efficient to have a queue based competing consumer model than having multiple connections banging away sporadically.

On top of this, ensure you’re using non-blocking libraries and implement circuit-breakers to trip when downstream services go offline. This of course begs the question, “what sort of response do I provide if the circuit-breaker is open?”… Well, that depends…

In some cases you can cache previous responses and serve this. If this sort of caching model works then even better, you can decouple the request for content from that fetching it from a downstream service so that you’re in effect always serving from cache. Allowing stale cache entries to be served whilst revalidating even when downstream services are unavailable can significantly improve the responsiveness and availability of the system. Don’t discard cached items just because they’re old. Keep using them until a fresh copy can be obtained. Size is a concern but if you can afford it then cache your content for as long as the RTO demands (the service should be back by then, e.g 4hrs) and revalidate as frequently as the business demands the content be fresh (e.g. every 10 minutes).

It may sound risky, but this approach can even be used with sensitive data such as user-permissions. You’re looking at a coincidence of bad events which is quite unlikely – e.g. users permissions are revoked (the old version is in cache), at the same time as the permissions system goes down, at the same as the user attempts something they previously could but should no longer be allowed to do.. It’s your risk but what’s worse… One user doing something bad or the whole system being unavailable?

If you can’t or don’t have a cache then can you implement a default or fallback option? Having a blank slot on a page, but a working page otherwise, may be the best of a bad set of options but the best nonetheless.

All else failing, apologise, quickly (see circuit-breaker) and profusely. Let the user know its you, not them, that they needn’t worry (e.g. you’ve not charged them and have unpicked any dependent transactions should you have them) and that you’ll be back as soon as you can.

Finally, log everything, monitor and alert. Regardless of the fact that it’s bad to rely on your customers to tell you when you’ve a problem, in many cases the user may not even realise something is amiss. It can easily be overlooked. Ensuring you log and monitor makes it much easier to know when you’ve an issue as well allowing root-cause analysis faster.

Queues, circuit-breakers, serve-stale-while-revalidate and logging.

Scalable = Horizontal

There’s two ways of scaling; vertical and horizontal, but there’s only one which is really scalable.

Vertical scaling essentially means bigger nodes. If you’ve got 8GB RAM, go to 16GB. If you’ve 2 cores, go to 4.. and so on.

Horizontal scaling means adding more nodes. One node to two nodes, to three and so on.

As a rule, horizontal scaling is good. Theoretically there’s no limit to the number of nodes you can have.

As a rule, vertical scaling is bad. You quickly run into constraints over the number of cores or RAM you can support. And for many of todays problems this just doesn’t work. Solutions need to be both scalable at the internet scale and available as in 24×7. Relying on large single nodes in such situations is not ideal. (and those supercomputers with 250,000+ processors are really horizontal solutions as well).

The problem is, horizontal scaling isn’t trivial. The culprits here are data and networking (plumbing really). State and caches need to be distributed and available to all. Databases need copies across nodes and need to be synchronised. Sharding usually becomes necessary (or you just end up with many very large nodes). And so on… Best bet is to avoid state as much as possible. But once you’ve cracked it you can build much larger solutions more efficiently (commodity hardware, virtualisation, containers etc.) and flex more rapidly than in the vertical world.

I could go on about how historically the big players love the vertical-scaling thing (think Oracle and IBM trying to sell you those huge servers and SQL databases solutions with the $$$ price-tags)… The world found NoSQL solutions which take a fundamentally different approach by accepting that consistency in many cases really isn’t as important as we once thought – and many of these are open-source…

Whatever, there’s only one way to scale… Horizontal!


Instrumentation as a 1st Class Citizen

I wrote previously that we are moving into an era of instrumentation and things are indeed improving. Just not as fast as I’d like. There’s a lot of good stuff out there to support instrumentation and monitoring including the likes of the ELK (ElasticSearch, Logstash, Kibana) and TIG (Telegraf, InfluxDB, Grafana) stacks as well as their more commercial offerings such as TICK (Telegraf, InfluxDB, Chronograf, Kapacitor), Splunk, DataDog, AppDynamics and others. The problem is, few still really treat instrumentation as a real concern… until it’s too late.

Your customers love this stuff! Really, they do! There’s nothing quite as sexy as an interactive graph showing how your application is performing as the load increases – transactions, visitors, response-times, server utilisation, queue-depths etc. When things are going well it gives everyone a warm fuzzy feeling that all is right with the universe. When things are going wrong it helps to quickly focus you in on where the problem is.

However, this stuff needs to be built into everything we do and not be an afterthought when the pressures on to ship-it and you can’t afford the time and effort to retrofit it. By then it’s too late.

As architects we need to put in the infrastructure and services needed to support instrumentation, monitoring and alerting. At a minimum this means putting in place standards for logging, data-retention polices, a data collection solution, repository for the data and some tooling to allow us to search that data and visualize what’s going on. Better still we can add alerting when thresholds breach and use richer analytics to allow us to scale up and down to meet demand.

As developers we need to be considering what metrics we want to capture from the components we build as we’re working on them. Am I interested in how long it’s taking for this function call? Do I want to know how many messages a service is handling? How many threads are being spawned? What exceptions are being thrown? Where from? What the queue depths are?.. etc. Almost certainly… YES! And this means putting in place strategies for logging these things. Perhaps you can find the data in existing log files.. Perhaps you need to use better tooling for detailed monitoring… Perhaps you need to write some code yourself to track how things are going…

Doing this from the start will enable you to get a much better feel for how things are working before you launch – including a reasonable view of performance and infrastructure demands which will allow you to focus your efforts better later when you do get into sizing and performance testing. It’ll mean you’re not scrambling around look for log files to help you root-cause issues as your latest release goes into meltdown. And it’ll mean your customer won’t be chewing your ear off asking you what’s going on every five minutes – they’ll be able to see it for themselves…

So please, get it in front of your customer, your product owner, your sponsor, your architects, your developers, your testers and make instrumentation a 1st class citizen in the backlog.


We all hate e-mail. We all love e-mail…

E-mail is like writing  a letter. There was a time when sitting down to write a letter (with pen) was an almost pleasant task which you expected to take a good hour on a rainy day… including moments of displaced thought spent staring out of the window (at this points anyone under the age of 30 is probably wondering what the hell I’m on about!).

I still can (and do) waste a good hour or two writing an e-mail.

E-mail is not:

  1. A replacement for conversation. The best tech solutions we have for this are Google Hangouts, Facetime or Skype etc. or even; god forbid, the telephone (psst, don’t tell the kids they can talk into those things). Ping-pong emails are just an ineffective and tedious form of conversation – even worse when they’re to a cc list who mostly couldn’t care less about the topic. It’s morse-code compared to the telephone. If you can, get off your back-side, walk across the office and talk to them!
  2. A replacement for instant-messaging. IM deserves more credit than it typically receives and in many organisations is a fundamental necessity to improve communication. You should be ashamed of yourself if you use e-mail this way! IM is quicker, simpler and crucially doesn’t fill your day with a tonne of “in-box” items you’ll never get round to. And if you’re only experience of IM is Lync.. you need to get out more.
  3.  A replacement for group conversation. Get a room for god-sake! Co-location is the #1 solution for group communication. But if you can’t do that (and don’t give in, fight for this as it will revolutionise your working life) then many tools are available to ease comms over a distance. Many of these; Slack notably in my experience, can be truly engaging for group conversations.
  4. And the cherry on the cake… a replacement for documentation. If you think that sending an email with detailed information counts as “documentation” then you deserve to be taken outside, strapped to the stocks, de-trousered, painted in pigs-blood and have your children forced to throw a variety or spoiled food products at your sorry carcass till their tears run dry. Put it in a wiki or a teamroom or in a document on a file-system – I care not which. But stuffed in the crevice of some email chain where it’s neither obvious or available to those that need it only serves to deter the distribution of knowledge, increase confusion and encourage chaos and entropy to thrive. If your organisation works this way then your organisation is likely living off institutionalised knowledge which may walk out the door tomorrow.

We all hate e-mail. We all love e-mail… No, scratch that. E-mail is rubbish and should be relegated to the same historic status as letter writing. Occasionally nice to receive but quaint and you’d rather not spend your time writing them… It’s time to abandon e-mail!


We can have a small server…

Screen Shot 2016-02-13 at 11.43.20

…a big server (aka vertical scaling)…

Screen Shot 2016-02-13 at 11.43.27

.. a cluster of servers (aka horizontal scaling)…

Screen Shot 2016-02-13 at 11.48.34

.. or even a compute grid (horizontal scaling on steroids).

Screen Shot 2016-02-13 at 11.43.41

For resiliency we can have active-passive…

Screen Shot 2016-02-13 at 11.52.46

… or active-active…

Screen Shot 2016-02-13 at 11.52.51

… or replication in a cluster or grid…

Screen Shot 2016-02-13 at 11.59.01

…each with their own connectivity, load-balancing and routing concerns.

From a logical perspective we could have a simple client-server setup…

Screen Shot 2016-02-13 at 13.03.29

…a two tier architecture…

Screen Shot 2016-02-13 at 13.03.35

…an n-tier architecture…

Screen Shot 2016-02-13 at 13.03.40

…a service oriented (micro- or ESB) architecture…

Screen Shot 2016-02-13 at 13.03.44

…and so on.

And in each environment we can have different physical topologies depending on the environmental needs with logical nodes mapped to each environments servers…

Screen Shot 2016-02-13 at 13.04.01

With our functional components deployed on our logical infrastructure using a myriad of other deployment topologies..

Screen Shot 2016-02-13 at 13.04.21

… or …

Screen Shot 2016-02-13 at 13.04.37

… and on and on and on…

And this functional perspective can be implemented using dozens of design patterns and a plethora of integration patterns.

Screen Shot 2016-02-13 at 12.08.46

With each component implemented using whichever products and packages we choose to be responsible for supporting one or more requirements and capabilities…

Screen Shot 2016-02-13 at 13.20.31

So the infrastructure we rely on, the products we select, the components we build or buy; the patterns we adopt and use… all exist for nothing but the underlying requirement.

We should therefore be able to trace from requirement through the design all the way to the tin on the floor.

And if we can do that we can answer lots of interesting questions such as “what happens if I turn this box off?”, “what’s impacted if I change this requirement?” or even “which requirements are driving costs?”. Which in turn can help improve supportability, maintainability and availability and reduce costs. You may even find your product sponsor questioning if they really need this or that feature…

Performance Testing is Easy

Performance testing is easy. We just throw as many requests at the system as we can as quickly as we want and measure the result. Job done right?

tl;dr? Short form…

  1. Understand the user scenarios and define tests. Review the mix of scenarios per test and the type of tests to be executed (peak, stress, soak, flood).
  2. Size and prepare the test environment and data. Consider the location of injectors and servers and mock peripheral services and systems where necessary.
  3. Test the tests!
  4. Execute and monitor everything. Start small and ramp up.
  5. Analyse results, tune, rinse and repeat until happy.
  6. Report the results.
  7. And question to what level of depth performance testing is really required…

Assuming we’ve got the tools and the environments, the  execution of performance tests should be fairly simple. The first hurdle though is in preparing for testing.

User Scenarios and Test Definitions

In order to test we first need to understand the sort of user scenarios that we’re going to encounter in production which warrant testing. For existing systems we can usually do some analysis on web-logs and the like to figure out what users are actually doing and try to model these scenarios. For this we may need a year or more of data to see if there are any seasonal variations and to understand what the growth trend looks like. For new systems we don’t have this data so need to make some assumptions and estimates as to what’s really going to happen. We also need to determine which of the scenarios we’re going to model and the transaction rates we want them to achieve.

When we’ve got system users calling APIs or running batch-jobs the variability is likely to be low. Human users are a different beast though and can wander off all over the place doing weird things. To model all scenarios can be a lot of effort (which equals a lot of cost) and a risk based approach is usually required. Considerations here include:

  • Picking the top few scenarios that account for the majority of activity. It depends on the system, but I’d suggest keeping these scenarios down to <5 – the fewer the better so long as it’s reasonably realistic.
  • Picking the “heavy” scenarios which we suspect are most intensive for the system (often batch jobs and the like).
  • Introducing noise to tests to force the system into doing things they’d not be doing normally. This sort of thing can be disruptive (e.g. a forced load of a library not otherwise used may be just enough to push the whole system over the edge in a catastrophic manner).

We next need to consider the relative mix of user scenarios for our tests (60% of users executing scenario A, 30% doing scenario B, 10% doing scenario C etc.) and the combinations of scenarios we want to consider (running scenarios A, B, C ; v’s A, B, C plus batch job Y).

Some of these tests may not be executed for performance reasons but for operability – e.g. what happens if my backup runs when I’m at peak load? or what happens when a node in a cluster fails?

We also need test data.

For each scenario we should be able to define the test data requirements. This is stuff like user-logins, account numbers, search terms etc.

Just getting 500 test user logins setup can be a nightmare. The associated test authentication system may not have capacity to handle the number of logins or account and we may need to mock it out. It’s all too common for peripheral systems not to be in the position to enable performance testing as we’d like and in any case we may want something that is more reliable when testing. For any mock services we do decide to build we need to work out how this should respond and what the performance of this should look like (it’s no good having a mock service return in 0.001 seconds when the real thing takes 1.8 seconds).

Account numbers have security implications and we may need to create dummy data. Search terms; especially from humans, can be wild and wonderful – returning millions or zero records in place of the expected handful.

In all cases, we need to prepare the environment based on the test data we’re going to use and size it correctly. Size it? Well, if production is going to have 10 millions records it’s not much good testing with 100! Copies of production data; possibly obfuscated, can be useful for existing systems. For new though we need to create the data. Here be dragons. The distribution of randomly generated data almost certainly won’t match that of real data – there are far more instances of surnames like Smith, Jones, Taylor, Williams or Brown than there are like Zebedee. If the distribution isn’t correct then the test may be invalid (e.g. we may hit one shard or tablespace and associated nodes and disks too little or too much).

I should point out that here that there’s a short cut for some situations. For existing systems with little in the way of stringent security requirements, no real functional changes and idempotent requests; think application upgrades or hardware migrations of primarily read-only websites, replaying the legacy web-logs may be a valid way to test. It’s cheap, quick and simple – if it’s viable.

We should also consider the profile and type of tests we want to run. For each test profile there are three parts. The ramp-up time (how long it takes to get to the target volume), steady-state time (how long the test runs at this level for), ramp-down time (how quickly we close the test (we usually care little for this and can close the test down quickly but in some cases we want a nice clean shutdown)). In terms of test types there are:

  • Peak load test – Typically a 1 to 2 hr test at peak target volumes. e.g. Ramp-up 30 minutes, steady-state 2hrs, ramp-down 5 mins.
  • Stress test – A longer test continually adding load beyond peak volumes to see how the system performs under excessive load and potentially where the break point is. e.g. Ramp-up 8 hrs, steady-state 0hrs, ramp-down 5 mins.
  • Soak test – A really long test running for 24hrs or more to identify memory leaks and the impact of peripheral/scheduled tasks. e.g. Ramp-up 30 mins, steady-state 24hrs, ramp-down 5 mins.
  • Flood test (aka Thundering Herd) – A short test where all users arrive in a very short period. In this scenario we can often see chaos ensue initially but the environment settling down after a short period. e.g. Ramp-up 0mins, steady-state 2hrs, ramp-down 5 mins

So we’re now ready to script our tests. We have the scenarios, we know the transaction volumes, we have test data, our environment is prep’d and we’ve mocked out any peripheral services and systems.


There are many test tools available from the free Apache JMeter and Microsoft web stress tools to commercial products such as HP LoadRunner and Rational Performance Tester to cloud based solutions such as Soasta or Blitz. Which tool we choose depends on the nature of the application and our budget. Cloud tools are great if we’re hosting in the public cloud, not so good if we’re an internal service.

The location of the load injectors (the servers which run the actual tests) is also important. If these are sitting next to the test server we’ll get different results than if the injector is running on someones laptop connected via a VPN tunnel over a 256kbit ADSL line somewhere in the Scottish Highlands. Which case is more appropriate will depend on what we’re trying to test and where we consider the edge of our responsibility to lie. We have no control over the sort of devices and connectivity internet users have so perhaps our responsibility stops at the point of ingress into our network? Or perhaps it’s a corporate network and we’re only concerned with the point of ingress into our servers? We do need to design and work within these constraints so measuring and managing page weight and latency is always a concern but we don’t want to have the complexity of all that “stuff” out there which isn’t our responsibility weighing us down.

Whichever tool we choose, we can now complete the scripting and get on with testing.


Firstly, check everything is working. Run the scripts with a single user for 20 minutes or so to ensure things are responding as expected and that the transaction load is correct. This will ensure that as we add more users we’re scaling as we want and that the scripts aren’t themselves defective. We then quite quickly ramp the tests up, 1 user, 10, users, 100 users etc. This helps to identify any concurrency problems early on with fewer users than expected (which can add too much noise and make it hard to see whats really going on).

If we’ve an existing system, once we know the scripts work we will want to get a baseline from the legacy system to compare to. This means running the tests on the legacy system. What? Hang on! This means we need another instance of the system available running the old codebase with similar test data and similar; but possibly not identical, scripts! Yup. That it does.

If we’ve got time-taken logging enabled (%D for Apache mod_log_config) then we could get away with comparing the old production response times with the new system so long as we’re happy the environments are comparable (same OS, same types of nodes, same spec, same topology, NOT necessarily the same scale in terms of numbers of servers) and that the results are showing the same thing (which depends on what upstream network connectivity is being used). But really, a direct comparison of test results is better – comparing apples with apples.

We also need to consider what to measure and monitor. We are probably interested in:

  • For the test responses:
    • Average, max, min and 95th percentile for the response time per request type.
    • Average, max, min size for page weight.
    • Response codes – 20x/30x probably good, lots of 40x/50x suggests the test or servers are broken.
    • Network load and latency.
  • For the test servers:
    • CPU, memory, disk and network utilisation throughout the test run.
    • Key metrics from middle-ware; queue depths, cache-hit rates, JVM garbage collection (note that JVM memory will look flat at the server level so needs some JVM monitoring tools). These will vary depending on the middle-ware and for databases we’ll want a DBA to advise on what to monitor.
    • Number of sessions.
    • Web-logs and other log files.
  • For the load injectors:
    • CPU, memory, disk and network utilisation throughout the test run. Just to make sure it’s not the injectors that are overstretched.

And finally we can test.

Analysis and Tuning

It’s important to verify that the test achieved the expected transaction rates and usage profiles. Reviews of log files to ensure no-errors and web-logs to confirm transaction rates and request types help verify that all was correct before we start to review response times and server utilisation.

We can then go through the process of correlating test activity with utilisation, identifying problems, limits near capacity (JVM memory for example) and extrapolate for production – for which some detailed understanding of the scaling nature of the system is required.

It’s worth noting that whilst tests rarely succeed first time, in my experience it’s just as likely to be an issue with the test as it is with the system itself. It’s therefore necessary to plan to execute tests multiple times. A couple of days is normally not sufficient for proper performance testing.

All performance test results should be documented for reporting and future needs. To already have an understanding of why certain changes have been made and a baseline to compare to the next time the tests are run is invaluable. It’s not war-and-peace, just a few of pages of findings in a document or wiki. Most test tools will also export the results to a PDF which can be attached to keep track of the detail.


This post is already too long but  one thing to question is… Is it worth the effort?

A Zipf distribution exists for systems and few really have that significant a load. Most handle a few transactions a second if that. I wouldn’t suggest “no performance testing” but I would suggest sizing the effort depending on the criticality and expected load. Getting a few guys in the office to hit F5 whilst we eyeball the CPU usage may well be enough. In code we can also include timing metrics in unit tests and execute these a few thousand times in a loop to see if there’s any cause for concern. Getting the engineering team to consider and monitor performance early on can help avoid issues later and reduce he need for multiple performance test iterations.

Critical systems with complex transactions or an expected high load (which as a rough guide I would say is anything around 10tps or more) should be tested more thoroughly. Combining capacity needs with operational needs informs the decision – four 9’s and 2k tps is the high end from my experience – and a risk based approach should always be used when considering performance testing.

Availability SLAs

I’ve been considering availability recently given the need to support five 9’s availability (amongst other non-functionals) and have decided to draw up the list below. Thoughts/comments appreciated.

Assumptions at the bottom.

99% – Accepted downtime = 3.6 days/year

Single server is sufficient assuming disk failures and restarts can be accommodated within minutes to hours.

Data centre (DC) failure means we’ve time to rebuild the application elsewhere if we need to (most DC issues will be resolved quicker than this) and restores from back-up tapes should be easy enough.

Daily backups off-site required in case of total DC failure.

99.9% – Accepted downtime = Just under 9 hrs/year

We probably lose 4.5hrs/year due to patching if we use a single server which only allows 4.5hrs to resolve a crash or other failure. This is close and likely not enough time (especially if a crash occurs more than once a year).

Therefore we need a clustered service for resiliency (alternating which node is patched to avoid service outage during that time). This may be active-active nodes or active-passive which makes SQL database configuration simpler.

DC issues are probably resolved in time but a cold stand-by available in second DC is advisable (restore from backups (note, offsite), option to use pre-production environment if capacity allows and its in a second DC).

Daily backup with redo logs taken (and transferred offsite) every 4 hours.

99.99% – Accepted downtime = Just under 1 hrs/year

We could in theory still accommodate with a clustered solution in one location but an issue at the datacenter level will be a real headache.

So we now want resiliency across DCs but can tolerate an hour to switch over if required. Active-passive DC solution therefore required with geographically dispersed data centres.

We need to replicate data in near-time and have secondary environment available (warm) ready to take the load in the event its required with GTM (Global Traffic Manager) ready to route traffic in the event its needed if our DNS changes take too long to ripple out. Classic SQL technology (with those enterprise options) still viable.

But at least we can still use traditional storage and database technology (daily backups, redo logs shipped every 30 mins; database mirroring etc.).

99.999% – Accepted downtime = Just under 5 mins/year (my requirement)

5 minutes is probably too short a time to fire-up a secondary environment so we need active-active data-centres, GTM now used to distribute load across DC’s and route solely to one in the event of an outage in another.

Data replication must be bi-directional allowing reads and writes simultaneously to each DC. This is complex, adds latency which degrades performance and consequently has significant impact on decisions relating to storage and database technology. Most classical SQL databases start to struggle and we probably need to ensure data is sharded between data-centers by whatever strategy makes most sense for the data and allow replication of shards across DC’s.

Application components need to be responsive to failures and route accordingly when they are detected. Monitoring, alerting and automatic failover is needed to ensure the response to failure is rapid. A Tungusta scale collision becomes an event worth considering and could have a huge impact on the power network disrupting multiple DCs if not sufficiently geographically distributed. However at the odds of one collision every 500 years, so long as we can rebuild everything within 40 hrs somewhere unaffected it’s a risk that could be taken. A tertiary “cold” DC ready for such an event becomes a consideration though.

99.9999% – Accepted downtime = Around 30secs/year

Taking things into silly territory…

We need to know pretty damn quickly that somethings gone wrong. Timeouts have to be reduced to ensure that we have time to retry transactions which may need to go to other nodes or DCs when a failure occurs. Components need to become more distributed, location agnostic, atomic and self-managing – automatic failover at each instance and each tier is required. This results in changes to the sort of patterns adopted in design and development, additional complexity to detect failures in a timely manner, routing and retry consideration to avoid failures. Additional DC’s are necessary and a data-centre on the moon becomes something to consider.

99.99999% – Accepted downtime = Around 3 secs/year

A “blip” will be considered an outage and we’re reaching the level where typical response times today are unacceptable – what happens when accepted downtime is less than transaction performance!?

Timeouts are reduced to a daft level as we need to know in 1.5 seconds at most to allow time to retry. The deeper down we go, the less time is available and the worse things gets (in a three tier architecture we’ve 0.375secs to complete any transactions). Trying to achieve data consistency now becomes virtually impossible. The option of a DC on the moon is no longer viable though due to latency (1.3s for light to get from earth to the moon, 2.6s round trip).



  1. Individual servers recycle once a month for 30 mins for patching.
  2. Assumes available hours requirement is 24×7.
  3. Says nothing about scalability.
  4. Assumes a data center occurs once every 10 years.
  5. Assumes a server crash once every six months.
  6. Assumes RPO (recovery point objective) is the same as the availability requirement.
  7. Assumes RAID storage used to avoid single disk outages.
  8. Assume cutover from active node to passive node takes less than 1 minute.

Windows Update – Really? Still? In 2015?

I’d almost forgotten how ridiculous this is after a couple of years with OSX and Linux… Happened on the train this morning.  Well, I’ll just look at the scenic view that is the south of England for the next half an hour then… Thanks Microsoft!


p.s. Some would claim this isn’t such a bad thing since the south of England can be beautiful, in this case though, I had work to do.


Instrumentation Revolution

It’s long been good practice to include some sort of tracing in code to help with problems as and when they arise (and they will). And as maligned as simply dumping to stdout is, I would prefer to see this than no trace at all. However, numerous logging frameworks exist and there’s little excuse not to use one.

We have though gotten into the habit of disabling much of this valued output in order to preserve performance. This is understandable of course as heavily used components or loops can chew up an awful lot of time and I/O writing out “Validating record 1 of 64000000. Validating record 2 of 64000000…” and so on. Useful huh?

And we have various levels of output – debug, info, warning, fatal – and the ability to turn the output level up for specific classes or libraries. Cool.

But what we do in production is turn off anything below a warning and when something goes wrong we scramble about; often under change control, to try and get some more data out of the system. And most of the time… you need to add more debug statements to the code to get the data out that you want. Emergency code releases, aren’t they just great fun?

Let’s face it, it’s the 1970’s and we are; on the whole, British Leyland knocking up rust buckets which break down every few hundred miles for no reason at all.

British Leyland Princess

My parents had several of these and they were, without exception, shit.

One of the most significant leaps forward in the automotive industry over the past couple of decades has been the instrumentation of many parts of the car along with complex electronic management systems to monitor and fine tune performance. Now when you open the bonnet (hood) all you see is a large plastic box screaming “Do not open!”.

Screen Shot 2015-07-11 at 11.12.07

And if you look really carefully you might find a naff looking SMART socket when an engineer can plug his computer in to get more data out. The car can tell him which bit is broken and probably talk him through the procedure to fix it…

Meanwhile, back in the IT industry…

It’s high time we applied some of the lessons from the failed ’70s automotive industry to the computer systems we build (and I don’t mean the unionised industries). Instrument your code!

For every piece of code, for every component part of your system, you need to ask, “what should I monitor?”. It should go without saying that you need to log exceptions when they’re raised but you should also consider logging:

  • Time-spent (in milli or microseconds) for potentially slow operations (i.e. anything that goes over a network or has time-complexity risks).
  • Frequency of occurrence – Just log the event and let the monitoring tools do the work to calculate frequency.
  • Key events – Especially entry points into your application (web access logs are a good place to start), startup, shutdown etc. but also which code path requests went down.
  • Data – Recording specific parameters or configuration items etc. You do though need to be very careful here as to what you record to avoid having any personal or sensitive data in log files – no passwords or card numbers etc…
  • Environment utilisation – CPU, memory, disk, network – Necessary to know how badly you’re affecting the environment in which your code is homed.

If you can scale the application horizontally you can probably afford the few microseconds it’s going to take to log the required data safely enough.

Then, once logged you need to process and visualise this data. I would recommend decoupling your application from the monitoring infrastructure as much as possible by logging to local files or; if that’s not possible, stream it out asynchronously to somewhere (a queue, Amazon Kinesis etc.). By decoupling you keep the responsibilities clear and can vary either without necessarily impacting the other.

You then need some agent to monitor the logged output and upload it to some repository, a repository to store the data and some means of analysing this data as and when required.

Screen Shot 2015-07-11 at 11.39.04

Using tools like Kibana, ElasticSearch and LogStash – all from Elastic – you can easily monitor files and visualise the data in pretty much real-time. You can even do so in the development environment (I run ElasticSearch and Kibana on a Raspberry Pi 2 for example) to try to understand the behaviour of your code before you get anywhere near production.

So now when that production problem occurs you can see when the root event occurred and impact it has across numerous components without needing to go through change-control to get more data out whilst the users suffer yet another IT failure. Once you know where to look the problem is 9 times out of 10 fixed. Dashboards can be set up to show at a glance the behaviour of the entire system and you’ll soon find your eye gets used to the patterns and will pick up on changes quite easily if you’re watching the right things.

The final step is to automate the processing of this data, correlate it across components and act accordingly to optimise the solution and eventually self-heal. Feedback control.

Screen Shot 2015-07-11 at 12.18.38


With the costs of computing power falling and the costs of an outage rising you can’t afford not to know what’s going on. For now you may have to limit yourself to getting the data into your enterprise monitoring solution – something like Tivoli Monitoring – for operations to support. It’s a start…

Without the data we’re blind. It’s time we started to instrument our systems more thoroughly.