The internet is down, all is well

 A few weeks ago I switched from Zen internet (stable enough; a touch more expensive than the big boys; excellent customer service) to Community Fibre (1Gb/s; fibre to the premise; symmetric; cheaper than Zen; only if you live in London).

The speed has been excellent. Overall just under 1Gb/s (around 960Mb/s) direct from the hub, up to 600Mb/s from a laptop in line-of-sight and easily 150Mb/s - up and down - everywhere else in the house.

Until today.

When it broke.

No idea why, an engineer will be sent to the house tomorrow, and in the meantime I'm operating off a mobile phone (works better than I expected).

I shall give Community Fibre the chance to fix this teething problem though that is not the reason for writing this post.

I also have a so called smart thermostat and since the wi-fi was down (I turned the now useless router off) the poor wee little smart hub had no way of talking to the thermostats and no way for us to tell it to switch the heating on or off from the app (of course there's an app...).

This does not make for a happy house when the wife wants to have a shower.

These systems have fall-backs for such cases though and pressing a button on the hub at least fires up the boiler, even if it doesn't talk to the thermostats - cue running around the house turning on the radiators and regaining enough brownie points to at least not spend the evening in the dog-house. Pressing the button again turns if off. Magic.

Ah-ha! I thought. But what happens if the wi-fi is working in the house even though the big bad internet is down.

To my surprise, it worked! The wife was not impressed but that's ok, I was (there's joy in small things). I can control the heating from the app despite the internet being unavailable.

It would be all too easy to design a system whereby the app and smart hub need internet access to initiate the handshake enabling the two to talk - a middle-man if you will. That this system (Drayton Wiser) did not rely on this and after a little thinking (spinning) connected and controlled my boiler - despite the lack of internet access - re-assured my faith that there's at least some competent engineers out there. The design team behind this has thought to cover a number of failure scenarios with some graceful degradation in capabilities to cover scenarios such as the internet being unavailable or my local wi-fi being down.

In all cases it should be possible to operate the system - turn the heating on and off. It may not be pretty, but it is viable and preferable to freezing (or spending the evening under the cold stare of a woman with dirty hair).

We all too often ignore failure scenarios and exception cases - they're hard, complicated and expensive. Instead we fall into a trap of assuring ourselves that "it'll never happen", that the real world is a perfect and clean environment which never fails, or that 99.9% uptime is good enough. In truth the real world is random, noisy and unreliable and one in which we have an insignificant amount of control. 99.9% uptime is essentially telling one in a thousand customers that you don't care about them. If you want to scale beyond more than a few hundred customers that 0.1% matters, at a million customers that can be a significant existential threat to your business.

The amount of hardware and software between the user and your application is staggering and any of it could and will fail in ways you have not considered1. Network and node failures happen. We need to accept this and introduce measures to defend ourselves. Measures such as bulk heads to allow for failures in some nodes, retry mechanisms for transient failures, queues to decouple components, caching for reliability as well as performance, circuit breakers, and my favourite of the day, a get-out-of-jail-free card for key functions for when the shit hits the fan (like a button to turn a boiler on or a hot-spot through a mobile phone to stay connected).

The, if everything else fails I can still get the job done by doing x magic! Trust me. You'll sleep better.

1 Originally I wrote "likely not considered", but no, you should assume you ain't considered half of it.   

No comments:

Post a Comment

Voyaging dwarves riding phantom eagles

It's been said before... the only two difficult things in computing are naming things and cache invalidation... or naming things and som...