2023/03/13

Voyaging dwarves riding phantom eagles

It's been said before... the only two difficult things in computing are naming things and cache invalidation... or naming things and something else, but always naming things.

When I started in IT we used to name servers after Star Trek ships, the Enterprise, Excelsior, Voyager etc. This I could cope with because I was naïve and a bit of a trekkie (or should that be a trekker?) so this seemed amusing enough.

Then we started naming software packages after authors (Kafka), languages after snakes (Python) and chemical reactions (Rust), and the origin of some Javascript libraries is a complete mystery (JHipster anyone?).

Then there are scrum team names based on Lord or the Rings characters (Hobbits, Elves etc.) which, erm, ok, and Marvel universe characters, Iron Man, Batman, whatever-man... (sorry guys, but the Marvel universe is boring).

And if you have project code names based on birds of prey then please don't be surprised if I can't tell the difference between a goshawk and a sparrowhawk.

I have very little connection with these names and they often bear no resemblance to what these things do, or if there is then any connection is so obscure as to be missed by the vast majority (sometimes being clever is stupid). 

Occasionally this is a good thing - obscurity and ambiguity are sometimes beneficial - but often such names serve only to confuse, restrict the flow of information and require mental vlookups to cross-reference code names to something more readily understood. We may as well invent our own language.

So "voyaging dwarves riding phantom eagles" would be work done by the accounting team on the development database to onboard a new client (or a scene from the Hobbit). Which do you think is clearer?

Beat me over the head with Rick and Morty memes long enough and perhaps I'll get it, but I'd rather spend my time riding my bike than watching cartoons (personal choices folks).

Names matter - and they are hard - so if you want to communicate clearly then use clear, simple and unambiguous names. Fair to say, I don't work in marketing...

2022/10/20

The internet is down, all is well

 A few weeks ago I switched from Zen internet (stable enough; a touch more expensive than the big boys; excellent customer service) to Community Fibre (1Gb/s; fibre to the premise; symmetric; cheaper than Zen; only if you live in London).

The speed has been excellent. Overall just under 1Gb/s (around 960Mb/s) direct from the hub, up to 600Mb/s from a laptop in line-of-sight and easily 150Mb/s - up and down - everywhere else in the house.

Until today.

When it broke.

No idea why, an engineer will be sent to the house tomorrow, and in the meantime I'm operating off a mobile phone (works better than I expected).

I shall give Community Fibre the chance to fix this teething problem though that is not the reason for writing this post.

I also have a so called smart thermostat and since the wi-fi was down (I turned the now useless router off) the poor wee little smart hub had no way of talking to the thermostats and no way for us to tell it to switch the heating on or off from the app (of course there's an app...).

This does not make for a happy house when the wife wants to have a shower.

These systems have fall-backs for such cases though and pressing a button on the hub at least fires up the boiler, even if it doesn't talk to the thermostats - cue running around the house turning on the radiators and regaining enough brownie points to at least not spend the evening in the dog-house. Pressing the button again turns if off. Magic.

Ah-ha! I thought. But what happens if the wi-fi is working in the house even though the big bad internet is down.

To my surprise, it worked! The wife was not impressed but that's ok, I was (there's joy in small things). I can control the heating from the app despite the internet being unavailable.

It would be all too easy to design a system whereby the app and smart hub need internet access to initiate the handshake enabling the two to talk - a middle-man if you will. That this system (Drayton Wiser) did not rely on this and after a little thinking (spinning) connected and controlled my boiler - despite the lack of internet access - re-assured my faith that there's at least some competent engineers out there. The design team behind this has thought to cover a number of failure scenarios with some graceful degradation in capabilities to cover scenarios such as the internet being unavailable or my local wi-fi being down.

In all cases it should be possible to operate the system - turn the heating on and off. It may not be pretty, but it is viable and preferable to freezing (or spending the evening under the cold stare of a woman with dirty hair).

We all too often ignore failure scenarios and exception cases - they're hard, complicated and expensive. Instead we fall into a trap of assuring ourselves that "it'll never happen", that the real world is a perfect and clean environment which never fails, or that 99.9% uptime is good enough. In truth the real world is random, noisy and unreliable and one in which we have an insignificant amount of control. 99.9% uptime is essentially telling one in a thousand customers that you don't care about them. If you want to scale beyond more than a few hundred customers that 0.1% matters, at a million customers that can be a significant existential threat to your business.

The amount of hardware and software between the user and your application is staggering and any of it could and will fail in ways you have not considered1. Network and node failures happen. We need to accept this and introduce measures to defend ourselves. Measures such as bulk heads to allow for failures in some nodes, retry mechanisms for transient failures, queues to decouple components, caching for reliability as well as performance, circuit breakers, and my favourite of the day, a get-out-of-jail-free card for key functions for when the shit hits the fan (like a button to turn a boiler on or a hot-spot through a mobile phone to stay connected).

The, if everything else fails I can still get the job done by doing x magic! Trust me. You'll sleep better.

1 Originally I wrote "likely not considered", but no, you should assume you ain't considered half of it.   

2022/07/11

We can't go on like this!

I'm sitting here in the sun - yes, it's sunny in south London - and for the past 30 minutes I've been trying to buy another of Martha Wells excellent Murderbot diaries on Kobo books. It's failed. Several times. Once it insisted on an update first, grrr. Once it just timed out. Once I gave up. And now... the ebook has decided to reboot itself. It nearly worked once but Monzo approval was needed and the Monzo app decided it was "Refreshing..."... and I guess it's hot so perhaps it needed a long cold shower because, well, that didn't work either, or rather when it did the Kobo had given up!

In the midst of this Microsoft logged me out of OneDrive for some random reason (I don't care why right now and I don't need to be interrupted to be told), and the god awful Apple AirPlay stuttered, stop-start, repeatedly. You can't listen to music like this.

All I wanted to do was sit in the sun, read some trashy sci-fi (Murderbot is excellent trash sci-fi - and I mean that in the best possible way), and listen to some Metallica in the background (blame Stranger Things).

To add insult to injury my mother called and her computer's no longer running Skype and a reboot won't bring it back. And you can forget about talking her through going to Launchpad and clicking on Skype. She's 78, once threatened to leave my father if he ever bought a new computer again (this was the '80s), and despises them (computers, not men). I have her using gmail and Skype and I've not got the energy to add any more complexity to her life or be the IT support bod any more than is strictly necessary. I shall fix Skype the next time I visit (which btw, what's the f**king point of video chat if you need to visit in person to get it working!).

We can't go on like this. This stuff needs to be much more reliable, much more tolerant of noise, much less invasive, so much easier to use and just plain better.

On the up side, it's 18:30, time for a whisky and a game of cards.

2022/04/05

Picture yourself on a boat on a river...

PO: We need a bridge over the river right here?

Me: Why?

PO: Because the customer needs to get to the other side?

Me: Why can't they use the bridge half a mile up river?

PO: Because that'll take them on a half hour round trip! That's not very nice...

Me: Right... and er, how many people do we think will use this bridge?

PO: None. We're not actually expecting anyone to use it. Not for the first year anyway, but we need to have it for compliance reasons.

Me: Which are?

PO: Compliance reasons.

Me: and...

PO: Urgh,... Sally told Peter told Ravi told Mike told Sue told Priya told Vlad told Jean-claude told me that it was "important compliance reasons". And Sally is very busy and even more important...

Me: <sigh>... So it's not a big deal if that no-one has to walk for 30 minutes then.

PO: Well, someone might use it, and what if they're in a hurry?

Me: They should have planned ahead...

PO: You can't tell a customer to "plan ahead"!

Me: Yes, you can. You do it every time you go on a road trip, you just need to know to expect it. Anyway, how about we have a boat that can ferry people across?

PO: No, no, no! We've told you before. We don't like boats. They're expensive and can't carry many passengers.

Me: Not as expensive as bridges and you just told me we don't have any customers!

PO: Look Mr Grumpy, are you going to design a bridge or not?

Me: .... what's your budget?

PO: Haven't thought about that, but it needs to be ready a week on Tuesday... and who's that fool in the river?

Me: That'll be a customer who fell off the rope bridge we built up river...


2021/09/12

Don't treat people like serverless functions.

When I were knee high to a grasshopper we didn't have all this new fangled cloud infrastructure and we certainly didn't have the concept of serverless computing. How can you compute without a computer?..

But before my time (and I'm not that old!) computers were people. People like Sally. Actual humans sitting in offices with bits of paper, pencils and tables of logarithms and trigonometric functions. Adding, subtracting, scribbling down results, checking and verifying. Whether a human being or a hunk of metal, you need a computer to compute.

Well, most of the time the thing you're computing is important enough that you can't afford for it not to be computed. It's a bad thing if you need to calculate wage-packets and can't do it, whether it's because Sally's off sick and can't compute today or the server has gone down because it's run out of disk space. You're going to have a riot on your hands come Friday afternoon when the pubs open...

Which brings us to the concept of redundancy. 

Rather than relying on Sally alone we need to ensure we have someone else around who can also compute when she's out sick. Equally - in the world of tin - we need backups in case our primary fails - disks, networks, servers, power, cooling, data-centers etc. 

This is an expensive habit.

Almost every system will be important enough to warrant some degree of redundancy. The more critical they become, the greater the degree of redundancy required and the more time architects spend worrying about the impact of component failures, how long it takes to recover, how much data could be lost, acceptable error rates and so on.

In the bad old days we would literally have a standby server for every primary. Boom! Double the costs right there.

Just imagine if we needed to employ Jane as well as Sally just to cope with the days Sally was sick? Of course we'd not do this. We need to ensure the function can be picked up by someone else but that person could be someone else in a team where the team has a degree of redundancy built in (or even someone who's main job is something else, Bob from accounting or Sallys manager for example). Perhaps we can work out that if we need seven computers we best hire ten to cover the days some are sick...

Having a level of redundancy in the organisation provides the flexibility to handle outages and failures. Besides, people can't work at 100% capacity. They will burn out, productivity will fall, they'll hate you for it and will leave as soon as a better opportunity turns up.

Anyway, back in the world of tin, along came virtualisation and we could host multiple virtual machines (VMs), on one physical host in much the same way one person could turn their hands to multiple tasks. This was great ('ish) as it reduced the number of physical servers significantly and saved on hardware, power, space, CO2 emissions and consequently dollars. We still need to support some degree of redundancy in case a host node went down or a VM failed but it's much better than before. 

How much better?

Well, most systems aren't Google or Netflix. I know, surprising huh?

Most systems don't need to support 1000's of transactions per second. Mostly it's less than 1 tps. Yup, one! And often it's a lot less than one... like a few hundred transactions a day is typical of many systems. Me and a my Casio fx-85GT Plus can handle that!

So we can stuff a lot of VMs onto a single physical host with perhaps 50 VMs running across 3 physical hosts in a cluster whilst still maintaining enough redundancy to ensure availability. Make that tin sweat!

Suffice to say, if we treated Sally like the tin, she would not be impressed.

VMs are still pretty hefty though. Each VM runs its own copy of the operating system and associated processes which makes them pretty big (GBs of RAM for the VM compared with perhaps a few MB for application processes). We've had multi-tasking operating systems for decades now and there's little reason we can't run multiple application processes on the same server. Other than it's a really bad idea.

Developers make lots of assumptions about the environment they're running - which libraries and versions are available, what file paths they can use etc. - and a lot of these things aren't compatible with each other. They're also really bad at security and act like peace loving drug infused hippies... "hey, why would anyone else read my files man?". Add to this that bugs happen (always will) resulting in unstable or runaway processes crashing or consuming all the resources available and it's a recipe for disaster.

Running multiple disparate application processes on the same server is a bad idea. Or was...

Now along comes containerization and provides a degree of isolation between processes within a server to prevent one hippie process treading on another hippies toes. And we know how exposed hippies toes are don't we?

This can give us an order of magnitude increase in processes on a host so we're now up to 500 containers across our 3 node cluster of computers. Nice.

Sally on the other hand is seriously pissed.

But we still have to manage a bunch of physical servers underpinning our applications. Whether VMs or containers, there's a bunch of power hungry, raging hot physical computers burning away in the background. And in the case of human computers, really angry overworked ones.

Then came serverless.

Forget the server. You pay-per-use - that few hundred transactions a day - leveraging services provided by cloud providers. Everything from databases to messaging to raw compute can be provided as a pay-per-use service without needing to worry about the server or redundancy.

Erm, well, except for the cloud service provider - who worries about it a lot - and your architect who still needs to worry about the non-functional as well as functional characteristics of all those services we end up consuming (especially the economics of serverless if that one tps turns into many thousands..). 

Of course serverless isn't really server-less and there's always a bit of tin somewhere. We're really talking about building applications out of services (like AWS Lambda or Google Cloud Functions). Those services are carefully managed by cloud providers who supply all the necessary redundancy to support the resiliency, availability and scalability you'd expect.

But what about Sally? Does she still have a job?

Sadly no. Sally has now been moved into the gig economy on a zero-hours contract and works on a pay-per-use basis. She doesn't get any guaranteed work, an hourly rate or sickness benefits. Please don't treat people like serverless functions.

2021/07/27

Docs

There, I said it. A four letter swear word. Something worse than the F’ word if the horror on the boss’ face is anything to go by.

We don’t do “documentation” anymore and besides, the agile manifesto says it’s immoral to write a word of documentation. The code is the documentation. That you have to get inside the twisted maze of my mind, and work out what drug infused insanity I was trying to convey at the time and which may or may not result in what was intended regardless… that’s your problem. 

Such nonsense pervades software development today. Although…

It’s not the lack of documentation per se but the lack of demonstrable thought that bugs me. A critical explanation of why things are the way they are.

Why has solution design been so poorly treated?

The answer to that lies in what design is trying to achieve.

Solution design aims to address the needs of the system, communicates how these needs are met and enables testing of the solution when the cost of change is lowest.

This testing is achieved by static walkthroughs and peer review.

However, in an agile world where we breakdown features and stories into small chunks that can be delivered rapidly and iteratively, the hefty tombs of yesterdays solution design documents are equally shrunk to focus on the few features and stories in scope.

Taken far enough it ends in a combined whiteboard design and review session with a handful of developers. And beyond taking a photo and circulating to the team, what's the point in doing anything more?

Firstly there's the supporting teams who are going to care for your solution through its early teething days, rebellious teenage years, into maturity and all the way to the grave – death being one of the few certainties in life. Making these guys bump their way around a darkened room trying to figure out how things hang together is unfair if not sadistic.

These guys don't work in scrum teams or story by story. They work from incident to incident, from shit-storm to fan-splattering shit-storm. They need a concise and holistic view of the solution for which a few poorly framed holiday photos doesn't cut it. Honestly, these guys are too nice to you.

Then there's the not insignificant matter of consistency. 

You can argue the marginal benefits of any technology but you need to justify any change which runs counter to the inherent design patterns of a system as contributing significant value to overcome the increased complexity and costs it brings.

Do things consistently and you get paybacks in terms of proven reliability, easier maintainability, reduced cognitive load, faster delivery and so on.

Do things inconsistently and whilst you may get to play with the latest  tech you're being pretty selfish - if that's your motivation. And if it's not then see above re justifying the increased complexity and costs.

Which brings me to my point on solution design.

Solution design is about more than addressing the needs of a few stories on a sprint by sprint basis. It's about addressing the broad needs of the system and providing a vision for the longer term. Defining the patterns which will be used time and again and which (should) enable that payback through consistency.

This isn't an argument that technological progress isn't a good thing or that we should never have crawled out of the sea. Or that your resident architect has all the answers. It's an argument to think about the broad needs, key decisions, responsibilities, patterns, principles, policies and costs - immediate and long term - that provide the foundation on which any system is based.

At times it may seem futile and time consuming but it's cheap to change this stuff early on rather than late on when it's most expensive.

And the way we do that best is through reasoned discourse. Discourse best expressed through quality documentation - words, diagrams and matrices.

2021/07/11

The Con of Agile (or why agile reductionism is hard…)

Agile is, to a large extent, a radical breakdown of function into small incremental features delivered in a prioritised manner with rapid feedback to inform on the next evolutionary step to deliver ever greater customer value.

Executed as such you may not; probably should not, get what you originally intended and even if you fail then at least you’ve failed fast and saved yourself an expensive pipedream.

Bliss.

But agility is not easy. It is not something a tool, method or consultant will magically fix for you – despite what they tell you. Radical breakdown of function is hard. Reductionism is hard.

Most systems we build today are inherently complex with dependencies spread far and wide through the enterprise. Through reductionism we attempt to decompose these complex systems into their simpler component parts – a technique that has a long and successful history in software development – only to then recompose them in many and varied ways to deliver the outcomes we desire.

In a typical large organisation it’s not uncommon for a system to require integration with dozens of others and for any particular new feature to impact several of these at a time – in fact it’s rarer for a feature to be self contained within one system. Integration is the norm.

And here our problems start… 

According to Conways law an organisation will typically design systems that mirror the organisations communication structure. i.e. we end up with various systems mirroring the group and team structure of the organisation integrated along the same communication channels as the organisation.

The net result is that we have specialist teams with a deep understanding not just of the technology but the values and ethos of each system and the organisation they reflect. These teams own the system and have veto rights – rightly so – over what functionality they do or do not support. 

So on one hand we have a prioritised backlog of customer focused features and on the other we have a collection of disparate teams available to deliver them hell bent on aligning to the internal organisation structure. Something the customer usually doesn’t (and shouldn’t) give a toss about.

An instinctive reaction is for teams to start to breakdown features into stories that make sense from the organisational structure perspective. This can create a reflection of the organisation in the user journey and leads to stories which are really tasks because that’s easiest for teams to consume. And if you’re simply working off a prioritised list of tasks you’re not agile. You may be able to re-prioritise but you’ve lost the connection with why you’re doing something and that critical feedback loop from the customer so you can’t respond to their changing needs.

Perhaps worse still, a disparate collection of tasks spread across a plethora of teams does not provide a clear vision and direction for the people working in those teams. Treated this way people can start to feel like slaves to the machine, churning out widgets hour after hour with no clear understanding of why. Motivation and quality suffers and we no longer have small agile teams focused on delivering customer value but a collection of teams each with their own perspective on what we’re trying to achieve. 

As we breakdown features and stories into smaller chunks, maintaining a focus on customer value and the overall vision is hard – particularly in a large and complex organisation.

There’s now a twofold need both for well-articulated, customer focused and concise feature and story definition, and for solution designs which clearly stitches together the various parts of the organisation; and associated systems, connected to deliver those features.

The features and stories say what we’re trying to do and why we’re trying to do it.

The solution design says how we’re going to do it.

And that understanding of why we’re doing something is the “Ah-ha!” moment that gives us meaning and purpose, the reason we get out of bed in the morning. 

Ultimately it’s easy to explain what we’re trying to do and how we plan to do it – they’re concrete things we can action and it’s natural for us poor humans to want to focus on solving problems. It’s much harder to articulate why we’re trying to do something, and we all too often forget that others aren’t aware of the vision in the first place.

So as we break down capabilities and features into ever smaller chunks we need to consciously focus on the “why” we’re building this chunk and less so the “what” and the “how”. It may seem counter intuitive but solving problems is easy, finding them is harder.

As a consequence our features and stories should be heavy on the explanation of why we’re trying to do something and lighter on what we’re actually going to do. The “how” we’re doing something should be reserved for the solution design and provides traceability from the “what” and the “why” to the organisational structure, providing the foundation for that task list everyone seems to desire.

As we breakdown stories we also tend to reduce the number of teams involved and if we go far enough that they only impact one team then great. It won’t though always be the case and we shouldn’t try to twist stories to meet the internal organisation structure and make it fit us rather than the customer.

In short, as we breakdown features and stories we need to focus on why we’re doing something and what we’re going to do to address that need. Don’t worry about how we’re doing it – leave that to the technical team of architects and engineers – give them the easy task of connecting the dots and solving the problem. 

The trap in agile reductionism is that we, as human beings, have a tendency to want to solve the problem and in doing so lose the connection to the why that gives us meaning and purpose. We end up focusing on the solution and thereby reduce the solution to a problem, not the problem itself. The two are not the same. One is optimised for the organisation, the other for the customer.




Voyaging dwarves riding phantom eagles

It's been said before... the only two difficult things in computing are naming things and cache invalidation... or naming things and som...