!Attachments, Links

As I’ve said before, email is broken. Unfortunately, like smack, many seems hooked on it. For your methadone, start to wean yourself off by sending links and not attachments.

It’s more accessible, more secure, you can keep the content up-to-date and ensure everyone sees the current version. It’ll even stop your mail file from bloating…

Just use a wiki like Confluence and try not to attach docs to pages…


In the increasingly interconnected micro-services world we’re creating the saying “a chain is only as strong as its weakest link” is particularly pertinent.

It’s quite easy for a single service to be dependent upon a number of downstream services as the diagram below shows.

An outage or go-slow in any one of the downstream services can have a knock on impact upstream and right back to users. Measuring this in SLAs, let’s say each of B, C, D, E, F each aims for an availability SLA of 99.99%. Assuming they meet this, the best A can achieve is 99.95%. More realistically, B, C, D, E and F are probably dependent on other services and before you know it end users are doing well to see anything above 99% uptime.

So what strategies do we have for dealing with this?

Firstly, you could just live with it. Really, don’t knock this option. Question “do I really need the availability?”, “does it really matter if it goes down?”. Before we worry about any elaborate plan to deal with the situation it’s worth considering if the situation is really all that bad.

Ok, so it is… The next question should be “do I need a response immediately?”. If not, go asynchronous and put a queue between them. If the recipient is down messages will queue up until they come back – no problem. Just make sure the queue is as local as possible to the source and persistent.

If it is a request-response model then consider a queue in any case. A queue can often be set to timeout old messages and deal with slow responses (e.g. if no response in 5 seconds then abandon). This can often save having very many messages in a backlog waiting to be processed. These can cause lock ups for requests which will never be processed and block the consumer for much longer than the downstream service is unavailable. And it can often be more efficient to have a queue based competing consumer model than having multiple connections banging away sporadically.

On top of this, ensure you’re using non-blocking libraries and implement circuit-breakers to trip when downstream services go offline. This of course begs the question, “what sort of response do I provide if the circuit-breaker is open?”… Well, that depends…

In some cases you can cache previous responses and serve this. If this sort of caching model works then even better, you can decouple the request for content from that fetching it from a downstream service so that you’re in effect always serving from cache. Allowing stale cache entries to be served whilst revalidating even when downstream services are unavailable can significantly improve the responsiveness and availability of the system. Don’t discard cached items just because they’re old. Keep using them until a fresh copy can be obtained. Size is a concern but if you can afford it then cache your content for as long as the RTO demands (the service should be back by then, e.g 4hrs) and revalidate as frequently as the business demands the content be fresh (e.g. every 10 minutes).

It may sound risky, but this approach can even be used with sensitive data such as user-permissions. You’re looking at a coincidence of bad events which is quite unlikely – e.g. users permissions are revoked (the old version is in cache), at the same time as the permissions system goes down, at the same as the user attempts something they previously could but should no longer be allowed to do.. It’s your risk but what’s worse… One user doing something bad or the whole system being unavailable?

If you can’t or don’t have a cache then can you implement a default or fallback option? Having a blank slot on a page, but a working page otherwise, may be the best of a bad set of options but the best nonetheless.

All else failing, apologise, quickly (see circuit-breaker) and profusely. Let the user know its you, not them, that they needn’t worry (e.g. you’ve not charged them and have unpicked any dependent transactions should you have them) and that you’ll be back as soon as you can.

Finally, log everything, monitor and alert. Regardless of the fact that it’s bad to rely on your customers to tell you when you’ve a problem, in many cases the user may not even realise something is amiss. It can easily be overlooked. Ensuring you log and monitor makes it much easier to know when you’ve an issue as well allowing root-cause analysis faster.

Queues, circuit-breakers, serve-stale-while-revalidate and logging.

The Matrix

The matrix may well be the most under-appreciated utility in the toolbox of architects.

We produce diagrams, verbose documents and lists-of-stuff till the cows come home but matrices are an all too rare; almost mythical, beast. Their power though is more real than the healing and purification properties of true Unicorn horns despite what some may say.

Here’s an example.

The diagram below shows a contrived and simplified matrix of the relationship between user stories and components. In many cases such a matrix may cross hundreds of stories and dozens of components.

Picture of a matrix from a spreadsheet

Crucially we can see for a particular story which components are impacted. This provides much needed assurance to the architect that we have the needed coverage and allows us to easily see where functionality has no current solution. In this case “US4: Audit Logging”.

Adding some prioritisation (col C) allows us to see if this is going to be an immediate issue or not. In this case the product owner has (foolishly) decided auditing isn’t important…

Developers can use the matrix to see which components need implementation for a story and see what other requirements are impacted by the components they’re about to develop.

Now, it may well be that we’ll proceed and accept any technical debt associated with high-priority requirements to deliver them faster. It may also be that the lower priority requirements never get delivered, so no-problem. But it may instead be that the next story in the backlog has some particular nuanced requirement which makes things rather hairy, and is best to consider up-front rather than walk into a pit if we do it things another way. It’s a balancing game with pros and cons – the matrix provides visibility to aid the assessment which all parties can use.

And there’s more (in true infomercial style)… We can also see that the “Access Gateway”, “Article Management” and “Database” components appear to cover many stories. This may be fine if the functionality they provide is consistent across requirements – for example the “Access Gateway” may simply be doing authentication and authorisation consistently – but in other cases it suggests some decomposition and refinement is needed – for example we may wish to consider breaking out “Articles” and “Comments” into two separate components which have more clearly defined responsibilities. Regardless, it helps to see that some components are going to be critical to a lot of requirements and may need more care and attention than others.

So where does this particular matrix come from? We could be accused of the near cardinal sin today of following a waterfall mentality with the need for a big up-front design phase. Not so. It’s more akin to a medical triage.

We have a backlog. We need to review the backlog and sketch out the core components required to support this. We don’t need to dig into each component in great detail – just enough to provide assurances that we have what’s needed for the priority requirements and that the requirements have enough detail to support this (basically some high level grooming). Low priority or simple requirements we may skim over (patient will live (or die)), higher priority or complex ones we assess till we can build the assurances we need (patient needs treatment).

When new requirement arise we can also quickly assess these against the matrix to see where the impact we will be.

This is just one of many useful matrices. Story-to-story can help identify requirement dependencies. Likewise for component-to-component. Mappings from logical components to infrastructure helps build a view of the required environment and can; when taken to the physical level, be used for automatic identification of things like firewall rules. You can even connect matrices together to allow for identification of which requirements are fulfilled by which servers – e.g. physical-node to logical-node to component to requirement maps – or use them for problem analysis to work out what’s broken – e.g. “this function isn’t working, which components could this relate to”. Their value of course is only as good as the quality of data they hold though so such capabilities are often not realised.

Like Unicorns, matrices can be magical. Fortunately for us; and – I hate to break this to you – unlike Unicorns, matrices are real (despite what some may say!).

Soft Guarantees

In “The Need for Strategic Security” Martyn Thomas considers some of the risks today in the the way systems are designed and built and some potential solutions to address the concerns raised. One of the solutions proposed is for software to come with a guarantee; or at least some warranty, around it’s security.

Firstly, I am (thankfully) not a lawyer but I can imagine the mind bogglingly twisted legalese that will be wrapped around such guarantees. So much so as to make them next to useless (bar giving some lawyer the satisfaction of adding another pointless 20 paragraphs of drivel to the already bloated terms and conditions..). However, putting this aside, I would welcome the introduction of such guarantees if it is at all possible.

For many years now we’ve convinced ourselves that it is not possible to write a program which is bug-free. Even the simple program:

echo "Hello World"

has dependencies on libraries, the operating system; along with the millions of lines of code therein, all the way down to the BIOS and means we cannot be 100% sure even this simple program will always work. We can never be sure it will run correctly for every nanosecond of every hour of every day of every year.. for ever! It is untestable and absolute certainty is not possible.

At a more practical level however we can bound our guarantees and accept some risks “compatible with RHEL 7.2”, “… will work until year-end 2020…”, “.. needs s/w dependencies x, y, z…” etc. Humm, it’s beginning to sound much like a software license and system requirements checklist… Yuck! On the whole, we’re pretty poor at providing any assurances over the quality, reliability and security of our products.

Martyns point though is that more rigorous methods and tools will enable us to be more certain (perhaps not absolutely) about the security of the software we develop and rely on allowing us to be more explicit about the guarantees we can offer.

Today we have tools such as SonarQube which helps to improve the quality of code or IBM Security AppScan for automated penetration testing. Ensuring such tools are used can help but these tools need to be used “right” if used at all. All too often a quick scan is done and only the few top (and typically obvious) issues are addressed. The variation of report output I have seen for scans on the same thing using the same tools but performed by different testers is quite ridiculous. A poor workman blames his tools.

Such tools also tend to be run once on release and rarely thereafter. The ecosystem in which our software is used evolves rapidly so continual review is needed to detect issues as and when new vulnerabilities are discovered.

In addition to tools we also need industry standards and certifications to qualify our products and practitioners against. In the security space we do have some standards such as CAPS and certification programmes such as CCP. Unfortunately few products go through the certification process unless they are specifically intended for government use and certified professionals are few and in-demand. Ultimately it comes down to time-and-money.

However, as our software is used in environments never originally intended for them and as devices become increasingly connected and more critical to our way of life (or our convenience), it will be increasingly important that all software comes with some form of compliance assurance over its security. For which more accessible standards will be needed. Imagine if in 10 years time when we all have “smart” fridges some rogue state sponsored hack manages to cycle them through a cold-warm-cold cycle on Christmas eve.. Would we notice on Christmas day? Would anyone be around to address such a vulnerability? Roast potatoes and e. coli turkey anyone? Not such a merry Christmas… (though the alcohol may help kill some of the bugs).

In addition, the software development community today is largely made up of enthusiastic and (sometimes) well-meaning amateurs. Many have no formal qualification or are self-taught. Many are cut’n’pasters who frankly don’t get-IT and just go through the motions. Don’t get me wrong, there are lots of really good developers out there. It’s just there are a lot more cowboys. As a consequence our reliance on security-through-obscurity is deeper than perhaps many would openly admit.

It’s getting better though and the quality of recent graduates I work with today has improved significantly – especially as we seem to have stopped trying to turn arts graduates into software engineers.

Improved and proven tools and standards help but at the heart of the concern is the need for a more rigorous scientific method.

As architects and engineers we need more evidence, transparency and traceability before we can provide the assurances and stamp of quality that a guarantee deserves. Evidence of the stresses components can handle and constraints that contain this. Transparency in design and in test coverage and outcome. Traceability from requirement through design and development into delivery. Boundaries within which we can guarantee the operation of software components.

We may not be able to write bug-free code but we can do it well enough and provide reasonable enough boundaries as to make guarantees workable – but to do so we need a more scientific approach. In the meantime we need to keep those damned lawyers in check and stop them running amok with the drivel they revel in.

Not all encryption is equal

Shit happens, data is stolen (or leaked) and your account details, passwords and bank-account are available online to any criminal who wants it (or at least is prepared to buy it).

But don’t panic, the data was encrypted so you’re ok. Sit back, relax in front of the fire and have another mince pie (or six).

We see this time and again in the press. Claims that the data was encrypted… they did everything they could… blah blah blah. Humm, I think we need more detail.

It’s common practice across many large organisations today to encrypt data using full-disk encryption with tools such as BitLocker or Becrypt. This is good practice and should be encouraged but is only the first line of defence as this only really helps when the disk is spun down and the machine powered off. If the machine is running (or even sleeping) then all you need is the users password and you’re in. And who today really wants to shutdown a laptop when you head home… and perhaps stop for a pint on the way?

In the data-center the practice is less common because the risk of disks being taken out of servers and smuggled out of the building is lower. On top of this the disks are almost always spinning so any user/administrator who has access to the server can get access to the data.

So, moving up a level, we can use database encryption tools such as Transparent Data Encryption to encrypt the database files on the server. Great! So now normal OS users can’t access the data and need to go through the data access channel to get it. Problem is, lots of people have access to databases including DBAs who probably shouldn’t be able to see the raw data itself but who generally can. On top of this, service accounts are often used for application components to connect and if these credentials are available to some wayward employee… your data could be pissing out an open window.

To protect against these attack vectors we need to use application level encryption. This isn’t transparent and developers need to build in data encryption and decryption routines as close to exposed interfaces as practical. Now having access to the OS, files or database doesn’t do enough to expose the data. An attacker also needs to get hold of the encryption keys which should be held on separate infrastructure such as an HSM. All of which costs time and money.

Nothings perfect and there’s still the risk that a wayward developer siphons off data as it passes through the system or that some users have too broad access rights and can access data, keys and code. These can be mitigated against through secure development practices, change management and good access management… to a degree.

Furthermore, encrypting everything impacts functionality – searching encrypted data becomes practically impossible – or may not be as secure as you’d expect – a little statistical analysis on some fields can expose the underlying data without actually needing to decrypt it due to a lack of sufficient variance in the raw data. Some risks need to be accepted.

We can then start to talk about the sort of ciphers used, how they are used and whether these and the keys are sufficiently strong and protected.

So when we hear in the press that leaked data was encrypted, we need to know more about how it was encrypted before deciding whether we need to change our passwords before tucking into the next mince pie.

Merry Christmas!

Scalable = Horizontal

There’s two ways of scaling; vertical and horizontal, but there’s only one which is really scalable.

Vertical scaling essentially means bigger nodes. If you’ve got 8GB RAM, go to 16GB. If you’ve 2 cores, go to 4.. and so on.

Horizontal scaling means adding more nodes. One node to two nodes, to three and so on.

As a rule, horizontal scaling is good. Theoretically there’s no limit to the number of nodes you can have.

As a rule, vertical scaling is bad. You quickly run into constraints over the number of cores or RAM you can support. And for many of todays problems this just doesn’t work. Solutions need to be both scalable at the internet scale and available as in 24×7. Relying on large single nodes in such situations is not ideal. (and those supercomputers with 250,000+ processors are really horizontal solutions as well).

The problem is, horizontal scaling isn’t trivial. The culprits here are data and networking (plumbing really). State and caches need to be distributed and available to all. Databases need copies across nodes and need to be synchronised. Sharding usually becomes necessary (or you just end up with many very large nodes). And so on… Best bet is to avoid state as much as possible. But once you’ve cracked it you can build much larger solutions more efficiently (commodity hardware, virtualisation, containers etc.) and flex more rapidly than in the vertical world.

I could go on about how historically the big players love the vertical-scaling thing (think Oracle and IBM trying to sell you those huge servers and SQL databases solutions with the $$$ price-tags)… The world found NoSQL solutions which take a fundamentally different approach by accepting that consistency in many cases really isn’t as important as we once thought – and many of these are open-source…

Whatever, there’s only one way to scale… Horizontal!


Instrumentation as a 1st Class Citizen

I wrote previously that we are moving into an era of instrumentation and things are indeed improving. Just not as fast as I’d like. There’s a lot of good stuff out there to support instrumentation and monitoring including the likes of the ELK (ElasticSearch, Logstash, Kibana) and TIG (Telegraf, InfluxDB, Grafana) stacks as well as their more commercial offerings such as TICK (Telegraf, InfluxDB, Chronograf, Kapacitor), Splunk, DataDog, AppDynamics and others. The problem is, few still really treat instrumentation as a real concern… until it’s too late.

Your customers love this stuff! Really, they do! There’s nothing quite as sexy as an interactive graph showing how your application is performing as the load increases – transactions, visitors, response-times, server utilisation, queue-depths etc. When things are going well it gives everyone a warm fuzzy feeling that all is right with the universe. When things are going wrong it helps to quickly focus you in on where the problem is.

However, this stuff needs to be built into everything we do and not be an afterthought when the pressures on to ship-it and you can’t afford the time and effort to retrofit it. By then it’s too late.

As architects we need to put in the infrastructure and services needed to support instrumentation, monitoring and alerting. At a minimum this means putting in place standards for logging, data-retention polices, a data collection solution, repository for the data and some tooling to allow us to search that data and visualize what’s going on. Better still we can add alerting when thresholds breach and use richer analytics to allow us to scale up and down to meet demand.

As developers we need to be considering what metrics we want to capture from the components we build as we’re working on them. Am I interested in how long it’s taking for this function call? Do I want to know how many messages a service is handling? How many threads are being spawned? What exceptions are being thrown? Where from? What the queue depths are?.. etc. Almost certainly… YES! And this means putting in place strategies for logging these things. Perhaps you can find the data in existing log files.. Perhaps you need to use better tooling for detailed monitoring… Perhaps you need to write some code yourself to track how things are going…

Doing this from the start will enable you to get a much better feel for how things are working before you launch – including a reasonable view of performance and infrastructure demands which will allow you to focus your efforts better later when you do get into sizing and performance testing. It’ll mean you’re not scrambling around look for log files to help you root-cause issues as your latest release goes into meltdown. And it’ll mean your customer won’t be chewing your ear off asking you what’s going on every five minutes – they’ll be able to see it for themselves…

So please, get it in front of your customer, your product owner, your sponsor, your architects, your developers, your testers and make instrumentation a 1st class citizen in the backlog.


The title of this post is encrypted.

This page is also encrypted (via TLS (aka the new name for SSL)).

Anyone sniffing traffic on the wire must first decrypt the TLS traffic and then decrypt the content to work out what the message says.

But why bother with two layers of encryption?

Ok, so forgive the fact that this page is publicly accessible and TLS is decrypted before your eyes. It’s possibly a poor example and in any case I’d like to talk about the server side of this traffic.

In many organisations, TLS is considered sufficient to provide security for data in-transit. The problem is TLS typically terminates on a load-balancer or on a web-server and is forwarded from there to another downstream server. Once this initial decryption takes place data often flows over the internal network of organisations in plain text. Many organisations consider this to be fine practice since the internal network is locked down with firewalls and intrusion detection devices etc. Some organisations even think it’s good practice so that they can monitor internal traffic more easily.

However, there is obvious concern over insider-attacks with system-admins or disgruntled employees being in a good position to skim off the data easily (and clean-up any trace after themselves). Additionally requests are often logged (think access logs and other server logs) and these can record some of the data submitted. Such data-exhaust is often available in volume to internal employees.

It’s possible to re-wrap traffic between each node to avoid network sniffing but this doesn’t help data-exhaust and the constant un-wrap-re-wrap becomes increasingly expensive if not in CPU and IO then in effort to manage all the necessary certificates. Still, if you’re concerned then do this or terminate TLS on the application-server.

But we can add another layer of encryption to programmatically protect sensitive data we’re sending over the wire in addition to TLS. Application components will need to decrypt this for use and when this happens the data will be in plain text in memory but right now that’s about as good as we can get.

The same applies for data at-rest – in fact this is arguably far worse. You can’t rely on full database encryption or file-system encryption. Once the machine is up and running anyone with access to the database or server can easily have full access to the raw data in all its glory. These sort of practices only really protect against devices being lifted out of your data-centre – in which case you’ve got bigger problems…

The safest thing here is to encrypt the attributes you’re concerned about before you store them and decrypt on retrieval. This sort of practice causes all sorts of problems in terms of searching but then should you really be searching passwords or credit card details? PII details; names, addresses etc, are the main issue here and careful thought about what really needs to be searched for; and some constructive data-modelling, may be needed to make this workable. Trivial it is not and compromises abound.

All this encryption creates headaches around certificate and key management but such is life and this is just another issue we need deal with. Be paranoid!

p.s. If you really want to know what the title says you can try the password over here.