It’s an older code..

(let’s assume we’re talking about encryption keys here rather than pass codes though it really makes little difference… and note that your passwords are a slightly different concern)

Is it incompetence to use an old code? No.

For synchronous requests (e.g. like those over HTTPS) there’s a handshake process you go through every few minutes to agree a new key. Client and server then continue to use this key until it expires then they agree a new one. If the underlying certificate changes you simply go through the handshake again.

For asynchronous requests things aren’t as easy. You could encrypt and queue a request one minute and change the key the next but the message remains on the queue for another hour before it gets processed. In these cases you can either reject the message (usually unacceptable) or try the older key and accept that for a while longer.

Equally with persistent storage you could change the key every month but you can’t go round decrypting and re-encrypting all historic content and accept the outage this causes every time. Well, not if you’ve billions of records and an availability SLA of greater than a few percent. So again, you’ve got to let the old codes work..

You could use full disk/database encryption but that’s got other issues – like its next to useless once the disks are spinning… And besides, when you change the disk password you’re not actually changing the key and re-encrypting the data, you’re just changing the password used to obtain the key.

So it is ok to accept old codes. For a while at least.

An empire spread throughout the galaxy isn’t going to be able to distribute new codes to every stormtrooper instantaneously. Even if they do have the dark-side on their side…

Lost and Fnd

I could spend the next two years tweaking things as much as I have the past two but it’s time to get some feedback on this so here we go…

Software modelling tools are expensive, bad at collaboration and more complicated than the vast majority of solutions demand. They set their sights on an unachievable utopia, demand constant maintenance to avoid being constantly out of date and dictate that we all follow overly prescriptive rules which may be philosophically correct but are practically irrelevant.

Diagramming tools are cheaper, better at collaboration and simpler but are all too fluid, lack structure and meaning and are far too open to interpretation. The best such tool I use is a whiteboard, a pen and room – a shame that such are so transient and unstructured.

A middle way is needed.

To this end I introduce Fnd (alpha).

Fnd lets you build catalogs, diagrams and matrices for solutions in a collaborative manner.

For now you can only sign-in with Google and when you do you’ll first be asked to create an account. Once you do you’ll see the home page which provides a list of current solutions.

Hit [+]  to create a new solution and provide any details you need.

Save this and you should now see a new solution defined on the homepage.

To edit the entry click on the main icon. To access the solution page, click on the link icon next to the solution name:Clicking on the solution link takes you to the solution page which provides a list of the catalogs, diagrams and matrices associated with that solution.

From here you can create a structural diagram

and in the diagram view you can set the title and description (and tags (a little more on tagging will come later)).

The nav bar allows you to specify the stencil, colours, save, reload, screenshot, document, delete (item) and delete diagram.

To add something to the diagram select the catalog item and “new”. This will show a popup allowing you to define that it and add it to the diagram.

Each catalog type has attributes specific to its needs. Choose that which suits best. For example, a component shows as:

When added to the diagram it appears with a stencil relevant to its type. This stencil can be changed by selecting the object in the diagram then the stencil type from the “Shapes” or “UML” drop-down. In the example below there are two components, one shown as a UML component stencil, the other as a database component.

Double click on the object to edit the settings and make sure you save the diagram – editing an object is immediate, changes to diagrams need to be saved.

Relationships can be added by clicking and dragging on the link icon (green +) from one object to the link icon on another.

From actor to component results in:

Double clicking on the link itself allows the attributes of the link to be defined. By default every link is a dependency but this can be changed as desired.

… and so on to build up diagrams.

Perhaps more importantly, if you grant another user access to your account (email-address > account > add user) then if you can both edit the same objects/diagrams at the same time and will see updates reflected in each others browser.

Matrices provide views of relationships in tabular and animated form. For example the above diagram appears as:

and

And catalog lists provide access to all objects of that type.

There’s more to this with multiple diagrams providing different views, the ability to search and add objects from one solution to another, using tags to provide filtered views, generating documentation from diagrams and so on. I can’t promise diagrams are always “pretty” or that models are “correct” but instead aim for a hopefully happy compromise somewhere between the two – enough to be understood!

A word of warning… Fnd is alphaware, I’m looking for feedback, it has bugs – some I know, some I do not. I use it daily in my work and it works for me (mostly). Let me know how you get on – the good, the bad and the ugly – and in turn I’ll try to improve it and make it more stable and functional.

You can access Fnd at https://nfa.cloud/. Feedback to admin [at] nfa.cloud.

p.s. Fnd is short for Foundation and simply a tip of my hat to one of my favourite authors

Capabilities and Responsibilities

According to TFD, “capabilities” are:

  1. The quality of being capable; ability.
  2. A talent or ability that has potential for development or use: student of great capabilities.
  3. The capacity to be used, treated, or developed for a specific purpose: nuclear capability.
Whereas “responsibilities” are:
  1. The state, quality, or fact of being responsible.
  2. Something for which one is responsible; a duty, obligation, or burden.

The words “developed for a specific purpose” indicate; at least to some extent, a degree of responsibility. All too often whether we are responsible for some function or ability is forgotten in the enthusiasm to build it. Blinded by the simple fact that we have the ability or potential we charge ahead regardless of whether it’s really our responsibility.

It’s not necessarily wrong – if no-one else takes responsibility or you want to compete in some capability then so be it – but in a larger enterprise you could just be in-fighting, duplicating effort or neglecting your true responsibilities. Something start-ups may not suffer from (or end up dying silently from in the cacophony of the market)…

Having the ability to do something does not make it the right thing to do. We talk a lot about capability, we need to talk more about responsibility. After all, what would the world be like if we all chose to exercise our (amateur) nuclear capabilities?

!Attachments, Links

As I’ve said before, email is broken. Unfortunately, like smack, many seems hooked on it. For your methadone, start to wean yourself off by sending links and not attachments.

It’s more accessible, more secure, you can keep the content up-to-date and ensure everyone sees the current version. It’ll even stop your mail file from bloating…

Just use a wiki like Confluence and try not to attach docs to pages…

Interconnected

In the increasingly interconnected micro-services world we’re creating the saying “a chain is only as strong as its weakest link” is particularly pertinent.

It’s quite easy for a single service to be dependent upon a number of downstream services as the diagram below shows.

An outage or go-slow in any one of the downstream services can have a knock on impact upstream and right back to users. Measuring this in SLAs, let’s say each of B, C, D, E, F each aims for an availability SLA of 99.99%. Assuming they meet this, the best A can achieve is 99.95%. More realistically, B, C, D, E and F are probably dependent on other services and before you know it end users are doing well to see anything above 99% uptime.

So what strategies do we have for dealing with this?

Firstly, you could just live with it. Really, don’t knock this option. Question “do I really need the availability?”, “does it really matter if it goes down?”. Before we worry about any elaborate plan to deal with the situation it’s worth considering if the situation is really all that bad.

Ok, so it is… The next question should be “do I need a response immediately?”. If not, go asynchronous and put a queue between them. If the recipient is down messages will queue up until they come back – no problem. Just make sure the queue is as local as possible to the source and persistent.

If it is a request-response model then consider a queue in any case. A queue can often be set to timeout old messages and deal with slow responses (e.g. if no response in 5 seconds then abandon). This can often save having very many messages in a backlog waiting to be processed. These can cause lock ups for requests which will never be processed and block the consumer for much longer than the downstream service is unavailable. And it can often be more efficient to have a queue based competing consumer model than having multiple connections banging away sporadically.

On top of this, ensure you’re using non-blocking libraries and implement circuit-breakers to trip when downstream services go offline. This of course begs the question, “what sort of response do I provide if the circuit-breaker is open?”… Well, that depends…

In some cases you can cache previous responses and serve this. If this sort of caching model works then even better, you can decouple the request for content from that fetching it from a downstream service so that you’re in effect always serving from cache. Allowing stale cache entries to be served whilst revalidating even when downstream services are unavailable can significantly improve the responsiveness and availability of the system. Don’t discard cached items just because they’re old. Keep using them until a fresh copy can be obtained. Size is a concern but if you can afford it then cache your content for as long as the RTO demands (the service should be back by then, e.g 4hrs) and revalidate as frequently as the business demands the content be fresh (e.g. every 10 minutes).

It may sound risky, but this approach can even be used with sensitive data such as user-permissions. You’re looking at a coincidence of bad events which is quite unlikely – e.g. users permissions are revoked (the old version is in cache), at the same time as the permissions system goes down, at the same as the user attempts something they previously could but should no longer be allowed to do.. It’s your risk but what’s worse… One user doing something bad or the whole system being unavailable?

If you can’t or don’t have a cache then can you implement a default or fallback option? Having a blank slot on a page, but a working page otherwise, may be the best of a bad set of options but the best nonetheless.

All else failing, apologise, quickly (see circuit-breaker) and profusely. Let the user know its you, not them, that they needn’t worry (e.g. you’ve not charged them and have unpicked any dependent transactions should you have them) and that you’ll be back as soon as you can.

Finally, log everything, monitor and alert. Regardless of the fact that it’s bad to rely on your customers to tell you when you’ve a problem, in many cases the user may not even realise something is amiss. It can easily be overlooked. Ensuring you log and monitor makes it much easier to know when you’ve an issue as well allowing root-cause analysis faster.

Queues, circuit-breakers, serve-stale-while-revalidate and logging.

The Matrix

The matrix may well be the most under-appreciated utility in the toolbox of architects.

We produce diagrams, verbose documents and lists-of-stuff till the cows come home but matrices are an all too rare; almost mythical, beast. Their power though is more real than the healing and purification properties of true Unicorn horns despite what some may say.

Here’s an example.

The diagram below shows a contrived and simplified matrix of the relationship between user stories and components. In many cases such a matrix may cross hundreds of stories and dozens of components.

Picture of a matrix from a spreadsheet

Crucially we can see for a particular story which components are impacted. This provides much needed assurance to the architect that we have the needed coverage and allows us to easily see where functionality has no current solution. In this case “US4: Audit Logging”.

Adding some prioritisation (col C) allows us to see if this is going to be an immediate issue or not. In this case the product owner has (foolishly) decided auditing isn’t important…

Developers can use the matrix to see which components need implementation for a story and see what other requirements are impacted by the components they’re about to develop.

Now, it may well be that we’ll proceed and accept any technical debt associated with high-priority requirements to deliver them faster. It may also be that the lower priority requirements never get delivered, so no-problem. But it may instead be that the next story in the backlog has some particular nuanced requirement which makes things rather hairy, and is best to consider up-front rather than walk into a pit if we do it things another way. It’s a balancing game with pros and cons – the matrix provides visibility to aid the assessment which all parties can use.

And there’s more (in true infomercial style)… We can also see that the “Access Gateway”, “Article Management” and “Database” components appear to cover many stories. This may be fine if the functionality they provide is consistent across requirements – for example the “Access Gateway” may simply be doing authentication and authorisation consistently – but in other cases it suggests some decomposition and refinement is needed – for example we may wish to consider breaking out “Articles” and “Comments” into two separate components which have more clearly defined responsibilities. Regardless, it helps to see that some components are going to be critical to a lot of requirements and may need more care and attention than others.

So where does this particular matrix come from? We could be accused of the near cardinal sin today of following a waterfall mentality with the need for a big up-front design phase. Not so. It’s more akin to a medical triage.

We have a backlog. We need to review the backlog and sketch out the core components required to support this. We don’t need to dig into each component in great detail – just enough to provide assurances that we have what’s needed for the priority requirements and that the requirements have enough detail to support this (basically some high level grooming). Low priority or simple requirements we may skim over (patient will live (or die)), higher priority or complex ones we assess till we can build the assurances we need (patient needs treatment).

When new requirement arise we can also quickly assess these against the matrix to see where the impact we will be.

This is just one of many useful matrices. Story-to-story can help identify requirement dependencies. Likewise for component-to-component. Mappings from logical components to infrastructure helps build a view of the required environment and can; when taken to the physical level, be used for automatic identification of things like firewall rules. You can even connect matrices together to allow for identification of which requirements are fulfilled by which servers – e.g. physical-node to logical-node to component to requirement maps – or use them for problem analysis to work out what’s broken – e.g. “this function isn’t working, which components could this relate to”. Their value of course is only as good as the quality of data they hold though so such capabilities are often not realised.

Like Unicorns, matrices can be magical. Fortunately for us; and – I hate to break this to you – unlike Unicorns, matrices are real (despite what some may say!).

Soft Guarantees

In “The Need for Strategic Security” Martyn Thomas considers some of the risks today in the the way systems are designed and built and some potential solutions to address the concerns raised. One of the solutions proposed is for software to come with a guarantee; or at least some warranty, around it’s security.

Firstly, I am (thankfully) not a lawyer but I can imagine the mind bogglingly twisted legalese that will be wrapped around such guarantees. So much so as to make them next to useless (bar giving some lawyer the satisfaction of adding another pointless 20 paragraphs of drivel to the already bloated terms and conditions..). However, putting this aside, I would welcome the introduction of such guarantees if it is at all possible.

For many years now we’ve convinced ourselves that it is not possible to write a program which is bug-free. Even the simple program:

echo "Hello World"

has dependencies on libraries, the operating system; along with the millions of lines of code therein, all the way down to the BIOS and means we cannot be 100% sure even this simple program will always work. We can never be sure it will run correctly for every nanosecond of every hour of every day of every year.. for ever! It is untestable and absolute certainty is not possible.

At a more practical level however we can bound our guarantees and accept some risks “compatible with RHEL 7.2”, “… will work until year-end 2020…”, “.. needs s/w dependencies x, y, z…” etc. Humm, it’s beginning to sound much like a software license and system requirements checklist… Yuck! On the whole, we’re pretty poor at providing any assurances over the quality, reliability and security of our products.

Martyns point though is that more rigorous methods and tools will enable us to be more certain (perhaps not absolutely) about the security of the software we develop and rely on allowing us to be more explicit about the guarantees we can offer.

Today we have tools such as SonarQube which helps to improve the quality of code or IBM Security AppScan for automated penetration testing. Ensuring such tools are used can help but these tools need to be used “right” if used at all. All too often a quick scan is done and only the few top (and typically obvious) issues are addressed. The variation of report output I have seen for scans on the same thing using the same tools but performed by different testers is quite ridiculous. A poor workman blames his tools.

Such tools also tend to be run once on release and rarely thereafter. The ecosystem in which our software is used evolves rapidly so continual review is needed to detect issues as and when new vulnerabilities are discovered.

In addition to tools we also need industry standards and certifications to qualify our products and practitioners against. In the security space we do have some standards such as CAPS and certification programmes such as CCP. Unfortunately few products go through the certification process unless they are specifically intended for government use and certified professionals are few and in-demand. Ultimately it comes down to time-and-money.

However, as our software is used in environments never originally intended for them and as devices become increasingly connected and more critical to our way of life (or our convenience), it will be increasingly important that all software comes with some form of compliance assurance over its security. For which more accessible standards will be needed. Imagine if in 10 years time when we all have “smart” fridges some rogue state sponsored hack manages to cycle them through a cold-warm-cold cycle on Christmas eve.. Would we notice on Christmas day? Would anyone be around to address such a vulnerability? Roast potatoes and e. coli turkey anyone? Not such a merry Christmas… (though the alcohol may help kill some of the bugs).

In addition, the software development community today is largely made up of enthusiastic and (sometimes) well-meaning amateurs. Many have no formal qualification or are self-taught. Many are cut’n’pasters who frankly don’t get-IT and just go through the motions. Don’t get me wrong, there are lots of really good developers out there. It’s just there are a lot more cowboys. As a consequence our reliance on security-through-obscurity is deeper than perhaps many would openly admit.

It’s getting better though and the quality of recent graduates I work with today has improved significantly – especially as we seem to have stopped trying to turn arts graduates into software engineers.

Improved and proven tools and standards help but at the heart of the concern is the need for a more rigorous scientific method.

As architects and engineers we need more evidence, transparency and traceability before we can provide the assurances and stamp of quality that a guarantee deserves. Evidence of the stresses components can handle and constraints that contain this. Transparency in design and in test coverage and outcome. Traceability from requirement through design and development into delivery. Boundaries within which we can guarantee the operation of software components.

We may not be able to write bug-free code but we can do it well enough and provide reasonable enough boundaries as to make guarantees workable – but to do so we need a more scientific approach. In the meantime we need to keep those damned lawyers in check and stop them running amok with the drivel they revel in.

Not all encryption is equal

Shit happens, data is stolen (or leaked) and your account details, passwords and bank-account are available online to any criminal who wants it (or at least is prepared to buy it).

But don’t panic, the data was encrypted so you’re ok. Sit back, relax in front of the fire and have another mince pie (or six).

We see this time and again in the press. Claims that the data was encrypted… they did everything they could… blah blah blah. Humm, I think we need more detail.

It’s common practice across many large organisations today to encrypt data using full-disk encryption with tools such as BitLocker or Becrypt. This is good practice and should be encouraged but is only the first line of defence as this only really helps when the disk is spun down and the machine powered off. If the machine is running (or even sleeping) then all you need is the users password and you’re in. And who today really wants to shutdown a laptop when you head home… and perhaps stop for a pint on the way?

In the data-center the practice is less common because the risk of disks being taken out of servers and smuggled out of the building is lower. On top of this the disks are almost always spinning so any user/administrator who has access to the server can get access to the data.

So, moving up a level, we can use database encryption tools such as Transparent Data Encryption to encrypt the database files on the server. Great! So now normal OS users can’t access the data and need to go through the data access channel to get it. Problem is, lots of people have access to databases including DBAs who probably shouldn’t be able to see the raw data itself but who generally can. On top of this, service accounts are often used for application components to connect and if these credentials are available to some wayward employee… your data could be pissing out an open window.

To protect against these attack vectors we need to use application level encryption. This isn’t transparent and developers need to build in data encryption and decryption routines as close to exposed interfaces as practical. Now having access to the OS, files or database doesn’t do enough to expose the data. An attacker also needs to get hold of the encryption keys which should be held on separate infrastructure such as an HSM. All of which costs time and money.

Nothings perfect and there’s still the risk that a wayward developer siphons off data as it passes through the system or that some users have too broad access rights and can access data, keys and code. These can be mitigated against through secure development practices, change management and good access management… to a degree.

Furthermore, encrypting everything impacts functionality – searching encrypted data becomes practically impossible – or may not be as secure as you’d expect – a little statistical analysis on some fields can expose the underlying data without actually needing to decrypt it due to a lack of sufficient variance in the raw data. Some risks need to be accepted.

We can then start to talk about the sort of ciphers used, how they are used and whether these and the keys are sufficiently strong and protected.

So when we hear in the press that leaked data was encrypted, we need to know more about how it was encrypted before deciding whether we need to change our passwords before tucking into the next mince pie.

Merry Christmas!