Equifarce!

I was/am interested into the whole Equifax hack and how this happened. To this end I posted a brief link yesterday to the Struts teams response. A simple case of failing to patch! Case closed..

But then I’ve been thinking that’s not really very fair.

This was (apparently) caused by a defect that’s been around for a long time. The developers had reacted very quickly when the problem was identified (within 24 hrs) but Equifax – by all accounts – had failed to patch for a further 6 months.

What did we expect? That they’d patch it the next day? No chance. Within a month? Maybe. But if the issue is embedded in some third party product then they’re dependent upon a fix being provided and if it’s in some in-house developed tool then they need to be able to rebuild the app and test it before they can deploy. Struts was/is extremely popular. It was the Spring of its day and is still deeply embedded in all sorts of corporate applications and off the shelf products. Fixing everything isn’t going to happen overnight.

Companies like Equifax will also have hundreds, even thousands, of applications and each application will have dozens of dependencies any one of which could have suffered a similar issue. On top of this, most of these applications will be minor, non critical tools which have been around for many years and which frankly few will care about. Running a programme to¬†track all of these dependencies, patch applications and test them all before rolling them into production would take an army of developers, testers, sys-ops and administrators working around the clock just to tread water. New features? Forget it. Zero-day? Shuffles shoes… Mind you, it’d be amusing to see how change management would handle this…

So we focus on the priority applications and the low hanging fruit of patching (OS etc.) and hope that’s good enough? Humm… anything else we can do?

Well, we’re getting better with build, test and deployment automation but we’re a long way from perfection. So do some of that, it’ll make dealing with change all the easier but its no silver bullet. And again, good luck with change management…

Ultimately though we have to assume we’re not going to have perfect code (there’s no such thing!)… that we’re not able to patch against every vulnerability… and that zero day exploits are a real risk.

Other measures are required regardless of your patching strategy. Reverse proxies, security filters, firewalls, intrusion detection, n-tier architectures, heterogenous software stacks, encryption, pen-testing etc. Security is like layers of swiss-cheese – no single layer will ever be perfect, you just hope the holes don’t line up when you stack them all together. Add to this some decent monitoring of traffic and an understanding of norms and patterns – at least something which you actually have people looking at continually rather than after the event – and you stand a chance of protecting yourself against such issues, or able to identify potential attacks before they become actual breaches.

Equifax may have failed to patch some Struts defect for six months but that’s not the worst of it. That they were vulnerable to such a defect in the first place smells like.. well, like they didn’t have enough swiss-cheese. That an employee tool was also accessible online and provided access to customer information with admin/admin credentials¬†goes on to suggests a real lack of competency and recklessness at senior levels.

Adding insult to injury, to blame an open-source project (for the wrong defect!) which heroically responded and addressed the real issue within 24 hrs of it being identified six month earlier (!?) makes Equifax look like an irresponsible child. Always blaming someone else for their reckless actions.

They claim to be “an innovative global information solutions company”. So innovative they’re bleeding edge and¬†giving their, no our!, data away. I’m just not sure who’s the bigger criminal… the hackers or Equifarce!

Equifax Data Breach Due to Failure to Install Patches

“the Equifax data compromise was due to their failure to install the security updates provided in a timely manner.”

Source: MEDIA ALERT: The Apache Software Foundation Confirms Equifax Data Breach Due to Failure to Install Patches Provided for Apache¬ģ Struts‚ĄĘ Exploit : The Apache Software Foundation Blog

As simple as that apparently. Keep up to date with patching.

DIY

I should probably have learnt this some time ago…

Quite often we find no-one is willing to do the {insert-task-here}.

I don’t know why. Fear of getting it wrong. Fear of ridicule. Fear of crayons. Whatever. Heres a tip on how to get things moving when no-one seems willing…

DO IT YOURSELF!

It doesn’t even matter if you do it badly. In fact it’s often¬†better to do it badly on purpose!

You’ll be amazed (or maybe not) at the number of people that come out of the woodwork to provide their own “advice”. All of sudden you’ll have no end of input. Just be prepared to bite your tongue and take solace in the knowledge that you took one for the greater good.

Someones got to get the ball rolling…

It’s an older code..

(let’s assume we’re talking about encryption keys here rather than pass codes though it really makes little difference… and note that your passwords are a slightly different concern)

Is it incompetence to use an old code? No.

For synchronous requests (e.g. like those over HTTPS) there’s a handshake process you go through every few minutes to agree a new key. Client and server then continue to use this key until it expires then they agree a new one. If the underlying certificate changes you simply go through the handshake again.

For asynchronous requests things aren’t as easy. You could encrypt and queue a request one minute and change the key the next but the message remains on the queue for another hour before it gets processed. In these cases you can either reject the message (usually unacceptable) or try the older key and accept that for a while longer.

Equally with persistent storage you could change the key every month but you can’t go round decrypting and re-encrypting all historic content and accept the outage this causes every time. Well, not if you’ve billions of records and an availability SLA of greater than a few percent. So again, you’ve got to let the old codes work..

You could use full disk/database encryption but that’s got other issues – like its next to useless once the disks are spinning… And besides, when you change the disk password you’re not actually changing the key and re-encrypting the data, you’re just changing the password used to obtain the key.

So it is ok to accept old codes. For a while at least.

An empire spread throughout the galaxy isn’t going to be able to distribute new codes to every stormtrooper instantaneously. Even if they do have the dark-side on their side…

Lost and Fnd

I could spend the next two years tweaking things¬†as much as I have the past two but it’s time to get some feedback on this so here we go…

Software modelling tools are expensive, bad at collaboration and more complicated than the vast majority of solutions demand. They set their sights on an unachievable utopia, demand constant maintenance to avoid being constantly out of date and dictate that we all follow overly prescriptive rules which may be philosophically correct but are practically irrelevant.

Diagramming tools are cheaper, better at collaboration and simpler but are all too fluid, lack structure and meaning and are far too open to interpretation. The best such tool I use is a whiteboard, a pen and room – a shame that such are so transient and unstructured.

A middle way is needed.

To this end I introduce Fnd (alpha).

Fnd lets you build catalogs, diagrams and matrices for solutions in a collaborative manner.

For now you can only sign-in with Google and when you do you’ll first be asked to create an account. Once you do you’ll see the home page which provides a list of current solutions.

Hit [+]  to create a new solution and provide any details you need.

Save this and you should now see a new solution defined on the homepage.

To edit the entry click on the main icon. To access the solution page, click on the link icon next to the solution name:Clicking on the solution link takes you to the solution page which provides a list of the catalogs, diagrams and matrices associated with that solution.

From here you can create a structural diagram

and in the diagram view you can set the title and description (and tags (a little more on tagging will come later)).

The nav bar allows you to specify the stencil, colours, save, reload, screenshot, document, delete (item) and delete diagram.

To add something to the diagram select the catalog item and “new”. This will show a popup allowing you to define that it and add it to the diagram.

Each catalog type has attributes specific to its needs. Choose that which suits best. For example, a component shows as:

When added to the diagram it appears with a stencil relevant to its type. This stencil can be changed by selecting the object¬†in the diagram then the stencil type from the “Shapes” or “UML” drop-down. In the example below there are two components, one shown as a UML component stencil, the other as a database component.

Double click on the object to edit the settings and make sure you save the diagram – editing an object is immediate, changes to diagrams need to be saved.

Relationships can be added by clicking and dragging on the link icon (green +) from one object to the link icon on another.

From actor to component results in:

Double clicking on the link itself allows the attributes of the link to be defined. By default every link is a dependency but this can be changed as desired.

… and so on to build up diagrams.

Perhaps more importantly, if you grant another user access to your account (email-address > account > add user) then if you can both edit the same objects/diagrams at the same time and will see updates reflected in each others browser.

Matrices provide views of relationships in tabular and animated form. For example the above diagram appears as:

and

And catalog lists provide access to all objects of that type.

There’s more to this with multiple diagrams providing different views, the ability to search and add objects from one solution to another, using tags to provide filtered views, generating documentation from diagrams and so on. I can’t promise diagrams are always “pretty” or that models are “correct” but instead aim for a hopefully happy compromise somewhere between the two – enough to be understood!

A word of warning… Fnd is alphaware, I’m looking for feedback, it has bugs – some I know, some¬†I do not. I use it daily in my work and it works for me (mostly). Let me know how you get on – the good, the bad and the ugly – and in turn I’ll try to improve it and make it more stable and functional.

You can access Fnd at https://nfa.cloud/. Feedback to admin [at] nfa.cloud.

p.s. Fnd is short for Foundation and simply a tip of my hat to one of my favourite authors

Capabilities and Responsibilities

According to TFD, “capabilities” are:

  1. The quality of being capable; ability.
  2. A talent or ability that has potential for development or use: a student of great capabilities.
  3. The capacity to be used, treated, or developed for a specific purpose: nuclear capability.
Whereas “responsibilities” are:
  1. The state, quality, or fact of being responsible.
  2. Something for which one is responsible; a duty, obligation, or burden.

The words “developed for a specific purpose” indicate; at least to some extent, a degree of responsibility. All too often whether we are¬†responsible for some function or ability¬†is forgotten in the enthusiasm to build it. Blinded by the simple fact that we have the ability¬†or potential we charge ahead regardless of whether it’s really our responsibility.

It’s not necessarily wrong – if no-one else takes responsibility or you want to compete in some capability then so be it¬†– but in¬†a larger enterprise you could just be in-fighting, duplicating effort or¬†neglecting your true responsibilities. Something start-ups may not suffer from (or end up dying silently¬†from¬†in the cacophony of the market)…

Having the ability to do something does not make it the right thing to do. We talk a lot about capability, we need to talk more about responsibility. After all, what would the world be like if we all chose to exercise our (amateur) nuclear capabilities?

!Attachments, Links

As I’ve said before, email is broken. Unfortunately, like smack, many seems hooked on it. For¬†your methadone, start to wean yourself off by sending links and not attachments.

It’s more accessible, more secure, you can keep the content¬†up-to-date and ensure everyone sees the current version. It’ll even stop your mail file from bloating…

Just use a wiki like Confluence and try not to attach¬†docs to pages…

Interconnected

In the increasingly interconnected micro-services world we’re creating the saying “a chain is only as strong as its weakest link” is particularly pertinent.

It’s quite easy for a single service to be dependent upon a number of downstream services as the diagram below shows.

An outage or go-slow in any one of the downstream services can have a knock on impact upstream and right back to users. Measuring this in SLAs, let’s say each of B, C, D, E, F each aims for an availability SLA of 99.99%. Assuming they meet this, the best A can achieve is 99.95%. More realistically, B, C, D, E and F are probably dependent on other services and before you know it end users are doing well to see¬†anything above 99% uptime.

So what strategies do we have for dealing with this?

Firstly, you could¬†just live with it. Really, don’t knock this option. Question “do I really need the availability?”, “does it really matter if it goes down?”. Before we worry about any elaborate plan to deal with the situation it’s worth considering if the situation is really all that bad.

Ok, so it is… The next question should be “do I need a response immediately?”. If not, go asynchronous and put a queue between them. If the recipient is down messages will queue up until they come back – no problem. Just make sure the queue is as local as possible to the¬†source and persistent.

If it is a request-response model then consider a queue in any case. A queue can often be set to timeout old messages and deal with slow responses (e.g. if no response in 5 seconds then abandon). This can often save having very many messages in a backlog waiting to be processed. These can cause lock ups for requests which will never be processed and block the consumer for much longer than the downstream service is unavailable. And it can often be more efficient to have a queue based competing consumer model than having multiple connections banging away sporadically.

On top of this, ensure you’re using non-blocking libraries and implement circuit-breakers to trip when downstream services go offline. This of course begs the question, “what sort of response do I provide¬†if the circuit-breaker¬†is open?”… Well, that depends…

In some cases you can cache previous responses and serve this. If this sort of caching model works then even better, you can decouple the request for content from that fetching it from a downstream service so that you’re in effect always serving from cache. Allowing stale cache entries to be served whilst¬†revalidating even when downstream services are unavailable can significantly improve the responsiveness and availability of the system. Don’t discard cached items just because they’re old. Keep using them until a fresh copy can be obtained. Size is a concern but if you can afford it then cache your content for as long as the¬†RTO demands (the service should be back by then, e.g 4hrs) and revalidate as frequently as the business demands the content be fresh (e.g. every 10 minutes).

It may sound risky, but this approach can even be used with sensitive data such as user-permissions. You’re looking at a coincidence of bad events which is quite¬†unlikely – e.g. users permissions are revoked¬†(the old version is in cache), at the same time as the permissions system goes down, at the same as the user attempts something they previously could but should no longer be allowed to do.. It’s your risk but what’s worse… One user doing something bad or the whole system being unavailable?

If you can’t or don’t have a cache then can you implement a default or fallback option? Having a blank slot on a page, but a working page otherwise, may be the best of a bad set of options but the best nonetheless.

All else failing, apologise, quickly (see circuit-breaker) and profusely.¬†Let the user know its you, not them, that they needn’t worry (e.g. you’ve not charged them and have unpicked any dependent transactions should you have them) and that you’ll be back as soon as you can.

Finally, log everything, monitor and alert. Regardless of the fact that it’s bad to rely on your customers to tell you when you’ve a problem, in many cases the user may not even realise something is amiss. It can easily be overlooked. Ensuring you log and monitor makes it much easier to know when you’ve an issue as well allowing root-cause analysis faster.

Queues, circuit-breakers, serve-stale-while-revalidate and logging.