I should probably have learnt this some time ago…

Quite often we find no-one is willing to do the {insert-task-here}.

I don’t know why. Fear of getting it wrong. Fear of ridicule. Fear of crayons. Whatever. Heres a tip on how to get things moving when no-one seems willing…


It doesn’t even matter if you do it badly. In fact it’s often better to do it badly on purpose!

You’ll be amazed (or maybe not) at the number of people that come out of the woodwork to provide their own “advice”. All of sudden you’ll have no end of input. Just be prepared to bite your tongue and take solace in the knowledge that you took one for the greater good.

Someones got to get the ball rolling…

Lost and Fnd

I could spend the next two years tweaking things as much as I have the past two but it’s time to get some feedback on this so here we go…

Software modelling tools are expensive, bad at collaboration and more complicated than the vast majority of solutions demand. They set their sights on an unachievable utopia, demand constant maintenance to avoid being constantly out of date and dictate that we all follow overly prescriptive rules which may be philosophically correct but are practically irrelevant.

Diagramming tools are cheaper, better at collaboration and simpler but are all too fluid, lack structure and meaning and are far too open to interpretation. The best such tool I use is a whiteboard, a pen and room – a shame that such are so transient and unstructured.

A middle way is needed.

To this end I introduce Fnd (alpha).

Fnd lets you build catalogs, diagrams and matrices for solutions in a collaborative manner.

For now you can only sign-in with Google and when you do you’ll first be asked to create an account. Once you do you’ll see the home page which provides a list of current solutions.

Hit [+]  to create a new solution and provide any details you need.

Save this and you should now see a new solution defined on the homepage.

To edit the entry click on the main icon. To access the solution page, click on the link icon next to the solution name:Clicking on the solution link takes you to the solution page which provides a list of the catalogs, diagrams and matrices associated with that solution.

From here you can create a structural diagram

and in the diagram view you can set the title and description (and tags (a little more on tagging will come later)).

The nav bar allows you to specify the stencil, colours, save, reload, screenshot, document, delete (item) and delete diagram.

To add something to the diagram select the catalog item and “new”. This will show a popup allowing you to define that it and add it to the diagram.

Each catalog type has attributes specific to its needs. Choose that which suits best. For example, a component shows as:

When added to the diagram it appears with a stencil relevant to its type. This stencil can be changed by selecting the object in the diagram then the stencil type from the “Shapes” or “UML” drop-down. In the example below there are two components, one shown as a UML component stencil, the other as a database component.

Double click on the object to edit the settings and make sure you save the diagram – editing an object is immediate, changes to diagrams need to be saved.

Relationships can be added by clicking and dragging on the link icon (green +) from one object to the link icon on another.

From actor to component results in:

Double clicking on the link itself allows the attributes of the link to be defined. By default every link is a dependency but this can be changed as desired.

… and so on to build up diagrams.

Perhaps more importantly, if you grant another user access to your account (email-address > account > add user) then if you can both edit the same objects/diagrams at the same time and will see updates reflected in each others browser.

Matrices provide views of relationships in tabular and animated form. For example the above diagram appears as:


And catalog lists provide access to all objects of that type.

There’s more to this with multiple diagrams providing different views, the ability to search and add objects from one solution to another, using tags to provide filtered views, generating documentation from diagrams and so on. I can’t promise diagrams are always “pretty” or that models are “correct” but instead aim for a hopefully happy compromise somewhere between the two – enough to be understood!

A word of warning… Fnd is alphaware, I’m looking for feedback, it has bugs – some I know, some I do not. I use it daily in my work and it works for me (mostly). Let me know how you get on – the good, the bad and the ugly – and in turn I’ll try to improve it and make it more stable and functional.

You can access Fnd at Feedback to admin [at]

p.s. Fnd is short for Foundation and simply a tip of my hat to one of my favourite authors


In the increasingly interconnected micro-services world we’re creating the saying “a chain is only as strong as its weakest link” is particularly pertinent.

It’s quite easy for a single service to be dependent upon a number of downstream services as the diagram below shows.

An outage or go-slow in any one of the downstream services can have a knock on impact upstream and right back to users. Measuring this in SLAs, let’s say each of B, C, D, E, F each aims for an availability SLA of 99.99%. Assuming they meet this, the best A can achieve is 99.95%. More realistically, B, C, D, E and F are probably dependent on other services and before you know it end users are doing well to see anything above 99% uptime.

So what strategies do we have for dealing with this?

Firstly, you could just live with it. Really, don’t knock this option. Question “do I really need the availability?”, “does it really matter if it goes down?”. Before we worry about any elaborate plan to deal with the situation it’s worth considering if the situation is really all that bad.

Ok, so it is… The next question should be “do I need a response immediately?”. If not, go asynchronous and put a queue between them. If the recipient is down messages will queue up until they come back – no problem. Just make sure the queue is as local as possible to the source and persistent.

If it is a request-response model then consider a queue in any case. A queue can often be set to timeout old messages and deal with slow responses (e.g. if no response in 5 seconds then abandon). This can often save having very many messages in a backlog waiting to be processed. These can cause lock ups for requests which will never be processed and block the consumer for much longer than the downstream service is unavailable. And it can often be more efficient to have a queue based competing consumer model than having multiple connections banging away sporadically.

On top of this, ensure you’re using non-blocking libraries and implement circuit-breakers to trip when downstream services go offline. This of course begs the question, “what sort of response do I provide if the circuit-breaker is open?”… Well, that depends…

In some cases you can cache previous responses and serve this. If this sort of caching model works then even better, you can decouple the request for content from that fetching it from a downstream service so that you’re in effect always serving from cache. Allowing stale cache entries to be served whilst revalidating even when downstream services are unavailable can significantly improve the responsiveness and availability of the system. Don’t discard cached items just because they’re old. Keep using them until a fresh copy can be obtained. Size is a concern but if you can afford it then cache your content for as long as the RTO demands (the service should be back by then, e.g 4hrs) and revalidate as frequently as the business demands the content be fresh (e.g. every 10 minutes).

It may sound risky, but this approach can even be used with sensitive data such as user-permissions. You’re looking at a coincidence of bad events which is quite unlikely – e.g. users permissions are revoked (the old version is in cache), at the same time as the permissions system goes down, at the same as the user attempts something they previously could but should no longer be allowed to do.. It’s your risk but what’s worse… One user doing something bad or the whole system being unavailable?

If you can’t or don’t have a cache then can you implement a default or fallback option? Having a blank slot on a page, but a working page otherwise, may be the best of a bad set of options but the best nonetheless.

All else failing, apologise, quickly (see circuit-breaker) and profusely. Let the user know its you, not them, that they needn’t worry (e.g. you’ve not charged them and have unpicked any dependent transactions should you have them) and that you’ll be back as soon as you can.

Finally, log everything, monitor and alert. Regardless of the fact that it’s bad to rely on your customers to tell you when you’ve a problem, in many cases the user may not even realise something is amiss. It can easily be overlooked. Ensuring you log and monitor makes it much easier to know when you’ve an issue as well allowing root-cause analysis faster.

Queues, circuit-breakers, serve-stale-while-revalidate and logging.

The Matrix

The matrix may well be the most under-appreciated utility in the toolbox of architects.

We produce diagrams, verbose documents and lists-of-stuff till the cows come home but matrices are an all too rare; almost mythical, beast. Their power though is more real than the healing and purification properties of true Unicorn horns despite what some may say.

Here’s an example.

The diagram below shows a contrived and simplified matrix of the relationship between user stories and components. In many cases such a matrix may cross hundreds of stories and dozens of components.

Picture of a matrix from a spreadsheet

Crucially we can see for a particular story which components are impacted. This provides much needed assurance to the architect that we have the needed coverage and allows us to easily see where functionality has no current solution. In this case “US4: Audit Logging”.

Adding some prioritisation (col C) allows us to see if this is going to be an immediate issue or not. In this case the product owner has (foolishly) decided auditing isn’t important…

Developers can use the matrix to see which components need implementation for a story and see what other requirements are impacted by the components they’re about to develop.

Now, it may well be that we’ll proceed and accept any technical debt associated with high-priority requirements to deliver them faster. It may also be that the lower priority requirements never get delivered, so no-problem. But it may instead be that the next story in the backlog has some particular nuanced requirement which makes things rather hairy, and is best to consider up-front rather than walk into a pit if we do it things another way. It’s a balancing game with pros and cons – the matrix provides visibility to aid the assessment which all parties can use.

And there’s more (in true infomercial style)… We can also see that the “Access Gateway”, “Article Management” and “Database” components appear to cover many stories. This may be fine if the functionality they provide is consistent across requirements – for example the “Access Gateway” may simply be doing authentication and authorisation consistently – but in other cases it suggests some decomposition and refinement is needed – for example we may wish to consider breaking out “Articles” and “Comments” into two separate components which have more clearly defined responsibilities. Regardless, it helps to see that some components are going to be critical to a lot of requirements and may need more care and attention than others.

So where does this particular matrix come from? We could be accused of the near cardinal sin today of following a waterfall mentality with the need for a big up-front design phase. Not so. It’s more akin to a medical triage.

We have a backlog. We need to review the backlog and sketch out the core components required to support this. We don’t need to dig into each component in great detail – just enough to provide assurances that we have what’s needed for the priority requirements and that the requirements have enough detail to support this (basically some high level grooming). Low priority or simple requirements we may skim over (patient will live (or die)), higher priority or complex ones we assess till we can build the assurances we need (patient needs treatment).

When new requirement arise we can also quickly assess these against the matrix to see where the impact we will be.

This is just one of many useful matrices. Story-to-story can help identify requirement dependencies. Likewise for component-to-component. Mappings from logical components to infrastructure helps build a view of the required environment and can; when taken to the physical level, be used for automatic identification of things like firewall rules. You can even connect matrices together to allow for identification of which requirements are fulfilled by which servers – e.g. physical-node to logical-node to component to requirement maps – or use them for problem analysis to work out what’s broken – e.g. “this function isn’t working, which components could this relate to”. Their value of course is only as good as the quality of data they hold though so such capabilities are often not realised.

Like Unicorns, matrices can be magical. Fortunately for us; and – I hate to break this to you – unlike Unicorns, matrices are real (despite what some may say!).

Scaling the Turd

It has been quite some time since my last post… Mainly because I’ve spent an inordinate amount of time trying to get an application scaling and performing as needed. We’ve done it, but I’m not happy.

Not happy, in part because of the time its taken, but mainly because the solution is totally unsatisfactory. It’s an off the shelf (COTS) package so making changes beyond a few “customisations” is out of the question and the supplier has been unwilling to accept the problem is within the product and instead points to our “environment” and “customisations” – of which, IMHO, neither are particularly odd.

At root there are really two problems.

One – a single JVM can only handle 10 tps according to the supplier (transaction/page requests/second). Our requirement is around 50.

Two – performance degrades over time to unacceptable levels if a JVM is stressed hard. So 10tps realistically becomes more like 1 to 2 tps after a couple of days of soak testing.

So we’ve done a lot of testing – stupid amounts of testing! Over and over, tweaking this and that, changing connection and thread pools, JVM settings, memory allocations etc. with pretty much no luck. We’ve checked the web-servers, the database, the queue configuration (itself an abomination of a setup), the CPU is idle, memory is plentiful, garbage-collection working a treat, disk IO is non-existent, network-IO measured in the Kb/sec. Nada! Except silence from the supplier…

And then we’ve taken thread dumps and can see stuck threads and lock contention so we know roughly where the problem lies, passed this to the supplier, but still, silence…

Well, not quite silence. They finally offered that “other customers don’t have these issues” and “other customers typically run 20+ JVMs”! Excuse me? 20 JVMs is typical..? wtf!? So really they’re admitting that the application doesn’t scale within a JVM. That it cannot make use of resources efficiently within a JVM and that the only way to make it work is to throw more JVMs at it. Sounds to me like a locking issue in the application – one that no doubt gets worse as the application is stressed. Well at least we have a fix…

This means that we’ve ended up with 30 JVMs across 10 servers (VMs) for one component to handle a pathetic 50tps! – something I would expect 2 or 3 servers to handle quite easily given the nature of the application (the content delivery aspect of a content management system). And the same problem pervades the applications other components so we end up with 50 servers (all VMs bar a physical DB cluster) for an application handling 50 tps… This is not efficient or cost effective.

There are also many other issues with the application including such idiocies as synchronous queueing, a total lack of cache headers (resulting in a stupid high hit-rate for static resources) and really badly implemented Lucene indexing (closing and opening indexes continually). It is, by some margin, the worst COTS application I have had the misfortunate to come across (I’ll admit I’ve seen worse home-grown ones so not sure what that leaves us in the buy-v-build argument…).

So what’s wrong with having so many JVMs?

Well, cost for a start. Even though we can cram more JVMs onto fewer VMs we need to step this up in chunks of RAM required per JVM (around 4GB). So, whilst I’m not concerned about CPU, a 20GB 4vCPU host can really only support 4 JVMs (some space is needed for OS and other process overheads). Lots of tin, doing nothing.

But the real issue is maintenance. How the hell do you manage that many JVMs and VMs efficiently? You can use clustering in the application-server, oh, except that this isn’t supported by the supplier (like I said, the worst application ever!). So we’ve now got monitors and scripts for each JVM and each VM and when something breaks (… and guess what, with this pile of sh*t, it does) we need to go round each node fixing them one-by-one.

Anyway, lessons learned, how should we have scaled such an application? What would I do differently now that I know? (bar using some completely different product of course)

Firstly I would have combined components together where we can. There’s no reason why some WARs couldn’t be homed together (despite the suppliers design suggesting otherwise). This would help reduce some of the JVMs and improve the reliability of some components (that queueing mechanism specifically).

Secondly, given we can’t use a real cluster in the app-server, we can (now) use containers to package up each component of the application instead. This then becomes our scaling and maintenance point and rather than having 50 servers to manage we have 7 or 8 images to maintain (still a lot for such an application). This then allows us to scale up or down at the container level more quickly. The whole application wouldn’t fit this model (DB in particular would remain as it is) but most of it would should.

Of course it doesn’t solve the root cause unfortunately but it is a more elegant, maintainable and cheaper solution and, bar eradicating this appalling product from the estate, one that would have been so much more satisfying.

So thats the project for the summer.. Work out how to containerise this sort of COTS application, how to connect, route and scale them in a way that is manageable, efficient and cost effective. Next project please!



We can have a small server…

Screen Shot 2016-02-13 at 11.43.20

…a big server (aka vertical scaling)…

Screen Shot 2016-02-13 at 11.43.27

.. a cluster of servers (aka horizontal scaling)…

Screen Shot 2016-02-13 at 11.48.34

.. or even a compute grid (horizontal scaling on steroids).

Screen Shot 2016-02-13 at 11.43.41

For resiliency we can have active-passive…

Screen Shot 2016-02-13 at 11.52.46

… or active-active…

Screen Shot 2016-02-13 at 11.52.51

… or replication in a cluster or grid…

Screen Shot 2016-02-13 at 11.59.01

…each with their own connectivity, load-balancing and routing concerns.

From a logical perspective we could have a simple client-server setup…

Screen Shot 2016-02-13 at 13.03.29

…a two tier architecture…

Screen Shot 2016-02-13 at 13.03.35

…an n-tier architecture…

Screen Shot 2016-02-13 at 13.03.40

…a service oriented (micro- or ESB) architecture…

Screen Shot 2016-02-13 at 13.03.44

…and so on.

And in each environment we can have different physical topologies depending on the environmental needs with logical nodes mapped to each environments servers…

Screen Shot 2016-02-13 at 13.04.01

With our functional components deployed on our logical infrastructure using a myriad of other deployment topologies..

Screen Shot 2016-02-13 at 13.04.21

… or …

Screen Shot 2016-02-13 at 13.04.37

… and on and on and on…

And this functional perspective can be implemented using dozens of design patterns and a plethora of integration patterns.

Screen Shot 2016-02-13 at 12.08.46

With each component implemented using whichever products and packages we choose to be responsible for supporting one or more requirements and capabilities…

Screen Shot 2016-02-13 at 13.20.31

So the infrastructure we rely on, the products we select, the components we build or buy; the patterns we adopt and use… all exist for nothing but the underlying requirement.

We should therefore be able to trace from requirement through the design all the way to the tin on the floor.

And if we can do that we can answer lots of interesting questions such as “what happens if I turn this box off?”, “what’s impacted if I change this requirement?” or even “which requirements are driving costs?”. Which in turn can help improve supportability, maintainability and availability and reduce costs. You may even find your product sponsor questioning if they really need this or that feature…

The Irresponsible Architect

I was once told that we; as architects, should always do “the right thing”! What is “right” is of course debatable and since you can take many different viewpoints – financial (the cheapest), time (the quickest), security (the most secure), etc. – you can often have the argument with yourself quite successfully without  worrying about other peoples opinions.

But as architects it is often up to us to weigh the balance of concerns and seek a compromise which provides a reasonable solution from all viewpoints. Harder than it sounds, but that is what I choose to interpret as “the right thing”.

There may still be many competing solution options which appear viable but some of those are false. Demons placed in front of you to tempt you into taking that first – and fatal – bite of the apple… The reason? Incorrectly ascribed responsibilities.

It is often easy to say “I need to do a lookup here” or “I need to parse this bit of data there” and it’s often quick (and cheap) to JFDI and stick some dirty piece of code in where it shouldn’t be. The code becomes more complicated than it need be as responsibilities become distributed throughout  and maintenance becomes harder and more costly as the days and years tick by. Finally you’ll want to pour petrol over the machine and let the damn thing burn.

The same applies from an infrastructure perspective. Using a database server as a file-server because, well, it’s accessible or using your backup  procedures as an archive because they’re kind of similar(!?) is wrong. Tomorrow someone will move the database and it’ll break, or your requirements for archiving will change and your backup solution will no longer be appropriate. Burn baby burn…

So before we sign off on the solution we have to ask “are the responsibilities of each component (logical and physical) well defined and reasonable?”,  “are the dependencies and relationships to other components a natural (necessary) consequence of those responsibilities?” and “if I were to rip this component out and replace it with something else… would I rather immolate myself?”.  If you answer “yes” to any of those then it’s probably not “right“. Probably.. not always. Sometimes you’ve just got to JFDI, sometimes you don’t care about tomorrow (throwaway code, temporary tin etc.) and sometimes, just sometimes, you’ll be wrong when you thought you were right (we’re all fallible… right?). Once you have a clear view of the components and their responsibilities, then you can worry about the precise implementation details…

And finally, if a higher authority overrules you then so long as you’ve explained the rationale, issues and implications clearly, it’s not your fault and you can sleep (or try to) with a clear conscience. Hey, you could be wrong!

So as a big fan of  keeping lists, for each component we need to define its:

  • Responsibilities
  • Rationale
  • Issues and implications
  • Dependencies and relationships to other components
  • Implementation