nonfunctionalarchitect.com

2016/11/12

Instrumentation as a 1st Class Citizen

I wrote previously that we are moving into an era of instrumentation and things are indeed improving. Just not as fast as I'd like. There's a lot of good stuff out there to support instrumentation and monitoring including the likes of the ELK (ElasticSearch, Logstash, Kibana) and TIG (Telegraf, InfluxDB, Grafana) stacks as well as their more commercial offerings such as TICK (Telegraf, InfluxDB, Chronograf, Kapacitor), Splunk, DataDog, AppDynamics and others. The problem is, few still really treat instrumentation as a real concern... until it's too late.

Your customers love this stuff! Really, they do! There's nothing quite as sexy as an interactive graph showing how your application is performing as the load increases - transactions, visitors, response-times, server utilisation, queue-depths etc. When things are going well it gives everyone a warm fuzzy feeling that all is right with the universe. When things are going wrong it helps to quickly focus you in on where the problem is.

However, this stuff needs to be built into everything we do and not be an afterthought when the pressures on to ship-it and you can't afford the time and effort to retrofit it. By then it's too late.

As architects we need to put in the infrastructure and services needed to support instrumentation, monitoring and alerting. At a minimum this means putting in place standards for logging, data-retention polices, a data collection solution, repository for the data and some tooling to allow us to search that data and visualize what's going on. Better still we can add alerting when thresholds breach and use richer analytics to allow us to scale up and down to meet demand.

As developers we need to be considering what metrics we want to capture from the components we build as we're working on them. Am I interested in how long it's taking for this function call? Do I want to know how many messages a service is handling? How many threads are being spawned? What exceptions are being thrown? Where from? What the queue depths are?.. etc. Almost certainly... YES! And this means putting in place strategies for logging these things. Perhaps you can find the data in existing log files.. Perhaps you need to use better tooling for detailed monitoring... Perhaps you need to write some code yourself to track how things are going...

Doing this from the start will enable you to get a much better feel for how things are working before you launch - including a reasonable view of performance and infrastructure demands which will allow you to focus your efforts better later when you do get into sizing and performance testing. It'll mean you're not scrambling around look for log files to help you root-cause issues as your latest release goes into meltdown. And it'll mean your customer won't be chewing your ear off asking you what's going on every five minutes - they'll be able to see it for themselves...

So please, get it in front of your customer, your product owner, your sponsor, your architects, your developers, your testers and make instrumentation a 1st class citizen in the backlog.

2016/09/23

macOS Minutes

Ah yes, Microsoft minutes... oh no, hang on...!!!

19,498 yrs later...

2016/09/10

MzE1NjE5NTI0ODYyfjIxMTAxNjI4NzMxNn5VMkZzZEdWa1gxOHorQVFYL2NQRFkwUVgyRm82anNJRGJHOW0vV1ZjdVEvVUR0cjZRZ0t3Y21WYWdCalo0dEZW

The title of this post is encrypted.

This page is also encrypted (via TLS (aka the new name for SSL)).

Anyone sniffing traffic on the wire must first decrypt the TLS traffic and then decrypt the content to work out what the message says.

But why bother with two layers of encryption?

Ok, so forgive the fact that this page is publicly accessible and TLS is decrypted before your eyes. It's possibly a poor example and in any case I'd like to talk about the server side of this traffic.

In many organisations, TLS is considered sufficient to provide security for data in-transit. The problem is TLS typically terminates on a load-balancer or on a web-server and is forwarded from there to another downstream server. Once this initial decryption takes place data often flows over the internal network of organisations in plain text. Many organisations consider this to be fine practice since the internal network is locked down with firewalls and intrusion detection devices etc. Some organisations even think it's good practice so that they can monitor internal traffic more easily.

However, there is obvious concern over insider-attacks with system-admins or disgruntled employees being in a good position to skim off the data easily (and clean-up any trace after themselves). Additionally requests are often logged (think access logs and other server logs) and these can record some of the data submitted. Such data-exhaust is often available in volume to internal employees.

It's possible to re-wrap traffic between each node to avoid network sniffing but this doesn't help data-exhaust and the constant un-wrap-re-wrap becomes increasingly expensive if not in CPU and IO then in effort to manage all the necessary certificates. Still, if you're concerned then do this or terminate TLS on the application-server.

But we can add another layer of encryption to programmatically protect sensitive data we're sending over the wire in addition to TLS. Application components will need to decrypt this for use and when this happens the data will be in plain text in memory but right now that's about as good as we can get.

The same applies for data at-rest - in fact this is arguably far worse. You can't rely on full database encryption or file-system encryption. Once the machine is up and running anyone with access to the database or server can easily have full access to the raw data in all its glory. These sort of practices only really protect against devices being lifted out of your data-centre - in which case you've got bigger problems...

The safest thing here is to encrypt the attributes you're concerned about before you store them and decrypt on retrieval. This sort of practice causes all sorts of problems in terms of searching but then should you really be searching passwords or credit card details? PII details; names, addresses etc, are the main issue here and careful thought about what really needs to be searched for; and some constructive data-modelling, may be needed to make this workable. Trivial it is not and compromises abound.

All this encryption creates headaches around certificate and key management but such is life and this is just another issue we need deal with. Be paranoid!

p.s. If you really want to know what the title says you can try the password over here.

2016/09/02

Channel 4 in France

Slight obsession some would say, but I enjoy F1... not that much that I'm prepared to pay Sky whatever extortionate fee they're come up with today though so I tend to watch the highlights only on C4. Nice coverage btw guys - shame to lose you next year.

Anyway, I have a VPN (OpenVPN) running off a Synology DiskStation to allow me to tunnel through home when I'm abroad. Works a treat... normally. Channel 4 does not.

Initially I thought it was DNS leakage picking up that name resolution is from french servers. You can see this by visiting www.dnsleaktest.com and running the "standard test". Even though I'm reported as being in the UK, all my DNS servers are in France... Humm, I smell a fish...

Am I in the UK or France?

To work around this I setup a proxy server on the DiskStation and the same test now reports UK DNS servers as everything goes through the proxy.

Definitely looks like I'm in the UK... But still no luck on C4...

Finally, I set the timezone I was in to UK rather than France and this seemed to do the trick. Note that you need to change the timezone on the laptop, not the time itself or you'll have all sorts of trouble connecting securely to websites including C4.

In the end, the proxy doesn't seem necessary so they don't appear to be picking up on DNS resolution yet though it's the sort of thing that they could look at adding (that, and device geolocation using HTML5 geo API though for this there are numerous plugins for browsers to report fake locations).

Incidentally, BBC iPlayer works fine and does so without fiddling with timezone.

The net wasn't really designed to expose your physical location and IP to location lookups such as MaxMind are more of a workaround than truly identifying your location. Using TOR as a more elaborate tunnel makes you appear to be all over the place as your IP address jumps around and corporate proxies; especially for large organisations, can make you appear to be in all sorts of weird places. Makes you wonder.. All these attempts to limit your access based on an IP address to prop up digital rights management just doesn't work. It's all too easy to work-around.

p.s. Turns out that whilst France doesn't have free-to-air F1 coverage, most places have some form of satellite TV via CanalSat or TNT which includes the German RTL channel. It'll do nothing to improve my French but at least I get to watch the race on the big screen...

2016/07/29

Scaling the Turd

It has been quite some time since my last post... Mainly because I've spent an inordinate amount of time trying to get an application scaling and performing as needed. We've done it, but I'm not happy.

Not happy, in part because of the time its taken, but mainly because the solution is totally unsatisfactory. It's an off the shelf (COTS) package so making changes beyond a few "customisations" is out of the question and the supplier has been unwilling to accept the problem is within the product and instead points to our "environment" and "customisations" - of which, IMHO, neither are particularly odd.

At root there are really two problems.

One - a single JVM can only handle 10 tps according to the supplier (transaction/page requests/second). Our requirement is around 50.

Two - performance degrades over time to unacceptable levels if a JVM is stressed hard. So 10tps realistically becomes more like 1 to 2 tps after a couple of days of soak testing.

So we've done a lot of testing - stupid amounts of testing! Over and over, tweaking this and that, changing connection and thread pools, JVM settings, memory allocations etc. with pretty much no luck. We've checked the web-servers, the database, the queue configuration (itself an abomination of a setup), the CPU is idle, memory is plentiful, garbage-collection working a treat, disk IO is non-existent, network-IO measured in the Kb/sec. Nada! Except silence from the supplier...

And then we've taken thread dumps and can see stuck threads and lock contention so we know roughly where the problem lies, passed this to the supplier, but still, silence...

Well, not quite silence. They finally offered that "other customers don't have these issues" and "other customers typically run 20+ JVMs"! Excuse me? 20 JVMs is typical..? wtf!? So really they're admitting that the application doesn't scale within a JVM. That it cannot make use of resources efficiently within a JVM and that the only way to make it work is to throw more JVMs at it. Sounds to me like a locking issue in the application - one that no doubt gets worse as the application is stressed. Well at least we have a fix...

This means that we've ended up with 30 JVMs across 10 servers (VMs) for one component to handle a pathetic 50tps! - something I would expect 2 or 3 servers to handle quite easily given the nature of the application (the content delivery aspect of a content management system). And the same problem pervades the applications other components so we end up with 50 servers (all VMs bar a physical DB cluster) for an application handling 50 tps... This is not efficient or cost effective.

There are also many other issues with the application including such idiocies as synchronous queueing, a total lack of cache headers (resulting in a stupid high hit-rate for static resources) and really badly implemented Lucene indexing (closing and opening indexes continually). It is, by some margin, the worst COTS application I have had the misfortunate to come across (I'll admit I've seen worse home-grown ones so not sure what that leaves us in the buy-v-build argument...).

So what's wrong with having so many JVMs?

Well, cost for a start. Even though we can cram more JVMs onto fewer VMs we need to step this up in chunks of RAM required per JVM (around 4GB). So, whilst I'm not concerned about CPU, a 20GB 4vCPU host can really only support 4 JVMs (some space is needed for OS and other process overheads). Lots of tin, doing nothing.

But the real issue is maintenance. How the hell do you manage that many JVMs and VMs efficiently? You can use clustering in the application-server, oh, except that this isn't supported by the supplier (like I said, the worst application ever!). So we've now got monitors and scripts for each JVM and each VM and when something breaks (... and guess what, with this pile of sh*t, it does) we need to go round each node fixing them one-by-one.

Anyway, lessons learned, how should we have scaled such an application? What would I do differently now that I know? (bar using some completely different product of course)

Firstly I would have combined components together where we can. There's no reason why some WARs couldn't be homed together (despite the suppliers design suggesting otherwise). This would help reduce some of the JVMs and improve the reliability of some components (that queueing mechanism specifically).

Secondly, given we can't use a real cluster in the app-server, we can (now) use containers to package up each component of the application instead. This then becomes our scaling and maintenance point and rather than having 50 servers to manage we have 7 or 8 images to maintain (still a lot for such an application). This then allows us to scale up or down at the container level more quickly. The whole application wouldn't fit this model (DB in particular would remain as it is) but most of it ~~would~~ should.

Of course it doesn't solve the root cause unfortunately but it is a more elegant, maintainable and cheaper solution and, bar eradicating this appalling product from the estate, one that would have been so much more satisfying.

So thats the project for the summer.. Work out how to containerise this sort of COTS application, how to connect, route and scale them in a way that is manageable, efficient and cost effective. Next project please!

2016/03/31

Hack a Mousetrap

My 10 year old son and I have been playing with Arduino recently... specifically to build a mouse trap alarm. And clearly we're not the only ones thinking about this... Some guys over at MS have tried to hack a mousetrap using every piece of technology they can get their hands on (I'm sure I saw the kitchen sink in there somewhere). Nice :)

2016/03/30

Security, Impact, Truth and Environments

I’ve been known to berate others for abusing environments - despite my personal habits - but I think its time for me to curtail my anger and reconsider exactly what distinguishes one environment from another and why.

We’re used to managing a plethora of environments - production, standby, prod-support, pre-production, performance test, UAT, system-test, development-integration, dev etc. - each of which has its own unique characteristics and purpose and each with a not insignificant cost.

With all those environments we can very easily have 5 or 6 times the infrastructure required to run production sitting mostly idle - and yet still needing to be maintained and patched and consuming kilowatts of power. All this for what can seem like no good reason bar to satisfy some decades old procedural dictate handed down by those upon high.

Unsurprisingly many organisations try to combine responsibilities into a smaller set of environments to save on $’s at the cost of increased risk. And recent trends in dev-ops, cloud and automation are helping to reduce the day-to-day need for all these environments even further. After all, if we can spin up a new server, install the codebase and introduce it into service in a matter of minutes then why not kill it just as quickly? If we can use cheaper t2.micro instances in dev and use m4.large only in prod then why shouldn’t we do so?

So we can shrink the number and size of environments so now we only have 2 or 3 times production and with auto-scaling this baseline capacity can actually be pretty low.

If we can get there...

… and the problem today is that whilst the technology exists, the legacy architectures, standards, procedures and practices adopted over many years by organisations simply don't allow these tools and techniques to be adopted at anywhere near the pace at which they are developing in the wild. That application written 10 years ago just doesn’t fit with the new cloud strategy the company is trying to develop. In short, revolution is fast (and bloody) and evolution is slow.

Our procedures and standards need to evolve at the same rate as technology and this just isn’t happening.

So I’ve been considering what all these environments are for and why they exist and think it comes down to three concerns; security, impact and truth.

- Security - What’s the security level of the data held? More often than not the production environment is the only one authorised to contain production data. That means it contains sensitive data or PII, has lots of access-control and auditing, firewalls everywhere, tripwires etc. There’s no way every developer is going to get access to this environment. Access is on a needs-to-know basis only... and we don’t need (and shouldn’t want) to know.
- Impact - Whats the impact to the business if the environment dies or runs slow? If dev goes down, no-one cares. Hell, if pre-prod goes down no-one bar prod-support really care.
- Truth - How true to version X does the environment have to be? Production clearly needs to be the correct release of the codebase across the board (MVT aside). If we have the wrong code with the wrong database then it matters. In the development environment?.. if a script fails then frankly it’s not the end of the world, and besides dev is usually going to be version X+n, unstable and flaky in any case.

So in terms of governance it’s those things that keep management awake at night. They want to know who’s got access to what, what they can do, on what boxes, with what assets and what the risk is to data exposure. When we want to push out the next release they want to know the impact if it screws up, that we've got a back-out plan for when it does and that we've tested it - the release, the install plan and the back-out. In short, they’re going to be a complete pain in the backside. For good reason.

But can we rethink our environments around these concerns and does this help? If we can demonstrate to management that we’ve met these needs then why shouldn’t they let us reduce, remove and recycle environments at will?

Production and stand-by will have to be secure and the truth. But the impact if stand-by goes down isn’t the same. There’s a risk if prod falls over but that’s not the same thing. So allowing data-analysts access to stand-by to run all sorts of wild and crazy queries may not be an issue unless prod falls flat on its face - a risk some will be willing to take to make more use of the tin and avoid environment spread. Better still, if the data in question isn’t sensitive or is just internal-use-only then why not mirror a copy into dev environments to provide a more realistic test data-set for developers?

And if the data is sensitive? Anonymise it and use that; or a decent sample of it, in dev and test environments. Doing so will improve the quality of code by increasing the likelihood developers will detect patterns and edge-cases sooner in the development cycle.

In terms of impact, If the impact to the business of an application outage is low then why insist on the full range of environments when frankly one or two will do? Many internal applications are only used 9 to 5 and have an RTO and RPO of in excess of 24 hrs. The business need to clearly understand what they’re agreeing to but ultimately it’s their $’s we’re spending and once they realise the cost they may be all too willing to take the risk. Having five different environments for every application for the sake of consistency alone isn’t justifiable.

And not all truths are equal. Some components don’t need the same rigour as others and may have lower impact to the business if they’re degraded to some degree. Allowing some components; especially expensive ones, to have fewer environments may complicate topologies and reduce the general comprehensiveness of the system but if we can justify it then so be it. We do though need to make sure this is very clearly understood by all involved else chaos can ensue - especially if some instances span environments (here be dragons).

Finally, if engineering teams paid more attention during development to performance and operability and could demonstrate this then the need for dedicated performance/pre-prod environments may also be reduced. We don't need an environment matching production to understand the performance profile of the application under load. We just need to consider the systems characteristics and test cases with a willingness (i.e. an acceptance of risk) to extrapolate. A truthful representation of production is usually not necessary.

Risk is everything here and if we think about how the applications concerns stack up against the security risk, the impact risk to the business and risk of things not being the truth, the whole truth and nothing but… then perhaps we can be smarter about how we structure our environments to help reduce the costs involved irrespective of adopting revolutionary technology.