Government IT Spend

The way that governments are run they’ll be a huge amount of duplication and waste in this lot. Situation is even worse when you consider that historically it’s mostly proprietary software with n year support contracts for stuff that’s rarely used (but hits the headlines when it is). Not at all surprising.

The future for government IT is open-source and cloud based.

Bypassing BT’s DNS Service

I suffered from BTs failure yesterday which knocked out many sites though thankfully it didn’t seem to affect nonfunctionalarchitect.com – phew! What a relief huh?

Anyway, BT has now apologised for the incident and is investigating root-cause. Well, feeling lost and detached from reality without full and proper access to the net (internet access should be a human right) I naturally did my own investigating which included the obligatory reboots to no avail (my Mac, wife’s PC, home-hub) – and you know they’ll make you redo these steps if you have to call support…

Some sites could be pinged, some couldn’t (could not resolve host) which points at a DNS issue. Bypassing BTs DNS isn’t that easy though as they have a transparent DNS service in place which means you can’t just add Googles free DNS servers to your list (8.8.8.8 and 8.8.4.4 if you’re interested). Doing this  in my case simply resulted in an error message saying that BT’s Parental Controls were on a prevented me using another DNS service. Turning parental controls off stopped the error message but didn’t help me resolve names because the transparent DNS service remains intercepting any requests.

I could only think of two methods to bypass BT’s DNS service:

1. Use a VPN.

This will still rely on BT’s network but prevents them from intercepting anything since it’s all secure in a warm and cosy encrypted VPN tunnel. The only problem here is finding a VPN end-point to connect to first – I have one, but its to allow me remote access to my house which in turns relies on BT. Doh!

2. Use TOR (The Onion Ring) and Privoxy.

This prevents DNS lookups from the browser (hence use of Privoxy) and all requests are sent over the TOR network and may surface anywhere in the world (preferably somewhere not using BT’s DNS service though I have little control over this). It’s not the fastest solution but it works. Fortunately I had an old VM with TOR and Privoxy installed and configured so with a few tweaks to this (listen on 0.0.0.0 (all addresses) rather than 127.0.0.1 (localhost only)) I could configure all the machines in the house to use this VM as a proxy service and bingo! We were back online and didn’t have to risk talking to each other anymore – phew!

TOR is awesome and useful for accessing sites which may be blocked by your service provider, your government or for some other legal issue (such as why the really cool but generally inaccessible BBC Future site is blocked from fee paying British residents). It’s also useful if you want to test stuff from somewhere else in the world over what feels like a wet piece of string for a network.

Resiliency worries needs to be considered before you have failure. In this instance you need to have a VM (or physical machine) pre-configured and ready for such an emergency (and don’t call 999, they won’t be able to help… ). Smug mode on!

Excremental Form

We often think we know what good design is; whether it be system, code or graphic design, and it’s a good thing that we strive for perfection.

Perfection though is subjective, comes at a cost and is ultimately unachievable. We must embrace the kludges, hacks, work-arounds and other compromises and like the Greek idiom; “whoever is not Greek is barbarian”, we should be damn proud of being that little bit barbaric even if we continue to admire the Greeks.

The question is not whether the design is good but whether the compromises are justified, sound and fit for purpose. Even shit can have good and bad form.

Chaos Monkey

I’ve had a number of discussions in the past about how we should be testing failover and recovery procedures on a regular basis – to make sure they work and everyone knows what to do so you’re not caught out when it happens for real (which will be at the worst possible moment). Scheduling these tests, even in production, is (or should be) possible at some convenient(‘ish) time. If you think it isn’t then you’ve already got a resiliency problem (you’re out when some component fails) as well as a maintenance problem.

I’ve also talked (ok, muttered) about how a healthy injection of randomness can actually improve stability, resilience and flexibility. Something covered by Nassim Taleb in his book Antifragile.

Anyway, beat to the punch again, Netflix developed a tool called Chaos Monkey (aka Simian Army) a few years back with randomly kills elements of the infrastructure to help identify weak points. Well worth checking out on codinghorror.com.

For the record… I’m not advocating that you use Chaos Monkey in production… Just that it’s a good way to test the resiliency of your environment and identify potential failure points. You should be testing procedures in production in a more structured manner.

Telco CDNs & Monopolies

Telco CDNs (Content Distribution Networks) are provided by telcos by embedding content caching infrastructure deep in the network close to the end-user (just before the last kM of copper wire). The result is improved streaming to end-users and significantly less load on both the content providers servers and the telcos wider network. It’s a win-win-win for everyone.

Telcos charge content providers for this service. If the telcos network has a limited client base then perhaps there’s not much point in the content provider paying them to cache the content since it’ll not reach many end-users. If the telco is a state run (or previously state run) monopoly telco then if you want to make sure your content is delivered in the best quality you’ll pay (if you can). The telco could thus be accused of abuse if they are seen to be using a monopoly position to drive ever higher profits through leveraging this sort of technology. It can also be considered an abuse of net neutrality principles by essentially prioritising (biasing) content. Worse still if it’s state run then you’ll wonder if it’s 1984 all over again (the fashion was truly awful!).

Technically I think the idea of telco CDNs is pretty neat and efficient (storage capacity is cheap compared to network capacity). I’d also not want to add directly to the cost of my internet connection to fund the infrastructure to support this so am pleased if someone else is prepared to pay.

Ultimately though we all pay of course and you could argue that this model at least attempts to ensure users of high volume services such as NetFlix pay rather than everyone. However, as with net neutrality concerns in general I wonder when the first public outcry will come… when we discover a telco is prioritising it’s own video streaming service over a competitors? when we find the government has been using such methods to intentionally drop “undesirable” content? or when we can’t watch East-Enders in HD because the BBC hasn’t paid their bill recently?

Resilient WebSphere Session Management

I’ve been promising myself that I’ll write this short piece sometime and since the football today has been a little sluggish I thought I take timeout from the world cup and get on with it… (you know it won’t be short either..).

Creating applications than can scale horizontally is; in theory, pretty simple. Processing must be parallelizable such that the work can be split amongst all member processors and servers in a cluster. Map-reduce is a common pattern implemented to achieve this. Another; even more common, pattern is the simple request-response mechanism of the web.  It may not sound like it since each request is typically independent from each other, but from a servers perspective it is arguably an example of parallel processing. Map-reduce handles pre-requisites by breaking jobs down into separate map and reduce tasks (fork and join) and chaining multiple map-reduce jobs. The web implements it’s own natural scheduling of requests which must be performed in sequence as a consequence of the wet-ware interacting at a snails pace with the UI.  In this case any state needing to be retained between requests is typically held in sessions – in-memory on the server.

Resiliency though is a different issue than scalability.

In map-reduce, if a server fails then the processing task can be restarted on another node. They’ll be some repeat work performed as the results of the in-flight task will have been lost (and maybe more) but computers don’t much mind doing repetitive tasks and will quite willingly get on with it without much grumbling (ignoring the question of “free will” in computing for the moment).

Humans do mind repeating themselves though (I’ve wanted to measure my reluctance to repeat tasks over time since I think it’s got progressively worse in recent years…).

So how do you not lose a users session state if a server goes down?

Firstly, you’re likely going to piss someone off. They’ll be some request in mid flight the second the server does down unless you’re in maintenance mode and are quiescing the server cleanly. Of course you could not bother with server session state at all and track all data through cookies running back and forth over the network. This isn’t very good – lot’s of network traffic and not very secure if you need to hold anything the user (or Eve) shouldn’t see, or if you’re concerned about someone spoofing requests. Sometimes it’s viable though…

But really you want a way for the server to handle such failures for you… and with WebSphere Application Server (WAS) there’s a few options (see how long it takes me to get to the point!).

==== SCROLL TO HERE IF YOU WANT TO SKIP THE RATTLING ====

The WAS plugin should always be used in front of WAS.  The plugin will route requests to the correct downstream app-server based on a clone id tagged on to the end of the session id cookie (JSESSIONID). If the target server is not available (plugin cannot open a connection to the server) then another will be tried. It also means that whatever http server (Apache, IIS, IHS) a request lands on it will be routed to the correct WAS server where the session is held in memory. It’s quite configurable for problem determination; on the fly, so well worth becoming friends with.

When the request finally lands on the WAS server then you’ve essentially three options for how you manage sessions for resiliency.

  1.  Local Sessions – Do nothing and all sessions will be held in memory on the local server. In this instance, if the server goes down, you’ll lose the session and users will have to login again and repeat any work they’ve done to date which is held in session (and note; as above, users don’t like repeating themselves).
  2. Database persistent sessions – Configure a JDBC source and WAS can store changes to the session in a database (make sure all your objects are serializable). The implementation has several options to optimize for performance over safety and the like but at the end of the day you’re writing session information to a database – it can have a significant performance impact and adds another pre-requisite dependency (i.e. a supported, available and resilient database). Requests hitting the original server will find session data available in-memory already. Requests hitting another server will incur a database round trip to fetch session state. As a one-off hit it’s tolerable but to avoid repeated DB hits you still want to use the plugin.
  3. Memory to memory replication – Here changes to user sessions are replicated;in the background, between all servers in a cluster. In theory any server could serve requests and the plugin can be ignored but in practice you’ll still want requests to go back to the origin to increase the likelihood that the server has the correct state as even memory-memory replication can take some (small) time.  There are two modes this can operate in, peer-to-peer (normal) and client-server (where a server operates as a dedicated session state server).

My preference is for peer-to-peer memory-to-memory replication due to performance and cost factors (no additional database required which would also need to be resilient, no dedicated session state server). Details of how you can setup this up are in the WAS Admin Redbook.

Finally, you should always keep the amount of data stored in session objects to a minimum (<4kB) and all objects need to be serializable if you want to replicate or store sessions in a database. Don’t store the complete results of a cursor in session for quick access – repeat the query and return only the results you want (using paging to skip through) – and don’t store things like database connections in session, it won’t work, at least, not for long…

Scaling on a budget

Pre-cloud era. You have a decision to make. Do you define your capacity and performance requirements in the belief that you’ll build the next top 1000 web-site in the world or start out with the view that you’ll likely build a dud which will be lucky to get more than a handful of visits each day?

If the former then you’ll need to build your own data-centres (redundant globally distributed data-centres). If the latter then you may as well climb into your grave before you start. But most likely you’ll go for something in the middle, or rather at the lower end, something which you can afford.

The problem comes when your site becomes popular. Worse still, when that popularity is temporary. In most cases you’ll suffer something like a slashdot effect for a day or so which will knock you out temporarily but could trash your image permanently. If you started at the higher end then your problems have probably become terminal (at least financially) already.

It’s a dilemma that every new web-site needs to address.

Post-cloud era. You have a choice – IaaS or PaaS? If you go with infrastructure then you can possibly scale out horizontally by adding more servers when needed. This though is relatively slow to provision* since you need to spin up a new server, install your applications and components, add it to the cluster, configure load-balancing, DNS resiliency and so on. Vertical scaling may be quicker but provides limited additional headroom. And this assumes you designed the application to scale in the first place – if you didn’t then chances are probably 1 in 10 that you’ll get lucky. On the up side, the IaaS solution gives you the flexibility to do-your-own-thing and your existing legacy applications have a good chance they can be made to run in the cloud this way (everything is relative of course).

If you go with PaaS then you’re leveraging (in theory) a platform which has been designed to scale but which constrains your solution design in doing so. Your existing applications have little chance they’ll run off-the-shelf (actually, no chance at all really) though if you’re lucky some of your libraries may (may!) work depending on compatibility (Google App Engine for Java, Microsoft Azure for .NET for example). The transition is more painful with PaaS but where you gain is in highly elastic scalability at low cost because it’s designed into the framework.

IaaS is great (this site runs on it), is flexible with minimal constraints, low cost and can be provisioned quickly (compared to the pre-cloud world).

PaaS provides a more limited set of capabilities at a low price point and constrains how applications can be built so that they scale and co-host with other users applications (introducing multi-tenancy issues).

A mix of these options probably provides the best solution overall depending on individual component requirements and other NFRs (security for example).

Anyway, it traverses the rats maze of my mind today due to relevance in the news… Many Government web-sites have pitiful visitor numbers until they get slashdotted or are placed at #1 on the BBC website – something which happens quite regularly though most of the time the sites get very little traffic – peaky. Todays victim is the Get Safe Online site which collapsed under load – probably as result of the BBC advertising it. For such sites perhaps PaaS is the way forward.

* I can’t really believe I’m calling IaaS “slow” given provisioning can be measured in the minutes and hours when previously you’d be talking days, weeks and likely months…