Chaos Monkey

I’ve had a number of discussions in the past about how we should be testing failover and recovery procedures on a regular basis – to make sure they work and everyone knows what to do so you’re not caught out when it happens for real (which will be at the worst possible moment). Scheduling these tests, even in production, is (or should be) possible at some convenient(‘ish) time. If you think it isn’t then you’ve already got a resiliency problem (you’re out when some component fails) as well as a maintenance problem.

I’ve also talked (ok, muttered) about how a healthy injection of randomness can actually improve stability, resilience and flexibility. Something covered by Nassim Taleb in his book Antifragile.

Anyway, beat to the punch again, Netflix developed a tool called Chaos Monkey (aka Simian Army) a few years back with randomly kills elements of the infrastructure to help identify weak points. Well worth checking out on codinghorror.com.

For the record… I’m not advocating that you use Chaos Monkey in production… Just that it’s a good way to test the resiliency of your environment and identify potential failure points. You should be testing procedures in production in a more structured manner.

Resilient WebSphere Session Management

I’ve been promising myself that I’ll write this short piece sometime and since the football today has been a little sluggish I thought I take timeout from the world cup and get on with it… (you know it won’t be short either..).

Creating applications than can scale horizontally is; in theory, pretty simple. Processing must be parallelizable such that the work can be split amongst all member processors and servers in a cluster. Map-reduce is a common pattern implemented to achieve this. Another; even more common, pattern is the simple request-response mechanism of the web.  It may not sound like it since each request is typically independent from each other, but from a servers perspective it is arguably an example of parallel processing. Map-reduce handles pre-requisites by breaking jobs down into separate map and reduce tasks (fork and join) and chaining multiple map-reduce jobs. The web implements it’s own natural scheduling of requests which must be performed in sequence as a consequence of the wet-ware interacting at a snails pace with the UI.  In this case any state needing to be retained between requests is typically held in sessions – in-memory on the server.

Resiliency though is a different issue than scalability.

In map-reduce, if a server fails then the processing task can be restarted on another node. They’ll be some repeat work performed as the results of the in-flight task will have been lost (and maybe more) but computers don’t much mind doing repetitive tasks and will quite willingly get on with it without much grumbling (ignoring the question of “free will” in computing for the moment).

Humans do mind repeating themselves though (I’ve wanted to measure my reluctance to repeat tasks over time since I think it’s got progressively worse in recent years…).

So how do you not lose a users session state if a server goes down?

Firstly, you’re likely going to piss someone off. They’ll be some request in mid flight the second the server does down unless you’re in maintenance mode and are quiescing the server cleanly. Of course you could not bother with server session state at all and track all data through cookies running back and forth over the network. This isn’t very good – lot’s of network traffic and not very secure if you need to hold anything the user (or Eve) shouldn’t see, or if you’re concerned about someone spoofing requests. Sometimes it’s viable though…

But really you want a way for the server to handle such failures for you… and with WebSphere Application Server (WAS) there’s a few options (see how long it takes me to get to the point!).

==== SCROLL TO HERE IF YOU WANT TO SKIP THE RATTLING ====

The WAS plugin should always be used in front of WAS.  The plugin will route requests to the correct downstream app-server based on a clone id tagged on to the end of the session id cookie (JSESSIONID). If the target server is not available (plugin cannot open a connection to the server) then another will be tried. It also means that whatever http server (Apache, IIS, IHS) a request lands on it will be routed to the correct WAS server where the session is held in memory. It’s quite configurable for problem determination; on the fly, so well worth becoming friends with.

When the request finally lands on the WAS server then you’ve essentially three options for how you manage sessions for resiliency.

  1.  Local Sessions – Do nothing and all sessions will be held in memory on the local server. In this instance, if the server goes down, you’ll lose the session and users will have to login again and repeat any work they’ve done to date which is held in session (and note; as above, users don’t like repeating themselves).
  2. Database persistent sessions – Configure a JDBC source and WAS can store changes to the session in a database (make sure all your objects are serializable). The implementation has several options to optimize for performance over safety and the like but at the end of the day you’re writing session information to a database – it can have a significant performance impact and adds another pre-requisite dependency (i.e. a supported, available and resilient database). Requests hitting the original server will find session data available in-memory already. Requests hitting another server will incur a database round trip to fetch session state. As a one-off hit it’s tolerable but to avoid repeated DB hits you still want to use the plugin.
  3. Memory to memory replication – Here changes to user sessions are replicated;in the background, between all servers in a cluster. In theory any server could serve requests and the plugin can be ignored but in practice you’ll still want requests to go back to the origin to increase the likelihood that the server has the correct state as even memory-memory replication can take some (small) time.  There are two modes this can operate in, peer-to-peer (normal) and client-server (where a server operates as a dedicated session state server).

My preference is for peer-to-peer memory-to-memory replication due to performance and cost factors (no additional database required which would also need to be resilient, no dedicated session state server). Details of how you can setup this up are in the WAS Admin Redbook.

Finally, you should always keep the amount of data stored in session objects to a minimum (<4kB) and all objects need to be serializable if you want to replicate or store sessions in a database. Don’t store the complete results of a cursor in session for quick access – repeat the query and return only the results you want (using paging to skip through) – and don’t store things like database connections in session, it won’t work, at least, not for long…

Windows 7 Incident

Having recently been responsible for an estate wide software upgrade programme for many thousand devices to Windows 7 I sympathise but have to find this amusing. However, it is an interesting approach to achieving a refresh in particularly short order… Make the best of it guys, treat it as an opportunity to audit your estate… I do hope your backup procedures are working though…  😉

Windows 7 Incident

wind7incident