Skip to main content

Resilient WebSphere Session Management

I've been promising myself that I'll write this short piece sometime and since the football today has been a little sluggish I thought I take timeout from the world cup and get on with it... (you know it won't be short either..).

Creating applications than can scale horizontally is; in theory, pretty simple. Processing must be parallelizable such that the work can be split amongst all member processors and servers in a cluster. Map-reduce is a common pattern implemented to achieve this. Another; even more common, pattern is the simple request-response mechanism of the web.  It may not sound like it since each request is typically independent from each other, but from a servers perspective it is arguably an example of parallel processing. Map-reduce handles pre-requisites by breaking jobs down into separate map and reduce tasks (fork and join) and chaining multiple map-reduce jobs. The web implements it's own natural scheduling of requests which must be performed in sequence as a consequence of the wet-ware interacting at a snails pace with the UI.  In this case any state needing to be retained between requests is typically held in sessions - in-memory on the server.

Resiliency though is a different issue than scalability.

In map-reduce, if a server fails then the processing task can be restarted on another node. They'll be some repeat work performed as the results of the in-flight task will have been lost (and maybe more) but computers don't much mind doing repetitive tasks and will quite willingly get on with it without much grumbling (ignoring the question of "free will" in computing for the moment).

Humans do mind repeating themselves though (I've wanted to measure my reluctance to repeat tasks over time since I think it's got progressively worse in recent years...).

So how do you not lose a users session state if a server goes down?

Firstly, you're likely going to piss someone off. They'll be some request in mid flight the second the server does down unless you're in maintenance mode and are quiescing the server cleanly. Of course you could not bother with server session state at all and track all data through cookies running back and forth over the network. This isn't very good - lot's of network traffic and not very secure if you need to hold anything the user (or Eve) shouldn't see, or if you're concerned about someone spoofing requests. Sometimes it's viable though...

But really you want a way for the server to handle such failures for you... and with WebSphere Application Server (WAS) there's a few options (see how long it takes me to get to the point!).


The WAS plugin should always be used in front of WAS.  The plugin will route requests to the correct downstream app-server based on a clone id tagged on to the end of the session id cookie (JSESSIONID). If the target server is not available (plugin cannot open a connection to the server) then another will be tried. It also means that whatever http server (Apache, IIS, IHS) a request lands on it will be routed to the correct WAS server where the session is held in memory. It's quite configurable for problem determination; on the fly, so well worth becoming friends with.

When the request finally lands on the WAS server then you've essentially three options for how you manage sessions for resiliency.

  1.  Local Sessions - Do nothing and all sessions will be held in memory on the local server. In this instance, if the server goes down, you'll lose the session and users will have to login again and repeat any work they've done to date which is held in session (and note; as above, users don't like repeating themselves).

  2. Database persistent sessions - Configure a JDBC source and WAS can store changes to the session in a database (make sure all your objects are serializable). The implementation has several options to optimize for performance over safety and the like but at the end of the day you're writing session information to a database - it can have a significant performance impact and adds another pre-requisite dependency (i.e. a supported, available and resilient database). Requests hitting the original server will find session data available in-memory already. Requests hitting another server will incur a database round trip to fetch session state. As a one-off hit it's tolerable but to avoid repeated DB hits you still want to use the plugin.

  3. Memory to memory replication - Here changes to user sessions are replicated;in the background, between all servers in a cluster. In theory any server could serve requests and the plugin can be ignored but in practice you'll still want requests to go back to the origin to increase the likelihood that the server has the correct state as even memory-memory replication can take some (small) time.  There are two modes this can operate in, peer-to-peer (normal) and client-server (where a server operates as a dedicated session state server).

My preference is for peer-to-peer memory-to-memory replication due to performance and cost factors (no additional database required which would also need to be resilient, no dedicated session state server). Details of how you can setup this up are in the WAS Admin Redbook.

Finally, you should always keep the amount of data stored in session objects to a minimum (<4kB) and all objects need to be serializable if you want to replicate or store sessions in a database. Don't store the complete results of a cursor in session for quick access - repeat the query and return only the results you want (using paging to skip through) - and don't store things like database connections in session, it won't work, at least, not for long...


Popular posts from this blog

An Observation

Much has changed in the past few years, hell, much has changed in the past few weeks, but that’s another story... and I’ve found a little time on my hands in which to tidy things up. The world of non-functionals has never been so important and yet remains irritatingly ignored by so many - in particular by product owners who seem to think NFRs are nothing more than a tech concern. So if your fancy new product collapses when you get get too many users, is that ok? It’s fair that the engineering team should be asking “how many users are we going to get?”,   or “how many failures can we tolerate?” but the only person who can really answer those questions is the product owner.   The dumb answer to these sort of question is “lots!”, or “none!” because at that point you’ve given carte-blanche to the engineering team to over engineer... and that most likely means it’ll take a hell of a lot longer to deliver and/or cost a hell of a lot more to run. The dumb answer is also “only a couple” and “

Inter-microservice Integrity

A central issue in a microservices environment is how to maintain transactional integrity between services. The scenario is fairly simple. Service A performs some operation which persists data and at the same time raises an event or notifies service B of this action. There's a couple of failure scenarios that raise a problem. Firstly, service B could be unavailable. Does service A rollback or unpick the transaction? What if it's already been committed in A? Do you notify the service consumer of a failure and trigger what could be a cascading failure across the entire service network? Or do you accept long term inconsistency between A & B? Secondly, if service B is available but you don't commit in service A before raising the event then you've told B about something that's not committed... What happens if you then try to commit in A and find you can't? Do you now need to have compensating transactions to tell service B "oops, ignore that previous messag

Equifax Data Breach Due to Failure to Install Patches

"the Equifax data compromise was due to their failure to install the security updates provided in a timely manner." Source: MEDIA ALERT: The Apache Software Foundation Confirms Equifax Data Breach Due to Failure to Install Patches Provided for Apache® Struts™ Exploit : The Apache Software Foundation Blog As simple as that apparently. Keep up to date with patching.