Session Abolition

I’ve been going through my bookcase; on orders from a higher-being, to weed out old, redundant books and make way for… well, I’m not entirely sure what, but anyway, it’s not been very successful.

I came across an old copy of Release It! by Michael T. Nygard and started flicking through, chuckling occasionally as memories (good and bad) surfaced. It’s an excellent book but made me stop and think when I came across a note reading:

Serve small cookies
Use cookies for identifiers, not entire objects. Keep session data on the server, where it can't be altered by a malicious client.

There’s nothing fundamentally wrong with this other than it chimes with a problem I’m currently facing and I don’t like any of the usual solutions.

Sessions either reside in some sort of stateful pool; persistent database, session management server, replicated memory etc., or more commonly exist stand-alone within each node of a cluster. In either case load-balancing is needed to route requests to the home node where the session exists (delays in replication means you can’t go to any node even when a stateful pool is used). Such load-balancing is performed by a network load-balancer, reverse proxy, web-server (mod_proxy, WebSphere plugin etc.) or application server and can work using numerous different algorithms; IP based routing, round-robin, least-connections etc.

So in my solution I now need some sort of load-balancer – more components, joy! But even worse, it’s creating havoc with reliability. Each time a node fails I lose all sessions on that server (unless I plumb for a session-management-server which I need like a hole in the head). And nodes fails all the time… (think cloud, autoscaling and hundreds of nodes).

So now I’m going to kind-of break that treasured piece of advice from Michael and create larger cookies (more likely request parameters) and include in them some every-so-slightly-sensitive details which I really shouldn’t. I should point out this isn’t is criminal as it sounds.

Firstly the data really isn’t that sensitive. It’s essentially routing information that needs to be remembered between requests – not my credit card details.

Secondly it’s still very small – a few bytes or so but I’d probably not worry too much until it gets to around 2K+ (some profiling required here I suspect).

Thirdly, there are other ways to protect the data – notably encryption and hashing. If I don’t want the client to be able to read it then I’ll encrypt it. If I don’t mind the client reading the data but want to make sure it’s not been tampered with, I’ll use an HMAC instead. A JSON Web Token like format should well work in most cases.

Now I can have no session on the back-end servers at all but instead need to decrypt (or verify the hash) and decode a token on each request. If a node fails I don’t care (much) as any other node can handle the same request and my load balancing can be as dumb as I can wish.

I’ve sacrificed performance for reliability – both in terms of computational effort server side and in terms of network payload – and made some simplification to the overall topology to boot. CPU cycles are getting pretty cheap now though and this pattern should scale horizontally and vertically – time for some testing… The network penalty isn’t so cheap but again should be acceptable and if I avoid using “cookies” for the token then I can at least save the load on every single request.

It also means that in a network of micro-services, so long as each service propagates these tokens around, the more thorny routing problem in this sort of environment virtually disappears.

I do though now have a key management problem. Somewhere, somehow I need to store the keys securely whilst distributing them to every node in the cluster… oh and don’t mention key-rotation…

Internet Scale Waste

Whilst reading up on internet scale computing I came across a presentation on Slideshare which contains the page below.

softlayer

23 millions domains for 24,000 customers = just under 1,000 domains per customer. Now that seems like a lot but I strongly suspect it’s more like most customers have x1 domain with a few having many many thousands (something akin to a Zipfs distribution). Likely someone out there will have many hundreds of thousand of domains… I wonder who needs so many domains…

On an aside, eh-hem, I get a lot of comments from people pertaining to be from something like www.hu12gyd38hasjakdh8102e12e2djklasdagghkagqdncc.com, all of which turn out be spammers. Hummm….. I wonder how much spam/botware/malware waste resides in the cloud…?

 

Chaos Monkey

I’ve had a number of discussions in the past about how we should be testing failover and recovery procedures on a regular basis – to make sure they work and everyone knows what to do so you’re not caught out when it happens for real (which will be at the worst possible moment). Scheduling these tests, even in production, is (or should be) possible at some convenient(‘ish) time. If you think it isn’t then you’ve already got a resiliency problem (you’re out when some component fails) as well as a maintenance problem.

I’ve also talked (ok, muttered) about how a healthy injection of randomness can actually improve stability, resilience and flexibility. Something covered by Nassim Taleb in his book Antifragile.

Anyway, beat to the punch again, Netflix developed a tool called Chaos Monkey (aka Simian Army) a few years back with randomly kills elements of the infrastructure to help identify weak points. Well worth checking out on codinghorror.com.

For the record… I’m not advocating that you use Chaos Monkey in production… Just that it’s a good way to test the resiliency of your environment and identify potential failure points. You should be testing procedures in production in a more structured manner.

Scaling on a budget

Pre-cloud era. You have a decision to make. Do you define your capacity and performance requirements in the belief that you’ll build the next top 1000 web-site in the world or start out with the view that you’ll likely build a dud which will be lucky to get more than a handful of visits each day?

If the former then you’ll need to build your own data-centres (redundant globally distributed data-centres). If the latter then you may as well climb into your grave before you start. But most likely you’ll go for something in the middle, or rather at the lower end, something which you can afford.

The problem comes when your site becomes popular. Worse still, when that popularity is temporary. In most cases you’ll suffer something like a slashdot effect for a day or so which will knock you out temporarily but could trash your image permanently. If you started at the higher end then your problems have probably become terminal (at least financially) already.

It’s a dilemma that every new web-site needs to address.

Post-cloud era. You have a choice – IaaS or PaaS? If you go with infrastructure then you can possibly scale out horizontally by adding more servers when needed. This though is relatively slow to provision* since you need to spin up a new server, install your applications and components, add it to the cluster, configure load-balancing, DNS resiliency and so on. Vertical scaling may be quicker but provides limited additional headroom. And this assumes you designed the application to scale in the first place – if you didn’t then chances are probably 1 in 10 that you’ll get lucky. On the up side, the IaaS solution gives you the flexibility to do-your-own-thing and your existing legacy applications have a good chance they can be made to run in the cloud this way (everything is relative of course).

If you go with PaaS then you’re leveraging (in theory) a platform which has been designed to scale but which constrains your solution design in doing so. Your existing applications have little chance they’ll run off-the-shelf (actually, no chance at all really) though if you’re lucky some of your libraries may (may!) work depending on compatibility (Google App Engine for Java, Microsoft Azure for .NET for example). The transition is more painful with PaaS but where you gain is in highly elastic scalability at low cost because it’s designed into the framework.

IaaS is great (this site runs on it), is flexible with minimal constraints, low cost and can be provisioned quickly (compared to the pre-cloud world).

PaaS provides a more limited set of capabilities at a low price point and constrains how applications can be built so that they scale and co-host with other users applications (introducing multi-tenancy issues).

A mix of these options probably provides the best solution overall depending on individual component requirements and other NFRs (security for example).

Anyway, it traverses the rats maze of my mind today due to relevance in the news… Many Government web-sites have pitiful visitor numbers until they get slashdotted or are placed at #1 on the BBC website – something which happens quite regularly though most of the time the sites get very little traffic – peaky. Todays victim is the Get Safe Online site which collapsed under load – probably as result of the BBC advertising it. For such sites perhaps PaaS is the way forward.

* I can’t really believe I’m calling IaaS “slow” given provisioning can be measured in the minutes and hours when previously you’d be talking days, weeks and likely months…

Cloud Jobs

Cloud is the current buzz in the industry and various cloud service-providers are jockeying for position to be #1. Beyond the hype and bravado I’ve been wondering who is really taking the lead because from my point of view it feels like it’s down to Amazon and Google.

So I searched a few job-sites to see which cloud service providers are seen as being requirements for positions and the results are below.

cloud-jobs-20140424

 

Lots of “Cloud” jobs and AWS (Amazon Web Services) occurs quite frequently with Azure (Microsoft) and Rackspace relatively hot (compared to OpenShift, Softlayer and Oracle Cloud). Google App Engine (GAE) gets a few hits whilst the general search for “Google” (which covers “Google Apps” and so much more) if included would bring the search results into a comparable position to AWS but this is too general to include as “Cloud” so I’ve excluded it here. Google Compute Engine got no hits.

So Cloud is big, Amazon are #1 (currently) and Azure is pretty popular; which shouldn’t be much of a surprise from the enterprise perspective. That OpenStack has a presence compared to end service-providers such as SoftLayer (IBM) and OpenShift (RedHat) indicates that there’s work in the open-source cloud space which is good to see (AFAIC) and some of this looks to be in building private clouds. But the lack of any hits for Softlayer, OpenShift or Oracle Cloud is a bit of a surprise. I’d have thought someone would be after skills in this stuff. Anyway, my somewhat unscientific reckoning as to where we are based on a very small and selective sample of data is:

  1. The notion that “Amazon=Cloud” is hard to shift and the rest look to be rather slow to the party.
  2. Microsoft Azure is the preferred option for many enterprises who have a historic investment in all things MS and .NET.
  3. Google may be late to the IaaS party but since the net is the bloodline for Google I suspect that in the wider context of “cloud” they’ll probably do ok (they’ve also got a hell of a lot of compute capacity lying around).
  4. Open-source cloud has a comparatively strong position compared to where OSS usually is (i.e. as the lowest cost option when you get down to IT as a commodity).
  5. There’s a lot of demand for cloud which doesn’t have any of these big cloud service providers as a requirement so the space for competition should be pretty hot despite this apparent Amazon/Microsoft duopoly.

Ok, it’s hardly scientific and the scope of these service providers varies significantly so comparison is perhaps unfair. There’s also the fact that some search results are of the form “… help us move from X to Y” which yields hits on both X as well as Y and though skills are required in both items it’s really Y that should be preferred. It’s also a very narrow selection of jobs in Britian today and says nothing about the rest of the world or what’s already in use. Anyway, for this evening it’s answered my question and I’ll be reading up on my AWS, Azure and OpenStack to keep my skills current this weekend… 🙂

For the record, job sites searched were; jobserve.com, monster.co.uk, and totaljobs.com.