Scaling the Turd

It has been quite some time since my last post… Mainly because I’ve spent an inordinate amount of time trying to get an application scaling and performing as needed. We’ve done it, but I’m not happy.

Not happy, in part because of the time its taken, but mainly because the solution is totally unsatisfactory. It’s an off the shelf (COTS) package so making changes beyond a few “customisations” is out of the question and the supplier has been unwilling to accept the problem is within the product and instead points to our “environment” and “customisations” – of which, IMHO, neither are particularly odd.

At root there are really two problems.

One – a single JVM can only handle 10 tps according to the supplier (transaction/page requests/second). Our requirement is around 50.

Two – performance degrades over time to unacceptable levels if a JVM is stressed hard. So 10tps realistically becomes more like 1 to 2 tps after a couple of days of soak testing.

So we’ve done a lot of testing – stupid amounts of testing! Over and over, tweaking this and that, changing connection and thread pools, JVM settings, memory allocations etc. with pretty much no luck. We’ve checked the web-servers, the database, the queue configuration (itself an abomination of a setup), the CPU is idle, memory is plentiful, garbage-collection working a treat, disk IO is non-existent, network-IO measured in the Kb/sec. Nada! Except silence from the supplier…

And then we’ve taken thread dumps and can see stuck threads and lock contention so we know roughly where the problem lies, passed this to the supplier, but still, silence…

Well, not quite silence. They finally offered that “other customers don’t have these issues” and “other customers typically run 20+ JVMs”! Excuse me? 20 JVMs is typical..? wtf!? So really they’re admitting that the application doesn’t scale within a JVM. That it cannot make use of resources efficiently within a JVM and that the only way to make it work is to throw more JVMs at it. Sounds to me like a locking issue in the application – one that no doubt gets worse as the application is stressed. Well at least we have a fix…

This means that we’ve ended up with 30 JVMs across 10 servers (VMs) for one component to handle a pathetic 50tps! – something I would expect 2 or 3 servers to handle quite easily given the nature of the application (the content delivery aspect of a content management system). And the same problem pervades the applications other components so we end up with 50 servers (all VMs bar a physical DB cluster) for an application handling 50 tps… This is not efficient or cost effective.

There are also many other issues with the application including such idiocies as synchronous queueing, a total lack of cache headers (resulting in a stupid high hit-rate for static resources) and really badly implemented Lucene indexing (closing and opening indexes continually). It is, by some margin, the worst COTS application I have had the misfortunate to come across (I’ll admit I’ve seen worse home-grown ones so not sure what that leaves us in the buy-v-build argument…).

So what’s wrong with having so many JVMs?

Well, cost for a start. Even though we can cram more JVMs onto fewer VMs we need to step this up in chunks of RAM required per JVM (around 4GB). So, whilst I’m not concerned about CPU, a 20GB 4vCPU host can really only support 4 JVMs (some space is needed for OS and other process overheads). Lots of tin, doing nothing.

But the real issue is maintenance. How the hell do you manage that many JVMs and VMs efficiently? You can use clustering in the application-server, oh, except that this isn’t supported by the supplier (like I said, the worst application ever!). So we’ve now got monitors and scripts for each JVM and each VM and when something breaks (… and guess what, with this pile of sh*t, it does) we need to go round each node fixing them one-by-one.

Anyway, lessons learned, how should we have scaled such an application? What would I do differently now that I know? (bar using some completely different product of course)

Firstly I would have combined components together where we can. There’s no reason why some WARs couldn’t be homed together (despite the suppliers design suggesting otherwise). This would help reduce some of the JVMs and improve the reliability of some components (that queueing mechanism specifically).

Secondly, given we can’t use a real cluster in the app-server, we can (now) use containers to package up each component of the application instead. This then becomes our scaling and maintenance point and rather than having 50 servers to manage we have 7 or 8 images to maintain (still a lot for such an application). This then allows us to scale up or down at the container level more quickly. The whole application wouldn’t fit this model (DB in particular would remain as it is) but most of it would should.

Of course it doesn’t solve the root cause unfortunately but it is a more elegant, maintainable and cheaper solution and, bar eradicating this appalling product from the estate, one that would have been so much more satisfying.

So thats the project for the summer.. Work out how to containerise this sort of COTS application, how to connect, route and scale them in a way that is manageable, efficient and cost effective. Next project please!