Scalable = Horizontal

There’s two ways of scaling; vertical and horizontal, but there’s only one which is really scalable.

Vertical scaling essentially means bigger nodes. If you’ve got 8GB RAM, go to 16GB. If you’ve 2 cores, go to 4.. and so on.

Horizontal scaling means adding more nodes. One node to two nodes, to three and so on.

As a rule, horizontal scaling is good. Theoretically there’s no limit to the number of nodes you can have.

As a rule, vertical scaling is bad. You quickly run into constraints over the number of cores or RAM you can support. And for many of todays problems this just doesn’t work. Solutions need to be both scalable at the internet scale and available as in 24×7. Relying on large single nodes in such situations is not ideal. (and those supercomputers with 250,000+ processors are really horizontal solutions as well).

The problem is, horizontal scaling isn’t trivial. The culprits here are data and networking (plumbing really). State and caches need to be distributed and available to all. Databases need copies across nodes and need to be synchronised. Sharding usually becomes necessary (or you just end up with many very large nodes). And so on… Best bet is to avoid state as much as possible. But once you’ve cracked it you can build much larger solutions more efficiently (commodity hardware, virtualisation, containers etc.) and flex more rapidly than in the vertical world.

I could go on about how historically the big players love the vertical-scaling thing (think Oracle and IBM trying to sell you those huge servers and SQL databases solutions with the $$$ price-tags)… The world found NoSQL solutions which take a fundamentally different approach by accepting that consistency in many cases really isn’t as important as we once thought – and many of these are open-source…

Whatever, there’s only one way to scale… Horizontal!

 

Performance Testing is Easy

Performance testing is easy. We just throw as many requests at the system as we can as quickly as we want and measure the result. Job done right?

tl;dr? Short form…

  1. Understand the user scenarios and define tests. Review the mix of scenarios per test and the type of tests to be executed (peak, stress, soak, flood).
  2. Size and prepare the test environment and data. Consider the location of injectors and servers and mock peripheral services and systems where necessary.
  3. Test the tests!
  4. Execute and monitor everything. Start small and ramp up.
  5. Analyse results, tune, rinse and repeat until happy.
  6. Report the results.
  7. And question to what level of depth performance testing is really required…

Assuming we’ve got the tools and the environments, the  execution of performance tests should be fairly simple. The first hurdle though is in preparing for testing.

User Scenarios and Test Definitions

In order to test we first need to understand the sort of user scenarios that we’re going to encounter in production which warrant testing. For existing systems we can usually do some analysis on web-logs and the like to figure out what users are actually doing and try to model these scenarios. For this we may need a year or more of data to see if there are any seasonal variations and to understand what the growth trend looks like. For new systems we don’t have this data so need to make some assumptions and estimates as to what’s really going to happen. We also need to determine which of the scenarios we’re going to model and the transaction rates we want them to achieve.

When we’ve got system users calling APIs or running batch-jobs the variability is likely to be low. Human users are a different beast though and can wander off all over the place doing weird things. To model all scenarios can be a lot of effort (which equals a lot of cost) and a risk based approach is usually required. Considerations here include:

  • Picking the top few scenarios that account for the majority of activity. It depends on the system, but I’d suggest keeping these scenarios down to <5 – the fewer the better so long as it’s reasonably realistic.
  • Picking the “heavy” scenarios which we suspect are most intensive for the system (often batch jobs and the like).
  • Introducing noise to tests to force the system into doing things they’d not be doing normally. This sort of thing can be disruptive (e.g. a forced load of a library not otherwise used may be just enough to push the whole system over the edge in a catastrophic manner).

We next need to consider the relative mix of user scenarios for our tests (60% of users executing scenario A, 30% doing scenario B, 10% doing scenario C etc.) and the combinations of scenarios we want to consider (running scenarios A, B, C ; v’s A, B, C plus batch job Y).

Some of these tests may not be executed for performance reasons but for operability – e.g. what happens if my backup runs when I’m at peak load? or what happens when a node in a cluster fails?

We also need test data.

For each scenario we should be able to define the test data requirements. This is stuff like user-logins, account numbers, search terms etc.

Just getting 500 test user logins setup can be a nightmare. The associated test authentication system may not have capacity to handle the number of logins or account and we may need to mock it out. It’s all too common for peripheral systems not to be in the position to enable performance testing as we’d like and in any case we may want something that is more reliable when testing. For any mock services we do decide to build we need to work out how this should respond and what the performance of this should look like (it’s no good having a mock service return in 0.001 seconds when the real thing takes 1.8 seconds).

Account numbers have security implications and we may need to create dummy data. Search terms; especially from humans, can be wild and wonderful – returning millions or zero records in place of the expected handful.

In all cases, we need to prepare the environment based on the test data we’re going to use and size it correctly. Size it? Well, if production is going to have 10 millions records it’s not much good testing with 100! Copies of production data; possibly obfuscated, can be useful for existing systems. For new though we need to create the data. Here be dragons. The distribution of randomly generated data almost certainly won’t match that of real data – there are far more instances of surnames like Smith, Jones, Taylor, Williams or Brown than there are like Zebedee. If the distribution isn’t correct then the test may be invalid (e.g. we may hit one shard or tablespace and associated nodes and disks too little or too much).

I should point out that here that there’s a short cut for some situations. For existing systems with little in the way of stringent security requirements, no real functional changes and idempotent requests; think application upgrades or hardware migrations of primarily read-only websites, replaying the legacy web-logs may be a valid way to test. It’s cheap, quick and simple – if it’s viable.

We should also consider the profile and type of tests we want to run. For each test profile there are three parts. The ramp-up time (how long it takes to get to the target volume), steady-state time (how long the test runs at this level for), ramp-down time (how quickly we close the test (we usually care little for this and can close the test down quickly but in some cases we want a nice clean shutdown)). In terms of test types there are:

  • Peak load test – Typically a 1 to 2 hr test at peak target volumes. e.g. Ramp-up 30 minutes, steady-state 2hrs, ramp-down 5 mins.
  • Stress test – A longer test continually adding load beyond peak volumes to see how the system performs under excessive load and potentially where the break point is. e.g. Ramp-up 8 hrs, steady-state 0hrs, ramp-down 5 mins.
  • Soak test – A really long test running for 24hrs or more to identify memory leaks and the impact of peripheral/scheduled tasks. e.g. Ramp-up 30 mins, steady-state 24hrs, ramp-down 5 mins.
  • Flood test (aka Thundering Herd) – A short test where all users arrive in a very short period. In this scenario we can often see chaos ensue initially but the environment settling down after a short period. e.g. Ramp-up 0mins, steady-state 2hrs, ramp-down 5 mins

So we’re now ready to script our tests. We have the scenarios, we know the transaction volumes, we have test data, our environment is prep’d and we’ve mocked out any peripheral services and systems.

Scripting

There are many test tools available from the free Apache JMeter and Microsoft web stress tools to commercial products such as HP LoadRunner and Rational Performance Tester to cloud based solutions such as Soasta or Blitz. Which tool we choose depends on the nature of the application and our budget. Cloud tools are great if we’re hosting in the public cloud, not so good if we’re an internal service.

The location of the load injectors (the servers which run the actual tests) is also important. If these are sitting next to the test server we’ll get different results than if the injector is running on someones laptop connected via a VPN tunnel over a 256kbit ADSL line somewhere in the Scottish Highlands. Which case is more appropriate will depend on what we’re trying to test and where we consider the edge of our responsibility to lie. We have no control over the sort of devices and connectivity internet users have so perhaps our responsibility stops at the point of ingress into our network? Or perhaps it’s a corporate network and we’re only concerned with the point of ingress into our servers? We do need to design and work within these constraints so measuring and managing page weight and latency is always a concern but we don’t want to have the complexity of all that “stuff” out there which isn’t our responsibility weighing us down.

Whichever tool we choose, we can now complete the scripting and get on with testing.

Testing

Firstly, check everything is working. Run the scripts with a single user for 20 minutes or so to ensure things are responding as expected and that the transaction load is correct. This will ensure that as we add more users we’re scaling as we want and that the scripts aren’t themselves defective. We then quite quickly ramp the tests up, 1 user, 10, users, 100 users etc. This helps to identify any concurrency problems early on with fewer users than expected (which can add too much noise and make it hard to see whats really going on).

If we’ve an existing system, once we know the scripts work we will want to get a baseline from the legacy system to compare to. This means running the tests on the legacy system. What? Hang on! This means we need another instance of the system available running the old codebase with similar test data and similar; but possibly not identical, scripts! Yup. That it does.

If we’ve got time-taken logging enabled (%D for Apache mod_log_config) then we could get away with comparing the old production response times with the new system so long as we’re happy the environments are comparable (same OS, same types of nodes, same spec, same topology, NOT necessarily the same scale in terms of numbers of servers) and that the results are showing the same thing (which depends on what upstream network connectivity is being used). But really, a direct comparison of test results is better – comparing apples with apples.

We also need to consider what to measure and monitor. We are probably interested in:

  • For the test responses:
    • Average, max, min and 95th percentile for the response time per request type.
    • Average, max, min size for page weight.
    • Response codes – 20x/30x probably good, lots of 40x/50x suggests the test or servers are broken.
    • Network load and latency.
  • For the test servers:
    • CPU, memory, disk and network utilisation throughout the test run.
    • Key metrics from middle-ware; queue depths, cache-hit rates, JVM garbage collection (note that JVM memory will look flat at the server level so needs some JVM monitoring tools). These will vary depending on the middle-ware and for databases we’ll want a DBA to advise on what to monitor.
    • Number of sessions.
    • Web-logs and other log files.
  • For the load injectors:
    • CPU, memory, disk and network utilisation throughout the test run. Just to make sure it’s not the injectors that are overstretched.

And finally we can test.

Analysis and Tuning

It’s important to verify that the test achieved the expected transaction rates and usage profiles. Reviews of log files to ensure no-errors and web-logs to confirm transaction rates and request types help verify that all was correct before we start to review response times and server utilisation.

We can then go through the process of correlating test activity with utilisation, identifying problems, limits near capacity (JVM memory for example) and extrapolate for production – for which some detailed understanding of the scaling nature of the system is required.

It’s worth noting that whilst tests rarely succeed first time, in my experience it’s just as likely to be an issue with the test as it is with the system itself. It’s therefore necessary to plan to execute tests multiple times. A couple of days is normally not sufficient for proper performance testing.

All performance test results should be documented for reporting and future needs. To already have an understanding of why certain changes have been made and a baseline to compare to the next time the tests are run is invaluable. It’s not war-and-peace, just a few of pages of findings in a document or wiki. Most test tools will also export the results to a PDF which can be attached to keep track of the detail.

Conclusion?

This post is already too long but  one thing to question is… Is it worth the effort?

A Zipf distribution exists for systems and few really have that significant a load. Most handle a few transactions a second if that. I wouldn’t suggest “no performance testing” but I would suggest sizing the effort depending on the criticality and expected load. Getting a few guys in the office to hit F5 whilst we eyeball the CPU usage may well be enough. In code we can also include timing metrics in unit tests and execute these a few thousand times in a loop to see if there’s any cause for concern. Getting the engineering team to consider and monitor performance early on can help avoid issues later and reduce he need for multiple performance test iterations.

Critical systems with complex transactions or an expected high load (which as a rough guide I would say is anything around 10tps or more) should be tested more thoroughly. Combining capacity needs with operational needs informs the decision – four 9’s and 2k tps is the high end from my experience – and a risk based approach should always be used when considering performance testing.

Session Abolition

I’ve been going through my bookcase; on orders from a higher-being, to weed out old, redundant books and make way for… well, I’m not entirely sure what, but anyway, it’s not been very successful.

I came across an old copy of Release It! by Michael T. Nygard and started flicking through, chuckling occasionally as memories (good and bad) surfaced. It’s an excellent book but made me stop and think when I came across a note reading:

Serve small cookies
Use cookies for identifiers, not entire objects. Keep session data on the server, where it can't be altered by a malicious client.

There’s nothing fundamentally wrong with this other than it chimes with a problem I’m currently facing and I don’t like any of the usual solutions.

Sessions either reside in some sort of stateful pool; persistent database, session management server, replicated memory etc., or more commonly exist stand-alone within each node of a cluster. In either case load-balancing is needed to route requests to the home node where the session exists (delays in replication means you can’t go to any node even when a stateful pool is used). Such load-balancing is performed by a network load-balancer, reverse proxy, web-server (mod_proxy, WebSphere plugin etc.) or application server and can work using numerous different algorithms; IP based routing, round-robin, least-connections etc.

So in my solution I now need some sort of load-balancer – more components, joy! But even worse, it’s creating havoc with reliability. Each time a node fails I lose all sessions on that server (unless I plumb for a session-management-server which I need like a hole in the head). And nodes fails all the time… (think cloud, autoscaling and hundreds of nodes).

So now I’m going to kind-of break that treasured piece of advice from Michael and create larger cookies (more likely request parameters) and include in them some every-so-slightly-sensitive details which I really shouldn’t. I should point out this isn’t is criminal as it sounds.

Firstly the data really isn’t that sensitive. It’s essentially routing information that needs to be remembered between requests – not my credit card details.

Secondly it’s still very small – a few bytes or so but I’d probably not worry too much until it gets to around 2K+ (some profiling required here I suspect).

Thirdly, there are other ways to protect the data – notably encryption and hashing. If I don’t want the client to be able to read it then I’ll encrypt it. If I don’t mind the client reading the data but want to make sure it’s not been tampered with, I’ll use an HMAC instead. A JSON Web Token like format should well work in most cases.

Now I can have no session on the back-end servers at all but instead need to decrypt (or verify the hash) and decode a token on each request. If a node fails I don’t care (much) as any other node can handle the same request and my load balancing can be as dumb as I can wish.

I’ve sacrificed performance for reliability – both in terms of computational effort server side and in terms of network payload – and made some simplification to the overall topology to boot. CPU cycles are getting pretty cheap now though and this pattern should scale horizontally and vertically – time for some testing… The network penalty isn’t so cheap but again should be acceptable and if I avoid using “cookies” for the token then I can at least save the load on every single request.

It also means that in a network of micro-services, so long as each service propagates these tokens around, the more thorny routing problem in this sort of environment virtually disappears.

I do though now have a key management problem. Somewhere, somehow I need to store the keys securely whilst distributing them to every node in the cluster… oh and don’t mention key-rotation…

Scaling on a budget

Pre-cloud era. You have a decision to make. Do you define your capacity and performance requirements in the belief that you’ll build the next top 1000 web-site in the world or start out with the view that you’ll likely build a dud which will be lucky to get more than a handful of visits each day?

If the former then you’ll need to build your own data-centres (redundant globally distributed data-centres). If the latter then you may as well climb into your grave before you start. But most likely you’ll go for something in the middle, or rather at the lower end, something which you can afford.

The problem comes when your site becomes popular. Worse still, when that popularity is temporary. In most cases you’ll suffer something like a slashdot effect for a day or so which will knock you out temporarily but could trash your image permanently. If you started at the higher end then your problems have probably become terminal (at least financially) already.

It’s a dilemma that every new web-site needs to address.

Post-cloud era. You have a choice – IaaS or PaaS? If you go with infrastructure then you can possibly scale out horizontally by adding more servers when needed. This though is relatively slow to provision* since you need to spin up a new server, install your applications and components, add it to the cluster, configure load-balancing, DNS resiliency and so on. Vertical scaling may be quicker but provides limited additional headroom. And this assumes you designed the application to scale in the first place – if you didn’t then chances are probably 1 in 10 that you’ll get lucky. On the up side, the IaaS solution gives you the flexibility to do-your-own-thing and your existing legacy applications have a good chance they can be made to run in the cloud this way (everything is relative of course).

If you go with PaaS then you’re leveraging (in theory) a platform which has been designed to scale but which constrains your solution design in doing so. Your existing applications have little chance they’ll run off-the-shelf (actually, no chance at all really) though if you’re lucky some of your libraries may (may!) work depending on compatibility (Google App Engine for Java, Microsoft Azure for .NET for example). The transition is more painful with PaaS but where you gain is in highly elastic scalability at low cost because it’s designed into the framework.

IaaS is great (this site runs on it), is flexible with minimal constraints, low cost and can be provisioned quickly (compared to the pre-cloud world).

PaaS provides a more limited set of capabilities at a low price point and constrains how applications can be built so that they scale and co-host with other users applications (introducing multi-tenancy issues).

A mix of these options probably provides the best solution overall depending on individual component requirements and other NFRs (security for example).

Anyway, it traverses the rats maze of my mind today due to relevance in the news… Many Government web-sites have pitiful visitor numbers until they get slashdotted or are placed at #1 on the BBC website – something which happens quite regularly though most of the time the sites get very little traffic – peaky. Todays victim is the Get Safe Online site which collapsed under load – probably as result of the BBC advertising it. For such sites perhaps PaaS is the way forward.

* I can’t really believe I’m calling IaaS “slow” given provisioning can be measured in the minutes and hours when previously you’d be talking days, weeks and likely months…

MongoDB Write Concern Performance

MongoDB is a popular NoSQL database which scales to very significant volumes through sharding and can provide resiliency through replication sets. MongoDB doesn’t support the sort or transaction isolation that you might get with a more traditional database (read committed, dirty reads etc.) and works at the document level as an atomic transaction (it’s either inserted/updated, or it’s not) – you cannot have a transaction spanning multiple documents.

What MongoDB does provide is called “Write-Concern” which provides some assurance over whether the transaction was safely written or not.

You can store a document and request “acknowledgement” (or not), whether the document was replicated to any replica-sets (for resiliency), whether the document was written to the transaction log etc. There’s a very good article on the details of Write-Concern over on the MongoDB site. Clearly the performance will vary depending on the options chosen and the Java driver supports a wide range of these:

  • ACKNOWLEDGED

  • ERRORS_IGNORED

  • FSYNCED

  • FSYNC_SAFE

  • JOURNALED

  • JOURNAL_SAFE

  • MAJORITY

  • NONE

  • NORMAL

  • REPLICAS_SAFE

  • REPLICA_ACKNOWLEDGED

  • SAFE

  • UNACKNOWLEDGED

So for a performance comparison I fired up a small 3 node MongoDB cluster (2 database servers, 1 arbitrator) and ran a script to store 100 documents in the database using the various methods available to see what the difference is. The database was cleaned down each time (to zero – so overall is very small).

**WARNING: Performance testing is highly dependent upon the environment in which it is run. These results are based on a dev/test environment running x3 guests on the same host node and may not be representative for you and only exist to provide a comparison. **

The results for all modes are shown below and shows x3 relatively distinct clusters.

Write Concern - All Modes
Write Concern – All Modes

Note: The initial run in all cases incurs a start up cost and hence appears slower than normal. This dissipates quickly though and performance can be seem to improve after this first run.

The slowest of these are FSYNCED, FSYNC_SAFE, JOURNALED and JOURNALED_SAFE (with JOURNALED_SAFE being the slowest).

Write-Concern Cluster 3 - Slowest
Write-Concern Cluster 3 – Slowest

These options all require the data to be written to disk which explains why they are significantly slower than other options though the contended nature of the test environment likely makes the results appear worse than they would be in a production environment. FSYNC modes are mainly useful for backups and the like so shouldn’t be used in code. JOURNALED modes depend on the journal commit interval (default 30 or 100ms) as well as the performance of your disks. Interestingly JOURNAL_SAFE is the supposedly the same as JOURNALED so seems a little odd that I can see a relatively significant reduction in performance consistently.

The second cluster improves performance significantly (from 3.5s overall to 500ms). This group covers the MAJORITY, REPLICAS_SAFE and REPLICAS_ACKNOWLEDGED options.

write-concern-c2
Write-Concern Cluster 2 – Mid

These options are all related to data replication to secondary nodes. REPLICA_ACKNOWLEDGED waits for x2 servers to have stored the data whilst MAJORITY waits for the majority to have stored and in this test since there are only x2 database servers it’s unsurprising that the results are similar. As the number of database servers increases then MAJORITY may be safer than REPLICA_ACKNOWLEDGED but will suffer some performance degradation. This though isn’t a linearly scaled performance drop since replication will generally occur in parallel. REPLICA_SAFE is supposedly the same as REPLICA_ACKNOWLEDGED and in this instance the results seem to back this up.

The fastest options cover everything else; ACKNOWLEDGED, SAFE, NORMAL, NONE, ERRORS_IGNORED and UNACKNOWLEDGED.

Write-Concern Cluster 1 - Fastest
Write-Concern Cluster 1 – Fastest

In theory I was expecting SAFE and ACKNOWLEDGED to be similar with NORMAL, NONE, ERRORS_IGNORED and UNACKNOWLEDGED quicker still since this last set shouldn’t wait for any acknowledgement from the server – once written to socket then assume all ok. However, the code I used was an older library I developed some time back which returns the object ID once stored. Since this has to read some data back, some sort of acknowledgement is implicit and so unsurprisingly they all perform similarly.

ERRORS_IGNORED and NONE are deprecated and shouldn’t be used anymore whilst NORMAL seems an odd name as the default for MonoDB itself is ACKNOWLEDGED!?

In summary. For raw speed ACKNOWLEDGED should do though if you want fire-and-forget then specific code and UNACKNOWLEDGED should be faster still. A performance drop will occur if you want the assurance that the data has been replicated to another server via REPLICA_ACKNOWLEDGED and this will depend on your network performance and locations so is worth testing for your specific needs. Finally, if you want to know it’s gone to disk then it’s slower still with the JOURNALED option, especially if you’ve contention on the disks as I do. For the truly paranoid there should be a REPLICA_JOURNALED option which would confirm both replicated and journaled.

Finally, if you insist on a replica acknowledging as well then it needs to be online and your code may hang if a replica is not available. If you’ve lots of replicas then this may be acceptable but if you’ve only 1 (as in this test case) then it’s bad enough to bring the application down immediately.

 

Mad Memoization (or how to make computers make mistakes)

Memoization is a technique used to effectively cache the results of computationally expensive functions to improve performance and throughput on subsequent executions. It can be implemented in a variety of languages but is perhaps best suited to functional programming languages where the response to a function should be consistent for a given set of input values. It’s a nice idea and has some uses but perhaps isn’t all that common since we tend to design  programs so that we only call such functions once; when needed, in any case.

I have a twist on this. Rather than remembering the response to a function with a particular set of values, remember the responses to a function and just make a guess at the response next time.

A guess could be made based on the entropy of the input and/or output values. For example, where the response is a boolean value (true or false) and you find that 99% of the time the response is “true” but it takes 5 seconds to work this out, then… to hell with it, just return “true” and don’t bother with the computation. Lazy I know.

Of course some of the time the response would be wrong but that’s the price you pay for improving performance throughput.

There would be some (possibly significant) cost to determining the entropy of inputs/outputs and any function which modifies the internal state of the system (non-idempotent) should be avoided from such treatment for obvious reasons. You’d also only really want to rely on such behaviour when the system is busy and nearly overloaded already so you need a way to quickly get through the backlog – think of it like the exit gates of a rock concert when a fire breaks out, you quickly want to ditch the “check-every-ticket” protocol in favour of a “let-everyone-out-asap” solution.

You could even complicate the process a little further and employ a decision  tree (based on information gain for example) when trying to determine the response to a particular set of inputs.

So, you need to identify expensive idempotent functions, calculate the entropy of inputs and outputs, build associated decision trees, get some feedback on the performance and load on the system and work out at which point to abandon reason and open the floodgates – all dynamically! Piece of piss… (humm, maybe not).

Anyway, your program would make mistakes when under load but should improve performance and throughput overall. Wtf! Like when would this ever be useful?

  • DoS attacks? Requests could be turned away at the front door to protect services deeper in the system?
  • The Slashdot effect? You may not give the users what they want but you’ll at least not collapse under the load.
  • Resiliency? If you’re dependent on some downstream component which is not responding (you could be getting timeouts after way too many seconds) then these requests will look expensive and the fallback to some default response (which may or may not be correct!?).

Ok, perhaps not my best idea to date but I like the idea of computers making mistakes by design rather than through incompetence of the developer (sorry, harsh I know, bugs happen, competent or otherwise).

Right, off to take the dog for a walk, or just step outside then come back in again if she’s feeling tired…