Good high level explanation of how Spectre and Meltdown work… and WHY RASPBERRY PI ISN’T VULNERABLE TO SPECTRE OR MELTDOWN
There’s two ways of scaling; vertical and horizontal, but there’s only one which is really scalable.
Vertical scaling essentially means bigger nodes. If you’ve got 8GB RAM, go to 16GB. If you’ve 2 cores, go to 4.. and so on.
Horizontal scaling means adding more nodes. One node to two nodes, to three and so on.
As a rule, horizontal scaling is good. Theoretically there’s no limit to the number of nodes you can have.
As a rule, vertical scaling is bad. You quickly run into constraints over the number of cores or RAM you can support. And for many of todays problems this just doesn’t work. Solutions need to be both scalable at the internet scale and available as in 24×7. Relying on large single nodes in such situations is not ideal. (and those supercomputers with 250,000+ processors are really horizontal solutions as well).
The problem is, horizontal scaling isn’t trivial. The culprits here are data and networking (plumbing really). State and caches need to be distributed and available to all. Databases need copies across nodes and need to be synchronised. Sharding usually becomes necessary (or you just end up with many very large nodes). And so on… Best bet is to avoid state as much as possible. But once you’ve cracked it you can build much larger solutions more efficiently (commodity hardware, virtualisation, containers etc.) and flex more rapidly than in the vertical world.
I could go on about how historically the big players love the vertical-scaling thing (think Oracle and IBM trying to sell you those huge servers and SQL databases solutions with the $$$ price-tags)… The world found NoSQL solutions which take a fundamentally different approach by accepting that consistency in many cases really isn’t as important as we once thought – and many of these are open-source…
Whatever, there’s only one way to scale… Horizontal!
I wrote previously that we are moving into an era of instrumentation and things are indeed improving. Just not as fast as I’d like. There’s a lot of good stuff out there to support instrumentation and monitoring including the likes of the ELK (ElasticSearch, Logstash, Kibana) and TIG (Telegraf, InfluxDB, Grafana) stacks as well as their more commercial offerings such as TICK (Telegraf, InfluxDB, Chronograf, Kapacitor), Splunk, DataDog, AppDynamics and others. The problem is, few still really treat instrumentation as a real concern… until it’s too late.
Your customers love this stuff! Really, they do! There’s nothing quite as sexy as an interactive graph showing how your application is performing as the load increases – transactions, visitors, response-times, server utilisation, queue-depths etc. When things are going well it gives everyone a warm fuzzy feeling that all is right with the universe. When things are going wrong it helps to quickly focus you in on where the problem is.
However, this stuff needs to be built into everything we do and not be an afterthought when the pressures on to ship-it and you can’t afford the time and effort to retrofit it. By then it’s too late.
As architects we need to put in the infrastructure and services needed to support instrumentation, monitoring and alerting. At a minimum this means putting in place standards for logging, data-retention polices, a data collection solution, repository for the data and some tooling to allow us to search that data and visualize what’s going on. Better still we can add alerting when thresholds breach and use richer analytics to allow us to scale up and down to meet demand.
As developers we need to be considering what metrics we want to capture from the components we build as we’re working on them. Am I interested in how long it’s taking for this function call? Do I want to know how many messages a service is handling? How many threads are being spawned? What exceptions are being thrown? Where from? What the queue depths are?.. etc. Almost certainly… YES! And this means putting in place strategies for logging these things. Perhaps you can find the data in existing log files.. Perhaps you need to use better tooling for detailed monitoring… Perhaps you need to write some code yourself to track how things are going…
Doing this from the start will enable you to get a much better feel for how things are working before you launch – including a reasonable view of performance and infrastructure demands which will allow you to focus your efforts better later when you do get into sizing and performance testing. It’ll mean you’re not scrambling around look for log files to help you root-cause issues as your latest release goes into meltdown. And it’ll mean your customer won’t be chewing your ear off asking you what’s going on every five minutes – they’ll be able to see it for themselves…
So please, get it in front of your customer, your product owner, your sponsor, your architects, your developers, your testers and make instrumentation a 1st class citizen in the backlog.
It has been quite some time since my last post… Mainly because I’ve spent an inordinate amount of time trying to get an application scaling and performing as needed. We’ve done it, but I’m not happy.
Not happy, in part because of the time its taken, but mainly because the solution is totally unsatisfactory. It’s an off the shelf (COTS) package so making changes beyond a few “customisations” is out of the question and the supplier has been unwilling to accept the problem is within the product and instead points to our “environment” and “customisations” – of which, IMHO, neither are particularly odd.
At root there are really two problems.
One – a single JVM can only handle 10 tps according to the supplier (transaction/page requests/second). Our requirement is around 50.
Two – performance degrades over time to unacceptable levels if a JVM is stressed hard. So 10tps realistically becomes more like 1 to 2 tps after a couple of days of soak testing.
So we’ve done a lot of testing – stupid amounts of testing! Over and over, tweaking this and that, changing connection and thread pools, JVM settings, memory allocations etc. with pretty much no luck. We’ve checked the web-servers, the database, the queue configuration (itself an abomination of a setup), the CPU is idle, memory is plentiful, garbage-collection working a treat, disk IO is non-existent, network-IO measured in the Kb/sec. Nada! Except silence from the supplier…
And then we’ve taken thread dumps and can see stuck threads and lock contention so we know roughly where the problem lies, passed this to the supplier, but still, silence…
Well, not quite silence. They finally offered that “other customers don’t have these issues” and “other customers typically run 20+ JVMs”! Excuse me? 20 JVMs is typical..? wtf!? So really they’re admitting that the application doesn’t scale within a JVM. That it cannot make use of resources efficiently within a JVM and that the only way to make it work is to throw more JVMs at it. Sounds to me like a locking issue in the application – one that no doubt gets worse as the application is stressed. Well at least we have a fix…
This means that we’ve ended up with 30 JVMs across 10 servers (VMs) for one component to handle a pathetic 50tps! – something I would expect 2 or 3 servers to handle quite easily given the nature of the application (the content delivery aspect of a content management system). And the same problem pervades the applications other components so we end up with 50 servers (all VMs bar a physical DB cluster) for an application handling 50 tps… This is not efficient or cost effective.
There are also many other issues with the application including such idiocies as synchronous queueing, a total lack of cache headers (resulting in a stupid high hit-rate for static resources) and really badly implemented Lucene indexing (closing and opening indexes continually). It is, by some margin, the worst COTS application I have had the misfortunate to come across (I’ll admit I’ve seen worse home-grown ones so not sure what that leaves us in the buy-v-build argument…).
So what’s wrong with having so many JVMs?
Well, cost for a start. Even though we can cram more JVMs onto fewer VMs we need to step this up in chunks of RAM required per JVM (around 4GB). So, whilst I’m not concerned about CPU, a 20GB 4vCPU host can really only support 4 JVMs (some space is needed for OS and other process overheads). Lots of tin, doing nothing.
But the real issue is maintenance. How the hell do you manage that many JVMs and VMs efficiently? You can use clustering in the application-server, oh, except that this isn’t supported by the supplier (like I said, the worst application ever!). So we’ve now got monitors and scripts for each JVM and each VM and when something breaks (… and guess what, with this pile of sh*t, it does) we need to go round each node fixing them one-by-one.
Anyway, lessons learned, how should we have scaled such an application? What would I do differently now that I know? (bar using some completely different product of course)
Firstly I would have combined components together where we can. There’s no reason why some WARs couldn’t be homed together (despite the suppliers design suggesting otherwise). This would help reduce some of the JVMs and improve the reliability of some components (that queueing mechanism specifically).
Secondly, given we can’t use a real cluster in the app-server, we can (now) use containers to package up each component of the application instead. This then becomes our scaling and maintenance point and rather than having 50 servers to manage we have 7 or 8 images to maintain (still a lot for such an application). This then allows us to scale up or down at the container level more quickly. The whole application wouldn’t fit this model (DB in particular would remain as it is) but most of it
Of course it doesn’t solve the root cause unfortunately but it is a more elegant, maintainable and cheaper solution and, bar eradicating this appalling product from the estate, one that would have been so much more satisfying.
So thats the project for the summer.. Work out how to containerise this sort of COTS application, how to connect, route and scale them in a way that is manageable, efficient and cost effective. Next project please!
We can have a small server…
…a big server (aka vertical scaling)…
.. a cluster of servers (aka horizontal scaling)…
.. or even a compute grid (horizontal scaling on steroids).
For resiliency we can have active-passive…
… or active-active…
… or replication in a cluster or grid…
…each with their own connectivity, load-balancing and routing concerns.
From a logical perspective we could have a simple client-server setup…
…an n-tier architecture…
…a service oriented (micro- or ESB) architecture…
…and so on.
And in each environment we can have different physical topologies depending on the environmental needs with logical nodes mapped to each environments servers…
With our functional components deployed on our logical infrastructure using a myriad of other deployment topologies..
… or …
… and on and on and on…
And this functional perspective can be implemented using dozens of design patterns and a plethora of integration patterns.
With each component implemented using whichever products and packages we choose to be responsible for supporting one or more requirements and capabilities…
So the infrastructure we rely on, the products we select, the components we build or buy; the patterns we adopt and use… all exist for nothing but the underlying requirement.
We should therefore be able to trace from requirement through the design all the way to the tin on the floor.
And if we can do that we can answer lots of interesting questions such as “what happens if I turn this box off?”, “what’s impacted if I change this requirement?” or even “which requirements are driving costs?”. Which in turn can help improve supportability, maintainability and availability and reduce costs. You may even find your product sponsor questioning if they really need this or that feature…
Performance testing is easy. We just throw as many requests at the system as we can as quickly as we want and measure the result. Job done right?
tl;dr? Short form…
- Understand the user scenarios and define tests. Review the mix of scenarios per test and the type of tests to be executed (peak, stress, soak, flood).
- Size and prepare the test environment and data. Consider the location of injectors and servers and mock peripheral services and systems where necessary.
- Test the tests!
- Execute and monitor everything. Start small and ramp up.
- Analyse results, tune, rinse and repeat until happy.
- Report the results.
- And question to what level of depth performance testing is really required…
Assuming we’ve got the tools and the environments, the execution of performance tests should be fairly simple. The first hurdle though is in preparing for testing.
User Scenarios and Test Definitions
In order to test we first need to understand the sort of user scenarios that we’re going to encounter in production which warrant testing. For existing systems we can usually do some analysis on web-logs and the like to figure out what users are actually doing and try to model these scenarios. For this we may need a year or more of data to see if there are any seasonal variations and to understand what the growth trend looks like. For new systems we don’t have this data so need to make some assumptions and estimates as to what’s really going to happen. We also need to determine which of the scenarios we’re going to model and the transaction rates we want them to achieve.
When we’ve got system users calling APIs or running batch-jobs the variability is likely to be low. Human users are a different beast though and can wander off all over the place doing weird things. To model all scenarios can be a lot of effort (which equals a lot of cost) and a risk based approach is usually required. Considerations here include:
- Picking the top few scenarios that account for the majority of activity. It depends on the system, but I’d suggest keeping these scenarios down to <5 – the fewer the better so long as it’s reasonably realistic.
- Picking the “heavy” scenarios which we suspect are most intensive for the system (often batch jobs and the like).
- Introducing noise to tests to force the system into doing things they’d not be doing normally. This sort of thing can be disruptive (e.g. a forced load of a library not otherwise used may be just enough to push the whole system over the edge in a catastrophic manner).
We next need to consider the relative mix of user scenarios for our tests (60% of users executing scenario A, 30% doing scenario B, 10% doing scenario C etc.) and the combinations of scenarios we want to consider (running scenarios A, B, C ; v’s A, B, C plus batch job Y).
Some of these tests may not be executed for performance reasons but for operability – e.g. what happens if my backup runs when I’m at peak load? or what happens when a node in a cluster fails?
We also need test data.
For each scenario we should be able to define the test data requirements. This is stuff like user-logins, account numbers, search terms etc.
Just getting 500 test user logins setup can be a nightmare. The associated test authentication system may not have capacity to handle the number of logins or account and we may need to mock it out. It’s all too common for peripheral systems not to be in the position to enable performance testing as we’d like and in any case we may want something that is more reliable when testing. For any mock services we do decide to build we need to work out how this should respond and what the performance of this should look like (it’s no good having a mock service return in 0.001 seconds when the real thing takes 1.8 seconds).
Account numbers have security implications and we may need to create dummy data. Search terms; especially from humans, can be wild and wonderful – returning millions or zero records in place of the expected handful.
In all cases, we need to prepare the environment based on the test data we’re going to use and size it correctly. Size it? Well, if production is going to have 10 millions records it’s not much good testing with 100! Copies of production data; possibly obfuscated, can be useful for existing systems. For new though we need to create the data. Here be dragons. The distribution of randomly generated data almost certainly won’t match that of real data – there are far more instances of surnames like Smith, Jones, Taylor, Williams or Brown than there are like Zebedee. If the distribution isn’t correct then the test may be invalid (e.g. we may hit one shard or tablespace and associated nodes and disks too little or too much).
I should point out that here that there’s a short cut for some situations. For existing systems with little in the way of stringent security requirements, no real functional changes and idempotent requests; think application upgrades or hardware migrations of primarily read-only websites, replaying the legacy web-logs may be a valid way to test. It’s cheap, quick and simple – if it’s viable.
We should also consider the profile and type of tests we want to run. For each test profile there are three parts. The ramp-up time (how long it takes to get to the target volume), steady-state time (how long the test runs at this level for), ramp-down time (how quickly we close the test (we usually care little for this and can close the test down quickly but in some cases we want a nice clean shutdown)). In terms of test types there are:
- Peak load test – Typically a 1 to 2 hr test at peak target volumes. e.g. Ramp-up 30 minutes, steady-state 2hrs, ramp-down 5 mins.
- Stress test – A longer test continually adding load beyond peak volumes to see how the system performs under excessive load and potentially where the break point is. e.g. Ramp-up 8 hrs, steady-state 0hrs, ramp-down 5 mins.
- Soak test – A really long test running for 24hrs or more to identify memory leaks and the impact of peripheral/scheduled tasks. e.g. Ramp-up 30 mins, steady-state 24hrs, ramp-down 5 mins.
- Flood test (aka Thundering Herd) – A short test where all users arrive in a very short period. In this scenario we can often see chaos ensue initially but the environment settling down after a short period. e.g. Ramp-up 0mins, steady-state 2hrs, ramp-down 5 mins
So we’re now ready to script our tests. We have the scenarios, we know the transaction volumes, we have test data, our environment is prep’d and we’ve mocked out any peripheral services and systems.
There are many test tools available from the free Apache JMeter and Microsoft web stress tools to commercial products such as HP LoadRunner and Rational Performance Tester to cloud based solutions such as Soasta or Blitz. Which tool we choose depends on the nature of the application and our budget. Cloud tools are great if we’re hosting in the public cloud, not so good if we’re an internal service.
The location of the load injectors (the servers which run the actual tests) is also important. If these are sitting next to the test server we’ll get different results than if the injector is running on someones laptop connected via a VPN tunnel over a 256kbit ADSL line somewhere in the Scottish Highlands. Which case is more appropriate will depend on what we’re trying to test and where we consider the edge of our responsibility to lie. We have no control over the sort of devices and connectivity internet users have so perhaps our responsibility stops at the point of ingress into our network? Or perhaps it’s a corporate network and we’re only concerned with the point of ingress into our servers? We do need to design and work within these constraints so measuring and managing page weight and latency is always a concern but we don’t want to have the complexity of all that “stuff” out there which isn’t our responsibility weighing us down.
Whichever tool we choose, we can now complete the scripting and get on with testing.
Firstly, check everything is working. Run the scripts with a single user for 20 minutes or so to ensure things are responding as expected and that the transaction load is correct. This will ensure that as we add more users we’re scaling as we want and that the scripts aren’t themselves defective. We then quite quickly ramp the tests up, 1 user, 10, users, 100 users etc. This helps to identify any concurrency problems early on with fewer users than expected (which can add too much noise and make it hard to see whats really going on).
If we’ve an existing system, once we know the scripts work we will want to get a baseline from the legacy system to compare to. This means running the tests on the legacy system. What? Hang on! This means we need another instance of the system available running the old codebase with similar test data and similar; but possibly not identical, scripts! Yup. That it does.
If we’ve got time-taken logging enabled (%D for Apache mod_log_config) then we could get away with comparing the old production response times with the new system so long as we’re happy the environments are comparable (same OS, same types of nodes, same spec, same topology, NOT necessarily the same scale in terms of numbers of servers) and that the results are showing the same thing (which depends on what upstream network connectivity is being used). But really, a direct comparison of test results is better – comparing apples with apples.
We also need to consider what to measure and monitor. We are probably interested in:
- For the test responses:
- Average, max, min and 95th percentile for the response time per request type.
- Average, max, min size for page weight.
- Response codes – 20x/30x probably good, lots of 40x/50x suggests the test or servers are broken.
- Network load and latency.
- For the test servers:
- CPU, memory, disk and network utilisation throughout the test run.
- Key metrics from middle-ware; queue depths, cache-hit rates, JVM garbage collection (note that JVM memory will look flat at the server level so needs some JVM monitoring tools). These will vary depending on the middle-ware and for databases we’ll want a DBA to advise on what to monitor.
- Number of sessions.
- Web-logs and other log files.
- For the load injectors:
- CPU, memory, disk and network utilisation throughout the test run. Just to make sure it’s not the injectors that are overstretched.
And finally we can test.
Analysis and Tuning
It’s important to verify that the test achieved the expected transaction rates and usage profiles. Reviews of log files to ensure no-errors and web-logs to confirm transaction rates and request types help verify that all was correct before we start to review response times and server utilisation.
We can then go through the process of correlating test activity with utilisation, identifying problems, limits near capacity (JVM memory for example) and extrapolate for production – for which some detailed understanding of the scaling nature of the system is required.
It’s worth noting that whilst tests rarely succeed first time, in my experience it’s just as likely to be an issue with the test as it is with the system itself. It’s therefore necessary to plan to execute tests multiple times. A couple of days is normally not sufficient for proper performance testing.
All performance test results should be documented for reporting and future needs. To already have an understanding of why certain changes have been made and a baseline to compare to the next time the tests are run is invaluable. It’s not war-and-peace, just a few of pages of findings in a document or wiki. Most test tools will also export the results to a PDF which can be attached to keep track of the detail.
This post is already too long but one thing to question is… Is it worth the effort?
A Zipf distribution exists for systems and few really have that significant a load. Most handle a few transactions a second if that. I wouldn’t suggest “no performance testing” but I would suggest sizing the effort depending on the criticality and expected load. Getting a few guys in the office to hit F5 whilst we eyeball the CPU usage may well be enough. In code we can also include timing metrics in unit tests and execute these a few thousand times in a loop to see if there’s any cause for concern. Getting the engineering team to consider and monitor performance early on can help avoid issues later and reduce he need for multiple performance test iterations.
Critical systems with complex transactions or an expected high load (which as a rough guide I would say is anything around 10tps or more) should be tested more thoroughly. Combining capacity needs with operational needs informs the decision – four 9’s and 2k tps is the high end from my experience – and a risk based approach should always be used when considering performance testing.
BT’s Sports online player – which being polite is piss poor and with the UX design provided by a six year old – is a fine example of how not to deal with errors in user interfaces. “User” being the key word here…
Rather than accepting users are human beings in need of meaningful error messages, and perhaps in-situ advice about what to do about it, they insist on providing cryptic codes with no explanation and which you need to google the meaning of (ok, I admit I ignored the FAQ!). This will lead you to an appalling page; clearly knocked up by the six year olds senior sibling on day one of code-club, littered with links you need to eyeball which finally leads you to something useful telling you how to deal with the idiocy of BT’s design decisions.
In this particular case the decision some halfwit architect made which requires users to downgrade their security settings (e.g. VC002 or VC040)! So a national telecom provider and ISP is insisting that users weaken their online security settings so they can access a half arsed service?…
Half-arsed since when you do get it working it performs abysmally with the video quality of a 1996 RealAudio stream over a 28.8 kbps connection. This likely because some mindless exec has decided they don’t want people watching on their laptops, they’d rather they use the “oh-god-please-don’t-get-me-started-on-how-bad-this-device-is” BT Vision Box – I feel sick just thinking about it…
In non-functional terms:
- Form – Fail
- Performance – Fail
- Security – Fail
- Operability – Well all I know is that it failed on day one of launch and I suspect it’s as solid as a house of cards behind the scenes. Lets see what happens if the service falls over at champions league final kick-off!
Success with non-functionals alone doesn’t guarantee success – you need a decent functional product and lets face it, champions league football is pretty decent – but no matter how good the function, if it’s unusable, if it makes your eyes bleed, if it performs like a dog, if it’s insecure and if its not reliable then you’re going to fail! It’s actually pretty impressive that BT have managed to fail on (almost) every count! Right, off now to watch something completely different not supplied by BT… oh, except they’re still my ISP because – quelle surprise – that bit actually works!
I’d almost forgotten how ridiculous this is after a couple of years with OSX and Linux… Happened on the train this morning. Well, I’ll just look at the scenic view that is the south of England for the next half an hour then… Thanks Microsoft!
p.s. Some would claim this isn’t such a bad thing since the south of England can be beautiful, in this case though, I had work to do.
I’ve been doing a little performance prototyping and my usual technique of logging milliseconds spent doesn’t quite cut it as the result fluctuates between 0ms and 1ms – not enough granularity to allow for any useful comparison. Switching to nanoseconds does the trick – case A is a little over 0.6ms slower than case B in my test… Cool!
What’s the difference between a nanosecond(10−9) and a microsecond(10−6)? Grace puts it in perspective… but I’m talking milliseconds(10−3)… so that’d be just shy of 300km per ms or 180km longer in A compared to B. What a waste…
I’ve been going through my bookcase; on orders from a higher-being, to weed out old, redundant books and make way for… well, I’m not entirely sure what, but anyway, it’s not been very successful.
I came across an old copy of Release It! by Michael T. Nygard and started flicking through, chuckling occasionally as memories (good and bad) surfaced. It’s an excellent book but made me stop and think when I came across a note reading:
There’s nothing fundamentally wrong with this other than it chimes with a problem I’m currently facing and I don’t like any of the usual solutions.
Sessions either reside in some sort of stateful pool; persistent database, session management server, replicated memory etc., or more commonly exist stand-alone within each node of a cluster. In either case load-balancing is needed to route requests to the home node where the session exists (delays in replication means you can’t go to any node even when a stateful pool is used). Such load-balancing is performed by a network load-balancer, reverse proxy, web-server (mod_proxy, WebSphere plugin etc.) or application server and can work using numerous different algorithms; IP based routing, round-robin, least-connections etc.
So in my solution I now need some sort of load-balancer – more components, joy! But even worse, it’s creating havoc with reliability. Each time a node fails I lose all sessions on that server (unless I plumb for a session-management-server which I need like a hole in the head). And nodes fails all the time… (think cloud, autoscaling and hundreds of nodes).
So now I’m going to kind-of break that treasured piece of advice from Michael and create larger cookies (more likely request parameters) and include in them some every-so-slightly-sensitive details which I really shouldn’t. I should point out this isn’t is criminal as it sounds.
Firstly the data really isn’t that sensitive. It’s essentially routing information that needs to be remembered between requests – not my credit card details.
Secondly it’s still very small – a few bytes or so but I’d probably not worry too much until it gets to around 2K+ (some profiling required here I suspect).
Thirdly, there are other ways to protect the data – notably encryption and hashing. If I don’t want the client to be able to read it then I’ll encrypt it. If I don’t mind the client reading the data but want to make sure it’s not been tampered with, I’ll use an HMAC instead. A JSON Web Token like format should well work in most cases.
Now I can have no session on the back-end servers at all but instead need to decrypt (or verify the hash) and decode a token on each request. If a node fails I don’t care (much) as any other node can handle the same request and my load balancing can be as dumb as I can wish.
I’ve sacrificed performance for reliability – both in terms of computational effort server side and in terms of network payload – and made some simplification to the overall topology to boot. CPU cycles are getting pretty cheap now though and this pattern should scale horizontally and vertically – time for some testing… The network penalty isn’t so cheap but again should be acceptable and if I avoid using “cookies” for the token then I can at least save the load on every single request.
It also means that in a network of micro-services, so long as each service propagates these tokens around, the more thorny routing problem in this sort of environment virtually disappears.
I do though now have a key management problem. Somewhere, somehow I need to store the keys securely whilst distributing them to every node in the cluster… oh and don’t mention key-rotation…