Skip to main content

Performance & Capacity

Performance and capacity - one of my favourites topics...

We all want our applications to perform well and scale to large volumes but it's not always so easy to achieve. Suffice to say, it starts with requirements, non-functional requirements. How many users do you expect? How many hits? What functions will be exercised? How often? and so on.

Getting these NFRs defined is itself not trivial. Some clients will optimistically over estimate the volume and frequency of requests; which will add cost on to the solution, others will underestimate in the hope of saving a buck; which may result in collapse of the service if it's more successful than this. Beyond this it's often a crystal ball gazing exercise as you rarely have the actual data to hand in advance.

Once you do though the  two key points regarding performance and capacity are "Model It" and "Measure It".

Modelling allows you to allocate resource and time allocations to various components of the solution as you go through design. This will help in determining where resource constraints may bite and lead to better solutions to handle the volume and frequency required. Time lost in iterative or sequential tasks may not be an issue where the number of iterations or sequences are low but will become fundamental issues with large volumes. The explosion of BigData projects in recent times shows; in part, the trend in handling such situations.

Measuring allows you to determine what the actual response times and capacity requirements are. This can usefully be used to compare against the model to determine if the model is sufficiently accurate and allows problem areas to be identified. Perhaps most importantly of all it allows you to return to the client with a stack of metrics and graphs in hand to prove that the performance and capacity requirements have been met. Ok, prove may be too strong a word...

Measuring once in production also allows a comparison to be made against the baseline to identify minor issues which may become more critical over time. Running out of disk-space or having the site slow-down to become virtually unusable are common issues and the only way to identify them in advance is to measure and extrapolate.

Both modelling and measuring need to be considered for the different types of resource you're going to be dealing with. Time, storage, CPU, memory and network utilisation all need to be considered. All too often there is a challenge to demonstrate that the "time" element is satisfied without considering the impact elsewhere. Not to do so prevents further extrapolation of results and identification of particularly constrained components.

Performance testing is a skill in its own right with many tools on the market available to assist; some open-source such as JMeter, other commercial products such as LoadRunner or Rational Performance Tester. Various scenarios exist to test applications such as:

  • Load Testing - Ramping up the load to the target NFRs to see how the system behaves.

  • Break-point Testing - Continuously ramping up the load to see when it (or the test tools) break.

  • Soak Testing - Loading the system over an extended period of time to see what happens as the result of sustained load on the system.

  • Flood Testing - Causing a sudden flood of traffic to hit the system such as may occur from the slashdot effect or DoS attack.


Care is also needed to craft test data-sets which are a reasonable approximation of production. This can often be a complex activity to generate data in volume with the correct variance. Random number and string generators may result in data which is not representative of real-life such that the tested performance of the system does not match production. Introducing edge cases (users with lots of orders for example) or generating values which follow a more natural distributions (e.g. Benford's Law) are needed - and at a volume which matches the target capacity of the system so that performance testing is done at the correct scale.

Once in production measurement should continue on a regular basis to ensure smooth running of the system. Additional activities may be required to tune performance or archive redundant data and only by measuring can this be done. Patterns may also start to appear in data as the normal cycle of life has it's effect on the system. Metrics can tell you if you're hitting the goals of the business in terms of the NFR's targeted or if you're about to hit that break-point identified during testing. You can identify quiet periods to determine when best to schedule maintenance and identify problems before users do.

Of course, lies, damn lies and statistics can have a persuasive hold over and need to be taken in context. However, I would rather some metrics as evidence that performance and capacity requirements are met than leave it up to chance.

Comments

Popular posts from this blog

An Observation

Much has changed in the past few years, hell, much has changed in the past few weeks, but that’s another story... and I’ve found a little time on my hands in which to tidy things up. The world of non-functionals has never been so important and yet remains irritatingly ignored by so many - in particular by product owners who seem to think NFRs are nothing more than a tech concern. So if your fancy new product collapses when you get get too many users, is that ok? It’s fair that the engineering team should be asking “how many users are we going to get?”,   or “how many failures can we tolerate?” but the only person who can really answer those questions is the product owner.   The dumb answer to these sort of question is “lots!”, or “none!” because at that point you’ve given carte-blanche to the engineering team to over engineer... and that most likely means it’ll take a hell of a lot longer to deliver and/or cost a hell of a lot more to run. The dumb answer is also “only a couple” and “

Inter-microservice Integrity

A central issue in a microservices environment is how to maintain transactional integrity between services. The scenario is fairly simple. Service A performs some operation which persists data and at the same time raises an event or notifies service B of this action. There's a couple of failure scenarios that raise a problem. Firstly, service B could be unavailable. Does service A rollback or unpick the transaction? What if it's already been committed in A? Do you notify the service consumer of a failure and trigger what could be a cascading failure across the entire service network? Or do you accept long term inconsistency between A & B? Secondly, if service B is available but you don't commit in service A before raising the event then you've told B about something that's not committed... What happens if you then try to commit in A and find you can't? Do you now need to have compensating transactions to tell service B "oops, ignore that previous messag

Equifax Data Breach Due to Failure to Install Patches

"the Equifax data compromise was due to their failure to install the security updates provided in a timely manner." Source: MEDIA ALERT: The Apache Software Foundation Confirms Equifax Data Breach Due to Failure to Install Patches Provided for Apache® Struts™ Exploit : The Apache Software Foundation Blog As simple as that apparently. Keep up to date with patching.