Performance & Capacity

Performance and capacity - one of my favourites topics...

We all want our applications to perform well and scale to large volumes but it's not always so easy to achieve. Suffice to say, it starts with requirements, non-functional requirements. How many users do you expect? How many hits? What functions will be exercised? How often? and so on.

Getting these NFRs defined is itself not trivial. Some clients will optimistically over estimate the volume and frequency of requests; which will add cost on to the solution, others will underestimate in the hope of saving a buck; which may result in collapse of the service if it's more successful than this. Beyond this it's often a crystal ball gazing exercise as you rarely have the actual data to hand in advance.

Once you do though the  two key points regarding performance and capacity are "Model It" and "Measure It".

Modelling allows you to allocate resource and time allocations to various components of the solution as you go through design. This will help in determining where resource constraints may bite and lead to better solutions to handle the volume and frequency required. Time lost in iterative or sequential tasks may not be an issue where the number of iterations or sequences are low but will become fundamental issues with large volumes. The explosion of BigData projects in recent times shows; in part, the trend in handling such situations.

Measuring allows you to determine what the actual response times and capacity requirements are. This can usefully be used to compare against the model to determine if the model is sufficiently accurate and allows problem areas to be identified. Perhaps most importantly of all it allows you to return to the client with a stack of metrics and graphs in hand to prove that the performance and capacity requirements have been met. Ok, prove may be too strong a word...

Measuring once in production also allows a comparison to be made against the baseline to identify minor issues which may become more critical over time. Running out of disk-space or having the site slow-down to become virtually unusable are common issues and the only way to identify them in advance is to measure and extrapolate.

Both modelling and measuring need to be considered for the different types of resource you're going to be dealing with. Time, storage, CPU, memory and network utilisation all need to be considered. All too often there is a challenge to demonstrate that the "time" element is satisfied without considering the impact elsewhere. Not to do so prevents further extrapolation of results and identification of particularly constrained components.

Performance testing is a skill in its own right with many tools on the market available to assist; some open-source such as JMeter, other commercial products such as LoadRunner or Rational Performance Tester. Various scenarios exist to test applications such as:

  • Load Testing - Ramping up the load to the target NFRs to see how the system behaves.

  • Break-point Testing - Continuously ramping up the load to see when it (or the test tools) break.

  • Soak Testing - Loading the system over an extended period of time to see what happens as the result of sustained load on the system.

  • Flood Testing - Causing a sudden flood of traffic to hit the system such as may occur from the slashdot effect or DoS attack.

Care is also needed to craft test data-sets which are a reasonable approximation of production. This can often be a complex activity to generate data in volume with the correct variance. Random number and string generators may result in data which is not representative of real-life such that the tested performance of the system does not match production. Introducing edge cases (users with lots of orders for example) or generating values which follow a more natural distributions (e.g. Benford's Law) are needed - and at a volume which matches the target capacity of the system so that performance testing is done at the correct scale.

Once in production measurement should continue on a regular basis to ensure smooth running of the system. Additional activities may be required to tune performance or archive redundant data and only by measuring can this be done. Patterns may also start to appear in data as the normal cycle of life has it's effect on the system. Metrics can tell you if you're hitting the goals of the business in terms of the NFR's targeted or if you're about to hit that break-point identified during testing. You can identify quiet periods to determine when best to schedule maintenance and identify problems before users do.

Of course, lies, damn lies and statistics can have a persuasive hold over and need to be taken in context. However, I would rather some metrics as evidence that performance and capacity requirements are met than leave it up to chance.

No comments:

Post a Comment

Voyaging dwarves riding phantom eagles

It's been said before... the only two difficult things in computing are naming things and cache invalidation... or naming things and som...