Skip to main content

Availability SLAs

I've been considering availability recently given the need to support five 9's availability (amongst other non-functionals) and have decided to draw up the list below. Thoughts/comments appreciated.

Assumptions at the bottom.

99% - Accepted downtime = 3.6 days/year


Single server is sufficient assuming disk failures and restarts can be accommodated within minutes to hours.

Data centre (DC) failure means we've time to rebuild the application elsewhere if we need to (most DC issues will be resolved quicker than this) and restores from back-up tapes should be easy enough.

Daily backups off-site required in case of total DC failure.

99.9% - Accepted downtime = Just under 9 hrs/year


We probably lose 4.5hrs/year due to patching if we use a single server which only allows 4.5hrs to resolve a crash or other failure. This is close and likely not enough time (especially if a crash occurs more than once a year).


Therefore we need a clustered service for resiliency (alternating which node is patched to avoid service outage during that time). This may be active-active nodes or active-passive which makes SQL database configuration simpler.


DC issues are probably resolved in time but a cold stand-by available in second DC is advisable (restore from backups (note, offsite), option to use pre-production environment if capacity allows and its in a second DC).


Daily backup with redo logs taken (and transferred offsite) every 4 hours.

99.99% - Accepted downtime = Just under 1 hrs/year


We could in theory still accommodate with a clustered solution in one location but an issue at the datacenter level will be a real headache.


So we now want resiliency across DCs but can tolerate an hour to switch over if required. Active-passive DC solution therefore required with geographically dispersed data centres.

We need to replicate data in near-time and have secondary environment available (warm) ready to take the load in the event its required with GTM (Global Traffic Manager) ready to route traffic in the event its needed if our DNS changes take too long to ripple out. Classic SQL technology (with those enterprise options) still viable.


But at least we can still use traditional storage and database technology (daily backups, redo logs shipped every 30 mins; database mirroring etc.).

99.999% - Accepted downtime = Just under 5 mins/year (my requirement)


5 minutes is probably too short a time to fire-up a secondary environment so we need active-active data-centres, GTM now used to distribute load across DC's and route solely to one in the event of an outage in another.


Data replication must be bi-directional allowing reads and writes simultaneously to each DC. This is complex, adds latency which degrades performance and consequently has significant impact on decisions relating to storage and database technology. Most classical SQL databases start to struggle and we probably need to ensure data is sharded between data-centers by whatever strategy makes most sense for the data and allow replication of shards across DC's.


Application components need to be responsive to failures and route accordingly when they are detected. Monitoring, alerting and automatic failover is needed to ensure the response to failure is rapid. A Tungusta scale collision becomes an event worth considering and could have a huge impact on the power network disrupting multiple DCs if not sufficiently geographically distributed. However at the odds of one collision every 500 years, so long as we can rebuild everything within 40 hrs somewhere unaffected it's a risk that could be taken. A tertiary "cold" DC ready for such an event becomes a consideration though.

99.9999% - Accepted downtime = Around 30secs/year


Taking things into silly territory...

We need to know pretty damn quickly that somethings gone wrong. Timeouts have to be reduced to ensure that we have time to retry transactions which may need to go to other nodes or DCs when a failure occurs. Components need to become more distributed, location agnostic, atomic and self-managing - automatic failover at each instance and each tier is required. This results in changes to the sort of patterns adopted in design and development, additional complexity to detect failures in a timely manner, routing and retry consideration to avoid failures. Additional DC's are necessary and a data-centre on the moon becomes something to consider.

99.99999% - Accepted downtime = Around 3 secs/year


A "blip" will be considered an outage and we're reaching the level where typical response times today are unacceptable - what happens when accepted downtime is less than transaction performance!?

Timeouts are reduced to a daft level as we need to know in 1.5 seconds at most to allow time to retry. The deeper down we go, the less time is available and the worse things gets (in a three tier architecture we've 0.375secs to complete any transactions). Trying to achieve data consistency now becomes virtually impossible. The option of a DC on the moon is no longer viable though due to latency (1.3s for light to get from earth to the moon, 2.6s round trip).

 

Notes/Assumptions



  1. Individual servers recycle once a month for 30 mins for patching.

  2. Assumes available hours requirement is 24x7.

  3. Says nothing about scalability.

  4. Assumes a data center occurs once every 10 years.

  5. Assumes a server crash once every six months.

  6. Assumes RPO (recovery point objective) is the same as the availability requirement.

  7. Assumes RAID storage used to avoid single disk outages.

  8. Assume cutover from active node to passive node takes less than 1 minute.

Comments

Popular posts from this blog

An Observation

Much has changed in the past few years, hell, much has changed in the past few weeks, but that’s another story... and I’ve found a little time on my hands in which to tidy things up. The world of non-functionals has never been so important and yet remains irritatingly ignored by so many - in particular by product owners who seem to think NFRs are nothing more than a tech concern. So if your fancy new product collapses when you get get too many users, is that ok? It’s fair that the engineering team should be asking “how many users are we going to get?”,   or “how many failures can we tolerate?” but the only person who can really answer those questions is the product owner.   The dumb answer to these sort of question is “lots!”, or “none!” because at that point you’ve given carte-blanche to the engineering team to over engineer... and that most likely means it’ll take a hell of a lot longer to deliver and/or cost a hell of a lot more to run. The dumb answer is also “only a couple” and “

Inter-microservice Integrity

A central issue in a microservices environment is how to maintain transactional integrity between services. The scenario is fairly simple. Service A performs some operation which persists data and at the same time raises an event or notifies service B of this action. There's a couple of failure scenarios that raise a problem. Firstly, service B could be unavailable. Does service A rollback or unpick the transaction? What if it's already been committed in A? Do you notify the service consumer of a failure and trigger what could be a cascading failure across the entire service network? Or do you accept long term inconsistency between A & B? Secondly, if service B is available but you don't commit in service A before raising the event then you've told B about something that's not committed... What happens if you then try to commit in A and find you can't? Do you now need to have compensating transactions to tell service B "oops, ignore that previous messag

Equifax Data Breach Due to Failure to Install Patches

"the Equifax data compromise was due to their failure to install the security updates provided in a timely manner." Source: MEDIA ALERT: The Apache Software Foundation Confirms Equifax Data Breach Due to Failure to Install Patches Provided for Apache® Struts™ Exploit : The Apache Software Foundation Blog As simple as that apparently. Keep up to date with patching.