2015/10/18

Availability SLAs

I've been considering availability recently given the need to support five 9's availability (amongst other non-functionals) and have decided to draw up the list below. Thoughts/comments appreciated.

Assumptions at the bottom.

99% - Accepted downtime = 3.6 days/year


Single server is sufficient assuming disk failures and restarts can be accommodated within minutes to hours.

Data centre (DC) failure means we've time to rebuild the application elsewhere if we need to (most DC issues will be resolved quicker than this) and restores from back-up tapes should be easy enough.

Daily backups off-site required in case of total DC failure.

99.9% - Accepted downtime = Just under 9 hrs/year


We probably lose 4.5hrs/year due to patching if we use a single server which only allows 4.5hrs to resolve a crash or other failure. This is close and likely not enough time (especially if a crash occurs more than once a year).


Therefore we need a clustered service for resiliency (alternating which node is patched to avoid service outage during that time). This may be active-active nodes or active-passive which makes SQL database configuration simpler.


DC issues are probably resolved in time but a cold stand-by available in second DC is advisable (restore from backups (note, offsite), option to use pre-production environment if capacity allows and its in a second DC).


Daily backup with redo logs taken (and transferred offsite) every 4 hours.

99.99% - Accepted downtime = Just under 1 hrs/year


We could in theory still accommodate with a clustered solution in one location but an issue at the datacenter level will be a real headache.


So we now want resiliency across DCs but can tolerate an hour to switch over if required. Active-passive DC solution therefore required with geographically dispersed data centres.

We need to replicate data in near-time and have secondary environment available (warm) ready to take the load in the event its required with GTM (Global Traffic Manager) ready to route traffic in the event its needed if our DNS changes take too long to ripple out. Classic SQL technology (with those enterprise options) still viable.


But at least we can still use traditional storage and database technology (daily backups, redo logs shipped every 30 mins; database mirroring etc.).

99.999% - Accepted downtime = Just under 5 mins/year (my requirement)


5 minutes is probably too short a time to fire-up a secondary environment so we need active-active data-centres, GTM now used to distribute load across DC's and route solely to one in the event of an outage in another.


Data replication must be bi-directional allowing reads and writes simultaneously to each DC. This is complex, adds latency which degrades performance and consequently has significant impact on decisions relating to storage and database technology. Most classical SQL databases start to struggle and we probably need to ensure data is sharded between data-centers by whatever strategy makes most sense for the data and allow replication of shards across DC's.


Application components need to be responsive to failures and route accordingly when they are detected. Monitoring, alerting and automatic failover is needed to ensure the response to failure is rapid. A Tungusta scale collision becomes an event worth considering and could have a huge impact on the power network disrupting multiple DCs if not sufficiently geographically distributed. However at the odds of one collision every 500 years, so long as we can rebuild everything within 40 hrs somewhere unaffected it's a risk that could be taken. A tertiary "cold" DC ready for such an event becomes a consideration though.

99.9999% - Accepted downtime = Around 30secs/year


Taking things into silly territory...

We need to know pretty damn quickly that somethings gone wrong. Timeouts have to be reduced to ensure that we have time to retry transactions which may need to go to other nodes or DCs when a failure occurs. Components need to become more distributed, location agnostic, atomic and self-managing - automatic failover at each instance and each tier is required. This results in changes to the sort of patterns adopted in design and development, additional complexity to detect failures in a timely manner, routing and retry consideration to avoid failures. Additional DC's are necessary and a data-centre on the moon becomes something to consider.

99.99999% - Accepted downtime = Around 3 secs/year


A "blip" will be considered an outage and we're reaching the level where typical response times today are unacceptable - what happens when accepted downtime is less than transaction performance!?

Timeouts are reduced to a daft level as we need to know in 1.5 seconds at most to allow time to retry. The deeper down we go, the less time is available and the worse things gets (in a three tier architecture we've 0.375secs to complete any transactions). Trying to achieve data consistency now becomes virtually impossible. The option of a DC on the moon is no longer viable though due to latency (1.3s for light to get from earth to the moon, 2.6s round trip).

 

Notes/Assumptions



  1. Individual servers recycle once a month for 30 mins for patching.

  2. Assumes available hours requirement is 24x7.

  3. Says nothing about scalability.

  4. Assumes a data center occurs once every 10 years.

  5. Assumes a server crash once every six months.

  6. Assumes RPO (recovery point objective) is the same as the availability requirement.

  7. Assumes RAID storage used to avoid single disk outages.

  8. Assume cutover from active node to passive node takes less than 1 minute.

No comments:

Post a Comment

Voyaging dwarves riding phantom eagles

It's been said before... the only two difficult things in computing are naming things and cache invalidation... or naming things and som...