I’ve had a number of discussions in the past about how we should be testing failover and recovery procedures on a regular basis – to make sure they work and everyone knows what to do so you’re not caught out when it happens for real (which will be at the worst possible moment). Scheduling these tests, even in production, is (or should be) possible at some convenient(‘ish) time. If you think it isn’t then you’ve already got a resiliency problem (you’re out when some component fails) as well as a maintenance problem.
I’ve also talked (ok, muttered) about how a healthy injection of randomness can actually improve stability, resilience and flexibility. Something covered by Nassim Taleb in his book Antifragile.
Anyway, beat to the punch again, Netflix developed a tool called Chaos Monkey (aka Simian Army) a few years back with randomly kills elements of the infrastructure to help identify weak points. Well worth checking out on codinghorror.com.
For the record… I’m not advocating that you use Chaos Monkey in production… Just that it’s a good way to test the resiliency of your environment and identify potential failure points. You should be testing procedures in production in a more structured manner.