I've had a number of discussions in the past about how we should be testing failover and recovery procedures on a regular basis - to make sure they work and everyone knows what to do so you're not caught out when it happens for real (which will be at the worst possible moment). Scheduling these tests, even in production, is (or should be) possible at some convenient('ish) time. If you think it isn't then you've already got a resiliency problem (you're out when some component fails) as well as a maintenance problem.
I've also talked (ok, muttered) about how a healthy injection of randomness can actually improve stability, resilience and flexibility. Something covered by Nassim Taleb in his book Antifragile.
Anyway, beat to the punch again, Netflix developed a tool called Chaos Monkey (aka Simian Army) a few years back with randomly kills elements of the infrastructure to help identify weak points. Well worth checking out on codinghorror.com.
For the record... I'm not advocating that you use Chaos Monkey in production... Just that it's a good way to test the resiliency of your environment and identify potential failure points. You should be testing procedures in production in a more structured manner.
PO: We need a bridge over the river right here? Me: Why? PO: Because the customer needs to get to the building on the other side? Me: Why ca...
When I were knee high to a grasshopper we didn't have all this new fangled cloud infrastructure and we certainly didn't have the con...
There, I said it. A four letter swear word. Something worse than the F’ word if the horror on the boss’ face is anything to go by. We don’t ...
Nice piece of work. Begs the questions when we'll see Windows for Linux though ;)