So now you have the perfect BCP, but does it work?
Last month I described the things that make up a great Business Continuity Plan (BCP) for your IT Disaster Recovery (DR) needs. So I’m sure you have your plan freshly printed and bound sitting in your office. Just like NASA had the “How to launch a rocket” handbook at the Kennedy Space Center in case the computers went down in the ‘70s. But… have you tested it? And I mean really tested it. From test fail-overs to user testing to fail-back operations and procedures?
Without testing your plan you cannot rely on it. Just like any good plan, it’s when you put into action that counts. Moreover, you cannot just test it once and rubber stamp it. It needs and must be tested over and over again with different scenarios in mind. For example, test fail-over a single system, did you capture all the dependencies in your planning?
A good plan will also develop over time. Remember I mentioned in my last post that we are in a DevOps world now, systems and applications are changing so fast we need to adapt our traditional thinking. No longer will you write a plan that lasts for 3 years because that was the lifecycle of a system. Your plan could be out-of-date in months or even weeks.
Testing is also where you can make the biggest gains in your DR planning to improve your RTO (Recovery Time Objective). Do you fail over to the real DR environment, or have a test DR system in cloud based infrastructure that you can just ‘reset’ after testing? This form of testing can be extremely cost effective. Instead of failing over to your real DR environment, you can failover to a dummy set of systems that are just a copy you made yesterday. This way you can test small changes or new scenarios without impacting users and production functions. Think of it as a ‘Test and Dev’ environment for your DR, but temporary.
While testing, it is critical to identify risks that could impede your ability to have a successful failover. This could be as simple as not having the correct password to access a system, to having an application start-up procedure being out of date. Make sure that you mark these items for follow up to avoid them slipping through the cracks and becoming landmines later. Items like these are easy to resolve in testing, but during a DR event when the clock is ticking on your RTO and MTO (Maximum Tolerable Outage), it’s ten times harder.
While all this testing is taking place it is a good idea to document the flow for each scenario. Most RTO fail to be met because the process and procedures are too generic and do not factor in nuances for each scenario. It is important to factor these in to achieve your outcome, a one-size-fits-all approach does not work. You may not be able to cover every eventuality that will ever present itself, but at least you can cover 80%. Within this, I would expect to be able to cover at least; hardware, platform and site failures.
Remember that all of this is done in the hope we never need it. Like insurance, we hate paying for it, but when we need it, it repays itself time and time again.
You’re not alone in the challenge that is BCDR planning and operations. If you need help and guidance or an entire strategy, please reach out to us.