Over the past year, I have been involved in numerous Disaster Recovery (DR) engagements – including review and implementations. When I start, my first question is NOT “What is your RTO/RPO?”. (although important).
My first question is “What is your goal?” It is amazing the blank stares I get from this question. Obviously, the thoughts are probably “Idiot, my goal is disaster recovery!” But, I then explain that there are many different types of disasters and many types of disaster recovery options. The most entertaining response I get when I ask this question is “Well, someone told us we needed a disaster recovery solution”. Of course, this is normally in Florida during hurricane season …
My mind drifts off to my favorite Jimmy Buffett song – “Tryin’ To Reason With Hurricane Season”
Anyway, I thought for a fun refresher, I would throw out the discussion of WHY we do a Disaster Recovery implementation, NOT HOW do we do a DR implementation. Of course, determining how to do a DR implementation is easy once we determine why we want to do a DR solution.
Seriously, most teams jump into the technology solution before considering the requirements or the goals of the solution. So let’s go and have some fun!
Goals of Disaster Recovery
Most people know me as an advocate of flexibility and a DR implementation is no different. The company DR goals should be defined by the type and scope of the disaster. Some types of disasters are obvious, including natural and man-made disasters resulting in a “smoking hole”. However, other failures may also require a disaster declaration or at least utilize the environment for recovering from an incident. The following graphic provides a focus on the goals of a disaster recovery solution.
The DR implementations focus on an Enterprise focus, which maintains a secondary site for their production site. Most of the time, the disaster recovery site reflects a “lights out” location and rarely tested. This implementation only satisfies a compliance person or government compliance check box, which is important. However, how does this support the business objective or the customer’s experience?
The disaster recovery goal should focus on a) How does this support our customer experience and b) how does this allow us to drive business?
After all, the resources for disaster recovery are expensive. Also, the management of data centers and environments requires focus and support of tools and personnel. Why not make these environments work for the company as well?
The following graphic represents a goal oriented focus on the disaster recovery solution. As the goals become more focused and flexible, the beneficiary transitions from internal operations to client experience and business focus.
Expanding the goals of the disaster recovery implementation for flexibility removes the self serving goal of checking a box. I have reviewed DR solutions which force customer applications to fail to disaster recovery due to a business application failure. Or worse, I have evaluated a client environment that isolates customers into separate environments for customer isolation; however, the DR plan requires all customers to transition to DR in the event of a single customer outage.
Isolating applications into groups supporting distinct applications or customer installments is great. Providing a “Site Switching” strategy for each application group is excellent and improves the customer experience and confidence! The isolation of applications, databases and incidents provide an effective solution for disaster recovery. Although moving a mountain during a disaster gets notice. Causing a customer multiple days outage due to a deleted table gets more notice.
As a manager of operations for a workforce staffing company, I recall an incident with Peoplesoft. A very capable person was performing an upgrade of a Peoplesoft application and mistakenly deleted the Vendor table. Our ability to quickly rebuild this table in our DR location and place it back in production not only saved our butts, but save the company 2 – 3 days of embarrassment while we were trying to rebuild the table.
Disaster Recovery Testing Frequency
Testing a disaster recovery cut-over requires coordination and resources. I am often asked “how often should our company perform a DR test?”. The question is simple, but most of the time, misguided.
A disaster is not convenient or forgiving. In operations at the time of a disaster, everything is on fire and everyone is yelling. Only experience and muscle memory provides the difference between minutes and days.
Testing DR is not proving the ability to move applications from one point to another, that is done daily. The POINT to testing DR is for providing experience and process for the individuals.
In one of my favorite movies – The Last Samurai – Tom Cruise’s character challenges a soldier to shoot him while being attacked. Of course the soldier panics and fails the challenge and the movie continues. But, the point is clear – things change in times of stress!
The following graphic provides a guide line for when to test the switchover for the efficiency of the team.
As the testing flows down the graphic, the team’s efficiencies also improve. As we test the common Site Switching on a monthly or quarterly basis, the experience of the supporting teams increase as well. Therefore, the testing of the Enterprise cutover would encompass smaller flexible site transitions, which are tested frequently.
Can your team perform a disaster recovery test? Can your team DEFINE a disaster for your operations? If not, you have some homework!
Better question, when was the last time you checked your spare tire? When was the last time you changed your tire? Your tire will not go flat when it is convenient, it will only go flat when it is raining and on the freeway.
My last story … when I was a director of operations, I managed a team of DBA and system administrators. I had a triage DBA that rotated between application and triage, but during triage they handled the weekly issues and requests. In the middle of this stressful period, I would write the name of a database, a date and a time on a piece of paper and put it on their desk. They would have to recover the database, while they were doing triage, based on that time frame. The process was to keep them sharp, not to test their abilities.