Achieving Disaster Recovery SLAs in Minutes with Cohesity

By Jon Hildebrand • April 25, 2019

When it comes to disaster recovery, everyone can agree that it’s very complicated. Many organizations have problems with ensuring basic backup and recovery is taken care of, let alone working on the next step for protecting an organization’s business. Recently, at Cloud Field Day 5, we announced our workflows on the Cohesity platform that can be leveraged for disaster recovery, depending upon the proposed recovery SLA. Unfortunately, we were unable to discuss the shortest recovery SLA option, which we’ve called Failover/Failback, so I’ll do that in today’s blog.

Here is What you Need to Know

An overview of Failover and Failback using Replication

Let’s Get Started!

The process does technically differ depending on whether we are using a virtual machine or a Cohesity view. In this case, we are going to focus on a virtual machine object. To allow for Failover and Failback of a virtual machine that is protected, two tasks need to be performed:

  •  The first is to install the Cohesity agent onto the virtual machine that will be protected. While we will not protect the virtual machine with the agent, it is necessary for the Failback operation.
  •  The second is to enable Cloud Migration in the Advanced settings of a protection job. Enabling this option will force the protection job to communicate with the agent that has been previously installed on the virtual machine and it will verify the agent is running.

Enabling the Cloud Migration feature of a Protection Job on a Virtual Machine

During a typical protection job run, the initial backup to the local source occurs. Once the data is properly indexed, the Cohesity system will then begin to perform replication of the data over to the second source. This step is heavily dependent upon bandwidth and the time to complete the replication task can vary greatly, especially if the second source is in a public cloud. Once completed, the second source will now have a new protection job listed, with the same name as the first source. The newly replicated protection job will have an Inactive flag enabled, as shown below:

The Inactive flag appearing on a replication Cohesity Protection Job

By selecting Failover in the UI, an administrator can begin the process of selecting a new virtualization environment or public cloud environment. Once the source is selected, we will run through what is known as a CloudSpin operation. The CloudSpin operation will convert the workload to the appropriate environment’s machine type and power on the virtual machine.

After adding the converted virtual machine to the new environment, an administrator needs to configure a new protection job to enable backups of the new device. However, unlike in the primary environment, this protection job needs to be configured as a physical device. The reason for this configuration has to do with using the Cohesity agent. At this time, CloudSpin is unable to convert public cloud virtual machine formats to on-premises virtual machine formats. To ensure we can continue to protect the data, we use the Cohesity agent we installed into the operating system of the virtual machine.

In the event of a disaster, the virtual machine is now available, and its data is available on the new source.

Time for Failback

How do we go about getting this data back in the event we can recover to the primary site? At our secondary site, we configure replication to a Cohesity cluster (new or existing) to the primary site. We configure a protection job to replicate the backup data, and we follow the same procedure for creating a virtual machine in our primary site virtualization environment.

Now it’s your Turn!

Many enterprises have first-hand experience with how complex disaster recovery can be. Also, many enterprises haven’t implemented disaster recovery for this reason. By utilizing Cohesity and the cloud, we believe you can simplify disaster recovery while providing multiple recovery SLA options. We help reduce Failover and Failback times from hours to minutes, ensuring that your disaster recovery plan uniquely fits your business and application recovery needs.