Disasters in a Hadoop environment have various origins. It could be a major natural disaster that takes out an entire data center, an extended power outage that makes the Hadoop platform unavailable, a DBA accidentally dropping an entire database, an application bug corrupting data stored on HDFS, or worse—a cyberattack. No matter the cause, proper Hadoop disaster recovery mechanisms are necessary prior to rolling out the application in production to protect data in each of these scenarios.
The actual mechanism used to protect this data depends on a number of factors including:
Multiple replicas in Hadoop are a great way to protect against hardware failures such as disk drive failure or server failure. However, neither way protects against natural disasters, human errors, application corruptions, or cyberattacks. One or more of the following three options will have to be put in place to protect against these scenarios:
So why not just use synchronous data replication to protect against a data center failure? There are some serious factors to consider before blindly jumping into deploying an active synchronous data replication solution for Hadoop disaster recovery purposes.
In the real world, very few applications (particularly transactional applications) require this type of stringent RPO and RTO. If your application is one of those few critical applications, active replication may make sense but it comes with its own limitations and cost considerations, noted below.
Synchronously replicating data will negatively impact your application performance. Every change made on the production system will have to be transmitted and acknowledged by the remote Hadoop cluster before allowing the application to proceed with the next change. The performance impact will depend on the network connectivity between the two clusters which will most likely be a slower wide area network (WAN) connection.
Synchronous data replication solutions require software to be installed on the production Hadoop cluster. This software will intercept all writes to the file system which can destabilize the production system and requires extensive testing prior to putting it into production. Also, any disruption on the WAN network will bring your application to a halt since data changes cannot be transmitted to the remote cluster or acknowledged. This can result in downtime and disruption to your production applications.
In the case of active real-time replication, all changes (temporary or permanent) are sent over the network to the remote Hadoop cluster. This will cause significantly more load on the WAN compared to an asynchronous replication methodology which will transmit far less data over the network.
Typically these solutions have much higher hardware, software, and networking costs.
When it comes to taking out an entire data center, human errors, application corruptions, and a ransomware attack are more likely than a natural disaster. Protecting data against these likely events should be a higher priority for an enterprise. Implementing an active disaster recovery solution will not protect data in these scenarios since all changes (intentional or accidental) will get propagated to the disaster recovery copy instantaneously.
Although real-time replication results in the best possible RPO and RTO, it comes with limitations and considerations that need to be carefully thought through. Implementing an active Hadoop disaster recovery solution must be done in context to the criticality of the application to get the best return on investment. If not, it can result in unnecessary expenditures, affect the availability of the production Hadoop system, and lead to excessive resources in managing the production Hadoop environment.
Watch this video to get deeper insights into the Cohesity solution for Hadoop backup and recovery.
Apache and Hadoop are trademarks of Apache Software Foundation.