4 Reasons Your Existing Hadoop Backup & Recovery Strategy is Falling Short

By Jay Desai • November 14, 2019

The rapid adoption of Hadoop within the enterprise has resulted in the deployment of a number of haphazard, quick-fix Hadoop backup and recovery mechanisms. Usually, these primitive solutions come bundled with the Hadoop distributions themselves but are also cobbled together by DevOps teams within organizations. While they may seem to work on the surface, they often put your data and organization at significant risk, particularly as your systems become bigger and more complex. Any resulting downtime or data loss (resulting from failed recoveries) in the event of a disaster will severely impact your business in terms of reputation, costs, and/or time-to-market.

Digging deeper, the inadequacies of these solutions are better understood by examining the underlying misconceptions regarding Hadoop from a data protection perspective.

1. Relying on File System Replicas for Hadoop Backup and Recovery

Replicas are a great way to protect data against hardware failures (such as one or more nodes going down, or disk drives failing). However, they do not protect your data against the more common scenarios, where certain user errors (for example, the DBA inadvertently dropping a Hive table) and application bugs end up corrupting data in the database.

A large technology company was relying on 3 Hadoop replicas to protect data. A DBA accidentally deleted a large 400 terabyte Hive table due to a typo. Since it had no true backup in place, the company ended up recreating the data from the source, which took 4 weeks of elapsed time and numerous engineering resources. Per its estimates, the total cost of these resources and associated downtime was $1.1M.

2. Using HDFS Snapshots

The Hadoop Distributed File System (HDFS) provides snapshot capabilities that create point-in-time copies of specific files and directories. While this may seem like a good data protection strategy, it has severe limitations as described below:

  • HDFS snapshots are file-level snapshots. As such, they do not work well with databases like Hive and Hbase as the associated schema definitions aren’t captured in the backups.
  • Since snapshots are stored on the same nodes as the data, a node or a disk failure results in a loss of both snapshots as well as the data being protected.
  • Recovering data is onerous as it requires someone to manually locate the files being recovered by combing through all the snapshots, rebuild any schemas pertinent to the time of recovery, and finally recover the data files.
  • Storing even a moderate number of snapshots will increase the storage requirements of the Hadoop cluster, thus limiting IT’s ability to go further back in time for the purposes of data recovery.

3. Writing Custom DevOps Scripts for Hadoop Backup and Recovery

Many organizations with in-house DevOps teams often resort to writing custom scripts for backing up their Hive and HBase databases, and HDFS files. Often, several human-months are spent writing and testing these scripts in order to make sure they will work under all scenarios.

The scripts need to be periodically updated to handle larger datasets, upgrades to the Hadoop distribution, and any other non-trivial changes to the data center infrastructure. Like snapshots, scripts only take care of making copies of data. Being a completely manual process, recovery continues to be onerous and error prone as it is with the snapshots approach. Unless tested regularly, scripts also can result in data loss, particularly if the DevOps team that wrote the scripts isn’t around anymore.

A retail organization had written scripts to back up its Hive and Hbase databases. Although the scripts had to be run manually, failed frequently, and required regular changes, the process seemed to be working until it had a data-loss incident. When the retailer tried to recover the data from its backups, it realized that the backup script was encountering a silent failure and, as a result, the backups were being reported as successful when, in reality, the backups were failing. Its backups failed the organization when it most needed it, resulting in data loss.

4. Using Backup Tools from Hadoop Distributions

Commercial Hadoop distributions come packaged with backup capabilities. These tools provide basic backup capabilities and may not meet an organization’s recovery point (RPO) and recovery time (RTO) objectives. They primarily provide a user interface on top of HDFS snapshots, so all of the limitations associated with HDFS snapshots mentioned above show up here as well. Generally, these tools do not provide any easy recovery mechanism so recovery continues to be manual and error prone.

A Solid Hadoop Backup and Recovery Strategy

As Hadoop-based applications and databases become more critical, organizations need to take a more serious look at their recovery strategies for Hadoop. A proper, well thought out Hadoop backup and recovery strategy is needed to ensure that data can be recovered reliably and quickly, and that backup operations do not take up too much engineering or DevOps resources.

A modern Hadoop backup and recovery solution must:

  • Completely eliminate the need for scripting
  • Be fully automated — not need dedicated resources
  • Require very little Hadoop expertise
  • Be extremely reliable and scalable to manage petabytes of data
  • Meet internal compliance requirements for RPO and RTO
  • Protect data in case of ransomware attacks
  • Integrate with cloud storage to reduce costs
  • Preserve multiple point-in-time copies of data
  • Be designed with recovery in mind
  • Be data aware and able to deduplicate big data formats

Watch this video to get deeper insights into the Cohesity solution for Hadoop backup and recovery.