The rapid adoption of Hadoop within the enterprise has resulted in the deployment of a number of haphazard, quick-fix Hadoop backup and recovery mechanisms. Usually, these primitive solutions come bundled with the Hadoop distributions themselves but are also cobbled together by DevOps teams within organizations. While they may seem to work on the surface, they often put your data and organization at significant risk, particularly as your systems become bigger and more complex. Any resulting downtime or data loss (resulting from failed recoveries) in the event of a disaster will severely impact your business in terms of reputation, costs, and/or time-to-market.
Digging deeper, the inadequacies of these solutions are better understood by examining the underlying misconceptions regarding Hadoop from a data protection perspective.
Replicas are a great way to protect data against hardware failures (such as one or more nodes going down, or disk drives failing). However, they do not protect your data against the more common scenarios, where certain user errors (for example, the DBA inadvertently dropping a Hive table) and application bugs end up corrupting data in the database.
A large technology company was relying on 3 Hadoop replicas to protect data. A DBA accidentally deleted a large 400 terabyte Hive table due to a typo. Since it had no true backup in place, the company ended up recreating the data from the source, which took 4 weeks of elapsed time and numerous engineering resources. Per its estimates, the total cost of these resources and associated downtime was $1.1M.
The Hadoop Distributed File System (HDFS) provides snapshot capabilities that create point-in-time copies of specific files and directories. While this may seem like a good data protection strategy, it has severe limitations as described below:
Many organizations with in-house DevOps teams often resort to writing custom scripts for backing up their Hive and HBase databases, and HDFS files. Often, several human-months are spent writing and testing these scripts in order to make sure they will work under all scenarios.
The scripts need to be periodically updated to handle larger datasets, upgrades to the Hadoop distribution, and any other non-trivial changes to the data center infrastructure. Like snapshots, scripts only take care of making copies of data. Being a completely manual process, recovery continues to be onerous and error prone as it is with the snapshots approach. Unless tested regularly, scripts also can result in data loss, particularly if the DevOps team that wrote the scripts isn’t around anymore.
A retail organization had written scripts to back up its Hive and Hbase databases. Although the scripts had to be run manually, failed frequently, and required regular changes, the process seemed to be working until it had a data-loss incident. When the retailer tried to recover the data from its backups, it realized that the backup script was encountering a silent failure and, as a result, the backups were being reported as successful when, in reality, the backups were failing. Its backups failed the organization when it most needed it, resulting in data loss.
Commercial Hadoop distributions come packaged with backup capabilities. These tools provide basic backup capabilities and may not meet an organization’s recovery point (RPO) and recovery time (RTO) objectives. They primarily provide a user interface on top of HDFS snapshots, so all of the limitations associated with HDFS snapshots mentioned above show up here as well. Generally, these tools do not provide any easy recovery mechanism so recovery continues to be manual and error prone.
As Hadoop-based applications and databases become more critical, organizations need to take a more serious look at their recovery strategies for Hadoop. A proper, well thought out Hadoop backup and recovery strategy is needed to ensure that data can be recovered reliably and quickly, and that backup operations do not take up too much engineering or DevOps resources.
Watch this video to get deeper insights into the Cohesity solution for Hadoop backup and recovery.