As monolithic systems reached their limits, they started getting replaced with scale-out distributed systems. The trend started two decades ago in compute when mainframes were replaced with server farms. Then made its way to storage (databases, file systems). In the database world, the Relational vs NoSQL debate has been raging for some time.
Today, I want to talk to you about distributed data storage platforms and consistency models. This is a very important requirement to consider when planning your storage infrastructure.
Let’s start with some basics. In a distributed system, the assumption is that individual nodes will fail. The system must be resilient to node failures. Therefore, the data must be duplicated across multiple nodes for redundancy.
In this context, let’s ask the following question: “If I perform a write (or update) on one node, will I always see the updated data across all the nodes?”
It seems like an innocuous question. Everyone would answer in the affirmative. “Duh, of course!”. But not so fast.
This is in fact a hard problem to solve in distributed systems, especially while maintaining performance. Systems that make this guarantee are called “strictly consistent”. However, a lot of systems take the easy way out, and only provide eventual consistency.
Let’s define eventual consistency vs strict consistency.
The following video demonstrates the process of eventual consistency.
Summary of the process
The following video demonstrates the process of strict consistency.
Summary of the process
Clearly, strict consistency is better because the user is guaranteed to always see the latest data, and data is protected as soon as it is written. The illustration below compares the two consistency models.
Mostly because the implementation of strict consistency can significantly impact performance. Specifically, latency and throughput will be impacted. How much of an impact depends on the scenario.
Strict consistency isn’t always required and eventual consistency may suffice in some use cases. For example, in a shopping cart, say an item is added and the datacenter failed. It is not a disaster for the customer to add that item again. In this case eventual consistency would be sufficient.
However, you wouldn’t want this happening to your bank account with a deposit you just made. Having your money vanish because a node failed is unacceptable. Strict consistency is required in financial transactions.
In enterprise storage, there are cases where eventual consistency is the right model, such as the example of cross-site replication. But in the vast majority of cases strict consistency is required. Let’s look at a few examples where strict consistency is needed.
It so happens that one of the leading scale-out file storage systems provides only eventual consistency. The data is written to only one node (on NVRAM) and acknowledged. An enterprise customer once explained to me that under heavy load, a node may get marked offline. Effectively it’s down, resulting in clients getting “File-Not-Found” errors for files they had successfully written just a few seconds prior. This wreaks havoc on their applications.
Next-generation scale-out backup solutions provide instant VM recovery from backup. Such solutions boot VMs from a copy of the backup image on the backup system. The backup system serves as primary storage for the duration of the recovery, until the data can be moved back to the original datastore using Storage vMotion. The advantage is clear: you are back in business ASAP.
However, many scale-out backup solutions only provide eventual consistency for the writes. Consequently, if a failure happens on the recovery node, the application fails and the system loses live production VM data.
With strict consistency users are guaranteed to always see the latest data, and data is protected as soon as it is written. Thanks to strict consistency, application availability/uptime and no data loss are guaranteed, even if the infrastructure fails.
These considerations of scale-out file storage and instant recovery from backup should be top of mind as you design your backup environment.
There are risks and problems that consistency models pose to any organization using traditional, or modern, data-protection and recovery solutions. Unfortunately, there is a tremendous lack of awareness and understanding about this topic.
Vendors offer traditional and modern methods for data protection and recovery solutions. They provide quick restore of a VM or data with a feature often called Instant Recovery. The goal is to minimize downtime, known as the Recovery Time Objective (RTO).
However, the restoration workflow and implementation is different depending on the vendor and customer’s infrastructure.
There are a series of recovery functions performed (manual or automatic) to restore an environment such as VMware vSphere. Typically, the data protection and recovery solution, where a copy of the VM or data is stored, provides some form of a storage abstraction. vSphere will provide the additional compute resources for this.
After the VM is recovered, it has to be migrated back to the primary storage platform. In vSphere, Storage vMotion is used to migrate data over the network. It possible to recover and instantiate a VM in minutes.
Yet, it’s not possible to restore in a few minutes if it means moving hundreds of gigabytes across a network. Depending on the size and capacity being transferred across the network, the process can take a long time to complete. Low times will depend on network bandwidth, interface saturation, etc.
This video illustrates the process of restoring a vSphere environment with vMotion using eventual consistency.
Summary of the process
This is not an acceptable outcome when you depend on a data protection and recovery solution as your insurance policy. The result – depending on the magnitude of the failure – can put a company out of business, or at the very least, cost someone their job.
This video illustrates the process of restoring a vSphere environment with vMotion using strict consistency.
The steps below are what enterprises should expect. This is what they should demand from their data protection and recovery solution.
The steps described above result in the outcome enterprises should expect, and demand, from their data protection and recovery solution when looking to leverage features such as Instant Recovery.
This video summarizes the information above, and demonstrates the issue strict vs eventual consistency. It takes you through two scenarios step by step. The first example is about backups with Oracle RMAN and the next is performing an instant restore for VMware.
This article explained why Cohesity implements strict consistency. It mitigates the risks of data inaccessibility and potential data loss due to node failure. Before you choose your data protection and recovery solution, consider the questions about its underlying design principles.
With strict consistency, data written to Cohesity is acknowledged only after it is safely written to multiple nodes in the cluster – not just to one node. This avoids a single point of failure. And that’s critical to avoid downtime and maintain business continuity.
Cohesity’s DataPlatform features, capabilities, and modern architecture were designed to meet the requirements of businesses today – not two-to-three decades ago. Check out Cohesity SpanFS and learn more about how our file system provides you with the highest levels of performance, resilience, and scale.