Why Global Deduplication Matters | Cohesity

Why Global Deduplication Matters

By Rawlinson Rivera • July 2, 2018

Given Gartner’s prediction of 800% growth in enterprise data within the next five years, storage space efficiency capabilities, such as deduplication, are critical to any enterprise storage platform.

To deal with the expected data growth, enterprises need storage products capable of delivering the highest level of space efficiency and capabilities — with optimal cost. Deduplication is one of the key storage technologies enterprises rely on to deliver optimal storage efficiency and reduced infrastructure costs.

Why Global Deduplication Matters: More than a Checkbox

Deduplication is a space efficiency feature that can be found in many enterprise storage products. However, because of a lack of standards, the effectiveness of data reduction ratios and cost savings are difficult to quantify when comparing different vendors. There is simply not always an apples-to-apples comparison.

This is mainly because vendors may use different types of implementations and individual techniques that will produce different results. I think it’s safe to say that, now more than ever, having deduplication as one of the capabilities in a storage product has become more than just another checkbox. It’s crucial for enterprises to be educated about a product’s implementation details in order to understand the advertised deduplication ratios and costs savings.

Key Characteristics of Data Deduplication

Deduplication is one of the most valued storage technologies in enterprise storage products because it’s designed to eliminate the need to store multiple copies of identical files. This reduces the amount of storage capacity required to store any given amount of data.

The value of deduplication is measured in a couple of ways:

  • Effectiveness of data reduction capabilities and maximum achievable deduplication ratios.
  • Overall impact on storage infrastructure cost and other data center resources and functions, such as network bandwidth, data replication, and disaster recovery requirements.

Certain deduplication traits can determine the level of data reduction efficiency and overall value based on the implementation details.

  • Algorithm: Fixed Length and Variable Length.
  • Global vs Local Deduplication: Dedupe on single nodes vs multiple nodes.

Explained a little further:

  • Fixed length dedupe algorithms divide data into fixed-size chunks. Once the blocks are compared, only the unique blocks are stored. When compared to the alternative, fixed-length is far less efficient, but easier to implement.
  • Variable-length dedupe algorithms use advanced context-aware anchor points to divide data into blocks based on the characteristics of the data itself and not on a fixed size. With fixed-length dedupe, small offsets in the data set can result in significant loss of space savings.
  • Global dedupe is the most effective process for data reduction and increases the dedupe ratio which helps to reduce the required capacity to store data.

With global dedupe, when a piece of data is written on one node, immediately after the write is acknowledged on node1, if the same data is written onto multiple additional nodes, global dedupe implementation is able to identify that data has already been written it and will not write it again.

Cohesity Global Variable-Length Deduplication and Compression

Cohesity utilizes a unique, variable-length data deduplication technology that spans an entire cluster, resulting in significant savings across the entire storage footprint. With variable-length deduplication, the size is not fixed. Instead, the algorithm divides the data into chunks of varying sizes based on the data characteristics.

The chunks are cut in a data dependent way that results in variable sized chunks and results in greater data reduction than fixed-size deduplication. The efficiency benefit of variable-length deduplication compounds over time, as additional backups are retained.

Cohesity also allows the ability to decide when data should be deduplicated in-line (when the data is written to the system) or post-process (after the data is written to the system) to optimize the backup protection jobs against backup time windows. Cohesity also provides compression of the deduped blocks to further maximize space efficiency.

The logical diagrams below illustrate the space efficiency capabilities of deduplication by comparing two different implementations:

Node Level Deduplication

 

Node level deduplication only maintains a single copy of block D; D2 is a pointer to D1. No dedupe is achieved for blocks A & B.

 

 

 

Cluster Level Deduplication

Cluster level deduplication only maintains a single copy of blocks A, B & D. A2, B2 & D2 are just pointers to A1, B1 & D1. This results in greater efficiencies in terms of utilization.

 

 

 

Cohesity’s global deduplication across all nodes in a cluster results in less storage consumed, compared with just node level deduplication used in several other data protection solutions.

 

The figure below presents a logical illustration of the logic and functions performed as part of Cohesity’s global deduplication implementation.

Deduplication is performed using a unique, variable-length data deduplication technology that spans an entire cluster, resulting in significant savings across a customer’s entire storage footprint.

Cohesity SpanFS creates variable length chunks of data, which optimizes the level of deduplication no matter the type of file. In addition to providing global data deduplication, Cohesity allows customers to decide if their data should be deduplicated inline (when the data is written to the system), post-process (after the data is written to the system), or not at all.

This type of implementation is more complex and effective than a fixed length dedupe algorithm. It is worth highlighting that Cohesity’s global variable length deduplication implementation is implemented on a distributed system (scale out shared nothing cluster).

With this implementation, the dedupe algorithm inserts markers at variable intervals in order to maximize the ability to match data, regardless of the file system block size approach being used.

Today, there aren’t many distributed storage systems with this type of implementation that are capable of delivering optimal data efficiency and cost in distributed storage systems.

– Enjoy

For future updates about Hyperconverged Secondary Storage, Cloud Computing, Networking, Storage, and anything in our wonderful world of technology be sure to follow me on Twitter: @PunchingClouds