All’s Fair in Love and Distributed Storage

Intro

There has been a growing shift in the way applications use and consume storage. I don’t want to give a computer history lesson here, but suffice to say, since the mainstream adoption of virtualization in the early 2000’s the demands on traditional storage arrays changed. The I/O demands have gone from single-stream monolithic I/O streams to hundreds of smaller, but disparate, requests to the same array. Things had to change.

Distributed computing mainly lived in the University labs, or the startups of the late 90’s like Yahoo, Google, etc. For storage this meant spreading workloads across several controllers vs everything hitting a single controller. All of the disks connected to each storage controller all participated in the same pool of storage. This paradigm allows for a lot more workloads hitting essentially the same pool of disk. Now all of these workloads could get an even share of disk and controller resources.

The one thing that has lagged behind these changes are the tools and techniques for managing distributed storage. As systems and storage administrators, our day-to-day tools weren’t geared for distributed systems. This series of blog posts looks to address just that: How to hack your system admin tools to work in a distributed systems era!

While I do work at Cohesity, Inc, and all these posts will work with the Cohesity DataPlatform.

I hope you find these posts helpful!

Rsync – A Love / Hate relationship.

If you need to move a vast quantity of data between two Linux/*NIX systems, rsync can be your best friend. But making it go fast can be mind-crushingly painful. Even harder yet is when you try to parallelize streams across several storage controllers. In order to go fast you need to accomplish two things in a distributed system: Parallelize as much as possible, and push as many disk operations as you can. Thankfully, in Linux, we have a few different ways to accomplish this.

GNU Parallel

If you are ok with installing OpenSource software on your servers, take a look at GNU Parallel. This simple utility can make life a lot easier. Below is a quick sample on how to run n instances of rsync with parallel:

rsync -avzm –stats –safe-links –ignore-existing –dry-run –human-readable $src $dest > /tmp/transaction.logcat /tmp/transaction.log | parallel –will-cite -j $threads rsync -avzm –relative –stats –safe-links –ignore-existing –human-readable {} $dest >> /tmp/results.log

The first line of rsync builds out a list of transactions that rsync needs to accomplish, the second line actually re-uses that output to drive the rsync command. Using parallel with a -j option will set the number of treads parallel will execute. Simple as that!

Xargs, Or: How to Use What’s On the Truck

If you don’t like installing other software, or are under more strict change management for your servers, you can accomplish the same task above using xargs to execute the rsync’s in parallel. Here’s an example of the xargs way of doing it. Simply put, we are searching for files in a directory and sending them to rsync:

find . ! -type d -print0 | xargs -0 -n1 -P$THREADS -I% rsync -az % $DESTDIR/%

Using the -P option allows you to specify the number of rsync threads you’re generating, and then you just need to pass a destination directory at the end.

Now we can parallel process things, in two different ways, from a process point of view, but now we need to spread them across all the storage controllers. For this trick, we can use a series of nested for loops. This method can be used for either xargs or parallel. For this example script we need to pass two configuration files, the first contains a list of SOURCE directories and the second contains a list of DESTINATION directories. In our example we are using a single Cohesity View, and mounting the View to our server with all our available VIPs:

export srcs=$1
export dests=$2
export threads=5
export loop_src=`cat $srcs`
export loop_dest=`cat $dests`
for src in $loop_src
do
for dest in $loop_dest
do
echo “about to rsync $src to $dest”
# Place your method of paralleling rsync here echo “completed the rsync of $src to $dest…moving on to next”
loop_dest=`echo $loop_dest | awk ‘{for (i=2; i<=NF; i++) print $i}’`
echo “new list of dirs is $dests”
break
done
done

This is a pretty standard for loop construction here, except we had to come up with a clever way to loop back around the destination directories. You can see this in our re-initialization of the loop_dest variable.

And there you have it! A completely parallelized and fully distributed wrapper for rsync! For those following along, here’s the entire script utilizing the GNU Parallel method:

#/bin/bash
#
# To use this script please have parallel installed
# In Ubuntu just run: sudo apt-get install parallel
# Once installed, create two txtfiles:
# sources.txt should contain a line separated list of the files you which to rsync
# destinations.txt should contain a line separated list of the Cohesity Mount Points
# Make a mount point for every VIP in your cluster and map them all to the same view
# The just run this script: p_rsync sources.txt destinations.txtexport srcs=$1
export dests=$2
export threads=5
export loop_src=`cat $srcs`
export loop_dest=`cat $dests`#check for Parallel
echo “checking for Parallel to be installed”
program=”parallel”
condition=$(which $program 2>/dev/null | grep -v “not found” | wc -l)
if [ $condition -eq 0 ] ; then
echo “$program is not installed”
exit
fiecho “parsing the SOURCE and DESTINATION strings to start the RSYNC in parallel”
#itterate over a list of Source directories and pass them off to Destinations one by one…
for src in $loop_src
do
for dest in $loop_dest
do
echo “about to rsync $src to $dest”
rsync -avzm –stats –safe-links –ignore-existing –dry-run –human-readable $src $dest > /tmp/transaction.log
cat /tmp/transaction.log | parallel –will-cite -j $threads rsync -avzm –relative –stats –safe-links –ignore-existing –human-readable {} $dest >> /tmp/results.log &
echo “completed the rsync of $src to $dest…moving on to next”
loop_dest=`echo $loop_dest | awk ‘{for (i=2; i<=NF; i++) print $i}’`
echo “new list of dirs is $dests”
break
done
done

Written By

Greg Statton

Office of the CTO - Data & AI

All’s Fair in Love and Distributed Storage

Intro

Rsync – A Love / Hate relationship.

GNU Parallel

Xargs, Or: How to Use What’s On the Truck

Recent Blogs

Most popular blogs

AI is accelerating vulnerability discovery—here’s how Cohesity is responding

Cohesity and Semperis team up to accelerate identity resilience

Minimum Viable Company is about trust