There has been a growing shift in the way applications use and consume storage. I don’t want to give a computer history lesson here, but suffice to say, since the mainstream adoption of virtualization in the early 2000’s the demands on traditional storage arrays changed. The I/O demands have gone from single-stream monolithic I/O streams to hundreds of smaller, but disparate, requests to the same array. Things had to change.
Distributed computing mainly lived in the University labs, or the startups of the late 90’s like Yahoo, Google, etc. For storage this meant spreading workloads across several controllers vs everything hitting a single controller. All of the disks connected to each storage controller all participated in the same pool of storage. This paradigm allows for a lot more workloads hitting essentially the same pool of disk. Now all of these workloads could get an even share of disk and controller resources.
The one thing that has lagged behind these changes are the tools and techniques for managing distributed storage. As systems and storage administrators, our day-to-day tools weren’t geared for distributed systems. This series of blog posts looks to address just that: How to hack your system admin tools to work in a distributed systems era!
While I do work at Cohesity, Inc, and all these posts will work with the Cohesity DataPlatform.
I hope you find these posts helpful!
If you need to move a vast quantity of data between two Linux/*NIX systems, rsync can be your best friend. But making it go fast can be mind-crushingly painful. Even harder yet is when you try to parallelize streams across several storage controllers. In order to go fast you need to accomplish two things in a distributed system: Parallelize as much as possible, and push as many disk operations as you can. Thankfully, in Linux, we have a few different ways to accomplish this.
If you are ok with installing OpenSource software on your servers, take a look at GNU Parallel. This simple utility can make life a lot easier. Below is a quick sample on how to run n instances of rsync with parallel:
The first line of rsync builds out a list of transactions that rsync needs to accomplish, the second line actually re-uses that output to drive the rsync command. Using parallel with a -j option will set the number of treads parallel will execute. Simple as that!
If you don’t like installing other software, or are under more strict change management for your servers, you can accomplish the same task above using xargs to execute the rsync’s in parallel. Here’s an example of the xargs way of doing it. Simply put, we are searching for files in a directory and sending them to rsync:
Using the -P option allows you to specify the number of rsync threads you’re generating, and then you just need to pass a destination directory at the end.
Now we can parallel process things, in two different ways, from a process point of view, but now we need to spread them across all the storage controllers. For this trick, we can use a series of nested for loops. This method can be used for either xargs or parallel. For this example script we need to pass two configuration files, the first contains a list of SOURCE directories and the second contains a list of DESTINATION directories. In our example we are using a single Cohesity View, and mounting the View to our server with all our available VIPs:
This is a pretty standard for loop construction here, except we had to come up with a clever way to loop back around the destination directories. You can see this in our re-initialization of the loop_dest variable.
And there you have it! A completely parallelized and fully distributed wrapper for rsync! For those following along, here’s the entire script utilizing the GNU Parallel method: