Rsync: A Key File Management Tool¶
TL;DR¶
If you need to manage a lot of files, it is well worth your time to learn how to use the rsync
command. This program is available on UNIX and Mac computers by default. It is increasingly available on Windows as the Windows Subsystem for Linux (WSL) becomes more common.
Introduction¶
JHPCE users store petabytes of data on our storage servers. Some of it accumulates as work is done. Much is imported from outside the cluster as source material for research. Moving around large numbers of files can be done with a variety of methods. Many users default to cp
or mv
All tools have limitations, most of which we, thankfully, don't encounter in day-to-day use. However, when manipulating large numbers of files and files in directory trees many levels deep, these issues can begin to surface. Some of them give you more confidence in the results than others. Some are misleadingly quiet despite failing or producing a result that isn't the same as the source location.
The cp
command is a good example of a familiar tool that can produce unexpected results. It supports recursion through the -R
flag. Depending on the UNIX version, cp
may or may not treat symbolic links or hard-linked files the way you expect.
What is the state of your copy if cp
or mv
attempts fail mid-stream? In the middle of a file?
For example, consider what happens if a program has a limitation where it cannot handle file paths longer than 1024 characters. You would be unlikely to experience a problem using that program to copy files around unless they are /stored/in/directory/trees/with/many/entries/YYYY-MM-DD-MM-SS/long-filename-with-details-embedded-in-its-name.dat
Such paths are common in the sciences and become more common as time passes and data accumulates.
Rsync¶
Rsync is a powerful command for copying large numbers of files between a SOURCE (aka SRC) and a DESTINATION (aka DEST), within a single computer or between computers.
Note
A key benefit of rsync
is that it will copy only the material needed to make DEST match SRC. Most other programs, such as cp
or scp
, will always copy over all of SRC to DEST.
There are MANY arguments which control how the copying will occur. So you can do things like specify files to exclude, create a log of the actions taken, and provide statistics when done. Rsync uses a variety of tests to compare files and directories. We wrote this page because using rsync properly for complex situations takes some skill and explanation.
The most common flag used with rsync
is -a
for "archive". This provides a recursive copy, preserving symbolic links, timestamps, file permissions, user & group ownership. It does NOT preserve ACLs or hard links.
Use the appropriate resources¶
When transferring files within the cluster, please use compute nodes when copying more than a few files.
When transferring files into and out of the cluster, please use the transfer node. The various tools for that work are described in this document. The rsync information on this page can be useful for such transfers.
Example rsync SLURM batch job¶
Is shown here.
SOURCE and DESTINATION¶
Multiple items can be listed as SRC material. Of course there can only be one DEST.
Rsync never updates the SRC location. Changes, if any, only occur in the DEST.
Files in DEST but not SRC will not be deleted unless you specify one of the delete flags.
SRC and DEST can be paths or hostnames with colons and paths or a combination. rsync will use ssh to send files between hosts, so you can even specify a different username for one of the SRC and DEST.
Here are the different scenarios and their core syntax:
- Local copying:
rsync [OPTION...] SRC... DEST
- Access via remote shell:
- Pull:
rsync [OPTION...] [USER@]HOST:SRC... [DEST]
- Push:
rsync [OPTION...] SRC... [USER@]HOST:DEST
Rsync servers¶
One can configure an rsync server which is always listening for desired transactions. Very few people do that. JHPCE doesn't run any. This is being mentioned because some of the language in the manual page can be confusing if you don't know about this.
Mind the trailing slash!¶
If the SOURCE represents a directory then adding a trailing forward slash to it will cause the contents of the directory to be copied into DESTINATION. If there is no trailing forward slash, then the SOURCE directory itself will be copied into DESTINATION.
- Example
- If directory /some/source/place contains files a, b, and c
rsync -a /some/source/place/ /a/destination/location/
- will result in the contents of /a/destination/location/ being a, b, and c.
- If the command was instead
rsync -a /some/source/place /a/destination/location/
- will result in the contents being instead /a/destination/location/place
This behavior is also found with the old standard cp
program when using the -R
recursive flag!!!
Rsync Examples¶
It pays to be cautious when running commands which can cause many changes in short order.
Rsync can be used to compare two directory trees without updating anything. If both possibly have unique data, then you want to be careful and run preliminary --dry-run
commands with a variety of informational flags.
Show what would be done but make no changes:
rsync -a --verbose --dry-run --stats /local1 /other2
Determine what would be done in more detail (but make no changes):
rsync -a --verbose --dry-run --stats --itemize-changes /local1 /other2
Only copy files ending in ".txt"
rsync -avz --include='*.txt' /src /dest
A good combination to consider using. Includes options to try to be complete, safe, and efficient.
rsync -avhAXH --progress --numeric-ids --sparse --one-file-system --stats --delete-after
Tip
To compare two directory trees, you can use a number of flags together to see what is different. Remember that rsync is unidirectional, so you would have to run it twice to compare in both directions. It might be true that each directory tree contains many duplicates but also some unique items. Direct the output into a text file for easier viewing and comparing.
These include: `--dry-run --itemize-changes --delete --stats
--dry-run
- don't change anything
--itemize-changes
- show what would be modified about each file
--delete-after
- identify any extra files found in DEST that aren't in SRC using the efficient choice for large collections of files
--stats
flags- and produce a summary of file numbers
However, --itemize-changes
produces a cryptic 11 character string for each file or directory. We have copied a useful chart from the Internet that you can consult. See this page for the key and this page for the manual page section.
Rsync Flags You May Want To Use¶
Rsync is very flexible. The manual page is long. Here are some useful arguments to know about, organized somewhat by purpose.
Most flags have two forms you can choose between, a short one consisting of a single character, or a readable form preceeded by two hyphens, and typically followed by an equals sign and a value.
--archive, -a archive mode; equals -rlptgoD (no -H,-A,-X)
--acls, -A preserve ACLs (implies --perms)
--xattrs, -X preserve extended attributes
--hard-links, -H preserve hard links
--exclude-from={'list.txt'}. there is a corresponding --include-from
--exclude={'*.txt','dir3','dir4'} there is a corresponding --include
--dry-run, -n perform a trial run with no changes made
--list-only list the files instead of copying them
--ignore-existing don't overwrite existing files, no matter what
--max-delete=NUM don't delete more than NUM files (SET TO 0 OR LOW NUMBER FOR SAFETY)
--one-file-system, -x ensure that you stay inside the SRC file system
--atimes -U preserve access (use) times
--crtimes, -N preserve create times (newness)
--times, -t preserve modification times
--numeric-ids don't map uid/gid values by user/group name (more efficient)
--delete-after wait until end to process deletes (much more efficient than deleting during)
--partial allows you to resume an interrupted transfer
--progress show which file is being copied (not stored in log file!!)
--itemize-changes, -i output a change-summary for all updates
has complex output. (see below for a link)
--log-file=FILE useful when using --itemize
--verbose, -v increase verbosity
--info=FLAGS fine-grained informational verbosity
--human-readable, -h output numbers in a human-readable format
--stats provide statistics at the end
--size-only skip files that match in size (know what you're doing)
--ignore-times, -I don't skip files that match size and time (when in doubt...)
--checksum -c skip based on checksums, not mod-time & size
--sparse turn sequences of nulls into sparse blocks
--bwlimit limit the impact of the rsync on the network (in kb/sec)
--chmod=CHMOD affect file and/or directory permissions
--usermap=STRING custom username mapping
--groupmap=STRING custom groupname mapping (STRING is not simply
--chown=USER:GROUP simple username/groupname mapping
Archive tools moving data through a pipe¶
Rsync is usually the best method. But rsync wasn't always available in the past, or it didn't support copying one or another attribute that more basic tools did. Different kinds of file permissions, data forks, etc.
One method that can be quick and effective is to use an archive program like tar to create an archive file which is passed through an input/output pipe to the same program running in another directory that extracts files into that location. This technique can use available system memory as buffer space, leading to smoother flows of data as disk reads and writes occur with optimal amounts of bytes.
This example copies the named directory some-directory from the current working directory to another location. Because tar by default works with blocks of data 512 bytes in size, a higher efficiency is achieved by telling it to create larger blocks to reduce the overhead of doing input/output requests in such small sizes.
tar -cbf 20480 - some-directory | (cd /destination/place; tar -xbf 20480 -)
tar -cbf 20480 - some-directory | ssh myaccount@another-host (cd /destination/place; tar -xbf 20480 -)"