This is an old revision of the document!

BIAC Backup Strategies

All BIAC users should have plans to back up their data to protect against disk failures. This page outlines what data BIAC backs up automatically for you, and lists some strategies that you can use to back up the rest of your data.

BIAC Experiment Data Backup Policies

BIAC employs a multi-level data protection strategy, which offers users comprehensive backup and restore options. Though this is a service provided by BIAC, this does not preclude you from implementing your own backup or mirroring strategy, if you so desire. It is always a good idea to back up any and all data that you would be sorry to lose. See below for suggestions.

Automatic Archiving of Raw MR Data

All of the raw MRI files that come directly from the MRI scanners (DICOM and P-files) are automatically backed up to tape. These tapes are kept indefinitely. Note that this does not include the automatically reconstructed data that ends up in your Data directories – the idea is that these can be regenerated from the raw data if needed.

Automatic Nightly Mirror of Specific Experiment Subdirectories

BIAC also now mirrors, to a separate (read-only) server, all data in experiment directories except for the Data and Analysis subdirectories. This mirror is updated nightly. The mirror quota is limited to 20GB. Experiments directories that are on Munin are mirrored to \\Hill\Data\Mirror, and those that are on any other server are mirrored to \\Munin\Data\Mirror. The intent is to maintain copies of “important” files, including those (like analysis scripts) that can be used to regenerate other data. Accordingly, it will make sense to keep copies of your behavioral data, analysis scripts, final figures, and anything else that you are not able to (or not willing to) regenerate or restore on your own in one of the mirrored subdirectories (like Scripts or Stimuli). Note that mirroring means that files you have accidentally deleted will not survive in the mirror after the next nightly mirror operation. However, several of BIAC's file servers do offer snapshot capabilities that enable you to recover deleted files (or older versions of existing files).

User-Configurable Snapshots of Experiment Directories

You can request to have snapshots set up for your data directories. A snapshot is a point-in-time, read-only copy of the data in your directory, which only stores changed blocks of data in order to be more space-efficient. A snapshot rule specifies the schedule on which to take the snapshots and how many to save. You can have multiple rules for your data. This allows you to restore old versions of a file or a whole directory in case of accidental deletion or other unwanted changes. By default, there is only one snapshot being saved (taken each night at 11:00 PM). Send an email to help@biac.duke.edu to request changes to the snapshot rules for your data directories.

How to check your snapshots

Every data directory on Munin has a hidden snapshot directory (called ~snapshot). Inside you will find directories that correspond to your snapshots. Within those snapshot directories, you will find read-only copies of your data as it was when the snapshot was taken. You can restore those files simply by copying them to the desired location.

If your data are on Hill or Fatt, you can right-click your data directory and then click on the Previous Versions tab to see your snapshots (also known as shadow copies on Windows).

Default Snapshot Rules

If you don't specify a snapshot policy, you will get the following default rules.

Snapshot every night at 11:00 PM. Save 4 snapshots.
Snapshot every Sunday night at 11:00 PM. Save 3 snapshots (i.e., 3 weeks).

Example Snapshot Rules

Snapshot every night at 11:00 PM. Save 14 snapshots (i.e., two weeks).
Snapshot every 4 hours. Save 6 snapshots (i.e., one day's worth).
Snapshot every Sunday at 6:00 AM. Save 12 snapshots (i.e., 12 weeks).
Snapshot on 1st of every month at 11:00 PM. Save 6 snapshots (i.e., 6 months).

Notes

The snapshots will take up space in your directory. If your directory gets close to full, the oldest existing snapshots will need to be deleted to free up space.
It may be tempting to save 6 months of snapshots, but that will mean that when you delete a file or directory, its space will not be freed for 6 months.
The more files change between snapshots, the more space will be taken up by the next snapshot. On the flip side, if nothing changes between snapshots, the snapshot will take up no space.
The schedule is actually a crontab entry, which is quite flexible. If you don't know what this means, just let us know in plain old English what schedule you want, and we'll get it sorted out. (If you are curious, see http://en.wikipedia.org/wiki/Cron)

Help! How do I back up my data?

NOTE: the advice presented here is informational and not comprehensive. It is intended to provide guidance to BIAC users in creating a backup plan. We provide no guarantees that the information or commands below are accurate or up-to-date.

There are many cost-effective external storage devices available on the market that are suitable for backing up your data. Many manufacturers will package software that is able to manage backups to the external storage device. Most external hard drives have USB interfaces, many have FireWire, some have eSATA (as this is not hot-pluggable and needs to be plugged in at boot time, this is for when you expect to have your external media connected 24/7). What type of interface and the amount of capacity you choose is based on the capabilities of the computer to which you expect to connect the device, the amount of data you need to back up, and the type of backups you intend to run. You may even wish to budget for double-backups of your data (on separate drives) to protect against your backup devices failing.

Backup background and terminology

Several backup models are used. For the purpose of this discussion, we will define the following terms. Note that usage of these terms may vary widely.

Mirror

A mirror is a copy of the data. Generally, the mirror is updated periodically to match the current state of the source data. This includes removing files from the mirror that are no longer in the source.

Pros: If the mirrored data are stored on a separate server, the data are preserved even in the case of a source server failure. Restoring the data is as simple as copying from the mirror. The space required is the same as the size of your source data.
Cons: Data you have deleted on the source are deleted on the mirror at the next update time, not allowing for recovery of accidentally deleted files.

Snapshots

A snapshot is a read-only copy of file system, made at an exact point in time. A snapshot comprises only the changed parts of files and resides on the same server as the source data. By saving multiple snapshots, one can provide a record of how files changed over time and allow for the recovery of earlier versions of files. Several BIAC servers offer this capability internally (but note that these internal snapshots are also at risk if the file server goes down due to hardware failure).

Pros: Essentially the same as mirroring, this strategy requires, at the minimum, that you periodically copy the source data to a new location that uniquely identifies the date/time at which the snapshot was taken. Accidentally deleted files are available for recovery if the files existed at a time when a snapshot was taken.
Cons: Without using clever file system strategies (see Backup Strategies for Linux, below), space required for multiple full snapshots is the sum of the sizes of the source data at all the times you take snapshots (i.e. N * size(source data) if the size of the source data does not change over time).

Full + Incremental backup

Given a pre-existing full backup, an incremental backup only stores those files and directories that have changed since the last full (or incremental) backup. Essentially “snapshots of changes”, this is a time- and space-saving strategy, providing the same recovery benefits as snapshots, but the space required is related to the rate at which your data changes, rather than the size of the full data. This strategy has historically been used to minimize the amount of data transferred when backups are spread across multiple removable media such as tapes. There are several categories of incremental backup, ranging from file-level incremental backup (files which have changed are transferred in their entirety) to byte-level incremental backup (where only the bytes that have changed are stored), with block-level in between (storing changes at the granularity of blocks consisting of a fixed number of bytes).

Pros: Just like for snapshots, accidentally deleted data is available for recovery if the data existed when a (full or incremental) backup was taken. However, space (and time) required for incremental backups is typically much less than for full snapshots taken at the same frequency.
Cons: This can be a complicated strategy, requiring sophisticated software that is capable of doing the right thing. Also, restoring the data is more complicated than a simple backup, as multiple versions of a file (or parts of a file) may be stored in different locations.

Backup Strategies

Backup strategies for Windows

Robocopy is in the P:\Tools directory (link), and you can use it in the Command Prompt windows much like rsync on Linux.

P:\TOOLS\ROBOCOPY \\Path\to\Data F:\Path\to\Backup /NFL /E /XO /R:5 /TEE /LOG:U:\Path\to\Logfile\backup.log

The above command will take a location, and back it up to a desired destination. The flags used are:

/NFL = turns off file name logging
/E   = copies all subdirectories (even empty)
/XO  = excludes files that are older than destination (if they are the same filename)
/R:n = specifies the number of retries if the copy fails
/TEE = displays output in console (as well as log file, if specified)
/LOG = location of the text log file to create (overwrites any existing log file)

The command can be saved in batch file (e.g., BACKUP.BAT). This batch file can then be set up as a Windows Scheduled Task to be run automatically each night.

For detailed information on the usage of Robocopy, check the documentation.

Backup strategies for UNIX-based machines (Linux, Mac, etc.)

First, you may want to reformat your external storage device to a native Linux file system such as ext3. If so, you want to repartition your disk to have an ext3 partition using fdisk DISKDEVICE (see fdisk usage), and then make a filesystem on the partition you just created using mkfs.ext3 PARTITIONDEVICE, where DISKDEVICE and PARTITIONDEVICE may be something like /dev/sdX and /dev/sdX1 respectively. Be very careful and don't accidentally rewrite the wrong drive!

The tools described below are available for most (and distributed with many) UNIX-based operating systems.

Rsync-based approaches

Summary: use this for maintaining simple mirrors and/or multiple snapshots for general workloads.

There are many open-source backup programs available for Linux. The one we will discuss here is rsync, which is available for nearly all UNIX-based systems (including Mac OS). rsync was originally designed for efficiently mirroring data (including very large files) between two machines across a network, transferring only data that differs between the two systems. Several enhancements to this tool have made it very useful even for “local” backups that only involve one machine or even one file system. The basic syntax of rsync for backup is:

rsync -a SOURCEFILE TARGETFILE

for copying one file, or:

rsync -a SOURCEPATHS... TARGETDIR

for copying multiple files or directories. The -a option means “archive” and turns on several options that are considered useful for archiving data (including -r which specifies full recursive copies of source directories). SOURCEPATHS can be one or more files or directories. Source directories are copied differently whether they have a trailing slash (/). Without the trailing slash (e.g. /path/to/source/dir1), the directory itself is copied as a subdirectory of TARGETDIR (e.g. TARGETDIR/dir1). With a trailing slash (e.g. /path/to/source/dir1/, which contains the file fileA), the contents of the directory are put into TARGETDIR (e.g. TARGETDIR/fileA). Both source and target paths can also be remote locations, i.e. remote.server.domain.name:/path/on/remote/server, and most newer rsyncs will automatically take that to mean it should connect to that server using ssh and retrieve the data.

The above command creates a full backup of the source paths. If you run it again, it will by default look at timestamps and file sizes in the source and target paths to determine what files have changed since the last time you ran the command; it will only copy or overwrite files that have changed. Plus, if it detects that a file has changed, and if either of the source or target paths is a remote location, only those bytes that have changed are transferred between the client and server. In a sense this is the incremental backup solution but where the incremental backup is merged into the full backup. By default the above command will not delete files that have been deleted at the source, so the target backup may accumulate detritus. To address this, there is the --delete option, but there is a very good reason this is not on by default! (Consider what would happen if you mistakenly specify the wrong source directory!)

Note that the above solution creates a single mirror that approximates the state of the source directories at a particular point in time (approximates, because if the source paths are on a live filesystem, the source files may change during the backup).

In the definition of “incremental backups” earlier in this page we mention filesystem tricks that may make the “snapshot” backup strategy more efficient. In this vein, rsync has a very useful option --link-dest=PREVIOUSBACKUP that enables you to use a previous backup as a basis for a subsequent “full” backup, but will only transfer files that have changed, and will “hard link” to files in the previous backup that haven't changed:

rsync -a --link-dest=PREVIOUSBACKUP SOURCEPATHS... TARGETPATH

This way, you can specify a new target directory for each backup, and it will look exactly like a full backup, but the new backup will only use up as much disk space as needed for the changed files. Moreover, recovering space on the target file system is a matter of simply removing the backup directories that you are no longer interested in. The hard links will preserve the files pointed by the other backups. This is a way of maintaining several snapshots (again, approximate) over time without needing to store or transfer multiple copies of all files. The only cost, beyond the space needed for changed files, is extra inodes. Another plus is that files that have been deleted at the source will not show up in later snapshots, avoiding the need to use the potentially dangerous --delete option.

This strategy combines many good aspects of the snapshot and full + incremental backup strategies. Here is a sample Perl script that uses the above method to create “snapshots” to a target path.

A Backup Script for Systems with Rsync

You can run it periodically, just specifying the source and destination paths (the same paths every time) and it will take care of timestamping each snapshot and sending the latest previous snapshot to the --link-dest option.

Another option is to use (rsnapshot), which does much the same as the above script, and has more features (supporting multiple levels of backup – monthly, daily, hourly if desired), but may require a little more configuration. Also, the automated periodic backup methods they propose require that your backup media be always connected (or at least connected whenever the backups are scheduled to run).

Rdiff-based approaches

Summary: use this for maintaining full + incremental backups of directories with large files, only parts of which change often.

rdiff-backup bases its backup algorithm and data transfer protocol on that of rsync (see above), but goes further and allows for block-level incremental backup (described above in “Full + Incremental backup”). Though rsync does an admirable job of limiting the amount of data sent in a remote data transfer, it always stores full copies of the files at the target. This is the right thing for mirroring, and even for snapshots the –link-dest option allows multiple (identical) copies of the same file to be stored without much extra disk space. However, if you are using snapshots of very large files that have isolated changes, rsync must store multiple copies of that file even though only a small number of bytes might have changed.

rdiff-backup addresses this issue by maintaining a mirror of your data but preserving a byte-level change history (based on the rsync delta algorithm and stored in a rdiff-backup-data subdirectory), allowing you to go back in time and restore the state of the source data at the time of an earlier run of rdiff-backup. Restoring from the last backup is as simple as copying the mirror, but restoring from an earlier backup requires running rdiff-backup using the -r option and specifying a timestamp.

Other features of rdiff-backup include storing file metadata in a separate file in the target directory, and so does not require a full traversal of the target tree to collect timestamps, checksums etc.

All of these things theoretically make it less resource-intensive than a standard rsync, but many people find in practice that rdiff-backup takes longer than rsync. However, rdiff-backup may be the only reasonable option for backing up large files that change frequently but where the changes represent a small fraction of the file's data. For more info, consult the rdiff-backup website.

Brain Imaging & Analysis Center

Table of Contents