This is an old revision of the document!
All BIAC users should have plans to back up their data to protect against disk failures. This page outlines what data BIAC backs up automatically for you, and lists some strategies that you can use to back up the rest of your data.
BIAC employs a multi-level data protection strategy, which offers users comprehensive backup and restore options. Though this is a service provided by BIAC, this does not preclude you from implementing your own backup or mirroring strategy, if you so desire. It is always a good idea to back up any and all data that you would be sorry to lose. See below for suggestions.
All of the raw MRI files that come directly from the MRI scanners (DICOM and P-files) are automatically backed up to tape. These tapes are kept indefinitely. Note that this does not include the automatically reconstructed data that ends up in your Data directories – the idea is that these can be regenerated from the raw data if needed.
BIAC also now mirrors, to a separate (read-only) server, all data in experiment directories except for the
Analysis subdirectories. This mirror is updated nightly. The mirror quota is limited to 20GB. Experiments directories that are on Munin are mirrored to \\Hill\Data\Mirror, and those that are on any other server are mirrored to \\Munin\Data\Mirror. The intent is to maintain copies of “important” files, including those (like analysis scripts) that can be used to regenerate other data. Accordingly, it will make sense to keep copies of your behavioral data, analysis scripts, final figures, and anything else that you are not able to (or not willing to) regenerate or restore on your own in one of the mirrored subdirectories (like
Stimuli). Note that mirroring means that files you have accidentally deleted will not survive in the mirror after the next nightly mirror operation. However, several of BIAC's file servers do offer snapshot capabilities that enable you to recover deleted files (or older versions of existing files).
You can request to have snapshots set up for your data directories. A snapshot is a point-in-time, read-only copy of the data in your directory, which only stores changed blocks of data in order to be more space-efficient. A snapshot rule specifies the schedule on which to take the snapshots and how many to save. You can have multiple rules for your data. This allows you to restore old versions of a file or a whole directory in case of accidental deletion or other unwanted changes. By default, there is only one snapshot being saved (taken each night at 11:00 PM). Send an email to firstname.lastname@example.org to request changes to the snapshot rules for your data directories.
Every data directory on Munin has a hidden snapshot directory (called ~snapshot). Inside you will find directories that correspond to your snapshots. Within those snapshot directories, you will find read-only copies of your data as it was when the snapshot was taken. You can restore those files simply by copying them to the desired location.
If your data are on Hill or Fatt, you can right-click your data directory and then click on the
Previous Versions tab to see your snapshots (also known as shadow copies on Windows).
If you don't specify a snapshot policy, you will get the following default rules.
NOTE: the advice presented here is informational and not comprehensive. It is intended to provide guidance to BIAC users in creating a backup plan. We provide no guarantees that the information or commands below are accurate or up-to-date.
There are many cost-effective external storage devices available on the market that are suitable for backing up your data. Many manufacturers will package software that is able to manage backups to the external storage device. Most external hard drives have USB interfaces, many have FireWire, some have eSATA (as this is not hot-pluggable and needs to be plugged in at boot time, this is for when you expect to have your external media connected 24/7). What type of interface and the amount of capacity you choose is based on the capabilities of the computer to which you expect to connect the device, the amount of data you need to back up, and the type of backups you intend to run. You may even wish to budget for double-backups of your data (on separate drives) to protect against your backup devices failing.
Several backup models are used. For the purpose of this discussion, we will define the following terms. Note that usage of these terms may vary widely.
A mirror is a copy of the data. Generally, the mirror is updated periodically to match the current state of the source data. This includes removing files from the mirror that are no longer in the source.
A snapshot is a read-only copy of file system, made at an exact point in time. A snapshot comprises only the changed parts of files and resides on the same server as the source data. By saving multiple snapshots, one can provide a record of how files changed over time and allow for the recovery of earlier versions of files. Several BIAC servers offer this capability internally (but note that these internal snapshots are also at risk if the file server goes down due to hardware failure).
Given a pre-existing full backup, an incremental backup only stores those files and directories that have changed since the last full (or incremental) backup. Essentially “snapshots of changes”, this is a time- and space-saving strategy, providing the same recovery benefits as snapshots, but the space required is related to the rate at which your data changes, rather than the size of the full data. This strategy has historically been used to minimize the amount of data transferred when backups are spread across multiple removable media such as tapes. There are several categories of incremental backup, ranging from file-level incremental backup (files which have changed are transferred in their entirety) to byte-level incremental backup (where only the bytes that have changed are stored), with block-level in between (storing changes at the granularity of blocks consisting of a fixed number of bytes).
Robocopy is in the P:\Tools directory (link), and you can use it in the Command Prompt windows much like rsync on Linux.
P:\TOOLS\ROBOCOPY \\Path\to\Data F:\Path\to\Backup /NFL /E /XO /R:5 /TEE /LOG:U:\Path\to\Logfile\backup.log
The above command will take a location, and back it up to a desired destination. The flags used are:
/NFL = turns off file name logging /E = copies all subdirectories (even empty) /XO = excludes files that are older than destination (if they are the same filename) /R:n = specifies the number of retries if the copy fails /TEE = displays output in console (as well as log file, if specified) /LOG = location of the text log file to create (overwrites any existing log file)
The command can be saved in batch file (e.g., BACKUP.BAT). This batch file can then be set up as a Windows Scheduled Task to be run automatically each night.
For detailed information on the usage of Robocopy, check the documentation.
First, you may want to reformat your external storage device to a native Linux file system such as ext3. If so, you want to repartition your disk to have an ext3 partition using
fdisk DISKDEVICE (see fdisk usage), and then make a filesystem on the partition you just created using
mkfs.ext3 PARTITIONDEVICE, where DISKDEVICE and PARTITIONDEVICE may be something like /dev/sdX and /dev/sdX1 respectively. Be very careful and don't accidentally rewrite the wrong drive!
The tools described below are available for most (and distributed with many) UNIX-based operating systems.
Summary: use this for maintaining simple mirrors and/or multiple snapshots for general workloads.
There are many open-source backup programs available for Linux. The one we will discuss here is
rsync, which is available for nearly all UNIX-based systems (including Mac OS).
rsync was originally designed for efficiently mirroring data (including very large files) between two machines across a network, transferring only data that differs between the two systems. Several enhancements to this tool have made it very useful even for “local” backups that only involve one machine or even one file system. The basic syntax of rsync for backup is:
rsync -a SOURCEFILE TARGETFILE
for copying one file, or:
rsync -a SOURCEPATHS... TARGETDIR
for copying multiple files or directories. The
-a option means “archive” and turns on several options that are considered useful for archiving data (including
-r which specifies full recursive copies of source directories).
SOURCEPATHS can be one or more files or directories. Source directories are copied differently whether they have a trailing slash (
/). Without the trailing slash (e.g.
/path/to/source/dir1), the directory itself is copied as a subdirectory of TARGETDIR (e.g.
TARGETDIR/dir1). With a trailing slash (e.g.
/path/to/source/dir1/, which contains the file
fileA), the contents of the directory are put into TARGETDIR (e.g.
TARGETDIR/fileA). Both source and target paths can also be remote locations, i.e.
remote.server.domain.name:/path/on/remote/server, and most newer rsyncs will automatically take that to mean it should connect to that server using
ssh and retrieve the data.
The above command creates a full backup of the source paths. If you run it again, it will by default look at timestamps and file sizes in the source and target paths to determine what files have changed since the last time you ran the command; it will only copy or overwrite files that have changed. Plus, if it detects that a file has changed, and if either of the source or target paths is a remote location, only those bytes that have changed are transferred between the client and server. In a sense this is the incremental backup solution but where the incremental backup is merged into the full backup. By default the above command will not delete files that have been deleted at the source, so the target backup may accumulate detritus. To address this, there is the
--delete option, but there is a very good reason this is not on by default! (Consider what would happen if you mistakenly specify the wrong source directory!)
Note that the above solution creates a single mirror that approximates the state of the source directories at a particular point in time (approximates, because if the source paths are on a live filesystem, the source files may change during the backup).
In the definition of “incremental backups” earlier in this page we mention filesystem tricks that may make the “snapshot” backup strategy more efficient. In this vein,
rsync has a very useful option
--link-dest=PREVIOUSBACKUP that enables you to use a previous backup as a basis for a subsequent “full” backup, but will only transfer files that have changed, and will “hard link” to files in the previous backup that haven't changed:
rsync -a --link-dest=PREVIOUSBACKUP SOURCEPATHS... TARGETPATH
This way, you can specify a new target directory for each backup, and it will look exactly like a full backup, but the new backup will only use up as much disk space as needed for the changed files. Moreover, recovering space on the target file system is a matter of simply removing the backup directories that you are no longer interested in. The hard links will preserve the files pointed by the other backups. This is a way of maintaining several snapshots (again, approximate) over time without needing to store or transfer multiple copies of all files. The only cost, beyond the space needed for changed files, is extra inodes. Another plus is that files that have been deleted at the source will not show up in later snapshots, avoiding the need to use the potentially dangerous
This strategy combines many good aspects of the snapshot and full + incremental backup strategies. Here is a sample Perl script that uses the above method to create “snapshots” to a target path.
You can run it periodically, just specifying the source and destination paths (the same paths every time) and it will take care of timestamping each snapshot and sending the latest previous snapshot to the
Another option is to use (rsnapshot), which does much the same as the above script, and has more features (supporting multiple levels of backup – monthly, daily, hourly if desired), but may require a little more configuration. Also, the automated periodic backup methods they propose require that your backup media be always connected (or at least connected whenever the backups are scheduled to run).
Summary: use this for maintaining full + incremental backups of directories with large files, only parts of which change often.
rdiff-backup bases its backup algorithm and data transfer protocol on that of
rsync (see above), but goes further and allows for block-level incremental backup (described above in “Full + Incremental backup”). Though
rsync does an admirable job of limiting the amount of data sent in a remote data transfer, it always stores full copies of the files at the target. This is the right thing for mirroring, and even for snapshots the
–link-dest option allows multiple (identical) copies of the same file to be stored without much extra disk space. However, if you are using snapshots of very large files that have isolated changes,
rsync must store multiple copies of that file even though only a small number of bytes might have changed.
rdiff-backup addresses this issue by maintaining a mirror of your data but preserving a byte-level change history (based on the
rsync delta algorithm and stored in a
rdiff-backup-data subdirectory), allowing you to go back in time and restore the state of the source data at the time of an earlier run of
rdiff-backup. Restoring from the last backup is as simple as copying the mirror, but restoring from an earlier backup requires running
rdiff-backup using the
-r option and specifying a timestamp.
Other features of
rdiff-backup include storing file metadata in a separate file in the target directory, and so does not require a full traversal of the target tree to collect timestamps, checksums etc.
All of these things theoretically make it less resource-intensive than a standard
rsync, but many people find in practice that
rdiff-backup takes longer than
rdiff-backup may be the only reasonable option for backing up large files that change frequently but where the changes represent a small fraction of the file's data. For more info, consult the rdiff-backup website.