Overview

chkfs - A commandline tool for storing filesystems inside a chkstore.

This stores a filesystem using the chkstore library. Use case goals include:

  • Backup many different old hard drives with redundant copies of filesystems in a deduplicating manner.
  • Store in a self-describing transparent format, so that if a user finds themselves with a typical fresh linux install but not network and no access to this code, they can still restore backups using bzip2, cat, cp, etc...
  • Incremental backup with atomically consistent cached progress state: If a backup process dies, it can be restarted and catch up to its previous run without using heavy resources.
    • Atomically consistent means a backup process can die suddenly at any step without corrupting the store.
    • Consistency also anticipates multiple writing processes can update the storage simultaneously without a loss of consistency. The only failure in this case is to overwrite a "snapshot pointer". Dangling snapshot pointers can be reconstructed with an expensive scan of the store.
    • Cached means the progress tracking state can be removed, and the only effect is that the next backup run will use more disk I/O and time, but will not lose information or revert any committed backup state.
  • Support many different backup source filesystems (old dos FAT, iso9660, ntfs...). Support for reading the filesystems comes from the kernel by dint of mounting, but the backup tool should save all relevant filesystem metadata.
    • This includes filenames in any encoding. The known encodings are ASCII and utf8, but if neither encoding can represent a filename, an "unknown" encoding stores the binary data directly. Encodings are "sniffed" by first validating against ASCII, then utf8, then falling back to unknown. This means the encoding is only a hint, because a non-ASCII or non-UTF8 filename may be misinterpreted as either of those encodings. However, no data is lost or corrupted.
  • Restore portions of the stored data.
    • The stored data can be inspected and restored in a fine-grained manner, such as by retrieving a single file from a large snapshot, or a transitive directory.
  • Recursive directory structures.
    • OSX, tahoe-lafs, and some other filesystems allow recursive directory structures. (In OSX for example, directories may be hard-linked.)

Unsupported Use Cases:

  • Deletion. My philosophy is to buy a new hard drive and to save data forever. There is a security risk, but OTOH, it's impossible to tell how valuable any datum may be in the future.
  • Redundancy. The underlying filesystem or storage drivers can handle this, and it's best to leave that complexity in a different layer.
  • High Availability. If the storage node explodes, all data is lost. To prevent this, delegate to another tool such as tahoe-lafs.
  • Privacy. Delegate to the underlying filesystem.
  • Crossing Trust Boundaries. This is intended for a case where anyone with read access to the store can read everything. If a user needs privacy within a backup, they could encrypt files before backing up and manage that complexity
  • Keeping chkfs storage on "unusual" or old filesystems: The design is intended to store old filesystem contents, but not to store on old filesystems. In particular, chkstore and chkfs assume directories can hold many, many entries, with names at least around 80 ascii bytes long. (They also currently assume the storage filesystem supports hardlinks for efficient commits, and O_CREAT|O_EXCL for avoiding multi-process collisions.)

Future use case:

  • A read-only fuse interface for convenient restore out of the chkfs.

Bonus use cases:

  • Integration as a backend in other networked/decentralized data stores such as camlistore or tahoe-lafs.

FAQ:

  • Why not cp -a or cp -r?
    • This is lossy in some ways in which chkfs is not: The vfs metadata about the source is not copied, the source filesystem may have metadata which cannot be stored in the target filesystem (including different filename encoding issues). chkfs also suffers some of these limitations by relying on the vfs layer for reading source filesystems. Also it sacrifices the convenient utility of having the backup files available directly as a filesystem (without a fuse interface), so chkfs lose the ability to run find | grep, for instance.
  • Why not tar or many of the existing very mature unix backup systems?
    • The "old school" solutions I'm aware of do not support all of the use cases above without excessive headache. The tradeoff is that old-school solutions are well tested in a large variety of circumstances and widely available.
  • Why not camlistore, tahoe-lafs, freenet, or decentralized storage tech X?
    • I don't need decentralization for personal backups. There's no need for networking, redundancy, or trust boundary complexity. (See the unsupported features section.)
  • Why not bup or another scheme which is better at dedup?
    • chkfs prefers a "fairly transparent" store, as described above. It should be possible to restore a backup without using this tool but only bzip2, cp, vim, etc...