docs/backupdb.rst

   1 .. -*- coding: utf-8-with-signature -*-
   2
   3 ==================
   4 The Tahoe BackupDB
   5 ==================
   6
   7 1.  `Overview`_
   8 2.  `Schema`_
   9 3.  `Upload Operation`_
  10 4.  `Directory Operations`_
  11
  12 Overview
  13 ========
  14 To speed up backup operations, Tahoe maintains a small database known as the
  15 "backupdb". This is used to avoid re-uploading files which have already been
  16 uploaded recently.
  17
  18 This database lives in ``~/.tahoe/private/backupdb.sqlite``, and is a SQLite
  19 single-file database. It is used by the "``tahoe backup``" command. In the
  20 future, it may optionally be used by other commands such as "``tahoe cp``".
  21
  22 The purpose of this database is twofold: to manage the file-to-cap
  23 translation (the "upload" step) and the directory-to-cap translation (the
  24 "mkdir-immutable" step).
  25
  26 The overall goal of optimizing backup is to reduce the work required when the
  27 source disk has not changed (much) since the last backup. In the ideal case,
  28 running "``tahoe backup``" twice in a row, with no intervening changes to the
  29 disk, will not require any network traffic. Minimal changes to the source
  30 disk should result in minimal traffic.
  31
  32 This database is optional. If it is deleted, the worst effect is that a
  33 subsequent backup operation may use more effort (network bandwidth, CPU
  34 cycles, and disk IO) than it would have without the backupdb.
  35
  36 The database uses sqlite3, which is included as part of the standard Python
  37 library with Python 2.5 and later. For Python 2.4, Tahoe will try to install the
  38 "pysqlite" package at build-time, but this will succeed only if sqlite3 with
  39 development headers is already installed.  On Debian and Debian derivatives
  40 you can install the "python-pysqlite2" package (which, despite the name,
  41 actually provides sqlite3 rather than sqlite2). On old distributions such
  42 as Debian etch (4.0 "oldstable") or Ubuntu Edgy (6.10) the "python-pysqlite2"
  43 package won't work, but the "sqlite3-dev" package will.
  44
  45 Schema
  46 ======
  47
  48 The database contains the following tables::
  49
  50   CREATE TABLE version
  51   (
  52    version integer  # contains one row, set to 1
  53   );
  54
  55   CREATE TABLE local_files
  56   (
  57    path  varchar(1024),  PRIMARY KEY -- index, this is an absolute UTF-8-encoded local filename
  58    size  integer,         -- os.stat(fn)[stat.ST_SIZE]
  59    mtime number,          -- os.stat(fn)[stat.ST_MTIME]
  60    ctime number,          -- os.stat(fn)[stat.ST_CTIME]
  61    fileid integer
  62   );
  63
  64   CREATE TABLE caps
  65   (
  66    fileid integer PRIMARY KEY AUTOINCREMENT,
  67    filecap varchar(256) UNIQUE    -- URI:CHK:...
  68   );
  69
  70   CREATE TABLE last_upload
  71   (
  72    fileid INTEGER PRIMARY KEY,
  73    last_uploaded TIMESTAMP,
  74    last_checked TIMESTAMP
  75   );
  76
  77   CREATE TABLE directories
  78   (
  79    dirhash varchar(256) PRIMARY KEY,
  80    dircap varchar(256),
  81    last_uploaded TIMESTAMP,
  82    last_checked TIMESTAMP
  83   );
  84
  85 Upload Operation
  86 ================
  87
  88 The upload process starts with a pathname (like ``~/.emacs``) and wants to end up
  89 with a file-cap (like ``URI:CHK:...``).
  90
  91 The first step is to convert the path to an absolute form
  92 (``/home/warner/.emacs``) and do a lookup in the local_files table. If the path
  93 is not present in this table, the file must be uploaded. The upload process
  94 is:
  95
  96 1. record the file's size, ctime (which is the directory-entry change time or
  97    file creation time depending on OS) and modification time
  98
  99 2. upload the file into the grid, obtaining an immutable file read-cap
 100
 101 3. add an entry to the 'caps' table, with the read-cap, to get a fileid
 102
 103 4. add an entry to the 'last_upload' table, with the current time
 104
 105 5. add an entry to the 'local_files' table, with the fileid, the path,
 106    and the local file's size/ctime/mtime
 107
 108 If the path *is* present in 'local_files', the easy-to-compute identifying
 109 information is compared: file size and ctime/mtime. If these differ, the file
 110 must be uploaded. The row is removed from the local_files table, and the
 111 upload process above is followed.
 112
 113 If the path is present but ctime or mtime differs, the file may have changed.
 114 If the size differs, then the file has certainly changed. At this point, a
 115 future version of the "backup" command might hash the file and look for a
 116 match in an as-yet-defined table, in the hopes that the file has simply been
 117 moved from somewhere else on the disk. This enhancement requires changes to
 118 the Tahoe upload API before it can be significantly more efficient than
 119 simply handing the file to Tahoe and relying upon the normal convergence to
 120 notice the similarity.
 121
 122 If ctime, mtime, or size is different, the client will upload the file, as
 123 above.
 124
 125 If these identifiers are the same, the client will assume that the file is
 126 unchanged (unless the ``--ignore-timestamps`` option is provided, in which
 127 case the client always re-uploads the file), and it may be allowed to skip
 128 the upload. For safety, however, we require the client periodically perform a
 129 filecheck on these probably-already-uploaded files, and re-upload anything
 130 that doesn't look healthy. The client looks the fileid up in the
 131 'last_checked' table, to see how long it has been since the file was last
 132 checked.
 133
 134 A "random early check" algorithm should be used, in which a check is
 135 performed with a probability that increases with the age of the previous
 136 results. E.g. files that were last checked within a month are not checked,
 137 files that were checked 5 weeks ago are re-checked with 25% probability, 6
 138 weeks with 50%, more than 8 weeks are always checked. This reduces the
 139 "thundering herd" of filechecks-on-everything that would otherwise result
 140 when a backup operation is run one month after the original backup. If a
 141 filecheck reveals the file is not healthy, it is re-uploaded.
 142
 143 If the filecheck shows the file is healthy, or if the filecheck was skipped,
 144 the client gets to skip the upload, and uses the previous filecap (from the
 145 'caps' table) to add to the parent directory.
 146
 147 If a new file is uploaded, a new entry is put in the 'caps' and 'last_upload'
 148 table, and an entry is made in the 'local_files' table to reflect the mapping
 149 from local disk pathname to uploaded filecap. If an old file is re-uploaded,
 150 the 'last_upload' entry is updated with the new timestamps. If an old file is
 151 checked and found healthy, the 'last_upload' entry is updated.
 152
 153 Relying upon timestamps is a compromise between efficiency and safety: a file
 154 which is modified without changing the timestamp or size will be treated as
 155 unmodified, and the "``tahoe backup``" command will not copy the new contents
 156 into the grid. The ``--no-timestamps`` option can be used to disable this
 157 optimization, forcing every byte of the file to be hashed and encoded.
 158
 159 Directory Operations
 160 ====================
 161
 162 Once the contents of a directory are known (a filecap for each file, and a
 163 dircap for each directory), the backup process must find or create a tahoe
 164 directory node with the same contents. The contents are hashed, and the hash
 165 is queried in the 'directories' table. If found, the last-checked timestamp
 166 is used to perform the same random-early-check algorithm described for files
 167 above, but no new upload is performed. Since "``tahoe backup``" creates immutable
 168 directories, it is perfectly safe to re-use a directory from a previous
 169 backup.
 170
 171 If not found, the web-API "mkdir-immutable" operation is used to create a new
 172 directory, and an entry is stored in the table.
 173
 174 The comparison operation ignores timestamps and metadata, and pays attention
 175 solely to the file names and contents.
 176
 177 By using a directory-contents hash, the "``tahoe backup``" command is able to
 178 re-use directories from other places in the backed up data, or from old
 179 backups. This means that renaming a directory and moving a subdirectory to a
 180 new parent both count as "minor changes" and will result in minimal Tahoe
 181 operations and subsequent network traffic (new directories will be created
 182 for the modified directory and all of its ancestors). It also means that you
 183 can perform a backup ("#1"), delete a file or directory, perform a backup
 184 ("#2"), restore it, and then the next backup ("#3") will re-use the
 185 directories from backup #1.
 186
 187 The best case is a null backup, in which nothing has changed. This will
 188 result in minimal network bandwidth: one directory read and two modifies. The
 189 ``Archives/`` directory must be read to locate the latest backup, and must be
 190 modified to add a new snapshot, and the ``Latest/`` directory will be updated to
 191 point to that same snapshot.
 192