docs/backupdb.txt

   1 = The Tahoe BackupDB =
   2
   3 To speed up backup operations, Tahoe maintains a small database known as the
   4 "backupdb". This is used to avoid re-uploading files which have already been
   5 uploaded recently.
   6
   7 This database lives in ~/.tahoe/private/backupdb.sqlite, and is a SQLite
   8 single-file database. It is used by the "tahoe backup" command. In the future,
   9 it will also be used by "tahoe mirror", and by "tahoe cp" when the
  10 --use-backupdb option is included.
  11
  12 The purpose of this database is specifically to manage the file-to-cap
  13 translation (the "upload" step). It does not address directory updates. A
  14 future version will include a directory cache.
  15
  16 The overall goal of optimizing backup is to reduce the work required when the
  17 source disk has not changed since the last backup. In the ideal case, running
  18 "tahoe backup" twice in a row, with no intervening changes to the disk, will
  19 not require any network traffic.
  20
  21 This database is optional. If it is deleted, the worst effect is that a
  22 subsequent backup operation may use more effort (network bandwidth, CPU
  23 cycles, and disk IO) than it would have without the backupdb.
  24
  25 The database uses sqlite3, which is included as part of the standard python
  26 library with python2.5 and later. For python2.4, Tahoe will try to install the
  27 "pysqlite" package at build-time, but this will succeed only if sqlite3 with
  28 development headers is already installed.  On Debian and Debian derivatives
  29 you can install the "python-pysqlite2" package (which, despite the name,
  30 actually provides sqlite3 rather than sqlite2), but on old distributions such
  31 as Debian etch (4.0 "oldstable") or Ubuntu Edgy (6.10) the "python-pysqlite2"
  32 package won't work, but the "sqlite3-dev" package will.
  33
  34 == Schema ==
  35
  36 The database contains the following tables:
  37
  38 CREATE TABLE version
  39 (
  40  version integer  # contains one row, set to 1
  41 );
  42
  43 CREATE TABLE local_files
  44 (
  45  path  varchar(1024),  PRIMARY KEY -- index, this is os.path.abspath(fn)
  46  size  integer,         -- os.stat(fn)[stat.ST_SIZE]
  47  mtime number,          -- os.stat(fn)[stat.ST_MTIME]
  48  ctime number,          -- os.stat(fn)[stat.ST_MTIME]
  49  fileid integer
  50 );
  51
  52 CREATE TABLE caps
  53 (
  54  fileid integer PRIMARY KEY AUTOINCREMENT,
  55  filecap varchar(256) UNIQUE    -- URI:CHK:...
  56 );
  57
  58 CREATE TABLE last_upload
  59 (
  60  fileid INTEGER PRIMARY KEY,
  61  last_uploaded TIMESTAMP,
  62  last_checked TIMESTAMP
  63 );
  64
  65 Notes: if we extend the backupdb to assist with directory maintenance (see
  66 below), we may need paths in multiple places, so it would make sense to
  67 create a table for them, and change the last_upload table to refer to a
  68 pathid instead of an absolute path:
  69
  70 CREATE TABLE paths
  71 (
  72  path varchar(1024) UNIQUE,  -- index
  73  pathid integer PRIMARY KEY AUTOINCREMENT
  74 );
  75
  76 == Operation ==
  77
  78 The upload process starts with a pathname (like ~/.emacs) and wants to end up
  79 with a file-cap (like URI:CHK:...).
  80
  81 The first step is to convert the path to an absolute form
  82 (/home/warner/emacs) and do a lookup in the last_upload table. If the path is
  83 not present in this table, the file must be uploaded. The upload process is:
  84
  85  1. record the file's size, creation time, and modification time
  86  2. upload the file into the grid, obtaining an immutable file read-cap
  87  3. add an entry to the 'caps' table, with the read-cap, to get a fileid
  88  4. add an entry to the 'last_upload' table, with the current time
  89  5. add an entry to the 'local_files' table, with the fileid, the path,
  90     and the local file's size/ctime/mtime
  91
  92 If the path *is* present in 'last_upload', the easy-to-compute identifying
  93 information is compared: file size and ctime/mtime. If these differ, the file
  94 must be uploaded. The row is removed from the last_upload table, and the
  95 upload process above is followed.
  96
  97 If the path is present but ctime or mtime differs, the file may have changed.
  98 If the size differs, then the file has certainly changed. At this point, a
  99 future version of the "backup" command might hash the file and look for a
 100 match in an as-yet-defined table, in the hopes that the file has simply been
 101 moved from somewhere else on the disk. This enhancement requires changes to
 102 the Tahoe upload API before it can be significantly more efficient than
 103 simply handing the file to Tahoe and relying upon the normal convergence to
 104 notice the similarity.
 105
 106 If ctime, mtime, or size is different, the client will upload the file, as
 107 above.
 108
 109 If these identifiers are the same, the client will assume that the file is
 110 unchanged (unless the --ignore-timestamps option is provided, in which case
 111 the client always re-uploads the file), and it may be allowed to skip the
 112 upload. For safety, however, we require the client periodically perform a
 113 filecheck on these probably-already-uploaded files, and re-upload anything
 114 that doesn't look healthy. The client looks the fileid up in the
 115 'last_upload' table, to see how long it has been since the file was last
 116 checked.
 117
 118 A "random early check" algorithm should be used, in which a check is
 119 performed with a probability that increases with the age of the previous
 120 results. E.g. files that were last checked within a month are not checked,
 121 files that were checked 5 weeks ago are re-checked with 25% probability, 6
 122 weeks with 50%, more than 8 weeks are always checked. This reduces the
 123 "thundering herd" of filechecks-on-everything that would otherwise result
 124 when a backup operation is run one month after the original backup. If a
 125 filecheck reveals the file is not healthy, it is re-uploaded.
 126
 127 If the filecheck shows the file is healthy, or if the filecheck was skipped,
 128 the client gets to skip the upload, and uses the previous filecap (from the
 129 'caps' table) to add to the parent directory.
 130
 131 If a new file is uploaded, a new entry is put in the 'caps' and 'last_upload'
 132 table, and an entry is made in the 'local_files' table to reflect the mapping
 133 from local disk pathname to uploaded filecap. If an old file is re-uploaded,
 134 the 'last_upload' entry is updated with the new timestamps. If an old file is
 135 checked and found healthy, the 'last_upload' entry is updated.
 136
 137 Relying upon timestamps is a compromise between efficiency and safety: a file
 138 which is modified without changing the timestamp or size will be treated as
 139 unmodified, and the "tahoe backup" command will not copy the new contents
 140 into the grid. The --no-timestamps can be used to disable this optimization,
 141 forcing every byte of the file to be hashed and encoded.
 142
 143 == DIRECTORY CACHING ==
 144
 145 A future version of the backupdb will also record a secure hash of the most
 146 recent contents of each tahoe directory that was used in the last backup run.
 147 The directories created by the "tahoe backup" command are all read-only, so
 148 it should be difficult to violate the assumption that these directories are
 149 unmodified since the previous pass. In the future, Tahoe will provide truly
 150 immutable directories, making this assumption even more solid.
 151
 152 In the current implementation, when the backup algorithm is faced with the
 153 decision to either create a new directory or share an old one, it must read
 154 the contents of the old directory to compare it against the desired new
 155 contents. This means that a "null backup" (performing a backup when nothing
 156 has been changed) must still read every Tahoe directory from the previous
 157 backup.
 158
 159 With a directory-caching backupdb, these directory reads will be bypassed,
 160 and the null backup will use minimal network bandwidth: one directory read
 161 and two modifies. The Archives/ directory must be read to locate the latest
 162 backup, and must be modified to add a new snapshot, and the Latest/ directory
 163 will be updated to point to that same snapshot.
 164