docs/proposed/leasedb.rst

   1 .. -*- coding: utf-8-with-signature -*-
   2
   3 =====================
   4 Lease database design
   5 =====================
   6
   7 The target audience for this document is developers who wish to understand
   8 the new lease database (leasedb) planned to be added in Tahoe-LAFS v1.11.0.
   9
  10
  11 Introduction
  12 ------------
  13
  14 A "lease" is a request by an account that a share not be deleted before a
  15 specified time. Each storage server stores leases in order to know which
  16 shares to spare from garbage collection.
  17
  18 Motivation
  19 ----------
  20
  21 The leasedb will replace the current design in which leases are stored in
  22 the storage server's share container files. That design has several
  23 disadvantages:
  24
  25 - Updating a lease requires modifying a share container file (even for
  26   immutable shares). This complicates the implementation of share classes.
  27   The mixing of share contents and lease data in share files also led to a
  28   security bug (ticket `#1528`_).
  29
  30 - When only the disk backend is supported, it is possible to read and
  31   update leases synchronously because the share files are stored locally
  32   to the storage server. For the cloud backend, accessing share files
  33   requires an HTTP request, and so must be asynchronous. Accepting this
  34   asynchrony for lease queries would be both inefficient and complex.
  35   Moving lease information out of shares and into a local database allows
  36   lease queries to stay synchronous.
  37
  38 Also, the current cryptographic protocol for renewing and cancelling leases
  39 (based on shared secrets derived from secure hash functions) is complex,
  40 and the cancellation part was never used.
  41
  42 The leasedb solves the first two problems by storing the lease information in
  43 a local database instead of in the share container files. The share data
  44 itself is still held in the share container file.
  45
  46 At the same time as implementing leasedb, we devised a simpler protocol for
  47 allocating and cancelling leases: a client can use a public key digital
  48 signature to authenticate access to a foolscap object representing the
  49 authority of an account. This protocol is not yet implemented; at the time
  50 of writing, only an "anonymous" account is supported.
  51
  52 The leasedb also provides an efficient way to get summarized information,
  53 such as total space usage of shares leased by an account, for accounting
  54 purposes.
  55
  56 .. _`#1528`: https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1528
  57
  58
  59 Design constraints
  60 ------------------
  61
  62 A share is stored as a collection of objects. The persistent storage may be
  63 remote from the server (for example, cloud storage).
  64
  65 Writing to the persistent store objects is in general not an atomic
  66 operation. So the leasedb also keeps track of which shares are in an
  67 inconsistent state because they have been partly written. (This may
  68 change in future when we implement a protocol to improve atomicity of
  69 updates to mutable shares.)
  70
  71 Leases are no longer stored in shares. The same share format is used as
  72 before, but the lease slots are ignored, and are cleared when rewriting a
  73 mutable share. The new design also does not use lease renewal or cancel
  74 secrets. (They are accepted as parameters in the storage protocol interfaces
  75 for backward compatibility, but are ignored. Cancel secrets were already
  76 ignored due to the fix for `#1528`_.)
  77
  78 The new design needs to be fail-safe in the sense that if the lease database
  79 is lost or corruption is detected, no share data will be lost (even though
  80 the metadata about leases held by particular accounts has been lost).
  81
  82
  83 Accounting crawler
  84 ------------------
  85
  86 A "crawler" is a long-running process that visits share container files at a
  87 slow rate, so as not to overload the server by trying to visit all share
  88 container files one after another immediately.
  89
  90 The accounting crawler replaces the previous "lease crawler". It examines
  91 each share container file and compares it with the state of the leasedb, and
  92 may update the state of the share and/or the leasedb.
  93
  94 The accounting crawler may perform the following functions (but see ticket
  95 #1834 for a proposal to reduce the scope of its responsibility):
  96
  97 - Remove leases that are past their expiration time. (Currently, this is
  98   done automatically before deleting shares, but we plan to allow expiration
  99   to be performed separately for individual accounts in future.)
 100
 101 - Delete the objects containing unleased shares — that is, shares that have
 102   stable entries in the leasedb but no current leases (see below for the
 103   definition of "stable" entries).
 104
 105 - Discover shares that have been manually added to storage, via ``scp`` or
 106   some other out-of-band means.
 107
 108 - Discover shares that are present when a storage server is upgraded to
 109   a leasedb-supporting version from a previous version, and give them
 110   "starter leases".
 111
 112 - Recover from a situation where the leasedb is lost or detectably
 113   corrupted. This is handled in the same way as upgrading from a previous
 114   version.
 115
 116 - Detect shares that have unexpectedly disappeared from storage.  The
 117   disappearance of a share is logged, and its entry and leases are removed
 118   from the leasedb.
 119
 120
 121 Accounts
 122 --------
 123
 124 An account holds leases for some subset of shares stored by a server. The
 125 leasedb schema can handle many distinct accounts, but for the time being we
 126 create only two accounts: an anonymous account and a starter account. The
 127 starter account is used for leases on shares discovered by the accounting
 128 crawler; the anonymous account is used for all other leases.
 129
 130 The leasedb has at most one lease entry per account per (storage_index,
 131 shnum) pair. This entry stores the times when the lease was last renewed and
 132 when it is set to expire (if the expiration policy does not force it to
 133 expire earlier), represented as Unix UTC-seconds-since-epoch timestamps.
 134
 135 For more on expiration policy, see `docs/garbage-collection.rst
 136 <../garbage-collection.rst>`__.
 137
 138
 139 Share states
 140 ------------
 141
 142 The leasedb holds an explicit indicator of the state of each share.
 143
 144 The diagram and descriptions below give the possible values of the "state"
 145 indicator, what that value means, and transitions between states, for any
 146 (storage_index, shnum) pair on each server::
 147
 148
 149   #        STATE_STABLE -------.
 150   #         ^   |   ^ |         |
 151   #         |   v   | |         v
 152   #    STATE_COMING | |    STATE_GOING
 153   #         ^       | |         |
 154   #         |       | v         |
 155   #         '----- NONE <------'
 156
 157
 158 **NONE**: There is no entry in the ``shares`` table for this (storage_index,
 159 shnum) in this server's leasedb. This is the initial state.
 160
 161 **STATE_COMING**: The share is being created or (if a mutable share)
 162 updated. The store objects may have been at least partially written, but
 163 the storage server doesn't have confirmation that they have all been
 164 completely written.
 165
 166 **STATE_STABLE**: The store objects have been completely written and are
 167 not in the process of being modified or deleted by the storage server. (It
 168 could have been modified or deleted behind the back of the storage server,
 169 but if it has, the server has not noticed that yet.) The share may or may not
 170 be leased.
 171
 172 **STATE_GOING**: The share is being deleted.
 173
 174 State transitions
 175 -----------------
 176
 177 • **STATE_GOING** → **NONE**
 178
 179     trigger: The storage server gains confidence that all store objects for
 180     the share have been removed.
 181
 182     implementation:
 183
 184     1. Remove the entry in the leasedb.
 185
 186 • **STATE_STABLE** → **NONE**
 187
 188     trigger: The accounting crawler noticed that all the store objects for
 189     this share are gone.
 190
 191     implementation:
 192
 193     1. Remove the entry in the leasedb.
 194
 195 • **NONE** → **STATE_COMING**
 196
 197     triggers: A new share is being created, as explicitly signalled by a
 198     client invoking a creation command, *or* the accounting crawler discovers
 199     an incomplete share.
 200
 201     implementation:
 202
 203     1. Add an entry to the leasedb with **STATE_COMING**.
 204
 205     2. (In case of explicit creation) begin writing the store objects to hold
 206        the share.
 207
 208 • **STATE_STABLE** → **STATE_COMING**
 209
 210     trigger: A mutable share is being modified, as explicitly signalled by a
 211     client invoking a modification command.
 212
 213     implementation:
 214
 215     1. Add an entry to the leasedb with **STATE_COMING**.
 216
 217     2. Begin updating the store objects.
 218
 219 • **STATE_COMING** → **STATE_STABLE**
 220
 221     trigger: All store objects have been written.
 222
 223     implementation:
 224
 225     1. Change the state value of this entry in the leasedb from
 226        **STATE_COMING** to **STATE_STABLE**.
 227
 228 • **NONE** → **STATE_STABLE**
 229
 230     trigger: The accounting crawler discovers a complete share.
 231
 232     implementation:
 233
 234     1. Add an entry to the leasedb with **STATE_STABLE**.
 235
 236 • **STATE_STABLE** → **STATE_GOING**
 237
 238     trigger: The share should be deleted because it is unleased.
 239
 240     implementation:
 241
 242     1. Change the state value of this entry in the leasedb from
 243        **STATE_STABLE** to **STATE_GOING**.
 244
 245     2. Initiate removal of the store objects.
 246
 247
 248 The following constraints are needed to avoid race conditions:
 249
 250 - While a share is being deleted (entry in **STATE_GOING**), we do not accept
 251   any requests to recreate it. That would result in add and delete requests
 252   for store objects being sent concurrently, with undefined results.
 253
 254 - While a share is being added or modified (entry in **STATE_COMING**), we
 255   treat it as leased.
 256
 257 - Creation or modification requests for a given mutable share are serialized.
 258
 259
 260 Unresolved design issues
 261 ------------------------
 262
 263 - What happens if a write to store objects for a new share fails
 264   permanently?  If we delete the share entry, then the accounting crawler
 265   will eventually get to those store objects and see that their lengths
 266   are inconsistent with the length in the container header. This will cause
 267   the share to be treated as corrupted. Should we instead attempt to
 268   delete those objects immediately? If so, do we need a direct
 269   **STATE_COMING** → **STATE_GOING** transition to handle this case?
 270
 271 - What happens if only some store objects for a share disappear
 272   unexpectedly?  This case is similar to only some objects having been
 273   written when we get an unrecoverable error during creation of a share, but
 274   perhaps we want to treat it differently in order to preserve information
 275   about the storage service having lost data.
 276
 277 - Does the leasedb need to track corrupted shares?
 278
 279
 280 Future directions
 281 -----------------
 282
 283 Clients will have key pairs identifying accounts, and will be able to add
 284 leases for a specific account. Various space usage policies can be defined.
 285
 286 Better migration tools ('tahoe storage export'?) will create export files
 287 that include both the share data and the lease data, and then an import tool
 288 will both put the share in the right place and update the recipient node's
 289 leasedb.