docs/proposed/leasedb.rst

   1
   2 =====================
   3 Lease database design
   4 =====================
   5
   6 The target audience for this document is developers who wish to understand
   7 the new lease database (leasedb) planned to be added in Tahoe-LAFS v1.11.0.
   8
   9
  10 Introduction
  11 ------------
  12
  13 A "lease" is a request by an account that a share not be deleted before a
  14 specified time. Each storage server stores leases in order to know which
  15 shares to spare from garbage collection.
  16
  17 Motivation
  18 ----------
  19
  20 The leasedb will replace the current design in which leases are stored in
  21 the storage server's share container files. That design has several
  22 disadvantages:
  23
  24 - Updating a lease requires modifying a share container file (even for
  25   immutable shares). This complicates the implementation of share classes.
  26   The mixing of share contents and lease data in share files also led to a
  27   security bug (ticket `#1528`_).
  28
  29 - When only the disk backend is supported, it is possible to read and
  30   update leases synchronously because the share files are stored locally
  31   to the storage server. For the cloud backend, accessing share files
  32   requires an HTTP request, and so must be asynchronous. Accepting this
  33   asynchrony for lease queries would be both inefficient and complex.
  34   Moving lease information out of shares and into a local database allows
  35   lease queries to stay synchronous.
  36
  37 Also, the current cryptographic protocol for renewing and cancelling leases
  38 (based on shared secrets derived from secure hash functions) is complex,
  39 and the cancellation part was never used.
  40
  41 The leasedb solves the first two problems by storing the lease information in
  42 a local database instead of in the share container files. The share data
  43 itself is still held in the share container file.
  44
  45 At the same time as implementing leasedb, we devised a simpler protocol for
  46 allocating and cancelling leases: a client can use a public key digital
  47 signature to authenticate access to a foolscap object representing the
  48 authority of an account. This protocol is not yet implemented; at the time
  49 of writing, only an "anonymous" account is supported.
  50
  51 The leasedb also provides an efficient way to get summarized information,
  52 such as total space usage of shares leased by an account, for accounting
  53 purposes.
  54
  55 .. _`#1528`: https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1528
  56
  57
  58 Design constraints
  59 ------------------
  60
  61 A share is stored as a collection of objects. The persistent storage may be
  62 remote from the server (for example, cloud storage).
  63
  64 Writing to the persistent store objects is in general not an atomic
  65 operation. So the leasedb also keeps track of which shares are in an
  66 inconsistent state because they have been partly written. (This may
  67 change in future when we implement a protocol to improve atomicity of
  68 updates to mutable shares.)
  69
  70 Leases are no longer stored in shares. The same share format is used as
  71 before, but the lease slots are ignored, and are cleared when rewriting a
  72 mutable share. The new design also does not use lease renewal or cancel
  73 secrets. (They are accepted as parameters in the storage protocol interfaces
  74 for backward compatibility, but are ignored. Cancel secrets were already
  75 ignored due to the fix for `#1528`_.)
  76
  77 The new design needs to be fail-safe in the sense that if the lease database
  78 is lost or corruption is detected, no share data will be lost (even though
  79 the metadata about leases held by particular accounts has been lost).
  80
  81
  82 Accounting crawler
  83 ------------------
  84
  85 A "crawler" is a long-running process that visits share container files at a
  86 slow rate, so as not to overload the server by trying to visit all share
  87 container files one after another immediately.
  88
  89 The accounting crawler replaces the previous "lease crawler". It examines
  90 each share container file and compares it with the state of the leasedb, and
  91 may update the state of the share and/or the leasedb.
  92
  93 The accounting crawler may perform the following functions (but see ticket
  94 #1834 for a proposal to reduce the scope of its responsibility):
  95
  96 - Remove leases that are past their expiration time. (Currently, this is
  97   done automatically before deleting shares, but we plan to allow expiration
  98   to be performed separately for individual accounts in future.)
  99
 100 - Delete the objects containing unleased shares — that is, shares that have
 101   stable entries in the leasedb but no current leases (see below for the
 102   definition of "stable" entries).
 103
 104 - Discover shares that have been manually added to storage, via ``scp`` or
 105   some other out-of-band means.
 106
 107 - Discover shares that are present when a storage server is upgraded to
 108   a leasedb-supporting version from a previous version, and give them
 109   "starter leases".
 110
 111 - Recover from a situation where the leasedb is lost or detectably
 112   corrupted. This is handled in the same way as upgrading from a previous
 113   version.
 114
 115 - Detect shares that have unexpectedly disappeared from storage.  The
 116   disappearance of a share is logged, and its entry and leases are removed
 117   from the leasedb.
 118
 119
 120 Accounts
 121 --------
 122
 123 An account holds leases for some subset of shares stored by a server. The
 124 leasedb schema can handle many distinct accounts, but for the time being we
 125 create only two accounts: an anonymous account and a starter account. The
 126 starter account is used for leases on shares discovered by the accounting
 127 crawler; the anonymous account is used for all other leases.
 128
 129 The leasedb has at most one lease entry per account per (storage_index,
 130 shnum) pair. This entry stores the times when the lease was last renewed and
 131 when it is set to expire (if the expiration policy does not force it to
 132 expire earlier), represented as Unix UTC-seconds-since-epoch timestamps.
 133
 134 For more on expiration policy, see `docs/garbage-collection.rst
 135 <../garbage-collection.rst>`__.
 136
 137
 138 Share states
 139 ------------
 140
 141 The leasedb holds an explicit indicator of the state of each share.
 142
 143 The diagram and descriptions below give the possible values of the "state"
 144 indicator, what that value means, and transitions between states, for any
 145 (storage_index, shnum) pair on each server::
 146
 147
 148   #        STATE_STABLE -------.
 149   #         ^   |   ^ |         |
 150   #         |   v   | |         v
 151   #    STATE_COMING | |    STATE_GOING
 152   #         ^       | |         |
 153   #         |       | v         |
 154   #         '----- NONE <------'
 155
 156
 157 **NONE**: There is no entry in the ``shares`` table for this (storage_index,
 158 shnum) in this server's leasedb. This is the initial state.
 159
 160 **STATE_COMING**: The share is being created or (if a mutable share)
 161 updated. The store objects may have been at least partially written, but
 162 the storage server doesn't have confirmation that they have all been
 163 completely written.
 164
 165 **STATE_STABLE**: The store objects have been completely written and are
 166 not in the process of being modified or deleted by the storage server. (It
 167 could have been modified or deleted behind the back of the storage server,
 168 but if it has, the server has not noticed that yet.) The share may or may not
 169 be leased.
 170
 171 **STATE_GOING**: The share is being deleted.
 172
 173 State transitions
 174 -----------------
 175
 176 • **STATE_GOING** → **NONE**
 177
 178     trigger: The storage server gains confidence that all store objects for
 179     the share have been removed.
 180
 181     implementation:
 182
 183     1. Remove the entry in the leasedb.
 184
 185 • **STATE_STABLE** → **NONE**
 186
 187     trigger: The accounting crawler noticed that all the store objects for
 188     this share are gone.
 189
 190     implementation:
 191
 192     1. Remove the entry in the leasedb.
 193
 194 • **NONE** → **STATE_COMING**
 195
 196     triggers: A new share is being created, as explicitly signalled by a
 197     client invoking a creation command, *or* the accounting crawler discovers
 198     an incomplete share.
 199
 200     implementation:
 201
 202     1. Add an entry to the leasedb with **STATE_COMING**.
 203
 204     2. (In case of explicit creation) begin writing the store objects to hold
 205        the share.
 206
 207 • **STATE_STABLE** → **STATE_COMING**
 208
 209     trigger: A mutable share is being modified, as explicitly signalled by a
 210     client invoking a modification command.
 211
 212     implementation:
 213
 214     1. Add an entry to the leasedb with **STATE_COMING**.
 215
 216     2. Begin updating the store objects.
 217
 218 • **STATE_COMING** → **STATE_STABLE**
 219
 220     trigger: All store objects have been written.
 221
 222     implementation:
 223
 224     1. Change the state value of this entry in the leasedb from
 225        **STATE_COMING** to **STATE_STABLE**.
 226
 227 • **NONE** → **STATE_STABLE**
 228
 229     trigger: The accounting crawler discovers a complete share.
 230
 231     implementation:
 232
 233     1. Add an entry to the leasedb with **STATE_STABLE**.
 234
 235 • **STATE_STABLE** → **STATE_GOING**
 236
 237     trigger: The share should be deleted because it is unleased.
 238
 239     implementation:
 240
 241     1. Change the state value of this entry in the leasedb from
 242        **STATE_STABLE** to **STATE_GOING**.
 243
 244     2. Initiate removal of the store objects.
 245
 246
 247 The following constraints are needed to avoid race conditions:
 248
 249 - While a share is being deleted (entry in **STATE_GOING**), we do not accept
 250   any requests to recreate it. That would result in add and delete requests
 251   for store objects being sent concurrently, with undefined results.
 252
 253 - While a share is being added or modified (entry in **STATE_COMING**), we
 254   treat it as leased.
 255
 256 - Creation or modification requests for a given mutable share are serialized.
 257
 258
 259 Unresolved design issues
 260 ------------------------
 261
 262 - What happens if a write to store objects for a new share fails
 263   permanently?  If we delete the share entry, then the accounting crawler
 264   will eventually get to those store objects and see that their lengths
 265   are inconsistent with the length in the container header. This will cause
 266   the share to be treated as corrupted. Should we instead attempt to
 267   delete those objects immediately? If so, do we need a direct
 268   **STATE_COMING** → **STATE_GOING** transition to handle this case?
 269
 270 - What happens if only some store objects for a share disappear
 271   unexpectedly?  This case is similar to only some objects having been
 272   written when we get an unrecoverable error during creation of a share, but
 273   perhaps we want to treat it differently in order to preserve information
 274   about the storage service having lost data.
 275
 276 - Does the leasedb need to track corrupted shares?
 277
 278
 279 Future directions
 280 -----------------
 281
 282 Clients will have key pairs identifying accounts, and will be able to add
 283 leases for a specific account. Various space usage policies can be defined.
 284
 285 Better migration tools ('tahoe storage export'?) will create export files
 286 that include both the share data and the lease data, and then an import tool
 287 will both put the share in the right place and update the recipient node's
 288 leasedb.