From: David-Sarah Hopwood <david-sarah@jacaranda.org>
Date: Thu, 6 Dec 2012 03:53:54 +0000 (+0000)
Subject: docs/proposed/leasedb.rst: Add design doc for leasedb.
X-Git-Tag: allmydata-tahoe-1.10.1a1~252
X-Git-Url: https://git.rkrishnan.org/components/com_hotproperty/flags/@manifest?a=commitdiff_plain;h=b94721ad403b4ffef378545e5e38b92a7f64376e;p=tahoe-lafs%2Ftahoe-lafs.git

docs/proposed/leasedb.rst: Add design doc for leasedb.
This version refs #1834.

Signed-off-by: David-Sarah Hopwood <david-sarah@jacaranda.org>
---

diff --git a/docs/proposed/leasedb.rst b/docs/proposed/leasedb.rst
new file mode 100644
index 00000000..bbd0e92a
--- /dev/null
+++ b/docs/proposed/leasedb.rst
@@ -0,0 +1,288 @@
+
+=====================
+Lease database design
+=====================
+
+The target audience for this document is developers who wish to understand
+the new lease database (leasedb) planned to be added in Tahoe-LAFS v1.11.0.
+
+
+Introduction
+------------
+
+A "lease" is a request by an account that a share not be deleted before a
+specified time. Each storage server stores leases in order to know which
+shares to spare from garbage collection.
+
+Motivation
+----------
+
+The leasedb will replace the current design in which leases are stored in
+the storage server's share container files. That design has several
+disadvantages:
+
+- Updating a lease requires modifying a share container file (even for
+  immutable shares). This complicates the implementation of share classes.
+  The mixing of share contents and lease data in share files also led to a
+  security bug (ticket `#1528`_).
+
+- When only the disk backend is supported, it is possible to read and
+  update leases synchronously because the share files are stored locally
+  to the storage server. For the cloud backend, accessing share files
+  requires an HTTP request, and so must be asynchronous. Accepting this
+  asynchrony for lease queries would be both inefficient and complex.
+  Moving lease information out of shares and into a local database allows
+  lease queries to stay synchronous.
+
+Also, the current cryptographic protocol for renewing and cancelling leases
+(based on shared secrets derived from secure hash functions) is complex,
+and the cancellation part was never used.
+
+The leasedb solves the first two problems by storing the lease information in
+a local database instead of in the share container files. The share data
+itself is still held in the share container file.
+
+At the same time as implementing leasedb, we devised a simpler protocol for
+allocating and cancelling leases: a client can use a public key digital
+signature to authenticate access to a foolscap object representing the
+authority of an account. This protocol is not yet implemented; at the time
+of writing, only an "anonymous" account is supported.
+
+The leasedb also provides an efficient way to get summarized information,
+such as total space usage of shares leased by an account, for accounting
+purposes.
+
+.. _`#1528`: https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1528
+
+
+Design constraints
+------------------
+
+A share is stored as a collection of objects. The persistent storage may be
+remote from the server (for example, cloud storage).
+
+Writing to the persistent store objects is in general not an atomic
+operation. So the leasedb also keeps track of which shares are in an
+inconsistent state because they have been partly written. (This may
+change in future when we implement a protocol to improve atomicity of
+updates to mutable shares.)
+
+Leases are no longer stored in shares. The same share format is used as
+before, but the lease slots are ignored, and are cleared when rewriting a
+mutable share. The new design also does not use lease renewal or cancel
+secrets. (They are accepted as parameters in the storage protocol interfaces
+for backward compatibility, but are ignored. Cancel secrets were already
+ignored due to the fix for `#1528`_.)
+
+The new design needs to be fail-safe in the sense that if the lease database
+is lost or corruption is detected, no share data will be lost (even though
+the metadata about leases held by particular accounts has been lost).
+
+
+Accounting crawler
+------------------
+
+A "crawler" is a long-running process that visits share container files at a
+slow rate, so as not to overload the server by trying to visit all share
+container files one after another immediately.
+
+The accounting crawler replaces the previous "lease crawler". It examines
+each share container file and compares it with the state of the leasedb, and
+may update the state of the share and/or the leasedb.
+
+The accounting crawler may perform the following functions (but see ticket
+#1834 for a proposal to reduce the scope of its responsibility):
+
+- Remove leases that are past their expiration time. (Currently, this is
+  done automatically before deleting shares, but we plan to allow expiration
+  to be performed separately for individual accounts in future.)
+
+- Delete the objects containing unleased shares — that is, shares that have
+  stable entries in the leasedb but no current leases (see below for the
+  definition of "stable" entries).
+
+- Discover shares that have been manually added to storage, via ``scp`` or
+  some other out-of-band means.
+
+- Discover shares that are present when a storage server is upgraded to
+  a leasedb-supporting version from a previous version, and give them
+  "starter leases".
+
+- Recover from a situation where the leasedb is lost or detectably
+  corrupted. This is handled in the same way as upgrading from a previous
+  version.
+
+- Detect shares that have unexpectedly disappeared from storage.  The
+  disappearance of a share is logged, and its entry and leases are removed
+  from the leasedb.
+
+
+Accounts
+--------
+
+An account holds leases for some subset of shares stored by a server. The
+leasedb schema can handle many distinct accounts, but for the time being we
+create only two accounts: an anonymous account and a starter account. The
+starter account is used for leases on shares discovered by the accounting
+crawler; the anonymous account is used for all other leases.
+
+The leasedb has at most one lease entry per account per (storage_index,
+shnum) pair. This entry stores the times when the lease was last renewed and
+when it is set to expire (if the expiration policy does not force it to
+expire earlier), represented as Unix UTC-seconds-since-epoch timestamps.
+
+For more on expiration policy, see `docs/garbage-collection.rst
+<../garbage-collection.rst>`__.
+
+
+Share states
+------------
+
+The leasedb holds an explicit indicator of the state of each share.
+
+The diagram and descriptions below give the possible values of the "state"
+indicator, what that value means, and transitions between states, for any
+(storage_index, shnum) pair on each server::
+
+
+  #        STATE_STABLE -------.
+  #         ^   |   ^ |         |
+  #         |   v   | |         v
+  #    STATE_COMING | |    STATE_GOING
+  #         ^       | |         |
+  #         |       | v         |
+  #         '----- NONE <------'
+
+
+**NONE**: There is no entry in the ``shares`` table for this (storage_index,
+shnum) in this server's leasedb. This is the initial state.
+
+**STATE_COMING**: The share is being created or (if a mutable share)
+updated. The store objects may have been at least partially written, but
+the storage server doesn't have confirmation that they have all been
+completely written.
+
+**STATE_STABLE**: The store objects have been completely written and are
+not in the process of being modified or deleted by the storage server. (It
+could have been modified or deleted behind the back of the storage server,
+but if it has, the server has not noticed that yet.) The share may or may not
+be leased.
+
+**STATE_GOING**: The share is being deleted.
+
+State transitions
+-----------------
+
+• **STATE_GOING** → **NONE**
+
+    trigger: The storage server gains confidence that all store objects for
+    the share have been removed.
+
+    implementation:
+
+    1. Remove the entry in the leasedb.
+
+• **STATE_STABLE** → **NONE**
+	
+    trigger: The accounting crawler noticed that all the store objects for
+    this share are gone.
+
+    implementation:
+
+    1. Remove the entry in the leasedb.
+
+• **NONE** → **STATE_COMING**
+
+    triggers: A new share is being created, as explicitly signalled by a
+    client invoking a creation command, *or* the accounting crawler discovers
+    an incomplete share.
+
+    implementation:
+
+    1. Add an entry to the leasedb with **STATE_COMING**.
+
+    2. (In case of explicit creation) begin writing the store objects to hold
+       the share.
+
+• **STATE_STABLE** → **STATE_COMING**
+
+    trigger: A mutable share is being modified, as explicitly signalled by a
+    client invoking a modification command.
+
+    implementation:
+
+    1. Add an entry to the leasedb with **STATE_COMING**.
+
+    2. Begin updating the store objects.
+
+• **STATE_COMING** → **STATE_STABLE**
+
+    trigger: All store objects have been written.
+
+    implementation:
+
+    1. Change the state value of this entry in the leasedb from
+       **STATE_COMING** to **STATE_STABLE**.
+
+• **NONE** → **STATE_STABLE**
+
+    trigger: The accounting crawler discovers a complete share.
+
+    implementation:
+
+    1. Add an entry to the leasedb with **STATE_STABLE**.
+
+• **STATE_STABLE** → **STATE_GOING**
+
+    trigger: The share should be deleted because it is unleased.
+
+    implementation:
+
+    1. Change the state value of this entry in the leasedb from
+       **STATE_STABLE** to **STATE_GOING**.
+
+    2. Initiate removal of the store objects.
+
+
+The following constraints are needed to avoid race conditions:
+
+- While a share is being deleted (entry in **STATE_GOING**), we do not accept
+  any requests to recreate it. That would result in add and delete requests
+  for store objects being sent concurrently, with undefined results.
+
+- While a share is being added or modified (entry in **STATE_COMING**), we
+  treat it as leased.
+
+- Creation or modification requests for a given mutable share are serialized.
+
+
+Unresolved design issues
+------------------------
+
+- What happens if a write to store objects for a new share fails
+  permanently?  If we delete the share entry, then the accounting crawler
+  will eventually get to those store objects and see that their lengths
+  are inconsistent with the length in the container header. This will cause
+  the share to be treated as corrupted. Should we instead attempt to
+  delete those objects immediately? If so, do we need a direct
+  **STATE_COMING** → **STATE_GOING** transition to handle this case?
+
+- What happens if only some store objects for a share disappear
+  unexpectedly?  This case is similar to only some objects having been
+  written when we get an unrecoverable error during creation of a share, but
+  perhaps we want to treat it differently in order to preserve information
+  about the storage service having lost data.
+
+- Does the leasedb need to track corrupted shares?
+
+
+Future directions
+-----------------
+
+Clients will have key pairs identifying accounts, and will be able to add
+leases for a specific account. Various space usage policies can be defined.
+
+Better migration tools ('tahoe storage export'?) will create export files
+that include both the share data and the lease data, and then an import tool
+will both put the share in the right place and update the recipient node's
+leasedb.