From: David-Sarah Hopwood Date: Thu, 6 Dec 2012 03:53:54 +0000 (+0000) Subject: docs/proposed/leasedb.rst: Add design doc for leasedb. X-Git-Tag: allmydata-tahoe-1.10.1a1~252 X-Git-Url: https://git.rkrishnan.org/uri/URI:DIR2-RO:%5B%5E?a=commitdiff_plain;h=b94721ad403b4ffef378545e5e38b92a7f64376e;p=tahoe-lafs%2Ftahoe-lafs.git docs/proposed/leasedb.rst: Add design doc for leasedb. This version refs #1834. Signed-off-by: David-Sarah Hopwood --- diff --git a/docs/proposed/leasedb.rst b/docs/proposed/leasedb.rst new file mode 100644 index 00000000..bbd0e92a --- /dev/null +++ b/docs/proposed/leasedb.rst @@ -0,0 +1,288 @@ + +===================== +Lease database design +===================== + +The target audience for this document is developers who wish to understand +the new lease database (leasedb) planned to be added in Tahoe-LAFS v1.11.0. + + +Introduction +------------ + +A "lease" is a request by an account that a share not be deleted before a +specified time. Each storage server stores leases in order to know which +shares to spare from garbage collection. + +Motivation +---------- + +The leasedb will replace the current design in which leases are stored in +the storage server's share container files. That design has several +disadvantages: + +- Updating a lease requires modifying a share container file (even for + immutable shares). This complicates the implementation of share classes. + The mixing of share contents and lease data in share files also led to a + security bug (ticket `#1528`_). + +- When only the disk backend is supported, it is possible to read and + update leases synchronously because the share files are stored locally + to the storage server. For the cloud backend, accessing share files + requires an HTTP request, and so must be asynchronous. Accepting this + asynchrony for lease queries would be both inefficient and complex. + Moving lease information out of shares and into a local database allows + lease queries to stay synchronous. + +Also, the current cryptographic protocol for renewing and cancelling leases +(based on shared secrets derived from secure hash functions) is complex, +and the cancellation part was never used. + +The leasedb solves the first two problems by storing the lease information in +a local database instead of in the share container files. The share data +itself is still held in the share container file. + +At the same time as implementing leasedb, we devised a simpler protocol for +allocating and cancelling leases: a client can use a public key digital +signature to authenticate access to a foolscap object representing the +authority of an account. This protocol is not yet implemented; at the time +of writing, only an "anonymous" account is supported. + +The leasedb also provides an efficient way to get summarized information, +such as total space usage of shares leased by an account, for accounting +purposes. + +.. _`#1528`: https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1528 + + +Design constraints +------------------ + +A share is stored as a collection of objects. The persistent storage may be +remote from the server (for example, cloud storage). + +Writing to the persistent store objects is in general not an atomic +operation. So the leasedb also keeps track of which shares are in an +inconsistent state because they have been partly written. (This may +change in future when we implement a protocol to improve atomicity of +updates to mutable shares.) + +Leases are no longer stored in shares. The same share format is used as +before, but the lease slots are ignored, and are cleared when rewriting a +mutable share. The new design also does not use lease renewal or cancel +secrets. (They are accepted as parameters in the storage protocol interfaces +for backward compatibility, but are ignored. Cancel secrets were already +ignored due to the fix for `#1528`_.) + +The new design needs to be fail-safe in the sense that if the lease database +is lost or corruption is detected, no share data will be lost (even though +the metadata about leases held by particular accounts has been lost). + + +Accounting crawler +------------------ + +A "crawler" is a long-running process that visits share container files at a +slow rate, so as not to overload the server by trying to visit all share +container files one after another immediately. + +The accounting crawler replaces the previous "lease crawler". It examines +each share container file and compares it with the state of the leasedb, and +may update the state of the share and/or the leasedb. + +The accounting crawler may perform the following functions (but see ticket +#1834 for a proposal to reduce the scope of its responsibility): + +- Remove leases that are past their expiration time. (Currently, this is + done automatically before deleting shares, but we plan to allow expiration + to be performed separately for individual accounts in future.) + +- Delete the objects containing unleased shares — that is, shares that have + stable entries in the leasedb but no current leases (see below for the + definition of "stable" entries). + +- Discover shares that have been manually added to storage, via ``scp`` or + some other out-of-band means. + +- Discover shares that are present when a storage server is upgraded to + a leasedb-supporting version from a previous version, and give them + "starter leases". + +- Recover from a situation where the leasedb is lost or detectably + corrupted. This is handled in the same way as upgrading from a previous + version. + +- Detect shares that have unexpectedly disappeared from storage. The + disappearance of a share is logged, and its entry and leases are removed + from the leasedb. + + +Accounts +-------- + +An account holds leases for some subset of shares stored by a server. The +leasedb schema can handle many distinct accounts, but for the time being we +create only two accounts: an anonymous account and a starter account. The +starter account is used for leases on shares discovered by the accounting +crawler; the anonymous account is used for all other leases. + +The leasedb has at most one lease entry per account per (storage_index, +shnum) pair. This entry stores the times when the lease was last renewed and +when it is set to expire (if the expiration policy does not force it to +expire earlier), represented as Unix UTC-seconds-since-epoch timestamps. + +For more on expiration policy, see `docs/garbage-collection.rst +<../garbage-collection.rst>`__. + + +Share states +------------ + +The leasedb holds an explicit indicator of the state of each share. + +The diagram and descriptions below give the possible values of the "state" +indicator, what that value means, and transitions between states, for any +(storage_index, shnum) pair on each server:: + + + # STATE_STABLE -------. + # ^ | ^ | | + # | v | | v + # STATE_COMING | | STATE_GOING + # ^ | | | + # | | v | + # '----- NONE <------' + + +**NONE**: There is no entry in the ``shares`` table for this (storage_index, +shnum) in this server's leasedb. This is the initial state. + +**STATE_COMING**: The share is being created or (if a mutable share) +updated. The store objects may have been at least partially written, but +the storage server doesn't have confirmation that they have all been +completely written. + +**STATE_STABLE**: The store objects have been completely written and are +not in the process of being modified or deleted by the storage server. (It +could have been modified or deleted behind the back of the storage server, +but if it has, the server has not noticed that yet.) The share may or may not +be leased. + +**STATE_GOING**: The share is being deleted. + +State transitions +----------------- + +• **STATE_GOING** → **NONE** + + trigger: The storage server gains confidence that all store objects for + the share have been removed. + + implementation: + + 1. Remove the entry in the leasedb. + +• **STATE_STABLE** → **NONE** + + trigger: The accounting crawler noticed that all the store objects for + this share are gone. + + implementation: + + 1. Remove the entry in the leasedb. + +• **NONE** → **STATE_COMING** + + triggers: A new share is being created, as explicitly signalled by a + client invoking a creation command, *or* the accounting crawler discovers + an incomplete share. + + implementation: + + 1. Add an entry to the leasedb with **STATE_COMING**. + + 2. (In case of explicit creation) begin writing the store objects to hold + the share. + +• **STATE_STABLE** → **STATE_COMING** + + trigger: A mutable share is being modified, as explicitly signalled by a + client invoking a modification command. + + implementation: + + 1. Add an entry to the leasedb with **STATE_COMING**. + + 2. Begin updating the store objects. + +• **STATE_COMING** → **STATE_STABLE** + + trigger: All store objects have been written. + + implementation: + + 1. Change the state value of this entry in the leasedb from + **STATE_COMING** to **STATE_STABLE**. + +• **NONE** → **STATE_STABLE** + + trigger: The accounting crawler discovers a complete share. + + implementation: + + 1. Add an entry to the leasedb with **STATE_STABLE**. + +• **STATE_STABLE** → **STATE_GOING** + + trigger: The share should be deleted because it is unleased. + + implementation: + + 1. Change the state value of this entry in the leasedb from + **STATE_STABLE** to **STATE_GOING**. + + 2. Initiate removal of the store objects. + + +The following constraints are needed to avoid race conditions: + +- While a share is being deleted (entry in **STATE_GOING**), we do not accept + any requests to recreate it. That would result in add and delete requests + for store objects being sent concurrently, with undefined results. + +- While a share is being added or modified (entry in **STATE_COMING**), we + treat it as leased. + +- Creation or modification requests for a given mutable share are serialized. + + +Unresolved design issues +------------------------ + +- What happens if a write to store objects for a new share fails + permanently? If we delete the share entry, then the accounting crawler + will eventually get to those store objects and see that their lengths + are inconsistent with the length in the container header. This will cause + the share to be treated as corrupted. Should we instead attempt to + delete those objects immediately? If so, do we need a direct + **STATE_COMING** → **STATE_GOING** transition to handle this case? + +- What happens if only some store objects for a share disappear + unexpectedly? This case is similar to only some objects having been + written when we get an unrecoverable error during creation of a share, but + perhaps we want to treat it differently in order to preserve information + about the storage service having lost data. + +- Does the leasedb need to track corrupted shares? + + +Future directions +----------------- + +Clients will have key pairs identifying accounts, and will be able to add +leases for a specific account. Various space usage policies can be defined. + +Better migration tools ('tahoe storage export'?) will create export files +that include both the share data and the lease data, and then an import tool +will both put the share in the right place and update the recipient node's +leasedb.