From: Zooko O'Whielacronx Date: Tue, 29 Apr 2008 22:51:58 +0000 (-0700) Subject: docs: remove the redundant (and therefore bit-rotting) parts of mutable-DSA.txt and... X-Git-Tag: allmydata-tahoe-1.1.0~175 X-Git-Url: https://git.rkrishnan.org/pf/content/en/seg/index.php?a=commitdiff_plain;h=2f3a44f820197e79c8af6ecb06b10f70402b5228;p=tahoe-lafs%2Ftahoe-lafs.git docs: remove the redundant (and therefore bit-rotting) parts of mutable-DSA.txt and instead refer to mutable.txt --- diff --git a/docs/mutable-DSA.txt b/docs/mutable-DSA.txt index b02815c1..73f3eb78 100644 --- a/docs/mutable-DSA.txt +++ b/docs/mutable-DSA.txt @@ -6,99 +6,9 @@ mutable files that were introduced in tahoe-0.7.0 . This proposal has not yet been implemented. Please see mutable-DSA.svg for a quick picture of the crypto scheme described herein) -= Mutable Files = - -Mutable File Slots are places with a stable identifier that can hold data -that changes over time. In contrast to CHK slots, for which the -URI/identifier is derived from the contents themselves, the Mutable File Slot -URI remains fixed for the life of the slot, regardless of what data is placed -inside it. - -Each mutable slot is referenced by two different URIs. The "read-write" URI -grants read-write access to its holder, allowing them to put whatever -contents they like into the slot. The "read-only" URI is less powerful, only -granting read access, and not enabling modification of the data. The -read-write URI can be turned into the read-only URI, but not the other way -around. - -The data in these slots is distributed over a number of servers, using the -same erasure coding that CHK files use, with 3-of-10 being a typical choice -of encoding parameters. The data is encrypted and signed in such a way that -only the holders of the read-write URI will be able to set the contents of -the slot, and only the holders of the read-only URI will be able to read -those contents. Holders of either URI will be able to validate the contents -as being written by someone with the read-write URI. The servers who hold the -shares cannot read or modify them: the worst they can do is deny service (by -deleting or corrupting the shares), or attempt a rollback attack (which can -only succeed with the cooperation of at least k servers). - -== Consistency vs Availability == - -There is an age-old battle between consistency and availability. Epic papers -have been written, elaborate proofs have been established, and generations of -theorists have learned that you cannot simultaneously achieve guaranteed -consistency with guaranteed reliability. In addition, the closer to 0 you get -on either axis, the cost and complexity of the design goes up. - -Tahoe's design goals are to largely favor design simplicity, then slightly -favor read availability, over the other criteria. - -As we develop more sophisticated mutable slots, the API may expose multiple -read versions to the application layer. The tahoe philosophy is to defer most -consistency recovery logic to the higher layers. Some applications have -effective ways to merge multiple versions, so inconsistency is not -necessarily a problem (i.e. directory nodes can usually merge multiple "add -child" operations). - -== The Prime Coordination Directive: "Don't Do That" == - -The current rule for applications which run on top of Tahoe is "do not -perform simultaneous uncoordinated writes". That means you need non-tahoe -means to make sure that two parties are not trying to modify the same mutable -slot at the same time. For example: - - * don't give the read-write URI to anyone else. Dirnodes in a private - directory generally satisfy this case, as long as you don't use two - clients on the same account at the same time - * if you give a read-write URI to someone else, stop using it yourself. An - inbox would be a good example of this. - * if you give a read-write URI to someone else, call them on the phone - before you write into it - * build an automated mechanism to have your agents coordinate writes. - For example, we expect a future release to include a FURL for a - "coordination server" in the dirnodes. The rule can be that you must - contact the coordination server and obtain a lock/lease on the file - before you're allowed to modify it. - -If you do not follow this rule, Bad Things will happen. The worst-case Bad -Thing is that the entire file will be lost. A less-bad Bad Thing is that one -or more of the simultaneous writers will lose their changes. An observer of -the file may not see monotonically-increasing changes to the file, i.e. they -may see version 1, then version 2, then 3, then 2 again. - -Tahoe takes some amount of care to reduce the badness of these Bad Things. -One way you can help nudge it from the "lose your file" case into the "lose -some changes" case is to reduce the number of competing versions: multiple -versions of the file that different parties are trying to establish as the -one true current contents. Each simultaneous writer counts as a "competing -version", as does the previous version of the file. If the count "S" of these -competing versions is larger than N/k, then the file runs the risk of being -lost completely. If at least one of the writers remains running after the -collision is detected, it will attempt to recover, but if S>(N/k) and all -writers crash after writing a few shares, the file will be lost. - - -== Small Distributed Mutable Files == - -SDMF slots are suitable for small (<1MB) files that are editing by rewriting -the entire file. The three operations are: - - * allocate (with initial contents) - * set (with new contents) - * get (old contents) - -The first use of SDMF slots will be to hold directories (dirnodes), which map -encrypted child names to rw-URI/ro-URI pairs. +This file shows only the differences from RSA-based mutable files to +(EC)DSA-based mutable files. You have to read and understand mutable.txt before +reading this file (mutable-DSA.txt). === SDMF slots overview === @@ -380,166 +290,6 @@ which the public key can be hashed three times to produce the desired portion of the storage index. We believe that 64 bits of material is sufficiently resistant to this form of pre-image attack to serve as a suitable deterrent. - -=== Server Storage Protocol === - -The storage servers will provide a mutable slot container which is oblivious -to the details of the data being contained inside it. Each storage index -refers to a "bucket", and each bucket has one or more shares inside it. (In a -well-provisioned network, each bucket will have only one share). The bucket -is stored as a directory, using the base32-encoded storage index as the -directory name. Each share is stored in a single file, using the share number -as the filename. - -The container holds space for a container magic number (for versioning), the -write enabler, the nodeid which accepted the write enabler (used for share -migration, described below), a small number of lease structures, the embedded -data itself, and expansion space for additional lease structures. - - # offset size name - 1 0 32 magic verstr "tahoe mutable container v1" plus binary - 2 32 20 write enabler's nodeid - 3 52 32 write enabler - 4 84 8 data size (actual share data present) (a) - 5 92 8 offset of (8) count of extra leases (after data) - 6 100 368 four leases, 92 bytes each - 0 4 ownerid (0 means "no lease here") - 4 4 expiration timestamp - 8 32 renewal token - 40 32 cancel token - 72 20 nodeid which accepted the tokens - 7 468 (a) data - 8 ?? 4 count of extra leases - 9 ?? n*92 extra leases - -The "extra leases" field must be copied and rewritten each time the size of -the enclosed data changes. The hope is that most buckets will have four or -fewer leases and this extra copying will not usually be necessary. - -The (4) "data size" field contains the actual number of bytes of data present -in field (7), such that a client request to read beyond 504+(a) will result -in an error. This allows the client to (one day) read relative to the end of -the file. The container size (that is, (8)-(7)) might be larger, especially -if extra size was pre-allocated in anticipation of filling the container with -a lot of data. - -The offset in (5) points at the *count* of extra leases, at (8). The actual -leases (at (9)) begin 4 bytes later. If the container size changes, both (8) -and (9) must be relocated by copying. - -The server will honor any write commands that provide the write token and do -not exceed the server-wide storage size limitations. Read and write commands -MUST be restricted to the 'data' portion of the container: the implementation -of those commands MUST perform correct bounds-checking to make sure other -portions of the container are inaccessible to the clients. - -The two methods provided by the storage server on these "MutableSlot" share -objects are: - - * readv(ListOf(offset=int, length=int)) - * returns a list of bytestrings, of the various requested lengths - * offset < 0 is interpreted relative to the end of the data - * spans which hit the end of the data will return truncated data - - * testv_and_writev(write_enabler, test_vector, write_vector) - * this is a test-and-set operation which performs the given tests and only - applies the desired writes if all tests succeed. This is used to detect - simultaneous writers, and to reduce the chance that an update will lose - data recently written by some other party (written after the last time - this slot was read). - * test_vector=ListOf(TupleOf(offset, length, opcode, specimen)) - * the opcode is a string, from the set [gt, ge, eq, le, lt, ne] - * each element of the test vector is read from the slot's data and - compared against the specimen using the desired (in)equality. If all - tests evaluate True, the write is performed - * write_vector=ListOf(TupleOf(offset, newdata)) - * offset < 0 is not yet defined, it probably means relative to the - end of the data, which probably means append, but we haven't nailed - it down quite yet - * write vectors are executed in order, which specifies the results of - overlapping writes - * return value: - * error: OutOfSpace - * error: something else (io error, out of memory, whatever) - * (True, old_test_data): the write was accepted (test_vector passed) - * (False, old_test_data): the write was rejected (test_vector failed) - * both 'accepted' and 'rejected' return the old data that was used - for the test_vector comparison. This can be used by the client - to detect write collisions, including collisions for which the - desired behavior was to overwrite the old version. - -In addition, the storage server provides several methods to access these -share objects: - - * allocate_mutable_slot(storage_index, sharenums=SetOf(int)) - * returns DictOf(int, MutableSlot) - * get_mutable_slot(storage_index) - * returns DictOf(int, MutableSlot) - * or raises KeyError - -We intend to add an interface which allows small slots to allocate-and-write -in a single call, as well as do update or read in a single call. The goal is -to allow a reasonably-sized dirnode to be created (or updated, or read) in -just one round trip (to all N shareholders in parallel). - -==== migrating shares ==== - -If a share must be migrated from one server to another, two values become -invalid: the write enabler (since it was computed for the old server), and -the lease renew/cancel tokens. - -Suppose that a slot was first created on nodeA, and was thus initialized with -WE(nodeA) (= H(WEM+nodeA)). Later, for provisioning reasons, the share is -moved from nodeA to nodeB. - -Readers may still be able to find the share in its new home, depending upon -how many servers are present in the grid, where the new nodeid lands in the -permuted index for this particular storage index, and how many servers the -reading client is willing to contact. - -When a client attempts to write to this migrated share, it will get a "bad -write enabler" error, since the WE it computes for nodeB will not match the -WE(nodeA) that was embedded in the share. When this occurs, the "bad write -enabler" message must include the old nodeid (e.g. nodeA) that was in the -share. - -The client then computes H(nodeB+H(WEM+nodeA)), which is the same as -H(nodeB+WE(nodeA)). The client sends this along with the new WE(nodeB), which -is H(WEM+nodeB). Note that the client only sends WE(nodeB) to nodeB, never to -anyone else. Also note that the client does not send a value to nodeB that -would allow the node to impersonate the client to a third node: everything -sent to nodeB will include something specific to nodeB in it. - -The server locally computes H(nodeB+WE(nodeA)), using its own node id and the -old write enabler from the share. It compares this against the value supplied -by the client. If they match, this serves as proof that the client was able -to compute the old write enabler. The server then accepts the client's new -WE(nodeB) and writes it into the container. - -This WE-fixup process requires an extra round trip, and requires the error -message to include the old nodeid, but does not require any public key -operations on either client or server. - -Migrating the leases will require a similar protocol. This protocol will be -defined concretely at a later date. - -=== Code Details === - -The current FileNode class will be renamed ImmutableFileNode, and a new -MutableFileNode class will be created. Instances of this class will contain a -URI and a reference to the client (for peer selection and connection). The -methods of MutableFileNode are: - - * replace(newdata) -> OK, ConsistencyError, NotEnoughSharesError - * get() -> [deferred] newdata, NotEnoughSharesError - * if there are multiple retrieveable versions in the grid, get() returns - the first version it can reconstruct, and silently ignores the others. - In the future, a more advanced API will signal and provide access to - the multiple heads. - -The peer-selection and data-structure manipulation (and signing/verification) -steps will be implemented in a separate class in allmydata/mutable.py . - === SMDF Slot Format === This SMDF data lives inside a server-side MutableSlot container. The server @@ -586,142 +336,8 @@ key sizes. has tree (with root "R"), from which a minimal subset of hashes is put in the share hash chain in (8). -=== Recovery === - -The first line of defense against damage caused by colliding writes is the -Prime Coordination Directive: "Don't Do That". - -The second line of defense is to keep "S" (the number of competing versions) -lower than N/k. If this holds true, at least one competing version will have -k shares and thus be recoverable. Note that server unavailability counts -against us here: the old version stored on the unavailable server must be -included in the value of S. - -The third line of defense is our use of testv_and_writev() (described below), -which increases the convergence of simultaneous writes: one of the writers -will be favored (the one with the highest "R"), and that version is more -likely to be accepted than the others. This defense is least effective in the -pathological situation where S simultaneous writers are active, the one with -the lowest "R" writes to N-k+1 of the shares and then dies, then the one with -the next-lowest "R" writes to N-2k+1 of the shares and dies, etc, until the -one with the highest "R" writes to k-1 shares and dies. Any other sequencing -will allow the highest "R" to write to at least k shares and establish a new -revision. - -The fourth line of defense is the fact that each client keeps writing until -at least one version has N shares. This uses additional servers, if -necessary, to make sure that either the client's version or some -newer/overriding version is highly available. - -The fifth line of defense is the recovery algorithm, which seeks to make sure -that at least *one* version is highly available, even if that version is -somebody else's. - -The write-shares-to-peers algorithm is as follows: - - * permute peers according to storage index - * walk through peers, trying to assign one share per peer - * for each peer: - * send testv_and_writev, using "old(seqnum+R) <= our(seqnum+R)" as the test - * this means that we will overwrite any old versions, and we will - overwrite simultaenous writers of the same version if our R is higher. - We will not overwrite writers using a higher seqnum. - * record the version that each share winds up with. If the write was - accepted, this is our own version. If it was rejected, read the - old_test_data to find out what version was retained. - * if old_test_data indicates the seqnum was equal or greater than our - own, mark the "Simultanous Writes Detected" flag, which will eventually - result in an error being reported to the writer (in their close() call). - * build a histogram of "R" values - * repeat until the histogram indicate that some version (possibly ours) - has N shares. Use new servers if necessary. - * If we run out of servers: - * if there are at least shares-of-happiness of any one version, we're - happy, so return. (the close() might still get an error) - * not happy, need to reinforce something, goto RECOVERY - -RECOVERY: - * read all shares, count the versions, identify the recoverable ones, - discard the unrecoverable ones. - * sort versions: locate max(seqnums), put all versions with that seqnum - in the list, sort by number of outstanding shares. Then put our own - version. (TODO: put versions with seqnum us ahead of us?). - * for each version: - * attempt to recover that version - * if not possible, remove it from the list, go to next one - * if recovered, start at beginning of peer list, push that version, - continue until N shares are placed - * if pushing our own version, bump up the seqnum to one higher than - the max seqnum we saw - * if we run out of servers: - * schedule retry and exponential backoff to repeat RECOVERY - * admit defeat after some period? presumeably the client will be shut down - eventually, maybe keep trying (once per hour?) until then. - - - - -== Medium Distributed Mutable Files == - -These are just like the SDMF case, but: - - * we actually take advantage of the Merkle hash tree over the blocks, by - reading a single segment of data at a time (and its necessary hashes), to - reduce the read-time alacrity - * we allow arbitrary writes to the file (i.e. seek() is provided, and - O_TRUNC is no longer required) - * we write more code on the client side (in the MutableFileNode class), to - first read each segment that a write must modify. This looks exactly like - the way a normal filesystem uses a block device, or how a CPU must perform - a cache-line fill before modifying a single word. - * we might implement some sort of copy-based atomic update server call, - to allow multiple writev() calls to appear atomic to any readers. - -MDMF slots provide fairly efficient in-place edits of very large files (a few -GB). Appending data is also fairly efficient, although each time a power of 2 -boundary is crossed, the entire file must effectively be re-uploaded (because -the size of the block hash tree changes), so if the filesize is known in -advance, that space ought to be pre-allocated (by leaving extra space between -the block hash tree and the actual data). - -MDMF1 uses the Merkle tree to enable low-alacrity random-access reads. MDMF2 -adds cache-line reads to allow random-access writes. - -== Large Distributed Mutable Files == - -LDMF slots use a fundamentally different way to store the file, inspired by -Mercurial's "revlog" format. They enable very efficient insert/remove/replace -editing of arbitrary spans. Multiple versions of the file can be retained, in -a revision graph that can have multiple heads. Each revision can be -referenced by a cryptographic identifier. There are two forms of the URI, one -that means "most recent version", and a longer one that points to a specific -revision. - -Metadata can be attached to the revisions, like timestamps, to enable rolling -back an entire tree to a specific point in history. - -LDMF1 provides deltas but tries to avoid dealing with multiple heads. LDMF2 -provides explicit support for revision identifiers and branching. - == TODO == -improve allocate-and-write or get-writer-buckets API to allow one-call (or -maybe two-call) updates. The challenge is in figuring out which shares are on -which machines. First cut will have lots of round trips. - -(eventually) define behavior when seqnum wraps. At the very least make sure -it can't cause a security problem. "the slot is worn out" is acceptable. - -(eventually) define share-migration lease update protocol. Including the -nodeid who accepted the lease is useful, we can use the same protocol as we -do for updating the write enabler. However we need to know which lease to -update.. maybe send back a list of all old nodeids that we find, then try all -of them when we accept the update? - - We now do this in a specially-formatted IndexError exception: - "UNABLE to renew non-existent lease. I have leases accepted by " + - "nodeids: '12345','abcde','44221' ." - Every node in a given tahoe grid must have the same common DSA moduli and exponent, but different grids could use different parameters. We haven't figured out how to define a "grid id" yet, but I think the DSA parameters