From: Brian Warner Date: Fri, 26 Oct 2007 09:26:56 +0000 (-0700) Subject: docs: add writeup of our mutable-file plans X-Git-Tag: allmydata-tahoe-0.7.0~353 X-Git-Url: https://git.rkrishnan.org/pf/content/en/footer/about.html?a=commitdiff_plain;h=c4d2a5faa2e6864d93c2b447c8134ffd4dea8f6c;p=tahoe-lafs%2Ftahoe-lafs.git docs: add writeup of our mutable-file plans --- diff --git a/docs/mutable.txt b/docs/mutable.txt new file mode 100644 index 00000000..891462c4 --- /dev/null +++ b/docs/mutable.txt @@ -0,0 +1,458 @@ + +(protocol proposal, work-in-progress, not authoritative) + += Mutable Files = + +Mutable File Slots are places with a stable identifier that can hold data +that changes over time. In contrast to CHK slots, for which the +URI/identifier is derived from the contents themselves, the Mutable File Slot +URI remains fixed for the life of the slot, regardless of what data is placed +inside it. + +Each mutable slot is referenced by two different URIs. The "read-write" URI +grants read-write access to its holder, allowing them to put whatever +contents they like into the slot. The "read-only" URI is less powerful, only +granting read access, and not enabling modification of the data. The +read-write URI can be turned into the read-only URI, but not the other way +around. + +The data in these slots is distributed over a number of servers, using the +same erasure coding that CHK files use, with 3-of-10 being a typical choice +of encoding parameters. The data is encrypted and signed in such a way that +only the holders of the read-write URI will be able to set the contents of +the slot, and only the holders of the read-only URI will be able to read +those contents. Holders of either URI will be able to validate the contents +as being written by someone with the read-write URI. The servers who hold the +shares cannot read or modify them: the worst they can do is deny service (by +deleting or corrupting the shares), or attempt a rollback attack (which can +only succeed with the cooperation of at least k servers). + +== Consistency vs Availability == + +There is an age-old battle between consistency and availability. Epic papers +have been written, elaborate proofs have been established, and generations of +theorists have learned that you cannot simultaneously achieve guaranteed +consistency with guaranteed reliability. In addition, the closer to 0 you get +on either axis, the cost and complexity of the design goes up. + +Tahoe's design goals are to largely favor design simplicity, then slightly +favor read availability, over the other criteria. + +As we develop more sophisticated mutable slots, the API may expose multiple +read versions to the application layer. The tahoe philosophy is to defer most +consistency recovery logic to the higher layers. Some applications have +effective ways to merge multiple versions, so inconsistency is not +necessarily a problem (i.e. directory nodes can usually merge multiple "add +child" operations). + +== The Prime Coordination Directive: "Don't Do That" == + +The current rule for applications which run on top of Tahoe is "do not +perform simultaneous uncoordinated writes". That means you need non-tahoe +means to make sure that two parties are not trying to modify the same mutable +slot at the same time. For example: + + * don't give the read-write URI to anyone else. Dirnodes in a private + directory generally satisfy this case, as long as you don't use two + clients on the same account at the same time + * if you give a read-write URI to someone else, stop using it yourself. An + inbox would be a good example of this. + * if you give a read-write URI to someone else, call them on the phone + before you write into it + * build an automated mechanism to have your agents coordinate writes. + For example, we expect a future release to include a FURL for a + "coordination server" in the dirnodes. The rule can be that you must + contact the coordination server and obtain a lock/lease on the file + before you're allowed to modify it. + +If you do not follow this rule, Bad Things will happen. The worst-case Bad +Thing is that the entire file will be lost. A less-bad Bad Thing is that one +or more of the simultaneous writers will lose their changes. An observer of +the file may not see monotonically-increasing changes to the file, i.e. they +may see version 1, then version 2, then 3, then 2 again. + +Tahoe takes some amount of care to reduce the badness of these Bad Things. +One way you can help nudge it from the "lose your file" case into the "lose +some changes" case is to reduce the number of competing versions: multiple +versions of the file that different parties are trying to establish as the +one true current contents. Each simultaneous writer counts as a "competing +version", as does the previous version of the file. If the count "S" of these +competing versions is larger than N/k, then the file runs the risk of being +lost completely. If at least one of the writers remains running after the +collision is detected, it will attempt to recover, but if S>(N/k) and all +writers crash after writing a few shares, the file will be lost. + + +== Small Distributed Mutable Files == + +SDMF slots are suitable for small (<1MB) files that are editing by rewriting +the entire file. The three operations are: + + * allocate (with initial contents) + * set (with new contents) + * get (old contents) + +The first use of SDMF slots will be to hold directories (dirnodes), which map +encrypted child names to rw-URI/ro-URI pairs. + +=== SDMF slots overview === + +Each SDMF slot is created with a public/private key pair (known as the +"verification key" and the "signature key"). The public key is hashed to form +the "read key" (an AES symmetric key), and the read key is hashed to form the +Storage Index (a unique string). The private key and public key are +concatenated together and hashed to form the "write key". The write key is +then hashed to form the "write enabler master". For each storage server on +which a share is kept, the write enabler master is concatenated with the +server's nodeid and hashed, and the result is called the "write enabler" for +that particular server. + +The read-write URI consists of the write key and the storage index. The +read-only URI contains just the read key. + +The SDMF slot is allocated by sending a request to the storage server with a +desired size, the storage index, and the write enabler for that server's +nodeid. If granted, the write enabler is stashed inside the slot's backing +store file. All further write requests must be accompanied by the write +enabler or they will not be honored. The storage server does not share the +write enabler with anyone else. + +The SDMF slot structure will be described in more detail below. The important +pieces are: + + * a sequence number + * a root hash "R" + * the encoding parameters (including k, N, and the file size) + * a signed copy of [seqnum,R,encoding_params], using the signature key + * the verification key (not encrypted) + * the share hash chain (part of a Merkle tree over the share hashes) + * the share data itself (erasure-coding of read-key-encrypted file data) + * the signature key, encrypted with the write key + +The access pattern for read is: + * use storage index to locate 'k' shares with identical 'R' values + * read verification key + * hash verification key, compare against read key + * OOPS!!! verification key is in the clear, so read key is too!! FIX! + * read seqnum, R, encoding parameters, signature + * verify signature + * read share data, hash + * read share hash chain + * validate share hash chain up to the root "R" + * submit share data to erasure decoding + * decrypt decoded data with read-key + * submit plaintext to application + +The access pattern for write is: + * use the storage index to locate at least one share + * read verification key and encrypted signature key + * decrypt signature key using write-key + * concatenate signature and verification keys, compare against write-key + * hash verification key to form read-key + * encrypt plaintext from application with read-key + * erasure-code crypttext to form shares + * compute Merkle tree of shares, find root "R" + * create share data structures, one per server: + * use seqnum which is one higher than the old version + * share hash chain has log(N) hashes, different for each server + * signed data is the same for each server + * now we have N shares and need homes for them + * walk through peers + * if share is not already present, allocate-and-set + * otherwise, try to modify existing share: + * send testv_and_writev operation to each one + * testv says to accept share if their(seqnum+R) <= our(seqnum+R) + * count how many servers wind up with which versions (histogram over R) + * keep going until N servers have the same version, or we run out of servers + * if any servers wound up with a different version, report error to + application + * if we ran out of servers, initiate recovery process (described below) + +=== Server Storage Protocol === + +The storage servers will provide a mutable slot container which is oblivious +to the details of the data being contained inside it. Each storage index +refers to a "bucket", and each bucket has one or more shares inside it. (In a +well-provisioned network, each bucket will have only one share). The bucket +is stored as a directory, using the base32-encoded storage index as the +directory name. Each share is stored in a single file, using the share number +as the filename. + +The container holds space for a container magic number (for versioning), the +write enabler, the nodeid for which the write enabler was generated (for +share migration, TBD), a small number of lease structures, the embedded data +itself, and expansion space for additional lease structures. + + # offset size name + 1 0 32 magic verstr "tahoe mutable container v1" plus binary + 2 32 32 write enabler's nodeid + 3 64 32 write enabler + 4 72 8 offset of extra leases (after data) + 5 80 288 four leases: + 0 4 ownerid (0 means "no lease here") + 4 4 expiration timestamp + 8 32 renewal token + 40 32 cancel token + 6 368 ?? data + 7 ?? 4 count of extra leases + 8 ?? n*72 extra leases + +The "extra leases" field must be copied and rewritten each time the size of +the enclosed data changes. The hope is that most buckets will have four or +fewer leases and this extra copying will not usually be necessary. + +The server will honor any write commands that provide the write token and do +not exceed the server-wide storage size limitations. Read and write commands +MUST be restricted to the 'data' portion of the container: the implementation +of those commands MUST perform correct bounds-checking to make sure other +portions of the container are inaccessible to the clients. + +The two methods provided by the storage server on these "MutableSlot" share +objects are: + + * readv(ListOf(offset=int, length=int)) + * returns a list of bytestrings, of the various requested lengths + * offset < 0 is interpreted relative to the end of the data + * spans which hit the end of the data will return truncated data + + * testv_and_writev(write_enabler, test_vector, write_vector) + * this is a test-and-set operation which performs the given tests and only + applies the desired writes if all tests succeed. This is used to detect + simultaneous writers, and to reduce the chance that an update will lose + data recently written by some other party (written after the last time + this slot was read). + * test_vector=ListOf(TupleOf(offset, length, opcode, specimen)) + * the opcode is a string, from the set [gt, ge, eq, le, lt, ne] + * each element of the test vector is read from the slot's data and + compared against the specimen using the desired (in)equality. If all + tests evaluate True, the write is performed + * write_vector=ListOf(TupleOf(offset, newdata)) + * offset < 0 is not yet defined, it probably means relative to the + end of the data, which probably means append, but we haven't nailed + it down quite yet + * write vectors are executed in order, which specifies the results of + overlapping writes + * return value: + * error: OutOfSpace + * error: something else (io error, out of memory, whatever) + * (True, old_test_data): the write was accepted (test_vector passed) + * (False, old_test_data): the write was rejected (test_vector failed) + * both 'accepted' and 'rejected' return the old data that was used + for the test_vector comparison. This can be used by the client + to detect write collisions, including collisions for which the + desired behavior was to overwrite the old version. + +In addition, the storage server provides several methods to access these +share objects: + + * allocate_mutable_slot(storage_index, sharenums=SetOf(int)) + * returns DictOf(int, MutableSlot) + * get_mutable_slot(storage_index) + * returns DictOf(int, MutableSlot) + * or raises KeyError + +We intend to add an interface which allows small slots to allocate-and-write +in a single call, as well as do update or read in a single call. The goal is +to allow a reasonably-sized dirnode to be created (or updated, or read) in +just one round trip (to all N shareholders in parallel). + +==== migrating shares ==== + +If a share must be migrated from one server to another, two values become +invalid: the write enabler (since it was computed for the old server), and +the lease renew/cancel tokens. + +One idea we have is to say that the migration process is obligated to replace +the write enabler with its hash (but leaving the old "write enabler node id" +in place, to remind it that this WE isn't its own). When a writer attempts to +modify a slot with the old write enabler, the server will reject the request +and include the old WE-nodeid in the rejection message. The writer should +then realize that the share has been migrated and try again with the hash of +their old write enabler. + +This process doesn't provide any means to fix up the write enabler, though, +requiring an extra roundtrip for the remainder of the slot's lifetime. It +might work better to have a call that allows the WE to be replaced, by +proving that the writer knows H(old-WE-nodeid,old-WE). If we leave the old WE +in place when migrating, this allows both writer and server to agree upon the +writer's authority, hopefully without granting the server any new authority +(or enabling it to trick a writer into revealing one). + +=== Code Details === + +The current FileNode class will be renamed ImmutableFileNode, and a new +MutableFileNode class will be created. Instances of this class will contain a +URI and a reference to the client (for peer selection and connection). The +methods of MutableFileNode are: + + * replace(newdata) -> OK, ConsistencyError, NotEnoughPeersError + * get() -> [deferred] newdata, NotEnoughPeersError + * if there are multiple retrieveable versions in the grid, get() returns + the first version it can reconstruct, and silently ignores the others. + In the future, a more advanced API will signal and provide access to + the multiple heads. + +The peer-selection and data-structure manipulation (and signing/verification) +steps will be implemented in a separate class in allmydata/mutable.py . + +=== SMDF Slot Format === + +This SMDF data lives inside a server-side MutableSlot container. The server +is oblivious to this format. + + # offset size name + 1 0 1 version byte, \x00 for this format + 2 1 8 sequence number. 2^64-1 must be handled specially, TBD + 3 9 32 "R" (root of share hash Merkle tree) + 4 41 18 encoding parameters: + 41 1 k + 42 1 N + 43 8 segment size + 51 8 data length + 5 59 32 offset table: + 91 4 (6) signature + 95 4 (7) share hash chain + 99 4 (8) share data + 103 8 (9) encrypted private key + 6 111 256 verification key (2048 RSA key 'n' value, e=3) + 7 367 256 signature= RSAenc(sig-key, H(version+seqnum+r+encparm)) + 8 623 (a) share hash chain + 9 ?? LEN share data +10 ?? 256 encrypted private key= AESenc(write-key, RSA 'd' value) + +(a) The share hash chain contains ceil(log(N)) hashes, each 32 bytes long. + This is the set of hashes necessary to validate this share's leaf in the + share Merkle tree. For N=10, this is 4 hashes, i.e. 128 bytes. + +=== Recovery === + +The first line of defense against damage caused by colliding writes is the +Prime Coordination Directive: "Don't Do That". + +The second line of defense is to keep "S" (the number of competing versions) +lower than N/k. If this holds true, at least one competing version will have +k shares and thus be recoverable. Note that server unavailability counts +against us here: the old version stored on the unavailable server must be +included in the value of S. + +The third line of defense is our use of testv_and_writev() (described below), +which increases the convergence of simultaneous writes: one of the writers +will be favored (the one with the highest "R"), and that version is more +likely to be accepted than the others. This defense is least effective in the +pathological situation where S simultaneous writers are active, the one with +the lowest "R" writes to N-k+1 of the shares and then dies, then the one with +the next-lowest "R" writes to N-2k+1 of the shares and dies, etc, until the +one with the highest "R" writes to k-1 shares and dies. Any other sequencing +will allow the highest "R" to write to at least k shares and establish a new +revision. + +The fourth line of defense is the fact that each client keeps writing until +at least one version has N shares. This uses additional servers, if +necessary, to make sure that either the client's version or some +newer/overriding version is highly available. + +The fifth line of defense is the recovery algorithm, which seeks to make sure +that at least *one* version is highly available, even if that version is +somebody else's. + +The write-shares-to-peers algorithm is as follows: + + * permute peers according to storage index + * walk through peers, trying to assign one share per peer + * for each peer: + * send testv_and_writev, using "old(seqnum+R) <= our(seqnum+R)" as the test + * this means that we will overwrite any old versions, and we will + overwrite simultaenous writers of the same version if our R is higher. + We will not overwrite writers using a higher seqnum. + * record the version that each share winds up with. If the write was + accepted, this is our own version. If it was rejected, read the + old_test_data to find out what version was retained. + * if old_test_data indicates the seqnum was equal or greater than our + own, mark the "Simultanous Writes Detected" flag, which will eventually + result in an error being reported to the writer (in their close() call). + * build a histogram of "R" values + * repeat until the histogram indicate that some version (possibly ours) + has N shares. Use new servers if necessary. + * If we run out of servers: + * if there are at least shares-of-happiness of any one version, we're + happy, so return. (the close() might still get an error) + * not happy, need to reinforce something, goto RECOVERY + +RECOVERY: + * read all shares, count the versions, identify the recoverable ones, + discard the unrecoverable ones. + * sort versions: locate max(seqnums), put all versions with that seqnum + in the list, sort by number of outstanding shares. Then put our own + version. (TODO: put versions with seqnum us ahead of us?). + * for each version: + * attempt to recover that version + * if not possible, remove it from the list, go to next one + * if recovered, start at beginning of peer list, push that version, + continue until N shares are placed + * if pushing our own version, bump up the seqnum to one higher than + the max seqnum we saw + * if we run out of servers: + * schedule retry and exponential backoff to repeat RECOVERY + * admit defeat after some period? presumeably the client will be shut down + eventually, maybe keep trying (once per hour?) until then. + + + + +== Medium Distributed Mutable Files == + +These are just like the SDMF case, but: + + * we use a Merkle hash tree over the blocks, instead of using a single flat + hash, to reduce the read-time alacrity + * we allow arbitrary writes to the file (i.e. seek() is provided, and + O_TRUNC is no longer required) + * we write more code on the client side (in the MutableFileNode class), to + first read each segment that a write must modify. This looks exactly like + the way a normal filesystem uses a block device, or how a CPU must perform + a cache-line fill before modifying a single word. + * we might implement some sort of copy-based atomic update server call, + to allow multiple writev() calls to appear atomic to any readers. + +MDMF slots provide fairly efficient in-place edits of very large files (a few +GB). Appending data is also fairly efficient, although each time a power of 2 +boundary is crossed, the entire file must effectively be re-uploaded, so if +the filesize is known in advance, that space ought to be pre-allocated. + +MDMF1 uses the Merkle tree to enable low-alacrity random-access reads. MDMF2 +adds cache-line reads to allow random-access writes. + +== Large Distributed Mutable Files == + +LDMF slots use a fundamentally different way to store the file, inspired by +Mercurial's "revlog" format. They enable very efficient insert/remove/replace +editing of arbitrary spans. Multiple versions of the file can be retained, in +a revision graph that can have multiple heads. Each revision can be +referenced by a cryptographic identifier. There are two forms of the URI, one +that means "most recent version", and a longer one that points to a specific +revision. + +Metadata can be attached to the revisions, like timestamps, to enable rolling +back an entire tree to a specific point in history. + +LDMF1 provides deltas but tries to avoid dealing with multiple heads. LDMF2 +provides explicit support for revision identifiers and branching. + +== TODO == + +fix gigantic RO-URI security bug, probably by adding a second secret + +how about: + * H(privkey+pubkey) -> writekey -> readkey -> storageindex + * RW-URI = writekey + * RO-URI = readkey + H(pubkey) + + +improve allocate-and-write or get-writer-buckets API to allow one-call (or +maybe two-call) updates. The challenge is in figuring out which shares are on +which machines. + +(eventually) define behavior when seqnum wraps. At the very least make sure +it can't cause a security problem. "the slot is worn out" is acceptable. + +(eventually) define share-migration WE-update protocol