From: Brian Warner Date: Fri, 11 Jan 2008 11:30:58 +0000 (-0700) Subject: docs/mutable-DSA.txt: update mutable.txt to reflect our proposed DSA-based mutable... X-Git-Url: https://git.rkrishnan.org/pf/content/en/service/contact.html?a=commitdiff_plain;h=db5f58f9d5bbd6eb282c200b8b4f12b41194ddb4;p=tahoe-lafs%2Ftahoe-lafs.git docs/mutable-DSA.txt: update mutable.txt to reflect our proposed DSA-based mutable file scheme (#217) --- diff --git a/docs/mutable-DSA.txt b/docs/mutable-DSA.txt new file mode 100644 index 00000000..568f3db4 --- /dev/null +++ b/docs/mutable-DSA.txt @@ -0,0 +1,708 @@ + +(protocol proposal, work-in-progress, not authoritative) + +(this document describes DSA-based mutable files, as opposed to the RSA-based +mutable files that were introduced in tahoe-0.7.0 . This proposal has not yet +been implemented. Please see mutable-DSA.svg for a quick picture of the +crypto scheme described herein) + += Mutable Files = + +Mutable File Slots are places with a stable identifier that can hold data +that changes over time. In contrast to CHK slots, for which the +URI/identifier is derived from the contents themselves, the Mutable File Slot +URI remains fixed for the life of the slot, regardless of what data is placed +inside it. + +Each mutable slot is referenced by two different URIs. The "read-write" URI +grants read-write access to its holder, allowing them to put whatever +contents they like into the slot. The "read-only" URI is less powerful, only +granting read access, and not enabling modification of the data. The +read-write URI can be turned into the read-only URI, but not the other way +around. + +The data in these slots is distributed over a number of servers, using the +same erasure coding that CHK files use, with 3-of-10 being a typical choice +of encoding parameters. The data is encrypted and signed in such a way that +only the holders of the read-write URI will be able to set the contents of +the slot, and only the holders of the read-only URI will be able to read +those contents. Holders of either URI will be able to validate the contents +as being written by someone with the read-write URI. The servers who hold the +shares cannot read or modify them: the worst they can do is deny service (by +deleting or corrupting the shares), or attempt a rollback attack (which can +only succeed with the cooperation of at least k servers). + +== Consistency vs Availability == + +There is an age-old battle between consistency and availability. Epic papers +have been written, elaborate proofs have been established, and generations of +theorists have learned that you cannot simultaneously achieve guaranteed +consistency with guaranteed reliability. In addition, the closer to 0 you get +on either axis, the cost and complexity of the design goes up. + +Tahoe's design goals are to largely favor design simplicity, then slightly +favor read availability, over the other criteria. + +As we develop more sophisticated mutable slots, the API may expose multiple +read versions to the application layer. The tahoe philosophy is to defer most +consistency recovery logic to the higher layers. Some applications have +effective ways to merge multiple versions, so inconsistency is not +necessarily a problem (i.e. directory nodes can usually merge multiple "add +child" operations). + +== The Prime Coordination Directive: "Don't Do That" == + +The current rule for applications which run on top of Tahoe is "do not +perform simultaneous uncoordinated writes". That means you need non-tahoe +means to make sure that two parties are not trying to modify the same mutable +slot at the same time. For example: + + * don't give the read-write URI to anyone else. Dirnodes in a private + directory generally satisfy this case, as long as you don't use two + clients on the same account at the same time + * if you give a read-write URI to someone else, stop using it yourself. An + inbox would be a good example of this. + * if you give a read-write URI to someone else, call them on the phone + before you write into it + * build an automated mechanism to have your agents coordinate writes. + For example, we expect a future release to include a FURL for a + "coordination server" in the dirnodes. The rule can be that you must + contact the coordination server and obtain a lock/lease on the file + before you're allowed to modify it. + +If you do not follow this rule, Bad Things will happen. The worst-case Bad +Thing is that the entire file will be lost. A less-bad Bad Thing is that one +or more of the simultaneous writers will lose their changes. An observer of +the file may not see monotonically-increasing changes to the file, i.e. they +may see version 1, then version 2, then 3, then 2 again. + +Tahoe takes some amount of care to reduce the badness of these Bad Things. +One way you can help nudge it from the "lose your file" case into the "lose +some changes" case is to reduce the number of competing versions: multiple +versions of the file that different parties are trying to establish as the +one true current contents. Each simultaneous writer counts as a "competing +version", as does the previous version of the file. If the count "S" of these +competing versions is larger than N/k, then the file runs the risk of being +lost completely. If at least one of the writers remains running after the +collision is detected, it will attempt to recover, but if S>(N/k) and all +writers crash after writing a few shares, the file will be lost. + + +== Small Distributed Mutable Files == + +SDMF slots are suitable for small (<1MB) files that are editing by rewriting +the entire file. The three operations are: + + * allocate (with initial contents) + * set (with new contents) + * get (old contents) + +The first use of SDMF slots will be to hold directories (dirnodes), which map +encrypted child names to rw-URI/ro-URI pairs. + +=== SDMF slots overview === + +Each SDMF slot is created with a DSA public/private key pair, using a +system-wide common modulus and generator, in which the private key is a +random 256 bit number, and the public key is a larger value (about 2048 bits) +that can be derived with a bit of math from the private key. The public key +is known as the "verification key", while the private key is called the +"signature key". + +The 256-bit signature key is used verbatim as the "write capability". This +can be converted into the 2048ish-bit verification key through a fairly cheap +set of modular exponentiation operations; this is done any time the holder of +the write-cap wants to read the data. (Note that the signature key can either +be a newly-generated random value, or the hash of something else, if we found +a need for a capability that's stronger than the write-cap). + +This results in a write-cap which is 256 bits long and can thus be expressed +in an ASCII/transport-safe encoded form (base62 encoding, fits in 72 +characters, including a local-node http: convenience prefix). + +The private key is hashed to form a 256-bit "salt". The public key is also +hashed to form a 256-bit "pubkey hash". These two values are concatenated, +hashed, and truncated to 192 bits to form the first 192 bits of the read-cap. +The pubkey hash is hashed by itself and truncated to 64 bits to form the last +64 bits of the read-cap. The full read-cap is 256 bits long, just like the +write-cap. + +The first 192 bits of the read-cap are hashed and truncated to form the first +64 bits of the storage index. The last 64 bits of the read-cap are hashed to +form the last 64 bits of the storage index. This gives us a 128-bit storage +index. + +The verification-cap is the first 64 bits of the storage index plus the +pubkey hash, 320 bits total. The verification-cap doesn't need to be +expressed in a printable transport-safe form, so it's ok that it's longer. + +The read-cap is hashed one way to form an AES encryption key that is used to +encrypt the salt; this key is called the "salt key". The encrypted salt is +stored in the share. The private key never changes, therefore the salt never +changes, and the salt key is only used for a single purpose, so there is no +need for an IV. + +The read-cap is hashed a different way to form the master data encryption +key. A random "data salt" is generated each time the share's contents are +replaced, and the master data encryption key is concatenated with the data +salt, then hashed, to form the AES CTR-mode "read key" that will be used to +encrypt the actual file data. This is to avoid key-reuse. An outstanding +issue is how to avoid key reuse when files are modified in place instead of +being replaced completely; this is not done in SDMF but might occur in MDMF. + +The private key is hashed one way to form the salt, and a different way to +form the "write enabler master". For each storage server on which a share is +kept, the write enabler master is concatenated with the server's nodeid and +hashed, and the result is called the "write enabler" for that particular +server. Note that multiple shares of the same slot stored on the same server +will all get the same write enabler, i.e. the write enabler is associated +with the "bucket", rather than the individual shares. + +The private key is hashed a third way to form the "data write key", which can +be used by applications which wish to store some data in a form that is only +available to those with a write-cap, and not to those with merely a read-cap. +This is used to implement transitive read-onlyness of dirnodes. + +The public key is stored on the servers, as is the encrypted salt, the +(non-encrypted) data salt, the encrypted data, and a signature. The container +records the write-enabler, but of course this is not visible to readers. To +make sure that every byte of the share can be verified by a holder of the +verify-cap (and also by the storage server itself), the signature covers the +version number, the sequence number, the root hash "R" of the share merkle +tree, the encoding parameters, and the encrypted salt. "R" itself covers the +hash trees and the share data. + +The read-write URI is just the private key. The read-only URI is the read-cap +key. The verify-only URI contains the the pubkey hash and the first 64 bits +of the storage index. + + FMW:b2a(privatekey) + FMR:b2a(readcap) + FMV:b2a(storageindex[:64])b2a(pubkey-hash) + +Note that this allows the read-only and verify-only URIs to be derived from +the read-write URI without actually retrieving any data from the share, but +instead by regenerating the public key from the private one. Uses of the +read-only or verify-only caps must validate the public key against their +pubkey hash (or its derivative) the first time they retrieve the pubkey, +before trusting any signatures they see. + +The SDMF slot is allocated by sending a request to the storage server with a +desired size, the storage index, and the write enabler for that server's +nodeid. If granted, the write enabler is stashed inside the slot's backing +store file. All further write requests must be accompanied by the write +enabler or they will not be honored. The storage server does not share the +write enabler with anyone else. + +The SDMF slot structure will be described in more detail below. The important +pieces are: + + * a sequence number + * a root hash "R" + * the data salt + * the encoding parameters (including k, N, file size, segment size) + * a signed copy of [seqnum,R,data_salt,encoding_params] (using signature key) + * the verification key (not encrypted) + * the share hash chain (part of a Merkle tree over the share hashes) + * the block hash tree (Merkle tree over blocks of share data) + * the share data itself (erasure-coding of read-key-encrypted file data) + * the salt, encrypted with the salt key + +The access pattern for read (assuming we hold the write-cap) is: + * generate public key from the private one + * hash private key to get the salt, hash public key, form read-cap + * form storage-index + * use storage-index to locate 'k' shares with identical 'R' values + * either get one share, read 'k' from it, then read k-1 shares + * or read, say, 5 shares, discover k, either get more or be finished + * or copy k into the URIs + * .. jump to "COMMON READ", below + +To read (assuming we only hold the read-cap), do: + * hash read-cap pieces to generate storage index and salt key + * use storage-index to locate 'k' shares with identical 'R' values + * retrieve verification key and encrypted salt + * decrypt salt + * hash decrypted salt and pubkey to generate another copy of the read-cap, + make sure they match (this validates the pubkey) + * .. jump to "COMMON READ" + + * COMMON READ: + * read seqnum, R, data salt, encoding parameters, signature + * verify signature against verification key + * hash data salt and read-cap to generate read-key + * read share data, compute block-hash Merkle tree and root "r" + * read share hash chain (leading from "r" to "R") + * validate share hash chain up to the root "R" + * submit share data to erasure decoding + * decrypt decoded data with read-key + * submit plaintext to application + +The access pattern for write is: + * generate pubkey, salt, read-cap, storage-index as in read case + * generate data salt for this update, generate read-key + * encrypt plaintext from application with read-key + * application can encrypt some data with the data-write-key to make it + only available to writers (used for transitively-readonly dirnodes) + * erasure-code crypttext to form shares + * split shares into blocks + * compute Merkle tree of blocks, giving root "r" for each share + * compute Merkle tree of shares, find root "R" for the file as a whole + * create share data structures, one per server: + * use seqnum which is one higher than the old version + * share hash chain has log(N) hashes, different for each server + * signed data is the same for each server + * include pubkey, encrypted salt, data salt + * now we have N shares and need homes for them + * walk through peers + * if share is not already present, allocate-and-set + * otherwise, try to modify existing share: + * send testv_and_writev operation to each one + * testv says to accept share if their(seqnum+R) <= our(seqnum+R) + * count how many servers wind up with which versions (histogram over R) + * keep going until N servers have the same version, or we run out of servers + * if any servers wound up with a different version, report error to + application + * if we ran out of servers, initiate recovery process (described below) + +==== Cryptographic Properties ==== + +This scheme protects the data's confidentiality with 192 bits of key +material, since the read-cap contains 192 secret bits (derived from an +encrypted salt, which is encrypted using those same 192 bits plus some +additional public material). + +The integrity of the data (assuming that the signature is valid) is protected +by the 256-bit hash which gets included in the signature. The privilege of +modifying the data (equivalent to the ability to form a valid signature) is +protected by a 256 bit random DSA private key, and the difficulty of +computing a discrete logarithm in a 2048-bit field. + +There are a few weaker denial-of-service attacks possible. If N-k+1 of the +shares are damaged or unavailable, the client will be unable to recover the +file. Any coalition of more than N-k shareholders will be able to effect this +attack by merely refusing to provide the desired share. The "write enabler" +shared secret protects existing shares from being displaced by new ones, +except by the holder of the write-cap. One server cannot affect the other +shares of the same file, once those other shares are in place. + +The worst DoS attack is the "roadblock attack", which must be made before +those shares get placed. Storage indexes are effectively random (being +derived from the hash of a random value), so they are not guessable before +the writer begins their upload, but there is a window of vulnerability during +the beginning of the upload, when some servers have heard about the storage +index but not all of them. + +The roadblock attack we want to prevent is when the first server that the +uploader contacts quickly runs to all the other selected servers and places a +bogus share under the same storage index, before the uploader can contact +them. These shares will normally be accepted, since storage servers create +new shares on demand. The bogus shares would have randomly-generated +write-enablers, which will of course be different than the real uploader's +write-enabler, since the malicious server does not know the write-cap. + +If this attack were successful, the uploader would be unable to place any of +their shares, because the slots have already been filled by the bogus shares. +The uploader would probably try for peers further and further away from the +desired location, but eventually they will hit a preconfigured distance limit +and give up. In addition, the further the writer searches, the less likely it +is that a reader will search as far. So a successful attack will either cause +the file to be uploaded but not be reachable, or it will cause the upload to +fail. + +If the uploader tries again (creating a new privkey), they may get lucky and +the malicious servers will appear later in the query list, giving sufficient +honest servers a chance to see their share before the malicious one manages +to place bogus ones. + +The first line of defense against this attack is the timing challenges: the +attacking server must be ready to act the moment a storage request arrives +(which will only occur for a certain percentage of all new-file uploads), and +only has a few seconds to act before the other servers will have allocated +the shares (and recorded the write-enabler, terminating the window of +vulnerability). + +The second line of defense is post-verification, and is possible because the +storage index is partially derived from the public key hash. A storage server +can, at any time, verify every public bit of the container as being signed by +the verification key (this operation is recommended as a continual background +process, when disk usage is minimal, to detect disk errors). The server can +also hash the verification key to derive 64 bits of the storage index. If it +detects that these 64 bits do not match (but the rest of the share validates +correctly), then the implication is that this share was stored to the wrong +storage index, either due to a bug or a roadblock attack. + +If an uploader finds that they are unable to place their shares because of +"bad write enabler errors" (as reported by the prospective storage servers), +it can "cry foul", and ask the storage server to perform this verification on +the share in question. If the pubkey and storage index do not match, the +storage server can delete the bogus share, thus allowing the real uploader to +place their share. Of course the origin of the offending bogus share should +be logged and reported to a central authority, so corrective measures can be +taken. It may be necessary to have this "cry foul" protocol include the new +write-enabler, to close the window during which the malicious server can +re-submit the bogus share during the adjudication process. + +If the problem persists, the servers can be placed into pre-verification +mode, in which this verification is performed on all potential shares before +being committed to disk. This mode is more CPU-intensive (since normally the +storage server ignores the contents of the container altogether), but would +solve the problem completely. + +The mere existence of these potential defenses should be sufficient to deter +any actual attacks. Note that the storage index only has 64 bits of +pubkey-derived data in it, which is below the usual crypto guidelines for +security factors. In this case it's a pre-image attack which would be needed, +rather than a collision, and the actual attack would be to find a keypair for +which the public key can be hashed three times to produce the desired portion +of the storage index. We believe that 64 bits of material is sufficiently +resistant to this form of pre-image attack to serve as a suitable deterrent. + + +=== Server Storage Protocol === + +The storage servers will provide a mutable slot container which is oblivious +to the details of the data being contained inside it. Each storage index +refers to a "bucket", and each bucket has one or more shares inside it. (In a +well-provisioned network, each bucket will have only one share). The bucket +is stored as a directory, using the base32-encoded storage index as the +directory name. Each share is stored in a single file, using the share number +as the filename. + +The container holds space for a container magic number (for versioning), the +write enabler, the nodeid which accepted the write enabler (used for share +migration, described below), a small number of lease structures, the embedded +data itself, and expansion space for additional lease structures. + + # offset size name + 1 0 32 magic verstr "tahoe mutable container v1" plus binary + 2 32 20 write enabler's nodeid + 3 52 32 write enabler + 4 84 8 data size (actual share data present) (a) + 5 92 8 offset of (8) count of extra leases (after data) + 6 100 368 four leases, 92 bytes each + 0 4 ownerid (0 means "no lease here") + 4 4 expiration timestamp + 8 32 renewal token + 40 32 cancel token + 72 20 nodeid which accepted the tokens + 7 468 (a) data + 8 ?? 4 count of extra leases + 9 ?? n*92 extra leases + +The "extra leases" field must be copied and rewritten each time the size of +the enclosed data changes. The hope is that most buckets will have four or +fewer leases and this extra copying will not usually be necessary. + +The (4) "data size" field contains the actual number of bytes of data present +in field (7), such that a client request to read beyond 504+(a) will result +in an error. This allows the client to (one day) read relative to the end of +the file. The container size (that is, (8)-(7)) might be larger, especially +if extra size was pre-allocated in anticipation of filling the container with +a lot of data. + +The offset in (5) points at the *count* of extra leases, at (8). The actual +leases (at (9)) begin 4 bytes later. If the container size changes, both (8) +and (9) must be relocated by copying. + +The server will honor any write commands that provide the write token and do +not exceed the server-wide storage size limitations. Read and write commands +MUST be restricted to the 'data' portion of the container: the implementation +of those commands MUST perform correct bounds-checking to make sure other +portions of the container are inaccessible to the clients. + +The two methods provided by the storage server on these "MutableSlot" share +objects are: + + * readv(ListOf(offset=int, length=int)) + * returns a list of bytestrings, of the various requested lengths + * offset < 0 is interpreted relative to the end of the data + * spans which hit the end of the data will return truncated data + + * testv_and_writev(write_enabler, test_vector, write_vector) + * this is a test-and-set operation which performs the given tests and only + applies the desired writes if all tests succeed. This is used to detect + simultaneous writers, and to reduce the chance that an update will lose + data recently written by some other party (written after the last time + this slot was read). + * test_vector=ListOf(TupleOf(offset, length, opcode, specimen)) + * the opcode is a string, from the set [gt, ge, eq, le, lt, ne] + * each element of the test vector is read from the slot's data and + compared against the specimen using the desired (in)equality. If all + tests evaluate True, the write is performed + * write_vector=ListOf(TupleOf(offset, newdata)) + * offset < 0 is not yet defined, it probably means relative to the + end of the data, which probably means append, but we haven't nailed + it down quite yet + * write vectors are executed in order, which specifies the results of + overlapping writes + * return value: + * error: OutOfSpace + * error: something else (io error, out of memory, whatever) + * (True, old_test_data): the write was accepted (test_vector passed) + * (False, old_test_data): the write was rejected (test_vector failed) + * both 'accepted' and 'rejected' return the old data that was used + for the test_vector comparison. This can be used by the client + to detect write collisions, including collisions for which the + desired behavior was to overwrite the old version. + +In addition, the storage server provides several methods to access these +share objects: + + * allocate_mutable_slot(storage_index, sharenums=SetOf(int)) + * returns DictOf(int, MutableSlot) + * get_mutable_slot(storage_index) + * returns DictOf(int, MutableSlot) + * or raises KeyError + +We intend to add an interface which allows small slots to allocate-and-write +in a single call, as well as do update or read in a single call. The goal is +to allow a reasonably-sized dirnode to be created (or updated, or read) in +just one round trip (to all N shareholders in parallel). + +==== migrating shares ==== + +If a share must be migrated from one server to another, two values become +invalid: the write enabler (since it was computed for the old server), and +the lease renew/cancel tokens. + +Suppose that a slot was first created on nodeA, and was thus initialized with +WE(nodeA) (= H(WEM+nodeA)). Later, for provisioning reasons, the share is +moved from nodeA to nodeB. + +Readers may still be able to find the share in its new home, depending upon +how many servers are present in the grid, where the new nodeid lands in the +permuted index for this particular storage index, and how many servers the +reading client is willing to contact. + +When a client attempts to write to this migrated share, it will get a "bad +write enabler" error, since the WE it computes for nodeB will not match the +WE(nodeA) that was embedded in the share. When this occurs, the "bad write +enabler" message must include the old nodeid (e.g. nodeA) that was in the +share. + +The client then computes H(nodeB+H(WEM+nodeA)), which is the same as +H(nodeB+WE(nodeA)). The client sends this along with the new WE(nodeB), which +is H(WEM+nodeB). Note that the client only sends WE(nodeB) to nodeB, never to +anyone else. Also note that the client does not send a value to nodeB that +would allow the node to impersonate the client to a third node: everything +sent to nodeB will include something specific to nodeB in it. + +The server locally computes H(nodeB+WE(nodeA)), using its own node id and the +old write enabler from the share. It compares this against the value supplied +by the client. If they match, this serves as proof that the client was able +to compute the old write enabler. The server then accepts the client's new +WE(nodeB) and writes it into the container. + +This WE-fixup process requires an extra round trip, and requires the error +message to include the old nodeid, but does not require any public key +operations on either client or server. + +Migrating the leases will require a similar protocol. This protocol will be +defined concretely at a later date. + +=== Code Details === + +The current FileNode class will be renamed ImmutableFileNode, and a new +MutableFileNode class will be created. Instances of this class will contain a +URI and a reference to the client (for peer selection and connection). The +methods of MutableFileNode are: + + * replace(newdata) -> OK, ConsistencyError, NotEnoughPeersError + * get() -> [deferred] newdata, NotEnoughPeersError + * if there are multiple retrieveable versions in the grid, get() returns + the first version it can reconstruct, and silently ignores the others. + In the future, a more advanced API will signal and provide access to + the multiple heads. + +The peer-selection and data-structure manipulation (and signing/verification) +steps will be implemented in a separate class in allmydata/mutable.py . + +=== SMDF Slot Format === + +This SMDF data lives inside a server-side MutableSlot container. The server +is generally oblivious to this format, but it may look inside the container +when verification is desired. + +This data is tightly packed. There are no gaps left between the different +fields, and the offset table is mainly present to allow future flexibility of +key sizes. + + # offset size name + 1 0 1 version byte, \x01 for this format + 2 1 8 sequence number. 2^64-1 must be handled specially, TBD + 3 9 32 "R" (root of share hash Merkle tree) + 4 41 32 data salt (readkey is H(readcap+data_salt)) + 5 73 32 encrypted salt (AESenc(key=H(readcap), salt) + 6 105 18 encoding parameters: + 105 1 k + 106 1 N + 107 8 segment size + 115 8 data length (of original plaintext) + 7 123 36 offset table: + 127 4 (9) signature + 131 4 (10) share hash chain + 135 4 (11) block hash tree + 139 4 (12) share data + 143 8 (13) EOF + 8 151 256 verification key (2048bit DSA key) + 9 407 40 signature=DSAsig(H([1,2,3,4,5,6])) +10 447 (a) share hash chain, encoded as: + "".join([pack(">H32s", shnum, hash) + for (shnum,hash) in needed_hashes]) +11 ?? (b) block hash tree, encoded as: + "".join([pack(">32s",hash) for hash in block_hash_tree]) +12 ?? LEN share data +13 ?? -- EOF + +(a) The share hash chain contains ceil(log(N)) hashes, each 32 bytes long. + This is the set of hashes necessary to validate this share's leaf in the + share Merkle tree. For N=10, this is 4 hashes, i.e. 128 bytes. +(b) The block hash tree contains ceil(length/segsize) hashes, each 32 bytes + long. This is the set of hashes necessary to validate any given block of + share data up to the per-share root "r". Each "r" is a leaf of the share + has tree (with root "R"), from which a minimal subset of hashes is put in + the share hash chain in (8). + +=== Recovery === + +The first line of defense against damage caused by colliding writes is the +Prime Coordination Directive: "Don't Do That". + +The second line of defense is to keep "S" (the number of competing versions) +lower than N/k. If this holds true, at least one competing version will have +k shares and thus be recoverable. Note that server unavailability counts +against us here: the old version stored on the unavailable server must be +included in the value of S. + +The third line of defense is our use of testv_and_writev() (described below), +which increases the convergence of simultaneous writes: one of the writers +will be favored (the one with the highest "R"), and that version is more +likely to be accepted than the others. This defense is least effective in the +pathological situation where S simultaneous writers are active, the one with +the lowest "R" writes to N-k+1 of the shares and then dies, then the one with +the next-lowest "R" writes to N-2k+1 of the shares and dies, etc, until the +one with the highest "R" writes to k-1 shares and dies. Any other sequencing +will allow the highest "R" to write to at least k shares and establish a new +revision. + +The fourth line of defense is the fact that each client keeps writing until +at least one version has N shares. This uses additional servers, if +necessary, to make sure that either the client's version or some +newer/overriding version is highly available. + +The fifth line of defense is the recovery algorithm, which seeks to make sure +that at least *one* version is highly available, even if that version is +somebody else's. + +The write-shares-to-peers algorithm is as follows: + + * permute peers according to storage index + * walk through peers, trying to assign one share per peer + * for each peer: + * send testv_and_writev, using "old(seqnum+R) <= our(seqnum+R)" as the test + * this means that we will overwrite any old versions, and we will + overwrite simultaenous writers of the same version if our R is higher. + We will not overwrite writers using a higher seqnum. + * record the version that each share winds up with. If the write was + accepted, this is our own version. If it was rejected, read the + old_test_data to find out what version was retained. + * if old_test_data indicates the seqnum was equal or greater than our + own, mark the "Simultanous Writes Detected" flag, which will eventually + result in an error being reported to the writer (in their close() call). + * build a histogram of "R" values + * repeat until the histogram indicate that some version (possibly ours) + has N shares. Use new servers if necessary. + * If we run out of servers: + * if there are at least shares-of-happiness of any one version, we're + happy, so return. (the close() might still get an error) + * not happy, need to reinforce something, goto RECOVERY + +RECOVERY: + * read all shares, count the versions, identify the recoverable ones, + discard the unrecoverable ones. + * sort versions: locate max(seqnums), put all versions with that seqnum + in the list, sort by number of outstanding shares. Then put our own + version. (TODO: put versions with seqnum us ahead of us?). + * for each version: + * attempt to recover that version + * if not possible, remove it from the list, go to next one + * if recovered, start at beginning of peer list, push that version, + continue until N shares are placed + * if pushing our own version, bump up the seqnum to one higher than + the max seqnum we saw + * if we run out of servers: + * schedule retry and exponential backoff to repeat RECOVERY + * admit defeat after some period? presumeably the client will be shut down + eventually, maybe keep trying (once per hour?) until then. + + + + +== Medium Distributed Mutable Files == + +These are just like the SDMF case, but: + + * we actually take advantage of the Merkle hash tree over the blocks, by + reading a single segment of data at a time (and its necessary hashes), to + reduce the read-time alacrity + * we allow arbitrary writes to the file (i.e. seek() is provided, and + O_TRUNC is no longer required) + * we write more code on the client side (in the MutableFileNode class), to + first read each segment that a write must modify. This looks exactly like + the way a normal filesystem uses a block device, or how a CPU must perform + a cache-line fill before modifying a single word. + * we might implement some sort of copy-based atomic update server call, + to allow multiple writev() calls to appear atomic to any readers. + +MDMF slots provide fairly efficient in-place edits of very large files (a few +GB). Appending data is also fairly efficient, although each time a power of 2 +boundary is crossed, the entire file must effectively be re-uploaded (because +the size of the block hash tree changes), so if the filesize is known in +advance, that space ought to be pre-allocated (by leaving extra space between +the block hash tree and the actual data). + +MDMF1 uses the Merkle tree to enable low-alacrity random-access reads. MDMF2 +adds cache-line reads to allow random-access writes. + +== Large Distributed Mutable Files == + +LDMF slots use a fundamentally different way to store the file, inspired by +Mercurial's "revlog" format. They enable very efficient insert/remove/replace +editing of arbitrary spans. Multiple versions of the file can be retained, in +a revision graph that can have multiple heads. Each revision can be +referenced by a cryptographic identifier. There are two forms of the URI, one +that means "most recent version", and a longer one that points to a specific +revision. + +Metadata can be attached to the revisions, like timestamps, to enable rolling +back an entire tree to a specific point in history. + +LDMF1 provides deltas but tries to avoid dealing with multiple heads. LDMF2 +provides explicit support for revision identifiers and branching. + +== TODO == + +improve allocate-and-write or get-writer-buckets API to allow one-call (or +maybe two-call) updates. The challenge is in figuring out which shares are on +which machines. First cut will have lots of round trips. + +(eventually) define behavior when seqnum wraps. At the very least make sure +it can't cause a security problem. "the slot is worn out" is acceptable. + +(eventually) define share-migration lease update protocol. Including the +nodeid who accepted the lease is useful, we can use the same protocol as we +do for updating the write enabler. However we need to know which lease to +update.. maybe send back a list of all old nodeids that we find, then try all +of them when we accept the update? + + We now do this in a specially-formatted IndexError exception: + "UNABLE to renew non-existent lease. I have leases accepted by " + + "nodeids: '12345','abcde','44221' ." + +Every node in a given tahoe grid must have the same common DSA moduli and +exponent, but different grids could use different parameters. We haven't +figured out how to define a "grid id" yet, but I think the DSA parameters +should be part of that identifier. In practical terms, this might mean that +the Introducer tells each node what parameters to use, or perhaps the node +could have a config file which specifies them instead.