--- /dev/null
+
+(protocol proposal, work-in-progress, not authoritative)
+
+= Mutable Files =
+
+Mutable File Slots are places with a stable identifier that can hold data
+that changes over time. In contrast to CHK slots, for which the
+URI/identifier is derived from the contents themselves, the Mutable File Slot
+URI remains fixed for the life of the slot, regardless of what data is placed
+inside it.
+
+Each mutable slot is referenced by two different URIs. The "read-write" URI
+grants read-write access to its holder, allowing them to put whatever
+contents they like into the slot. The "read-only" URI is less powerful, only
+granting read access, and not enabling modification of the data. The
+read-write URI can be turned into the read-only URI, but not the other way
+around.
+
+The data in these slots is distributed over a number of servers, using the
+same erasure coding that CHK files use, with 3-of-10 being a typical choice
+of encoding parameters. The data is encrypted and signed in such a way that
+only the holders of the read-write URI will be able to set the contents of
+the slot, and only the holders of the read-only URI will be able to read
+those contents. Holders of either URI will be able to validate the contents
+as being written by someone with the read-write URI. The servers who hold the
+shares cannot read or modify them: the worst they can do is deny service (by
+deleting or corrupting the shares), or attempt a rollback attack (which can
+only succeed with the cooperation of at least k servers).
+
+== Consistency vs Availability ==
+
+There is an age-old battle between consistency and availability. Epic papers
+have been written, elaborate proofs have been established, and generations of
+theorists have learned that you cannot simultaneously achieve guaranteed
+consistency with guaranteed reliability. In addition, the closer to 0 you get
+on either axis, the cost and complexity of the design goes up.
+
+Tahoe's design goals are to largely favor design simplicity, then slightly
+favor read availability, over the other criteria.
+
+As we develop more sophisticated mutable slots, the API may expose multiple
+read versions to the application layer. The tahoe philosophy is to defer most
+consistency recovery logic to the higher layers. Some applications have
+effective ways to merge multiple versions, so inconsistency is not
+necessarily a problem (i.e. directory nodes can usually merge multiple "add
+child" operations).
+
+== The Prime Coordination Directive: "Don't Do That" ==
+
+The current rule for applications which run on top of Tahoe is "do not
+perform simultaneous uncoordinated writes". That means you need non-tahoe
+means to make sure that two parties are not trying to modify the same mutable
+slot at the same time. For example:
+
+ * don't give the read-write URI to anyone else. Dirnodes in a private
+ directory generally satisfy this case, as long as you don't use two
+ clients on the same account at the same time
+ * if you give a read-write URI to someone else, stop using it yourself. An
+ inbox would be a good example of this.
+ * if you give a read-write URI to someone else, call them on the phone
+ before you write into it
+ * build an automated mechanism to have your agents coordinate writes.
+ For example, we expect a future release to include a FURL for a
+ "coordination server" in the dirnodes. The rule can be that you must
+ contact the coordination server and obtain a lock/lease on the file
+ before you're allowed to modify it.
+
+If you do not follow this rule, Bad Things will happen. The worst-case Bad
+Thing is that the entire file will be lost. A less-bad Bad Thing is that one
+or more of the simultaneous writers will lose their changes. An observer of
+the file may not see monotonically-increasing changes to the file, i.e. they
+may see version 1, then version 2, then 3, then 2 again.
+
+Tahoe takes some amount of care to reduce the badness of these Bad Things.
+One way you can help nudge it from the "lose your file" case into the "lose
+some changes" case is to reduce the number of competing versions: multiple
+versions of the file that different parties are trying to establish as the
+one true current contents. Each simultaneous writer counts as a "competing
+version", as does the previous version of the file. If the count "S" of these
+competing versions is larger than N/k, then the file runs the risk of being
+lost completely. If at least one of the writers remains running after the
+collision is detected, it will attempt to recover, but if S>(N/k) and all
+writers crash after writing a few shares, the file will be lost.
+
+
+== Small Distributed Mutable Files ==
+
+SDMF slots are suitable for small (<1MB) files that are editing by rewriting
+the entire file. The three operations are:
+
+ * allocate (with initial contents)
+ * set (with new contents)
+ * get (old contents)
+
+The first use of SDMF slots will be to hold directories (dirnodes), which map
+encrypted child names to rw-URI/ro-URI pairs.
+
+=== SDMF slots overview ===
+
+Each SDMF slot is created with a public/private key pair (known as the
+"verification key" and the "signature key"). The public key is hashed to form
+the "read key" (an AES symmetric key), and the read key is hashed to form the
+Storage Index (a unique string). The private key and public key are
+concatenated together and hashed to form the "write key". The write key is
+then hashed to form the "write enabler master". For each storage server on
+which a share is kept, the write enabler master is concatenated with the
+server's nodeid and hashed, and the result is called the "write enabler" for
+that particular server.
+
+The read-write URI consists of the write key and the storage index. The
+read-only URI contains just the read key.
+
+The SDMF slot is allocated by sending a request to the storage server with a
+desired size, the storage index, and the write enabler for that server's
+nodeid. If granted, the write enabler is stashed inside the slot's backing
+store file. All further write requests must be accompanied by the write
+enabler or they will not be honored. The storage server does not share the
+write enabler with anyone else.
+
+The SDMF slot structure will be described in more detail below. The important
+pieces are:
+
+ * a sequence number
+ * a root hash "R"
+ * the encoding parameters (including k, N, and the file size)
+ * a signed copy of [seqnum,R,encoding_params], using the signature key
+ * the verification key (not encrypted)
+ * the share hash chain (part of a Merkle tree over the share hashes)
+ * the share data itself (erasure-coding of read-key-encrypted file data)
+ * the signature key, encrypted with the write key
+
+The access pattern for read is:
+ * use storage index to locate 'k' shares with identical 'R' values
+ * read verification key
+ * hash verification key, compare against read key
+ * OOPS!!! verification key is in the clear, so read key is too!! FIX!
+ * read seqnum, R, encoding parameters, signature
+ * verify signature
+ * read share data, hash
+ * read share hash chain
+ * validate share hash chain up to the root "R"
+ * submit share data to erasure decoding
+ * decrypt decoded data with read-key
+ * submit plaintext to application
+
+The access pattern for write is:
+ * use the storage index to locate at least one share
+ * read verification key and encrypted signature key
+ * decrypt signature key using write-key
+ * concatenate signature and verification keys, compare against write-key
+ * hash verification key to form read-key
+ * encrypt plaintext from application with read-key
+ * erasure-code crypttext to form shares
+ * compute Merkle tree of shares, find root "R"
+ * create share data structures, one per server:
+ * use seqnum which is one higher than the old version
+ * share hash chain has log(N) hashes, different for each server
+ * signed data is the same for each server
+ * now we have N shares and need homes for them
+ * walk through peers
+ * if share is not already present, allocate-and-set
+ * otherwise, try to modify existing share:
+ * send testv_and_writev operation to each one
+ * testv says to accept share if their(seqnum+R) <= our(seqnum+R)
+ * count how many servers wind up with which versions (histogram over R)
+ * keep going until N servers have the same version, or we run out of servers
+ * if any servers wound up with a different version, report error to
+ application
+ * if we ran out of servers, initiate recovery process (described below)
+
+=== Server Storage Protocol ===
+
+The storage servers will provide a mutable slot container which is oblivious
+to the details of the data being contained inside it. Each storage index
+refers to a "bucket", and each bucket has one or more shares inside it. (In a
+well-provisioned network, each bucket will have only one share). The bucket
+is stored as a directory, using the base32-encoded storage index as the
+directory name. Each share is stored in a single file, using the share number
+as the filename.
+
+The container holds space for a container magic number (for versioning), the
+write enabler, the nodeid for which the write enabler was generated (for
+share migration, TBD), a small number of lease structures, the embedded data
+itself, and expansion space for additional lease structures.
+
+ # offset size name
+ 1 0 32 magic verstr "tahoe mutable container v1" plus binary
+ 2 32 32 write enabler's nodeid
+ 3 64 32 write enabler
+ 4 72 8 offset of extra leases (after data)
+ 5 80 288 four leases:
+ 0 4 ownerid (0 means "no lease here")
+ 4 4 expiration timestamp
+ 8 32 renewal token
+ 40 32 cancel token
+ 6 368 ?? data
+ 7 ?? 4 count of extra leases
+ 8 ?? n*72 extra leases
+
+The "extra leases" field must be copied and rewritten each time the size of
+the enclosed data changes. The hope is that most buckets will have four or
+fewer leases and this extra copying will not usually be necessary.
+
+The server will honor any write commands that provide the write token and do
+not exceed the server-wide storage size limitations. Read and write commands
+MUST be restricted to the 'data' portion of the container: the implementation
+of those commands MUST perform correct bounds-checking to make sure other
+portions of the container are inaccessible to the clients.
+
+The two methods provided by the storage server on these "MutableSlot" share
+objects are:
+
+ * readv(ListOf(offset=int, length=int))
+ * returns a list of bytestrings, of the various requested lengths
+ * offset < 0 is interpreted relative to the end of the data
+ * spans which hit the end of the data will return truncated data
+
+ * testv_and_writev(write_enabler, test_vector, write_vector)
+ * this is a test-and-set operation which performs the given tests and only
+ applies the desired writes if all tests succeed. This is used to detect
+ simultaneous writers, and to reduce the chance that an update will lose
+ data recently written by some other party (written after the last time
+ this slot was read).
+ * test_vector=ListOf(TupleOf(offset, length, opcode, specimen))
+ * the opcode is a string, from the set [gt, ge, eq, le, lt, ne]
+ * each element of the test vector is read from the slot's data and
+ compared against the specimen using the desired (in)equality. If all
+ tests evaluate True, the write is performed
+ * write_vector=ListOf(TupleOf(offset, newdata))
+ * offset < 0 is not yet defined, it probably means relative to the
+ end of the data, which probably means append, but we haven't nailed
+ it down quite yet
+ * write vectors are executed in order, which specifies the results of
+ overlapping writes
+ * return value:
+ * error: OutOfSpace
+ * error: something else (io error, out of memory, whatever)
+ * (True, old_test_data): the write was accepted (test_vector passed)
+ * (False, old_test_data): the write was rejected (test_vector failed)
+ * both 'accepted' and 'rejected' return the old data that was used
+ for the test_vector comparison. This can be used by the client
+ to detect write collisions, including collisions for which the
+ desired behavior was to overwrite the old version.
+
+In addition, the storage server provides several methods to access these
+share objects:
+
+ * allocate_mutable_slot(storage_index, sharenums=SetOf(int))
+ * returns DictOf(int, MutableSlot)
+ * get_mutable_slot(storage_index)
+ * returns DictOf(int, MutableSlot)
+ * or raises KeyError
+
+We intend to add an interface which allows small slots to allocate-and-write
+in a single call, as well as do update or read in a single call. The goal is
+to allow a reasonably-sized dirnode to be created (or updated, or read) in
+just one round trip (to all N shareholders in parallel).
+
+==== migrating shares ====
+
+If a share must be migrated from one server to another, two values become
+invalid: the write enabler (since it was computed for the old server), and
+the lease renew/cancel tokens.
+
+One idea we have is to say that the migration process is obligated to replace
+the write enabler with its hash (but leaving the old "write enabler node id"
+in place, to remind it that this WE isn't its own). When a writer attempts to
+modify a slot with the old write enabler, the server will reject the request
+and include the old WE-nodeid in the rejection message. The writer should
+then realize that the share has been migrated and try again with the hash of
+their old write enabler.
+
+This process doesn't provide any means to fix up the write enabler, though,
+requiring an extra roundtrip for the remainder of the slot's lifetime. It
+might work better to have a call that allows the WE to be replaced, by
+proving that the writer knows H(old-WE-nodeid,old-WE). If we leave the old WE
+in place when migrating, this allows both writer and server to agree upon the
+writer's authority, hopefully without granting the server any new authority
+(or enabling it to trick a writer into revealing one).
+
+=== Code Details ===
+
+The current FileNode class will be renamed ImmutableFileNode, and a new
+MutableFileNode class will be created. Instances of this class will contain a
+URI and a reference to the client (for peer selection and connection). The
+methods of MutableFileNode are:
+
+ * replace(newdata) -> OK, ConsistencyError, NotEnoughPeersError
+ * get() -> [deferred] newdata, NotEnoughPeersError
+ * if there are multiple retrieveable versions in the grid, get() returns
+ the first version it can reconstruct, and silently ignores the others.
+ In the future, a more advanced API will signal and provide access to
+ the multiple heads.
+
+The peer-selection and data-structure manipulation (and signing/verification)
+steps will be implemented in a separate class in allmydata/mutable.py .
+
+=== SMDF Slot Format ===
+
+This SMDF data lives inside a server-side MutableSlot container. The server
+is oblivious to this format.
+
+ # offset size name
+ 1 0 1 version byte, \x00 for this format
+ 2 1 8 sequence number. 2^64-1 must be handled specially, TBD
+ 3 9 32 "R" (root of share hash Merkle tree)
+ 4 41 18 encoding parameters:
+ 41 1 k
+ 42 1 N
+ 43 8 segment size
+ 51 8 data length
+ 5 59 32 offset table:
+ 91 4 (6) signature
+ 95 4 (7) share hash chain
+ 99 4 (8) share data
+ 103 8 (9) encrypted private key
+ 6 111 256 verification key (2048 RSA key 'n' value, e=3)
+ 7 367 256 signature= RSAenc(sig-key, H(version+seqnum+r+encparm))
+ 8 623 (a) share hash chain
+ 9 ?? LEN share data
+10 ?? 256 encrypted private key= AESenc(write-key, RSA 'd' value)
+
+(a) The share hash chain contains ceil(log(N)) hashes, each 32 bytes long.
+ This is the set of hashes necessary to validate this share's leaf in the
+ share Merkle tree. For N=10, this is 4 hashes, i.e. 128 bytes.
+
+=== Recovery ===
+
+The first line of defense against damage caused by colliding writes is the
+Prime Coordination Directive: "Don't Do That".
+
+The second line of defense is to keep "S" (the number of competing versions)
+lower than N/k. If this holds true, at least one competing version will have
+k shares and thus be recoverable. Note that server unavailability counts
+against us here: the old version stored on the unavailable server must be
+included in the value of S.
+
+The third line of defense is our use of testv_and_writev() (described below),
+which increases the convergence of simultaneous writes: one of the writers
+will be favored (the one with the highest "R"), and that version is more
+likely to be accepted than the others. This defense is least effective in the
+pathological situation where S simultaneous writers are active, the one with
+the lowest "R" writes to N-k+1 of the shares and then dies, then the one with
+the next-lowest "R" writes to N-2k+1 of the shares and dies, etc, until the
+one with the highest "R" writes to k-1 shares and dies. Any other sequencing
+will allow the highest "R" to write to at least k shares and establish a new
+revision.
+
+The fourth line of defense is the fact that each client keeps writing until
+at least one version has N shares. This uses additional servers, if
+necessary, to make sure that either the client's version or some
+newer/overriding version is highly available.
+
+The fifth line of defense is the recovery algorithm, which seeks to make sure
+that at least *one* version is highly available, even if that version is
+somebody else's.
+
+The write-shares-to-peers algorithm is as follows:
+
+ * permute peers according to storage index
+ * walk through peers, trying to assign one share per peer
+ * for each peer:
+ * send testv_and_writev, using "old(seqnum+R) <= our(seqnum+R)" as the test
+ * this means that we will overwrite any old versions, and we will
+ overwrite simultaenous writers of the same version if our R is higher.
+ We will not overwrite writers using a higher seqnum.
+ * record the version that each share winds up with. If the write was
+ accepted, this is our own version. If it was rejected, read the
+ old_test_data to find out what version was retained.
+ * if old_test_data indicates the seqnum was equal or greater than our
+ own, mark the "Simultanous Writes Detected" flag, which will eventually
+ result in an error being reported to the writer (in their close() call).
+ * build a histogram of "R" values
+ * repeat until the histogram indicate that some version (possibly ours)
+ has N shares. Use new servers if necessary.
+ * If we run out of servers:
+ * if there are at least shares-of-happiness of any one version, we're
+ happy, so return. (the close() might still get an error)
+ * not happy, need to reinforce something, goto RECOVERY
+
+RECOVERY:
+ * read all shares, count the versions, identify the recoverable ones,
+ discard the unrecoverable ones.
+ * sort versions: locate max(seqnums), put all versions with that seqnum
+ in the list, sort by number of outstanding shares. Then put our own
+ version. (TODO: put versions with seqnum <max but >us ahead of us?).
+ * for each version:
+ * attempt to recover that version
+ * if not possible, remove it from the list, go to next one
+ * if recovered, start at beginning of peer list, push that version,
+ continue until N shares are placed
+ * if pushing our own version, bump up the seqnum to one higher than
+ the max seqnum we saw
+ * if we run out of servers:
+ * schedule retry and exponential backoff to repeat RECOVERY
+ * admit defeat after some period? presumeably the client will be shut down
+ eventually, maybe keep trying (once per hour?) until then.
+
+
+
+
+== Medium Distributed Mutable Files ==
+
+These are just like the SDMF case, but:
+
+ * we use a Merkle hash tree over the blocks, instead of using a single flat
+ hash, to reduce the read-time alacrity
+ * we allow arbitrary writes to the file (i.e. seek() is provided, and
+ O_TRUNC is no longer required)
+ * we write more code on the client side (in the MutableFileNode class), to
+ first read each segment that a write must modify. This looks exactly like
+ the way a normal filesystem uses a block device, or how a CPU must perform
+ a cache-line fill before modifying a single word.
+ * we might implement some sort of copy-based atomic update server call,
+ to allow multiple writev() calls to appear atomic to any readers.
+
+MDMF slots provide fairly efficient in-place edits of very large files (a few
+GB). Appending data is also fairly efficient, although each time a power of 2
+boundary is crossed, the entire file must effectively be re-uploaded, so if
+the filesize is known in advance, that space ought to be pre-allocated.
+
+MDMF1 uses the Merkle tree to enable low-alacrity random-access reads. MDMF2
+adds cache-line reads to allow random-access writes.
+
+== Large Distributed Mutable Files ==
+
+LDMF slots use a fundamentally different way to store the file, inspired by
+Mercurial's "revlog" format. They enable very efficient insert/remove/replace
+editing of arbitrary spans. Multiple versions of the file can be retained, in
+a revision graph that can have multiple heads. Each revision can be
+referenced by a cryptographic identifier. There are two forms of the URI, one
+that means "most recent version", and a longer one that points to a specific
+revision.
+
+Metadata can be attached to the revisions, like timestamps, to enable rolling
+back an entire tree to a specific point in history.
+
+LDMF1 provides deltas but tries to avoid dealing with multiple heads. LDMF2
+provides explicit support for revision identifiers and branching.
+
+== TODO ==
+
+fix gigantic RO-URI security bug, probably by adding a second secret
+
+how about:
+ * H(privkey+pubkey) -> writekey -> readkey -> storageindex
+ * RW-URI = writekey
+ * RO-URI = readkey + H(pubkey)
+
+
+improve allocate-and-write or get-writer-buckets API to allow one-call (or
+maybe two-call) updates. The challenge is in figuring out which shares are on
+which machines.
+
+(eventually) define behavior when seqnum wraps. At the very least make sure
+it can't cause a security problem. "the slot is worn out" is acceptable.
+
+(eventually) define share-migration WE-update protocol