been implemented. Please see mutable-DSA.svg for a quick picture of the
crypto scheme described herein)
-= Mutable Files =
-
-Mutable File Slots are places with a stable identifier that can hold data
-that changes over time. In contrast to CHK slots, for which the
-URI/identifier is derived from the contents themselves, the Mutable File Slot
-URI remains fixed for the life of the slot, regardless of what data is placed
-inside it.
-
-Each mutable slot is referenced by two different URIs. The "read-write" URI
-grants read-write access to its holder, allowing them to put whatever
-contents they like into the slot. The "read-only" URI is less powerful, only
-granting read access, and not enabling modification of the data. The
-read-write URI can be turned into the read-only URI, but not the other way
-around.
-
-The data in these slots is distributed over a number of servers, using the
-same erasure coding that CHK files use, with 3-of-10 being a typical choice
-of encoding parameters. The data is encrypted and signed in such a way that
-only the holders of the read-write URI will be able to set the contents of
-the slot, and only the holders of the read-only URI will be able to read
-those contents. Holders of either URI will be able to validate the contents
-as being written by someone with the read-write URI. The servers who hold the
-shares cannot read or modify them: the worst they can do is deny service (by
-deleting or corrupting the shares), or attempt a rollback attack (which can
-only succeed with the cooperation of at least k servers).
-
-== Consistency vs Availability ==
-
-There is an age-old battle between consistency and availability. Epic papers
-have been written, elaborate proofs have been established, and generations of
-theorists have learned that you cannot simultaneously achieve guaranteed
-consistency with guaranteed reliability. In addition, the closer to 0 you get
-on either axis, the cost and complexity of the design goes up.
-
-Tahoe's design goals are to largely favor design simplicity, then slightly
-favor read availability, over the other criteria.
-
-As we develop more sophisticated mutable slots, the API may expose multiple
-read versions to the application layer. The tahoe philosophy is to defer most
-consistency recovery logic to the higher layers. Some applications have
-effective ways to merge multiple versions, so inconsistency is not
-necessarily a problem (i.e. directory nodes can usually merge multiple "add
-child" operations).
-
-== The Prime Coordination Directive: "Don't Do That" ==
-
-The current rule for applications which run on top of Tahoe is "do not
-perform simultaneous uncoordinated writes". That means you need non-tahoe
-means to make sure that two parties are not trying to modify the same mutable
-slot at the same time. For example:
-
- * don't give the read-write URI to anyone else. Dirnodes in a private
- directory generally satisfy this case, as long as you don't use two
- clients on the same account at the same time
- * if you give a read-write URI to someone else, stop using it yourself. An
- inbox would be a good example of this.
- * if you give a read-write URI to someone else, call them on the phone
- before you write into it
- * build an automated mechanism to have your agents coordinate writes.
- For example, we expect a future release to include a FURL for a
- "coordination server" in the dirnodes. The rule can be that you must
- contact the coordination server and obtain a lock/lease on the file
- before you're allowed to modify it.
-
-If you do not follow this rule, Bad Things will happen. The worst-case Bad
-Thing is that the entire file will be lost. A less-bad Bad Thing is that one
-or more of the simultaneous writers will lose their changes. An observer of
-the file may not see monotonically-increasing changes to the file, i.e. they
-may see version 1, then version 2, then 3, then 2 again.
-
-Tahoe takes some amount of care to reduce the badness of these Bad Things.
-One way you can help nudge it from the "lose your file" case into the "lose
-some changes" case is to reduce the number of competing versions: multiple
-versions of the file that different parties are trying to establish as the
-one true current contents. Each simultaneous writer counts as a "competing
-version", as does the previous version of the file. If the count "S" of these
-competing versions is larger than N/k, then the file runs the risk of being
-lost completely. If at least one of the writers remains running after the
-collision is detected, it will attempt to recover, but if S>(N/k) and all
-writers crash after writing a few shares, the file will be lost.
-
-
-== Small Distributed Mutable Files ==
-
-SDMF slots are suitable for small (<1MB) files that are editing by rewriting
-the entire file. The three operations are:
-
- * allocate (with initial contents)
- * set (with new contents)
- * get (old contents)
-
-The first use of SDMF slots will be to hold directories (dirnodes), which map
-encrypted child names to rw-URI/ro-URI pairs.
+This file shows only the differences from RSA-based mutable files to
+(EC)DSA-based mutable files. You have to read and understand mutable.txt before
+reading this file (mutable-DSA.txt).
=== SDMF slots overview ===
of the storage index. We believe that 64 bits of material is sufficiently
resistant to this form of pre-image attack to serve as a suitable deterrent.
-
-=== Server Storage Protocol ===
-
-The storage servers will provide a mutable slot container which is oblivious
-to the details of the data being contained inside it. Each storage index
-refers to a "bucket", and each bucket has one or more shares inside it. (In a
-well-provisioned network, each bucket will have only one share). The bucket
-is stored as a directory, using the base32-encoded storage index as the
-directory name. Each share is stored in a single file, using the share number
-as the filename.
-
-The container holds space for a container magic number (for versioning), the
-write enabler, the nodeid which accepted the write enabler (used for share
-migration, described below), a small number of lease structures, the embedded
-data itself, and expansion space for additional lease structures.
-
- # offset size name
- 1 0 32 magic verstr "tahoe mutable container v1" plus binary
- 2 32 20 write enabler's nodeid
- 3 52 32 write enabler
- 4 84 8 data size (actual share data present) (a)
- 5 92 8 offset of (8) count of extra leases (after data)
- 6 100 368 four leases, 92 bytes each
- 0 4 ownerid (0 means "no lease here")
- 4 4 expiration timestamp
- 8 32 renewal token
- 40 32 cancel token
- 72 20 nodeid which accepted the tokens
- 7 468 (a) data
- 8 ?? 4 count of extra leases
- 9 ?? n*92 extra leases
-
-The "extra leases" field must be copied and rewritten each time the size of
-the enclosed data changes. The hope is that most buckets will have four or
-fewer leases and this extra copying will not usually be necessary.
-
-The (4) "data size" field contains the actual number of bytes of data present
-in field (7), such that a client request to read beyond 504+(a) will result
-in an error. This allows the client to (one day) read relative to the end of
-the file. The container size (that is, (8)-(7)) might be larger, especially
-if extra size was pre-allocated in anticipation of filling the container with
-a lot of data.
-
-The offset in (5) points at the *count* of extra leases, at (8). The actual
-leases (at (9)) begin 4 bytes later. If the container size changes, both (8)
-and (9) must be relocated by copying.
-
-The server will honor any write commands that provide the write token and do
-not exceed the server-wide storage size limitations. Read and write commands
-MUST be restricted to the 'data' portion of the container: the implementation
-of those commands MUST perform correct bounds-checking to make sure other
-portions of the container are inaccessible to the clients.
-
-The two methods provided by the storage server on these "MutableSlot" share
-objects are:
-
- * readv(ListOf(offset=int, length=int))
- * returns a list of bytestrings, of the various requested lengths
- * offset < 0 is interpreted relative to the end of the data
- * spans which hit the end of the data will return truncated data
-
- * testv_and_writev(write_enabler, test_vector, write_vector)
- * this is a test-and-set operation which performs the given tests and only
- applies the desired writes if all tests succeed. This is used to detect
- simultaneous writers, and to reduce the chance that an update will lose
- data recently written by some other party (written after the last time
- this slot was read).
- * test_vector=ListOf(TupleOf(offset, length, opcode, specimen))
- * the opcode is a string, from the set [gt, ge, eq, le, lt, ne]
- * each element of the test vector is read from the slot's data and
- compared against the specimen using the desired (in)equality. If all
- tests evaluate True, the write is performed
- * write_vector=ListOf(TupleOf(offset, newdata))
- * offset < 0 is not yet defined, it probably means relative to the
- end of the data, which probably means append, but we haven't nailed
- it down quite yet
- * write vectors are executed in order, which specifies the results of
- overlapping writes
- * return value:
- * error: OutOfSpace
- * error: something else (io error, out of memory, whatever)
- * (True, old_test_data): the write was accepted (test_vector passed)
- * (False, old_test_data): the write was rejected (test_vector failed)
- * both 'accepted' and 'rejected' return the old data that was used
- for the test_vector comparison. This can be used by the client
- to detect write collisions, including collisions for which the
- desired behavior was to overwrite the old version.
-
-In addition, the storage server provides several methods to access these
-share objects:
-
- * allocate_mutable_slot(storage_index, sharenums=SetOf(int))
- * returns DictOf(int, MutableSlot)
- * get_mutable_slot(storage_index)
- * returns DictOf(int, MutableSlot)
- * or raises KeyError
-
-We intend to add an interface which allows small slots to allocate-and-write
-in a single call, as well as do update or read in a single call. The goal is
-to allow a reasonably-sized dirnode to be created (or updated, or read) in
-just one round trip (to all N shareholders in parallel).
-
-==== migrating shares ====
-
-If a share must be migrated from one server to another, two values become
-invalid: the write enabler (since it was computed for the old server), and
-the lease renew/cancel tokens.
-
-Suppose that a slot was first created on nodeA, and was thus initialized with
-WE(nodeA) (= H(WEM+nodeA)). Later, for provisioning reasons, the share is
-moved from nodeA to nodeB.
-
-Readers may still be able to find the share in its new home, depending upon
-how many servers are present in the grid, where the new nodeid lands in the
-permuted index for this particular storage index, and how many servers the
-reading client is willing to contact.
-
-When a client attempts to write to this migrated share, it will get a "bad
-write enabler" error, since the WE it computes for nodeB will not match the
-WE(nodeA) that was embedded in the share. When this occurs, the "bad write
-enabler" message must include the old nodeid (e.g. nodeA) that was in the
-share.
-
-The client then computes H(nodeB+H(WEM+nodeA)), which is the same as
-H(nodeB+WE(nodeA)). The client sends this along with the new WE(nodeB), which
-is H(WEM+nodeB). Note that the client only sends WE(nodeB) to nodeB, never to
-anyone else. Also note that the client does not send a value to nodeB that
-would allow the node to impersonate the client to a third node: everything
-sent to nodeB will include something specific to nodeB in it.
-
-The server locally computes H(nodeB+WE(nodeA)), using its own node id and the
-old write enabler from the share. It compares this against the value supplied
-by the client. If they match, this serves as proof that the client was able
-to compute the old write enabler. The server then accepts the client's new
-WE(nodeB) and writes it into the container.
-
-This WE-fixup process requires an extra round trip, and requires the error
-message to include the old nodeid, but does not require any public key
-operations on either client or server.
-
-Migrating the leases will require a similar protocol. This protocol will be
-defined concretely at a later date.
-
-=== Code Details ===
-
-The current FileNode class will be renamed ImmutableFileNode, and a new
-MutableFileNode class will be created. Instances of this class will contain a
-URI and a reference to the client (for peer selection and connection). The
-methods of MutableFileNode are:
-
- * replace(newdata) -> OK, ConsistencyError, NotEnoughSharesError
- * get() -> [deferred] newdata, NotEnoughSharesError
- * if there are multiple retrieveable versions in the grid, get() returns
- the first version it can reconstruct, and silently ignores the others.
- In the future, a more advanced API will signal and provide access to
- the multiple heads.
-
-The peer-selection and data-structure manipulation (and signing/verification)
-steps will be implemented in a separate class in allmydata/mutable.py .
-
=== SMDF Slot Format ===
This SMDF data lives inside a server-side MutableSlot container. The server
has tree (with root "R"), from which a minimal subset of hashes is put in
the share hash chain in (8).
-=== Recovery ===
-
-The first line of defense against damage caused by colliding writes is the
-Prime Coordination Directive: "Don't Do That".
-
-The second line of defense is to keep "S" (the number of competing versions)
-lower than N/k. If this holds true, at least one competing version will have
-k shares and thus be recoverable. Note that server unavailability counts
-against us here: the old version stored on the unavailable server must be
-included in the value of S.
-
-The third line of defense is our use of testv_and_writev() (described below),
-which increases the convergence of simultaneous writes: one of the writers
-will be favored (the one with the highest "R"), and that version is more
-likely to be accepted than the others. This defense is least effective in the
-pathological situation where S simultaneous writers are active, the one with
-the lowest "R" writes to N-k+1 of the shares and then dies, then the one with
-the next-lowest "R" writes to N-2k+1 of the shares and dies, etc, until the
-one with the highest "R" writes to k-1 shares and dies. Any other sequencing
-will allow the highest "R" to write to at least k shares and establish a new
-revision.
-
-The fourth line of defense is the fact that each client keeps writing until
-at least one version has N shares. This uses additional servers, if
-necessary, to make sure that either the client's version or some
-newer/overriding version is highly available.
-
-The fifth line of defense is the recovery algorithm, which seeks to make sure
-that at least *one* version is highly available, even if that version is
-somebody else's.
-
-The write-shares-to-peers algorithm is as follows:
-
- * permute peers according to storage index
- * walk through peers, trying to assign one share per peer
- * for each peer:
- * send testv_and_writev, using "old(seqnum+R) <= our(seqnum+R)" as the test
- * this means that we will overwrite any old versions, and we will
- overwrite simultaenous writers of the same version if our R is higher.
- We will not overwrite writers using a higher seqnum.
- * record the version that each share winds up with. If the write was
- accepted, this is our own version. If it was rejected, read the
- old_test_data to find out what version was retained.
- * if old_test_data indicates the seqnum was equal or greater than our
- own, mark the "Simultanous Writes Detected" flag, which will eventually
- result in an error being reported to the writer (in their close() call).
- * build a histogram of "R" values
- * repeat until the histogram indicate that some version (possibly ours)
- has N shares. Use new servers if necessary.
- * If we run out of servers:
- * if there are at least shares-of-happiness of any one version, we're
- happy, so return. (the close() might still get an error)
- * not happy, need to reinforce something, goto RECOVERY
-
-RECOVERY:
- * read all shares, count the versions, identify the recoverable ones,
- discard the unrecoverable ones.
- * sort versions: locate max(seqnums), put all versions with that seqnum
- in the list, sort by number of outstanding shares. Then put our own
- version. (TODO: put versions with seqnum <max but >us ahead of us?).
- * for each version:
- * attempt to recover that version
- * if not possible, remove it from the list, go to next one
- * if recovered, start at beginning of peer list, push that version,
- continue until N shares are placed
- * if pushing our own version, bump up the seqnum to one higher than
- the max seqnum we saw
- * if we run out of servers:
- * schedule retry and exponential backoff to repeat RECOVERY
- * admit defeat after some period? presumeably the client will be shut down
- eventually, maybe keep trying (once per hour?) until then.
-
-
-
-
-== Medium Distributed Mutable Files ==
-
-These are just like the SDMF case, but:
-
- * we actually take advantage of the Merkle hash tree over the blocks, by
- reading a single segment of data at a time (and its necessary hashes), to
- reduce the read-time alacrity
- * we allow arbitrary writes to the file (i.e. seek() is provided, and
- O_TRUNC is no longer required)
- * we write more code on the client side (in the MutableFileNode class), to
- first read each segment that a write must modify. This looks exactly like
- the way a normal filesystem uses a block device, or how a CPU must perform
- a cache-line fill before modifying a single word.
- * we might implement some sort of copy-based atomic update server call,
- to allow multiple writev() calls to appear atomic to any readers.
-
-MDMF slots provide fairly efficient in-place edits of very large files (a few
-GB). Appending data is also fairly efficient, although each time a power of 2
-boundary is crossed, the entire file must effectively be re-uploaded (because
-the size of the block hash tree changes), so if the filesize is known in
-advance, that space ought to be pre-allocated (by leaving extra space between
-the block hash tree and the actual data).
-
-MDMF1 uses the Merkle tree to enable low-alacrity random-access reads. MDMF2
-adds cache-line reads to allow random-access writes.
-
-== Large Distributed Mutable Files ==
-
-LDMF slots use a fundamentally different way to store the file, inspired by
-Mercurial's "revlog" format. They enable very efficient insert/remove/replace
-editing of arbitrary spans. Multiple versions of the file can be retained, in
-a revision graph that can have multiple heads. Each revision can be
-referenced by a cryptographic identifier. There are two forms of the URI, one
-that means "most recent version", and a longer one that points to a specific
-revision.
-
-Metadata can be attached to the revisions, like timestamps, to enable rolling
-back an entire tree to a specific point in history.
-
-LDMF1 provides deltas but tries to avoid dealing with multiple heads. LDMF2
-provides explicit support for revision identifiers and branching.
-
== TODO ==
-improve allocate-and-write or get-writer-buckets API to allow one-call (or
-maybe two-call) updates. The challenge is in figuring out which shares are on
-which machines. First cut will have lots of round trips.
-
-(eventually) define behavior when seqnum wraps. At the very least make sure
-it can't cause a security problem. "the slot is worn out" is acceptable.
-
-(eventually) define share-migration lease update protocol. Including the
-nodeid who accepted the lease is useful, we can use the same protocol as we
-do for updating the write enabler. However we need to know which lease to
-update.. maybe send back a list of all old nodeids that we find, then try all
-of them when we accept the update?
-
- We now do this in a specially-formatted IndexError exception:
- "UNABLE to renew non-existent lease. I have leases accepted by " +
- "nodeids: '12345','abcde','44221' ."
-
Every node in a given tahoe grid must have the same common DSA moduli and
exponent, but different grids could use different parameters. We haven't
figured out how to define a "grid id" yet, but I think the DSA parameters