docs/specifications/mutable.rst

   1 =============
   2 Mutable Files
   3 =============
   4
   5 This describes the "RSA-based mutable files" which were shipped in Tahoe v0.8.0.
   6
   7 1.  `Consistency vs. Availability`_
   8 2.  `The Prime Coordination Directive: "Don't Do That"`_
   9 3.  `Small Distributed Mutable Files`_
  10
  11     1. `SDMF slots overview`_
  12     2. `Server Storage Protocol`_
  13     3. `Code Details`_
  14     4. `SMDF Slot Format`_
  15     5. `Recovery`_
  16
  17 4.  `Medium Distributed Mutable Files`_
  18 5.  `Large Distributed Mutable Files`_
  19 6.  `TODO`_
  20
  21 Mutable File Slots are places with a stable identifier that can hold data
  22 that changes over time. In contrast to CHK slots, for which the
  23 URI/identifier is derived from the contents themselves, the Mutable File Slot
  24 URI remains fixed for the life of the slot, regardless of what data is placed
  25 inside it.
  26
  27 Each mutable slot is referenced by two different URIs. The "read-write" URI
  28 grants read-write access to its holder, allowing them to put whatever
  29 contents they like into the slot. The "read-only" URI is less powerful, only
  30 granting read access, and not enabling modification of the data. The
  31 read-write URI can be turned into the read-only URI, but not the other way
  32 around.
  33
  34 The data in these slots is distributed over a number of servers, using the
  35 same erasure coding that CHK files use, with 3-of-10 being a typical choice
  36 of encoding parameters. The data is encrypted and signed in such a way that
  37 only the holders of the read-write URI will be able to set the contents of
  38 the slot, and only the holders of the read-only URI will be able to read
  39 those contents. Holders of either URI will be able to validate the contents
  40 as being written by someone with the read-write URI. The servers who hold the
  41 shares cannot read or modify them: the worst they can do is deny service (by
  42 deleting or corrupting the shares), or attempt a rollback attack (which can
  43 only succeed with the cooperation of at least k servers).
  44
  45 Consistency vs. Availability
  46 ============================
  47
  48 There is an age-old battle between consistency and availability. Epic papers
  49 have been written, elaborate proofs have been established, and generations of
  50 theorists have learned that you cannot simultaneously achieve guaranteed
  51 consistency with guaranteed reliability. In addition, the closer to 0 you get
  52 on either axis, the cost and complexity of the design goes up.
  53
  54 Tahoe's design goals are to largely favor design simplicity, then slightly
  55 favor read availability, over the other criteria.
  56
  57 As we develop more sophisticated mutable slots, the API may expose multiple
  58 read versions to the application layer. The tahoe philosophy is to defer most
  59 consistency recovery logic to the higher layers. Some applications have
  60 effective ways to merge multiple versions, so inconsistency is not
  61 necessarily a problem (i.e. directory nodes can usually merge multiple "add
  62 child" operations).
  63
  64 The Prime Coordination Directive: "Don't Do That"
  65 =================================================
  66
  67 The current rule for applications which run on top of Tahoe is "do not
  68 perform simultaneous uncoordinated writes". That means you need non-tahoe
  69 means to make sure that two parties are not trying to modify the same mutable
  70 slot at the same time. For example:
  71
  72 * don't give the read-write URI to anyone else. Dirnodes in a private
  73   directory generally satisfy this case, as long as you don't use two
  74   clients on the same account at the same time
  75 * if you give a read-write URI to someone else, stop using it yourself. An
  76   inbox would be a good example of this.
  77 * if you give a read-write URI to someone else, call them on the phone
  78   before you write into it
  79 * build an automated mechanism to have your agents coordinate writes.
  80   For example, we expect a future release to include a FURL for a
  81   "coordination server" in the dirnodes. The rule can be that you must
  82   contact the coordination server and obtain a lock/lease on the file
  83   before you're allowed to modify it.
  84
  85 If you do not follow this rule, Bad Things will happen. The worst-case Bad
  86 Thing is that the entire file will be lost. A less-bad Bad Thing is that one
  87 or more of the simultaneous writers will lose their changes. An observer of
  88 the file may not see monotonically-increasing changes to the file, i.e. they
  89 may see version 1, then version 2, then 3, then 2 again.
  90
  91 Tahoe takes some amount of care to reduce the badness of these Bad Things.
  92 One way you can help nudge it from the "lose your file" case into the "lose
  93 some changes" case is to reduce the number of competing versions: multiple
  94 versions of the file that different parties are trying to establish as the
  95 one true current contents. Each simultaneous writer counts as a "competing
  96 version", as does the previous version of the file. If the count "S" of these
  97 competing versions is larger than N/k, then the file runs the risk of being
  98 lost completely. [TODO] If at least one of the writers remains running after
  99 the collision is detected, it will attempt to recover, but if S>(N/k) and all
 100 writers crash after writing a few shares, the file will be lost.
 101
 102 Note that Tahoe uses serialization internally to make sure that a single
 103 Tahoe node will not perform simultaneous modifications to a mutable file. It
 104 accomplishes this by using a weakref cache of the MutableFileNode (so that
 105 there will never be two distinct MutableFileNodes for the same file), and by
 106 forcing all mutable file operations to obtain a per-node lock before they
 107 run. The Prime Coordination Directive therefore applies to inter-node
 108 conflicts, not intra-node ones.
 109
 110
 111 Small Distributed Mutable Files
 112 ===============================
 113
 114 SDMF slots are suitable for small (<1MB) files that are editing by rewriting
 115 the entire file. The three operations are:
 116
 117  * allocate (with initial contents)
 118  * set (with new contents)
 119  * get (old contents)
 120
 121 The first use of SDMF slots will be to hold directories (dirnodes), which map
 122 encrypted child names to rw-URI/ro-URI pairs.
 123
 124 SDMF slots overview
 125 -------------------
 126
 127 Each SDMF slot is created with a public/private key pair. The public key is
 128 known as the "verification key", while the private key is called the
 129 "signature key". The private key is hashed and truncated to 16 bytes to form
 130 the "write key" (an AES symmetric key). The write key is then hashed and
 131 truncated to form the "read key". The read key is hashed and truncated to
 132 form the 16-byte "storage index" (a unique string used as an index to locate
 133 stored data).
 134
 135 The public key is hashed by itself to form the "verification key hash".
 136
 137 The write key is hashed a different way to form the "write enabler master".
 138 For each storage server on which a share is kept, the write enabler master is
 139 concatenated with the server's nodeid and hashed, and the result is called
 140 the "write enabler" for that particular server. Note that multiple shares of
 141 the same slot stored on the same server will all get the same write enabler,
 142 i.e. the write enabler is associated with the "bucket", rather than the
 143 individual shares.
 144
 145 The private key is encrypted (using AES in counter mode) by the write key,
 146 and the resulting crypttext is stored on the servers. so it will be
 147 retrievable by anyone who knows the write key. The write key is not used to
 148 encrypt anything else, and the private key never changes, so we do not need
 149 an IV for this purpose.
 150
 151 The actual data is encrypted (using AES in counter mode) with a key derived
 152 by concatenating the readkey with the IV, the hashing the results and
 153 truncating to 16 bytes. The IV is randomly generated each time the slot is
 154 updated, and stored next to the encrypted data.
 155
 156 The read-write URI consists of the write key and the verification key hash.
 157 The read-only URI contains the read key and the verification key hash. The
 158 verify-only URI contains the storage index and the verification key hash.
 159
 160 ::
 161
 162  URI:SSK-RW:b2a(writekey):b2a(verification_key_hash)
 163  URI:SSK-RO:b2a(readkey):b2a(verification_key_hash)
 164  URI:SSK-Verify:b2a(storage_index):b2a(verification_key_hash)
 165
 166 Note that this allows the read-only and verify-only URIs to be derived from
 167 the read-write URI without actually retrieving the public keys. Also note
 168 that it means the read-write agent must validate both the private key and the
 169 public key when they are first fetched. All users validate the public key in
 170 exactly the same way.
 171
 172 The SDMF slot is allocated by sending a request to the storage server with a
 173 desired size, the storage index, and the write enabler for that server's
 174 nodeid. If granted, the write enabler is stashed inside the slot's backing
 175 store file. All further write requests must be accompanied by the write
 176 enabler or they will not be honored. The storage server does not share the
 177 write enabler with anyone else.
 178
 179 The SDMF slot structure will be described in more detail below. The important
 180 pieces are:
 181
 182 * a sequence number
 183 * a root hash "R"
 184 * the encoding parameters (including k, N, file size, segment size)
 185 * a signed copy of [seqnum,R,encoding_params], using the signature key
 186 * the verification key (not encrypted)
 187 * the share hash chain (part of a Merkle tree over the share hashes)
 188 * the block hash tree (Merkle tree over blocks of share data)
 189 * the share data itself (erasure-coding of read-key-encrypted file data)
 190 * the signature key, encrypted with the write key
 191
 192 The access pattern for read is:
 193
 194 * hash read-key to get storage index
 195 * use storage index to locate 'k' shares with identical 'R' values
 196
 197   * either get one share, read 'k' from it, then read k-1 shares
 198   * or read, say, 5 shares, discover k, either get more or be finished
 199   * or copy k into the URIs
 200
 201 * read verification key
 202 * hash verification key, compare against verification key hash
 203 * read seqnum, R, encoding parameters, signature
 204 * verify signature against verification key
 205 * read share data, compute block-hash Merkle tree and root "r"
 206 * read share hash chain (leading from "r" to "R")
 207 * validate share hash chain up to the root "R"
 208 * submit share data to erasure decoding
 209 * decrypt decoded data with read-key
 210 * submit plaintext to application
 211
 212 The access pattern for write is:
 213
 214 * hash write-key to get read-key, hash read-key to get storage index
 215 * use the storage index to locate at least one share
 216 * read verification key and encrypted signature key
 217 * decrypt signature key using write-key
 218 * hash signature key, compare against write-key
 219 * hash verification key, compare against verification key hash
 220 * encrypt plaintext from application with read-key
 221
 222   * application can encrypt some data with the write-key to make it only
 223     available to writers (use this for transitive read-onlyness of dirnodes)
 224
 225 * erasure-code crypttext to form shares
 226 * split shares into blocks
 227 * compute Merkle tree of blocks, giving root "r" for each share
 228 * compute Merkle tree of shares, find root "R" for the file as a whole
 229 * create share data structures, one per server:
 230
 231   * use seqnum which is one higher than the old version
 232   * share hash chain has log(N) hashes, different for each server
 233   * signed data is the same for each server
 234
 235 * now we have N shares and need homes for them
 236 * walk through peers
 237
 238   * if share is not already present, allocate-and-set
 239   * otherwise, try to modify existing share:
 240   * send testv_and_writev operation to each one
 241   * testv says to accept share if their(seqnum+R) <= our(seqnum+R)
 242   * count how many servers wind up with which versions (histogram over R)
 243   * keep going until N servers have the same version, or we run out of servers
 244
 245     * if any servers wound up with a different version, report error to
 246       application
 247     * if we ran out of servers, initiate recovery process (described below)
 248
 249 Server Storage Protocol
 250 -----------------------
 251
 252 The storage servers will provide a mutable slot container which is oblivious
 253 to the details of the data being contained inside it. Each storage index
 254 refers to a "bucket", and each bucket has one or more shares inside it. (In a
 255 well-provisioned network, each bucket will have only one share). The bucket
 256 is stored as a directory, using the base32-encoded storage index as the
 257 directory name. Each share is stored in a single file, using the share number
 258 as the filename.
 259
 260 The container holds space for a container magic number (for versioning), the
 261 write enabler, the nodeid which accepted the write enabler (used for share
 262 migration, described below), a small number of lease structures, the embedded
 263 data itself, and expansion space for additional lease structures::
 264
 265  #   offset    size    name
 266  1   0         32      magic verstr "Tahoe mutable container v1\n\x75\x09\x44\x03\x8e"
 267  2   32        20      write enabler's nodeid
 268  3   52        32      write enabler
 269  4   84        8       data size (actual share data present) (a)
 270  5   92        8       offset of (8) count of extra leases (after data)
 271  6   100       368     four leases, 92 bytes each
 272                         0    4   ownerid (0 means "no lease here")
 273                         4    4   expiration timestamp
 274                         8   32   renewal token
 275                         40  32   cancel token
 276                         72  20   nodeid which accepted the tokens
 277  7   468       (a)     data
 278  8   ??        4       count of extra leases
 279  9   ??        n*92    extra leases
 280
 281 The "extra leases" field must be copied and rewritten each time the size of
 282 the enclosed data changes. The hope is that most buckets will have four or
 283 fewer leases and this extra copying will not usually be necessary.
 284
 285 The (4) "data size" field contains the actual number of bytes of data present
 286 in field (7), such that a client request to read beyond 504+(a) will result
 287 in an error. This allows the client to (one day) read relative to the end of
 288 the file. The container size (that is, (8)-(7)) might be larger, especially
 289 if extra size was pre-allocated in anticipation of filling the container with
 290 a lot of data.
 291
 292 The offset in (5) points at the *count* of extra leases, at (8). The actual
 293 leases (at (9)) begin 4 bytes later. If the container size changes, both (8)
 294 and (9) must be relocated by copying.
 295
 296 The server will honor any write commands that provide the write token and do
 297 not exceed the server-wide storage size limitations. Read and write commands
 298 MUST be restricted to the 'data' portion of the container: the implementation
 299 of those commands MUST perform correct bounds-checking to make sure other
 300 portions of the container are inaccessible to the clients.
 301
 302 The two methods provided by the storage server on these "MutableSlot" share
 303 objects are:
 304
 305 * readv(ListOf(offset=int, length=int))
 306
 307   * returns a list of bytestrings, of the various requested lengths
 308   * offset < 0 is interpreted relative to the end of the data
 309   * spans which hit the end of the data will return truncated data
 310
 311 * testv_and_writev(write_enabler, test_vector, write_vector)
 312
 313   * this is a test-and-set operation which performs the given tests and only
 314     applies the desired writes if all tests succeed. This is used to detect
 315     simultaneous writers, and to reduce the chance that an update will lose
 316     data recently written by some other party (written after the last time
 317     this slot was read).
 318   * test_vector=ListOf(TupleOf(offset, length, opcode, specimen))
 319   * the opcode is a string, from the set [gt, ge, eq, le, lt, ne]
 320   * each element of the test vector is read from the slot's data and
 321     compared against the specimen using the desired (in)equality. If all
 322     tests evaluate True, the write is performed
 323   * write_vector=ListOf(TupleOf(offset, newdata))
 324
 325     * offset < 0 is not yet defined, it probably means relative to the
 326       end of the data, which probably means append, but we haven't nailed
 327       it down quite yet
 328     * write vectors are executed in order, which specifies the results of
 329       overlapping writes
 330
 331   * return value:
 332
 333     * error: OutOfSpace
 334     * error: something else (io error, out of memory, whatever)
 335     * (True, old_test_data): the write was accepted (test_vector passed)
 336     * (False, old_test_data): the write was rejected (test_vector failed)
 337
 338       * both 'accepted' and 'rejected' return the old data that was used
 339         for the test_vector comparison. This can be used by the client
 340         to detect write collisions, including collisions for which the
 341         desired behavior was to overwrite the old version.
 342
 343 In addition, the storage server provides several methods to access these
 344 share objects:
 345
 346 * allocate_mutable_slot(storage_index, sharenums=SetOf(int))
 347
 348   * returns DictOf(int, MutableSlot)
 349
 350 * get_mutable_slot(storage_index)
 351
 352   * returns DictOf(int, MutableSlot)
 353   * or raises KeyError
 354
 355 We intend to add an interface which allows small slots to allocate-and-write
 356 in a single call, as well as do update or read in a single call. The goal is
 357 to allow a reasonably-sized dirnode to be created (or updated, or read) in
 358 just one round trip (to all N shareholders in parallel).
 359
 360 migrating shares
 361 ````````````````
 362
 363 If a share must be migrated from one server to another, two values become
 364 invalid: the write enabler (since it was computed for the old server), and
 365 the lease renew/cancel tokens.
 366
 367 Suppose that a slot was first created on nodeA, and was thus initialized with
 368 WE(nodeA) (= H(WEM+nodeA)). Later, for provisioning reasons, the share is
 369 moved from nodeA to nodeB.
 370
 371 Readers may still be able to find the share in its new home, depending upon
 372 how many servers are present in the grid, where the new nodeid lands in the
 373 permuted index for this particular storage index, and how many servers the
 374 reading client is willing to contact.
 375
 376 When a client attempts to write to this migrated share, it will get a "bad
 377 write enabler" error, since the WE it computes for nodeB will not match the
 378 WE(nodeA) that was embedded in the share. When this occurs, the "bad write
 379 enabler" message must include the old nodeid (e.g. nodeA) that was in the
 380 share.
 381
 382 The client then computes H(nodeB+H(WEM+nodeA)), which is the same as
 383 H(nodeB+WE(nodeA)). The client sends this along with the new WE(nodeB), which
 384 is H(WEM+nodeB). Note that the client only sends WE(nodeB) to nodeB, never to
 385 anyone else. Also note that the client does not send a value to nodeB that
 386 would allow the node to impersonate the client to a third node: everything
 387 sent to nodeB will include something specific to nodeB in it.
 388
 389 The server locally computes H(nodeB+WE(nodeA)), using its own node id and the
 390 old write enabler from the share. It compares this against the value supplied
 391 by the client. If they match, this serves as proof that the client was able
 392 to compute the old write enabler. The server then accepts the client's new
 393 WE(nodeB) and writes it into the container.
 394
 395 This WE-fixup process requires an extra round trip, and requires the error
 396 message to include the old nodeid, but does not require any public key
 397 operations on either client or server.
 398
 399 Migrating the leases will require a similar protocol. This protocol will be
 400 defined concretely at a later date.
 401
 402 Code Details
 403 ------------
 404
 405 The MutableFileNode class is used to manipulate mutable files (as opposed to
 406 ImmutableFileNodes). These are initially generated with
 407 client.create_mutable_file(), and later recreated from URIs with
 408 client.create_node_from_uri(). Instances of this class will contain a URI and
 409 a reference to the client (for peer selection and connection).
 410
 411 NOTE: this section is out of date. Please see src/allmydata/interfaces.py
 412 (the section on IMutableFilesystemNode) for more accurate information.
 413
 414 The methods of MutableFileNode are:
 415
 416 * download_to_data() -> [deferred] newdata, NotEnoughSharesError
 417
 418   * if there are multiple retrieveable versions in the grid, get() returns
 419     the first version it can reconstruct, and silently ignores the others.
 420     In the future, a more advanced API will signal and provide access to
 421     the multiple heads.
 422
 423 * update(newdata) -> OK, UncoordinatedWriteError, NotEnoughSharesError
 424 * overwrite(newdata) -> OK, UncoordinatedWriteError, NotEnoughSharesError
 425
 426 download_to_data() causes a new retrieval to occur, pulling the current
 427 contents from the grid and returning them to the caller. At the same time,
 428 this call caches information about the current version of the file. This
 429 information will be used in a subsequent call to update(), and if another
 430 change has occured between the two, this information will be out of date,
 431 triggering the UncoordinatedWriteError.
 432
 433 update() is therefore intended to be used just after a download_to_data(), in
 434 the following pattern::
 435
 436  d = mfn.download_to_data()
 437  d.addCallback(apply_delta)
 438  d.addCallback(mfn.update)
 439
 440 If the update() call raises UCW, then the application can simply return an
 441 error to the user ("you violated the Prime Coordination Directive"), and they
 442 can try again later. Alternatively, the application can attempt to retry on
 443 its own. To accomplish this, the app needs to pause, download the new
 444 (post-collision and post-recovery) form of the file, reapply their delta,
 445 then submit the update request again. A randomized pause is necessary to
 446 reduce the chances of colliding a second time with another client that is
 447 doing exactly the same thing::
 448
 449  d = mfn.download_to_data()
 450  d.addCallback(apply_delta)
 451  d.addCallback(mfn.update)
 452  def _retry(f):
 453    f.trap(UncoordinatedWriteError)
 454    d1 = pause(random.uniform(5, 20))
 455    d1.addCallback(lambda res: mfn.download_to_data())
 456    d1.addCallback(apply_delta)
 457    d1.addCallback(mfn.update)
 458    return d1
 459  d.addErrback(_retry)
 460
 461 Enthusiastic applications can retry multiple times, using a randomized
 462 exponential backoff between each. A particularly enthusiastic application can
 463 retry forever, but such apps are encouraged to provide a means to the user of
 464 giving up after a while.
 465
 466 UCW does not mean that the update was not applied, so it is also a good idea
 467 to skip the retry-update step if the delta was already applied::
 468
 469  d = mfn.download_to_data()
 470  d.addCallback(apply_delta)
 471  d.addCallback(mfn.update)
 472  def _retry(f):
 473    f.trap(UncoordinatedWriteError)
 474    d1 = pause(random.uniform(5, 20))
 475    d1.addCallback(lambda res: mfn.download_to_data())
 476    def _maybe_apply_delta(contents):
 477      new_contents = apply_delta(contents)
 478      if new_contents != contents:
 479        return mfn.update(new_contents)
 480    d1.addCallback(_maybe_apply_delta)
 481    return d1
 482  d.addErrback(_retry)
 483
 484 update() is the right interface to use for delta-application situations, like
 485 directory nodes (in which apply_delta might be adding or removing child
 486 entries from a serialized table).
 487
 488 Note that any uncoordinated write has the potential to lose data. We must do
 489 more analysis to be sure, but it appears that two clients who write to the
 490 same mutable file at the same time (even if both eventually retry) will, with
 491 high probability, result in one client observing UCW and the other silently
 492 losing their changes. It is also possible for both clients to observe UCW.
 493 The moral of the story is that the Prime Coordination Directive is there for
 494 a reason, and that recovery/UCW/retry is not a subsitute for write
 495 coordination.
 496
 497 overwrite() tells the client to ignore this cached version information, and
 498 to unconditionally replace the mutable file's contents with the new data.
 499 This should not be used in delta application, but rather in situations where
 500 you want to replace the file's contents with completely unrelated ones. When
 501 raw files are uploaded into a mutable slot through the Tahoe-LAFS web-API
 502 (using POST and the ?mutable=true argument), they are put in place with
 503 overwrite().
 504
 505 The peer-selection and data-structure manipulation (and signing/verification)
 506 steps will be implemented in a separate class in allmydata/mutable.py .
 507
 508 SMDF Slot Format
 509 ----------------
 510
 511 This SMDF data lives inside a server-side MutableSlot container. The server
 512 is oblivious to this format.
 513
 514 This data is tightly packed. In particular, the share data is defined to run
 515 all the way to the beginning of the encrypted private key (the encprivkey
 516 offset is used both to terminate the share data and to begin the encprivkey).
 517
 518 ::
 519
 520   #    offset   size    name
 521   1    0        1       version byte, \x00 for this format
 522   2    1        8       sequence number. 2^64-1 must be handled specially, TBD
 523   3    9        32      "R" (root of share hash Merkle tree)
 524   4    41       16      IV (share data is AES(H(readkey+IV)) )
 525   5    57       18      encoding parameters:
 526         57       1        k
 527         58       1        N
 528         59       8        segment size
 529         67       8        data length (of original plaintext)
 530   6    75       32      offset table:
 531         75       4        (8) signature
 532         79       4        (9) share hash chain
 533         83       4        (10) block hash tree
 534         87       4        (11) share data
 535         91       8        (12) encrypted private key
 536         99       8        (13) EOF
 537   7    107      436ish  verification key (2048 RSA key)
 538   8    543ish   256ish  signature=RSAsign(sigkey, H(version+seqnum+r+IV+encparm))
 539   9    799ish   (a)     share hash chain, encoded as:
 540                          "".join([pack(">H32s", shnum, hash)
 541                                   for (shnum,hash) in needed_hashes])
 542  10    (927ish) (b)     block hash tree, encoded as:
 543                          "".join([pack(">32s",hash) for hash in block_hash_tree])
 544  11    (935ish) LEN     share data (no gap between this and encprivkey)
 545  12    ??       1216ish encrypted private key= AESenc(write-key, RSA-key)
 546  13    ??       --      EOF
 547
 548  (a) The share hash chain contains ceil(log(N)) hashes, each 32 bytes long.
 549     This is the set of hashes necessary to validate this share's leaf in the
 550     share Merkle tree. For N=10, this is 4 hashes, i.e. 128 bytes.
 551  (b) The block hash tree contains ceil(length/segsize) hashes, each 32 bytes
 552     long. This is the set of hashes necessary to validate any given block of
 553     share data up to the per-share root "r". Each "r" is a leaf of the share
 554     has tree (with root "R"), from which a minimal subset of hashes is put in
 555     the share hash chain in (8).
 556
 557 Recovery
 558 --------
 559
 560 The first line of defense against damage caused by colliding writes is the
 561 Prime Coordination Directive: "Don't Do That".
 562
 563 The second line of defense is to keep "S" (the number of competing versions)
 564 lower than N/k. If this holds true, at least one competing version will have
 565 k shares and thus be recoverable. Note that server unavailability counts
 566 against us here: the old version stored on the unavailable server must be
 567 included in the value of S.
 568
 569 The third line of defense is our use of testv_and_writev() (described below),
 570 which increases the convergence of simultaneous writes: one of the writers
 571 will be favored (the one with the highest "R"), and that version is more
 572 likely to be accepted than the others. This defense is least effective in the
 573 pathological situation where S simultaneous writers are active, the one with
 574 the lowest "R" writes to N-k+1 of the shares and then dies, then the one with
 575 the next-lowest "R" writes to N-2k+1 of the shares and dies, etc, until the
 576 one with the highest "R" writes to k-1 shares and dies. Any other sequencing
 577 will allow the highest "R" to write to at least k shares and establish a new
 578 revision.
 579
 580 The fourth line of defense is the fact that each client keeps writing until
 581 at least one version has N shares. This uses additional servers, if
 582 necessary, to make sure that either the client's version or some
 583 newer/overriding version is highly available.
 584
 585 The fifth line of defense is the recovery algorithm, which seeks to make sure
 586 that at least *one* version is highly available, even if that version is
 587 somebody else's.
 588
 589 The write-shares-to-peers algorithm is as follows:
 590
 591 * permute peers according to storage index
 592 * walk through peers, trying to assign one share per peer
 593 * for each peer:
 594
 595   * send testv_and_writev, using "old(seqnum+R) <= our(seqnum+R)" as the test
 596
 597     * this means that we will overwrite any old versions, and we will
 598       overwrite simultaenous writers of the same version if our R is higher.
 599       We will not overwrite writers using a higher seqnum.
 600
 601   * record the version that each share winds up with. If the write was
 602     accepted, this is our own version. If it was rejected, read the
 603     old_test_data to find out what version was retained.
 604   * if old_test_data indicates the seqnum was equal or greater than our
 605     own, mark the "Simultanous Writes Detected" flag, which will eventually
 606     result in an error being reported to the writer (in their close() call).
 607   * build a histogram of "R" values
 608   * repeat until the histogram indicate that some version (possibly ours)
 609     has N shares. Use new servers if necessary.
 610   * If we run out of servers:
 611
 612     * if there are at least shares-of-happiness of any one version, we're
 613       happy, so return. (the close() might still get an error)
 614     * not happy, need to reinforce something, goto RECOVERY
 615
 616 Recovery:
 617
 618 * read all shares, count the versions, identify the recoverable ones,
 619   discard the unrecoverable ones.
 620 * sort versions: locate max(seqnums), put all versions with that seqnum
 621   in the list, sort by number of outstanding shares. Then put our own
 622   version. (TODO: put versions with seqnum <max but >us ahead of us?).
 623 * for each version:
 624
 625   * attempt to recover that version
 626   * if not possible, remove it from the list, go to next one
 627   * if recovered, start at beginning of peer list, push that version,
 628     continue until N shares are placed
 629   * if pushing our own version, bump up the seqnum to one higher than
 630     the max seqnum we saw
 631   * if we run out of servers:
 632
 633     * schedule retry and exponential backoff to repeat RECOVERY
 634
 635   * admit defeat after some period? presumeably the client will be shut down
 636     eventually, maybe keep trying (once per hour?) until then.
 637
 638
 639 Medium Distributed Mutable Files
 640 ================================
 641
 642 These are just like the SDMF case, but:
 643
 644 * we actually take advantage of the Merkle hash tree over the blocks, by
 645   reading a single segment of data at a time (and its necessary hashes), to
 646   reduce the read-time alacrity
 647 * we allow arbitrary writes to the file (i.e. seek() is provided, and
 648   O_TRUNC is no longer required)
 649 * we write more code on the client side (in the MutableFileNode class), to
 650   first read each segment that a write must modify. This looks exactly like
 651   the way a normal filesystem uses a block device, or how a CPU must perform
 652   a cache-line fill before modifying a single word.
 653 * we might implement some sort of copy-based atomic update server call,
 654   to allow multiple writev() calls to appear atomic to any readers.
 655
 656 MDMF slots provide fairly efficient in-place edits of very large files (a few
 657 GB). Appending data is also fairly efficient, although each time a power of 2
 658 boundary is crossed, the entire file must effectively be re-uploaded (because
 659 the size of the block hash tree changes), so if the filesize is known in
 660 advance, that space ought to be pre-allocated (by leaving extra space between
 661 the block hash tree and the actual data).
 662
 663 MDMF1 uses the Merkle tree to enable low-alacrity random-access reads. MDMF2
 664 adds cache-line reads to allow random-access writes.
 665
 666 Large Distributed Mutable Files
 667 ===============================
 668
 669 LDMF slots use a fundamentally different way to store the file, inspired by
 670 Mercurial's "revlog" format. They enable very efficient insert/remove/replace
 671 editing of arbitrary spans. Multiple versions of the file can be retained, in
 672 a revision graph that can have multiple heads. Each revision can be
 673 referenced by a cryptographic identifier. There are two forms of the URI, one
 674 that means "most recent version", and a longer one that points to a specific
 675 revision.
 676
 677 Metadata can be attached to the revisions, like timestamps, to enable rolling
 678 back an entire tree to a specific point in history.
 679
 680 LDMF1 provides deltas but tries to avoid dealing with multiple heads. LDMF2
 681 provides explicit support for revision identifiers and branching.
 682
 683 TODO
 684 ====
 685
 686 improve allocate-and-write or get-writer-buckets API to allow one-call (or
 687 maybe two-call) updates. The challenge is in figuring out which shares are on
 688 which machines. First cut will have lots of round trips.
 689
 690 (eventually) define behavior when seqnum wraps. At the very least make sure
 691 it can't cause a security problem. "the slot is worn out" is acceptable.
 692
 693 (eventually) define share-migration lease update protocol. Including the
 694 nodeid who accepted the lease is useful, we can use the same protocol as we
 695 do for updating the write enabler. However we need to know which lease to
 696 update.. maybe send back a list of all old nodeids that we find, then try all
 697 of them when we accept the update?
 698
 699 We now do this in a specially-formatted IndexError exception:
 700  "UNABLE to renew non-existent lease. I have leases accepted by " +
 701  "nodeids: '12345','abcde','44221' ."
 702
 703 confirm that a repairer can regenerate shares without the private key. Hmm,
 704 without the write-enabler they won't be able to write those shares to the
 705 servers.. although they could add immutable new shares to new servers.