docs/specifications/mutable.rst

   1 .. -*- coding: utf-8-with-signature -*-
   2
   3 =============
   4 Mutable Files
   5 =============
   6
   7 1.  `Mutable Formats`_
   8 2.  `Consistency vs. Availability`_
   9 3.  `The Prime Coordination Directive: "Don't Do That"`_
  10 4.  `Small Distributed Mutable Files`_
  11
  12     1. `SDMF slots overview`_
  13     2. `Server Storage Protocol`_
  14     3. `Code Details`_
  15     4. `SMDF Slot Format`_
  16     5. `Recovery`_
  17
  18 5.  `Medium Distributed Mutable Files`_
  19 6.  `Large Distributed Mutable Files`_
  20 7.  `TODO`_
  21
  22 Mutable files are places with a stable identifier that can hold data that
  23 changes over time. In contrast to immutable slots, for which the
  24 identifier/capability is derived from the contents themselves, the mutable
  25 file identifier remains fixed for the life of the slot, regardless of what
  26 data is placed inside it.
  27
  28 Each mutable file is referenced by two different caps. The "read-write" cap
  29 grants read-write access to its holder, allowing them to put whatever
  30 contents they like into the slot. The "read-only" cap is less powerful, only
  31 granting read access, and not enabling modification of the data. The
  32 read-write cap can be turned into the read-only cap, but not the other way
  33 around.
  34
  35 The data in these files is distributed over a number of servers, using the
  36 same erasure coding that immutable files use, with 3-of-10 being a typical
  37 choice of encoding parameters. The data is encrypted and signed in such a way
  38 that only the holders of the read-write cap will be able to set the contents
  39 of the slot, and only the holders of the read-only cap will be able to read
  40 those contents. Holders of either cap will be able to validate the contents
  41 as being written by someone with the read-write cap. The servers who hold the
  42 shares are not automatically given the ability read or modify them: the worst
  43 they can do is deny service (by deleting or corrupting the shares), or
  44 attempt a rollback attack (which can only succeed with the cooperation of at
  45 least k servers).
  46
  47
  48 Mutable Formats
  49 ===============
  50
  51 History
  52 -------
  53
  54 When mutable files first shipped in Tahoe-0.8.0 (15-Feb-2008), the only
  55 version available was "SDMF", described below. This was a
  56 limited-functionality placeholder, intended to be replaced with
  57 improved-efficiency "MDMF" files shortly afterwards. The development process
  58 took longer than expected, and MDMF didn't ship until Tahoe-1.9.0
  59 (31-Oct-2011), and even then it was opt-in (not used by default).
  60
  61 SDMF was intended for relatively small mutable files, up to a few megabytes.
  62 It uses only one segment, so alacrity (the measure of how quickly the first
  63 byte of plaintext is returned to the client) suffers, as the whole file must
  64 be downloaded even if you only want to get a single byte. The memory used by
  65 both clients and servers also scales with the size of the file, instead of
  66 being limited to the half-a-MB-or-so that immutable file operations use, so
  67 large files cause significant memory usage. To discourage the use of SDMF
  68 outside it's design parameters, the early versions of Tahoe enforced a
  69 maximum size on mutable files (maybe 10MB). Since most directories are built
  70 out of mutable files, this imposed a limit of about 30k entries per
  71 directory. In subsequent releases, this limit was removed, but the
  72 performance problems inherent in the SDMF implementation remained.
  73
  74 In the summer of 2010, Google-Summer-of-Code student Kevan Carstensen took on
  75 the project of finally implementing MDMF. Because of my (Brian) design
  76 mistake in SDMF (not including a separate encryption seed in each segment),
  77 the share format for SDMF could not be used for MDMF, resulting in a larger
  78 gap between the two implementations (my original intention had been to make
  79 SDMF a clean subset of MDMF, where any single-segment MDMF file could be
  80 handled by the old SDMF code). In the fall of 2011, Kevan's code was finally
  81 integrated, and first made available in the Tahoe-1.9.0 release.
  82
  83 SDMF vs. MDMF
  84 -------------
  85
  86 The improvement of MDMF is the use of multiple segments: individual 128-KiB
  87 sections of the file can be retrieved or modified independently. The
  88 improvement can be seen when fetching just a portion of the file (using a
  89 Range: header on the webapi), or when modifying a portion (again with a
  90 Range: header). It can also be seen indirectly when fetching the whole file:
  91 the first segment of data should be delivered faster from a large MDMF file
  92 than from an SDMF file, although the overall download will then proceed at
  93 the same rate.
  94
  95 We've decided to make it opt-in for now: mutable files default to
  96 SDMF format unless explicitly configured to use MDMF, either in ``tahoe.cfg``
  97 (see `<configuration.rst>`__) or in the WUI or CLI command that created a
  98 new mutable file.
  99
 100 The code can read and modify existing files of either format without user
 101 intervention. We expect to make MDMF the default in a subsequent release,
 102 perhaps 2.0.
 103
 104 Which format should you use? SDMF works well for files up to a few MB, and
 105 can be handled by older versions (Tahoe-1.8.3 and earlier). If you do not
 106 need to support older clients, want to efficiently work with mutable files,
 107 and have code which will use Range: headers that make partial reads and
 108 writes, then MDMF is for you.
 109
 110
 111 Consistency vs. Availability
 112 ============================
 113
 114 There is an age-old battle between consistency and availability. Epic papers
 115 have been written, elaborate proofs have been established, and generations of
 116 theorists have learned that you cannot simultaneously achieve guaranteed
 117 consistency with guaranteed reliability. In addition, the closer to 0 you get
 118 on either axis, the cost and complexity of the design goes up.
 119
 120 Tahoe's design goals are to largely favor design simplicity, then slightly
 121 favor read availability, over the other criteria.
 122
 123 As we develop more sophisticated mutable slots, the API may expose multiple
 124 read versions to the application layer. The tahoe philosophy is to defer most
 125 consistency recovery logic to the higher layers. Some applications have
 126 effective ways to merge multiple versions, so inconsistency is not
 127 necessarily a problem (i.e. directory nodes can usually merge multiple
 128 "add child" operations).
 129
 130
 131 The Prime Coordination Directive: "Don't Do That"
 132 =================================================
 133
 134 The current rule for applications which run on top of Tahoe is "do not
 135 perform simultaneous uncoordinated writes". That means you need non-tahoe
 136 means to make sure that two parties are not trying to modify the same mutable
 137 slot at the same time. For example:
 138
 139 * don't give the read-write URI to anyone else. Dirnodes in a private
 140   directory generally satisfy this case, as long as you don't use two
 141   clients on the same account at the same time
 142 * if you give a read-write URI to someone else, stop using it yourself. An
 143   inbox would be a good example of this.
 144 * if you give a read-write URI to someone else, call them on the phone
 145   before you write into it
 146 * build an automated mechanism to have your agents coordinate writes.
 147   For example, we expect a future release to include a FURL for a
 148   "coordination server" in the dirnodes. The rule can be that you must
 149   contact the coordination server and obtain a lock/lease on the file
 150   before you're allowed to modify it.
 151
 152 If you do not follow this rule, Bad Things will happen. The worst-case Bad
 153 Thing is that the entire file will be lost. A less-bad Bad Thing is that one
 154 or more of the simultaneous writers will lose their changes. An observer of
 155 the file may not see monotonically-increasing changes to the file, i.e. they
 156 may see version 1, then version 2, then 3, then 2 again.
 157
 158 Tahoe takes some amount of care to reduce the badness of these Bad Things.
 159 One way you can help nudge it from the "lose your file" case into the "lose
 160 some changes" case is to reduce the number of competing versions: multiple
 161 versions of the file that different parties are trying to establish as the
 162 one true current contents. Each simultaneous writer counts as a "competing
 163 version", as does the previous version of the file. If the count "S" of these
 164 competing versions is larger than N/k, then the file runs the risk of being
 165 lost completely. [TODO] If at least one of the writers remains running after
 166 the collision is detected, it will attempt to recover, but if S>(N/k) and all
 167 writers crash after writing a few shares, the file will be lost.
 168
 169 Note that Tahoe uses serialization internally to make sure that a single
 170 Tahoe node will not perform simultaneous modifications to a mutable file. It
 171 accomplishes this by using a weakref cache of the MutableFileNode (so that
 172 there will never be two distinct MutableFileNodes for the same file), and by
 173 forcing all mutable file operations to obtain a per-node lock before they
 174 run. The Prime Coordination Directive therefore applies to inter-node
 175 conflicts, not intra-node ones.
 176
 177
 178 Small Distributed Mutable Files
 179 ===============================
 180
 181 SDMF slots are suitable for small (<1MB) files that are editing by rewriting
 182 the entire file. The three operations are:
 183
 184  * allocate (with initial contents)
 185  * set (with new contents)
 186  * get (old contents)
 187
 188 The first use of SDMF slots will be to hold directories (dirnodes), which map
 189 encrypted child names to rw-URI/ro-URI pairs.
 190
 191 SDMF slots overview
 192 -------------------
 193
 194 Each SDMF slot is created with a public/private key pair. The public key is
 195 known as the "verification key", while the private key is called the
 196 "signature key". The private key is hashed and truncated to 16 bytes to form
 197 the "write key" (an AES symmetric key). The write key is then hashed and
 198 truncated to form the "read key". The read key is hashed and truncated to
 199 form the 16-byte "storage index" (a unique string used as an index to locate
 200 stored data).
 201
 202 The public key is hashed by itself to form the "verification key hash".
 203
 204 The write key is hashed a different way to form the "write enabler master".
 205 For each storage server on which a share is kept, the write enabler master is
 206 concatenated with the server's nodeid and hashed, and the result is called
 207 the "write enabler" for that particular server. Note that multiple shares of
 208 the same slot stored on the same server will all get the same write enabler,
 209 i.e. the write enabler is associated with the "bucket", rather than the
 210 individual shares.
 211
 212 The private key is encrypted (using AES in counter mode) by the write key,
 213 and the resulting crypttext is stored on the servers. so it will be
 214 retrievable by anyone who knows the write key. The write key is not used to
 215 encrypt anything else, and the private key never changes, so we do not need
 216 an IV for this purpose.
 217
 218 The actual data is encrypted (using AES in counter mode) with a key derived
 219 by concatenating the readkey with the IV, the hashing the results and
 220 truncating to 16 bytes. The IV is randomly generated each time the slot is
 221 updated, and stored next to the encrypted data.
 222
 223 The read-write URI consists of the write key and the verification key hash.
 224 The read-only URI contains the read key and the verification key hash. The
 225 verify-only URI contains the storage index and the verification key hash.
 226
 227 ::
 228
 229  URI:SSK-RW:b2a(writekey):b2a(verification_key_hash)
 230  URI:SSK-RO:b2a(readkey):b2a(verification_key_hash)
 231  URI:SSK-Verify:b2a(storage_index):b2a(verification_key_hash)
 232
 233 Note that this allows the read-only and verify-only URIs to be derived from
 234 the read-write URI without actually retrieving the public keys. Also note
 235 that it means the read-write agent must validate both the private key and the
 236 public key when they are first fetched. All users validate the public key in
 237 exactly the same way.
 238
 239 The SDMF slot is allocated by sending a request to the storage server with a
 240 desired size, the storage index, and the write enabler for that server's
 241 nodeid. If granted, the write enabler is stashed inside the slot's backing
 242 store file. All further write requests must be accompanied by the write
 243 enabler or they will not be honored. The storage server does not share the
 244 write enabler with anyone else.
 245
 246 The SDMF slot structure will be described in more detail below. The important
 247 pieces are:
 248
 249 * a sequence number
 250 * a root hash "R"
 251 * the encoding parameters (including k, N, file size, segment size)
 252 * a signed copy of [seqnum,R,encoding_params], using the signature key
 253 * the verification key (not encrypted)
 254 * the share hash chain (part of a Merkle tree over the share hashes)
 255 * the block hash tree (Merkle tree over blocks of share data)
 256 * the share data itself (erasure-coding of read-key-encrypted file data)
 257 * the signature key, encrypted with the write key
 258
 259 The access pattern for read is:
 260
 261 * hash read-key to get storage index
 262 * use storage index to locate 'k' shares with identical 'R' values
 263
 264   * either get one share, read 'k' from it, then read k-1 shares
 265   * or read, say, 5 shares, discover k, either get more or be finished
 266   * or copy k into the URIs
 267
 268 * read verification key
 269 * hash verification key, compare against verification key hash
 270 * read seqnum, R, encoding parameters, signature
 271 * verify signature against verification key
 272 * read share data, compute block-hash Merkle tree and root "r"
 273 * read share hash chain (leading from "r" to "R")
 274 * validate share hash chain up to the root "R"
 275 * submit share data to erasure decoding
 276 * decrypt decoded data with read-key
 277 * submit plaintext to application
 278
 279 The access pattern for write is:
 280
 281 * hash write-key to get read-key, hash read-key to get storage index
 282 * use the storage index to locate at least one share
 283 * read verification key and encrypted signature key
 284 * decrypt signature key using write-key
 285 * hash signature key, compare against write-key
 286 * hash verification key, compare against verification key hash
 287 * encrypt plaintext from application with read-key
 288
 289   * application can encrypt some data with the write-key to make it only
 290     available to writers (use this for transitive read-onlyness of dirnodes)
 291
 292 * erasure-code crypttext to form shares
 293 * split shares into blocks
 294 * compute Merkle tree of blocks, giving root "r" for each share
 295 * compute Merkle tree of shares, find root "R" for the file as a whole
 296 * create share data structures, one per server:
 297
 298   * use seqnum which is one higher than the old version
 299   * share hash chain has log(N) hashes, different for each server
 300   * signed data is the same for each server
 301
 302 * now we have N shares and need homes for them
 303 * walk through peers
 304
 305   * if share is not already present, allocate-and-set
 306   * otherwise, try to modify existing share:
 307   * send testv_and_writev operation to each one
 308   * testv says to accept share if their(seqnum+R) <= our(seqnum+R)
 309   * count how many servers wind up with which versions (histogram over R)
 310   * keep going until N servers have the same version, or we run out of servers
 311
 312     * if any servers wound up with a different version, report error to
 313       application
 314     * if we ran out of servers, initiate recovery process (described below)
 315
 316 Server Storage Protocol
 317 -----------------------
 318
 319 The storage servers will provide a mutable slot container which is oblivious
 320 to the details of the data being contained inside it. Each storage index
 321 refers to a "bucket", and each bucket has one or more shares inside it. (In a
 322 well-provisioned network, each bucket will have only one share). The bucket
 323 is stored as a directory, using the base32-encoded storage index as the
 324 directory name. Each share is stored in a single file, using the share number
 325 as the filename.
 326
 327 The container holds space for a container magic number (for versioning), the
 328 write enabler, the nodeid which accepted the write enabler (used for share
 329 migration, described below), a small number of lease structures, the embedded
 330 data itself, and expansion space for additional lease structures::
 331
 332  #   offset    size    name
 333  1   0         32      magic verstr "Tahoe mutable container v1\n\x75\x09\x44\x03\x8e"
 334  2   32        20      write enabler's nodeid
 335  3   52        32      write enabler
 336  4   84        8       data size (actual share data present) (a)
 337  5   92        8       offset of (8) count of extra leases (after data)
 338  6   100       368     four leases, 92 bytes each
 339                         0    4   ownerid (0 means "no lease here")
 340                         4    4   expiration timestamp
 341                         8   32   renewal token
 342                         40  32   cancel token
 343                         72  20   nodeid which accepted the tokens
 344  7   468       (a)     data
 345  8   ??        4       count of extra leases
 346  9   ??        n*92    extra leases
 347
 348 The "extra leases" field must be copied and rewritten each time the size of
 349 the enclosed data changes. The hope is that most buckets will have four or
 350 fewer leases and this extra copying will not usually be necessary.
 351
 352 The (4) "data size" field contains the actual number of bytes of data present
 353 in field (7), such that a client request to read beyond 504+(a) will result
 354 in an error. This allows the client to (one day) read relative to the end of
 355 the file. The container size (that is, (8)-(7)) might be larger, especially
 356 if extra size was pre-allocated in anticipation of filling the container with
 357 a lot of data.
 358
 359 The offset in (5) points at the *count* of extra leases, at (8). The actual
 360 leases (at (9)) begin 4 bytes later. If the container size changes, both (8)
 361 and (9) must be relocated by copying.
 362
 363 The server will honor any write commands that provide the write token and do
 364 not exceed the server-wide storage size limitations. Read and write commands
 365 MUST be restricted to the 'data' portion of the container: the implementation
 366 of those commands MUST perform correct bounds-checking to make sure other
 367 portions of the container are inaccessible to the clients.
 368
 369 The two methods provided by the storage server on these "MutableSlot" share
 370 objects are:
 371
 372 * readv(ListOf(offset=int, length=int))
 373
 374   * returns a list of bytestrings, of the various requested lengths
 375   * offset < 0 is interpreted relative to the end of the data
 376   * spans which hit the end of the data will return truncated data
 377
 378 * testv_and_writev(write_enabler, test_vector, write_vector)
 379
 380   * this is a test-and-set operation which performs the given tests and only
 381     applies the desired writes if all tests succeed. This is used to detect
 382     simultaneous writers, and to reduce the chance that an update will lose
 383     data recently written by some other party (written after the last time
 384     this slot was read).
 385   * test_vector=ListOf(TupleOf(offset, length, opcode, specimen))
 386   * the opcode is a string, from the set [gt, ge, eq, le, lt, ne]
 387   * each element of the test vector is read from the slot's data and
 388     compared against the specimen using the desired (in)equality. If all
 389     tests evaluate True, the write is performed
 390   * write_vector=ListOf(TupleOf(offset, newdata))
 391
 392     * offset < 0 is not yet defined, it probably means relative to the
 393       end of the data, which probably means append, but we haven't nailed
 394       it down quite yet
 395     * write vectors are executed in order, which specifies the results of
 396       overlapping writes
 397
 398   * return value:
 399
 400     * error: OutOfSpace
 401     * error: something else (io error, out of memory, whatever)
 402     * (True, old_test_data): the write was accepted (test_vector passed)
 403     * (False, old_test_data): the write was rejected (test_vector failed)
 404
 405       * both 'accepted' and 'rejected' return the old data that was used
 406         for the test_vector comparison. This can be used by the client
 407         to detect write collisions, including collisions for which the
 408         desired behavior was to overwrite the old version.
 409
 410 In addition, the storage server provides several methods to access these
 411 share objects:
 412
 413 * allocate_mutable_slot(storage_index, sharenums=SetOf(int))
 414
 415   * returns DictOf(int, MutableSlot)
 416
 417 * get_mutable_slot(storage_index)
 418
 419   * returns DictOf(int, MutableSlot)
 420   * or raises KeyError
 421
 422 We intend to add an interface which allows small slots to allocate-and-write
 423 in a single call, as well as do update or read in a single call. The goal is
 424 to allow a reasonably-sized dirnode to be created (or updated, or read) in
 425 just one round trip (to all N shareholders in parallel).
 426
 427 migrating shares
 428 ````````````````
 429
 430 If a share must be migrated from one server to another, two values become
 431 invalid: the write enabler (since it was computed for the old server), and
 432 the lease renew/cancel tokens.
 433
 434 Suppose that a slot was first created on nodeA, and was thus initialized with
 435 WE(nodeA) (= H(WEM+nodeA)). Later, for provisioning reasons, the share is
 436 moved from nodeA to nodeB.
 437
 438 Readers may still be able to find the share in its new home, depending upon
 439 how many servers are present in the grid, where the new nodeid lands in the
 440 permuted index for this particular storage index, and how many servers the
 441 reading client is willing to contact.
 442
 443 When a client attempts to write to this migrated share, it will get a "bad
 444 write enabler" error, since the WE it computes for nodeB will not match the
 445 WE(nodeA) that was embedded in the share. When this occurs, the "bad write
 446 enabler" message must include the old nodeid (e.g. nodeA) that was in the
 447 share.
 448
 449 The client then computes H(nodeB+H(WEM+nodeA)), which is the same as
 450 H(nodeB+WE(nodeA)). The client sends this along with the new WE(nodeB), which
 451 is H(WEM+nodeB). Note that the client only sends WE(nodeB) to nodeB, never to
 452 anyone else. Also note that the client does not send a value to nodeB that
 453 would allow the node to impersonate the client to a third node: everything
 454 sent to nodeB will include something specific to nodeB in it.
 455
 456 The server locally computes H(nodeB+WE(nodeA)), using its own node id and the
 457 old write enabler from the share. It compares this against the value supplied
 458 by the client. If they match, this serves as proof that the client was able
 459 to compute the old write enabler. The server then accepts the client's new
 460 WE(nodeB) and writes it into the container.
 461
 462 This WE-fixup process requires an extra round trip, and requires the error
 463 message to include the old nodeid, but does not require any public key
 464 operations on either client or server.
 465
 466 Migrating the leases will require a similar protocol. This protocol will be
 467 defined concretely at a later date.
 468
 469 Code Details
 470 ------------
 471
 472 The MutableFileNode class is used to manipulate mutable files (as opposed to
 473 ImmutableFileNodes). These are initially generated with
 474 client.create_mutable_file(), and later recreated from URIs with
 475 client.create_node_from_uri(). Instances of this class will contain a URI and
 476 a reference to the client (for peer selection and connection).
 477
 478 NOTE: this section is out of date. Please see src/allmydata/interfaces.py
 479 (the section on IMutableFilesystemNode) for more accurate information.
 480
 481 The methods of MutableFileNode are:
 482
 483 * download_to_data() -> [deferred] newdata, NotEnoughSharesError
 484
 485   * if there are multiple retrieveable versions in the grid, get() returns
 486     the first version it can reconstruct, and silently ignores the others.
 487     In the future, a more advanced API will signal and provide access to
 488     the multiple heads.
 489
 490 * update(newdata) -> OK, UncoordinatedWriteError, NotEnoughSharesError
 491 * overwrite(newdata) -> OK, UncoordinatedWriteError, NotEnoughSharesError
 492
 493 download_to_data() causes a new retrieval to occur, pulling the current
 494 contents from the grid and returning them to the caller. At the same time,
 495 this call caches information about the current version of the file. This
 496 information will be used in a subsequent call to update(), and if another
 497 change has occured between the two, this information will be out of date,
 498 triggering the UncoordinatedWriteError.
 499
 500 update() is therefore intended to be used just after a download_to_data(), in
 501 the following pattern::
 502
 503  d = mfn.download_to_data()
 504  d.addCallback(apply_delta)
 505  d.addCallback(mfn.update)
 506
 507 If the update() call raises UCW, then the application can simply return an
 508 error to the user ("you violated the Prime Coordination Directive"), and they
 509 can try again later. Alternatively, the application can attempt to retry on
 510 its own. To accomplish this, the app needs to pause, download the new
 511 (post-collision and post-recovery) form of the file, reapply their delta,
 512 then submit the update request again. A randomized pause is necessary to
 513 reduce the chances of colliding a second time with another client that is
 514 doing exactly the same thing::
 515
 516  d = mfn.download_to_data()
 517  d.addCallback(apply_delta)
 518  d.addCallback(mfn.update)
 519  def _retry(f):
 520    f.trap(UncoordinatedWriteError)
 521    d1 = pause(random.uniform(5, 20))
 522    d1.addCallback(lambda res: mfn.download_to_data())
 523    d1.addCallback(apply_delta)
 524    d1.addCallback(mfn.update)
 525    return d1
 526  d.addErrback(_retry)
 527
 528 Enthusiastic applications can retry multiple times, using a randomized
 529 exponential backoff between each. A particularly enthusiastic application can
 530 retry forever, but such apps are encouraged to provide a means to the user of
 531 giving up after a while.
 532
 533 UCW does not mean that the update was not applied, so it is also a good idea
 534 to skip the retry-update step if the delta was already applied::
 535
 536  d = mfn.download_to_data()
 537  d.addCallback(apply_delta)
 538  d.addCallback(mfn.update)
 539  def _retry(f):
 540    f.trap(UncoordinatedWriteError)
 541    d1 = pause(random.uniform(5, 20))
 542    d1.addCallback(lambda res: mfn.download_to_data())
 543    def _maybe_apply_delta(contents):
 544      new_contents = apply_delta(contents)
 545      if new_contents != contents:
 546        return mfn.update(new_contents)
 547    d1.addCallback(_maybe_apply_delta)
 548    return d1
 549  d.addErrback(_retry)
 550
 551 update() is the right interface to use for delta-application situations, like
 552 directory nodes (in which apply_delta might be adding or removing child
 553 entries from a serialized table).
 554
 555 Note that any uncoordinated write has the potential to lose data. We must do
 556 more analysis to be sure, but it appears that two clients who write to the
 557 same mutable file at the same time (even if both eventually retry) will, with
 558 high probability, result in one client observing UCW and the other silently
 559 losing their changes. It is also possible for both clients to observe UCW.
 560 The moral of the story is that the Prime Coordination Directive is there for
 561 a reason, and that recovery/UCW/retry is not a subsitute for write
 562 coordination.
 563
 564 overwrite() tells the client to ignore this cached version information, and
 565 to unconditionally replace the mutable file's contents with the new data.
 566 This should not be used in delta application, but rather in situations where
 567 you want to replace the file's contents with completely unrelated ones. When
 568 raw files are uploaded into a mutable slot through the Tahoe-LAFS web-API
 569 (using POST and the ?mutable=true argument), they are put in place with
 570 overwrite().
 571
 572 The peer-selection and data-structure manipulation (and signing/verification)
 573 steps will be implemented in a separate class in allmydata/mutable.py .
 574
 575 SMDF Slot Format
 576 ----------------
 577
 578 This SMDF data lives inside a server-side MutableSlot container. The server
 579 is oblivious to this format.
 580
 581 This data is tightly packed. In particular, the share data is defined to run
 582 all the way to the beginning of the encrypted private key (the encprivkey
 583 offset is used both to terminate the share data and to begin the encprivkey).
 584
 585 ::
 586
 587   #    offset   size    name
 588   1    0        1       version byte, \x00 for this format
 589   2    1        8       sequence number. 2^64-1 must be handled specially, TBD
 590   3    9        32      "R" (root of share hash Merkle tree)
 591   4    41       16      IV (share data is AES(H(readkey+IV)) )
 592   5    57       18      encoding parameters:
 593         57       1        k
 594         58       1        N
 595         59       8        segment size
 596         67       8        data length (of original plaintext)
 597   6    75       32      offset table:
 598         75       4        (8) signature
 599         79       4        (9) share hash chain
 600         83       4        (10) block hash tree
 601         87       4        (11) share data
 602         91       8        (12) encrypted private key
 603         99       8        (13) EOF
 604   7    107      436ish  verification key (2048 RSA key)
 605   8    543ish   256ish  signature=RSAsign(sigkey, H(version+seqnum+r+IV+encparm))
 606   9    799ish   (a)     share hash chain, encoded as:
 607                          "".join([pack(">H32s", shnum, hash)
 608                                   for (shnum,hash) in needed_hashes])
 609  10    (927ish) (b)     block hash tree, encoded as:
 610                          "".join([pack(">32s",hash) for hash in block_hash_tree])
 611  11    (935ish) LEN     share data (no gap between this and encprivkey)
 612  12    ??       1216ish encrypted private key= AESenc(write-key, RSA-key)
 613  13    ??       --      EOF
 614
 615  (a) The share hash chain contains ceil(log(N)) hashes, each 32 bytes long.
 616     This is the set of hashes necessary to validate this share's leaf in the
 617     share Merkle tree. For N=10, this is 4 hashes, i.e. 128 bytes.
 618  (b) The block hash tree contains ceil(length/segsize) hashes, each 32 bytes
 619     long. This is the set of hashes necessary to validate any given block of
 620     share data up to the per-share root "r". Each "r" is a leaf of the share
 621     has tree (with root "R"), from which a minimal subset of hashes is put in
 622     the share hash chain in (8).
 623
 624 Recovery
 625 --------
 626
 627 The first line of defense against damage caused by colliding writes is the
 628 Prime Coordination Directive: "Don't Do That".
 629
 630 The second line of defense is to keep "S" (the number of competing versions)
 631 lower than N/k. If this holds true, at least one competing version will have
 632 k shares and thus be recoverable. Note that server unavailability counts
 633 against us here: the old version stored on the unavailable server must be
 634 included in the value of S.
 635
 636 The third line of defense is our use of testv_and_writev() (described below),
 637 which increases the convergence of simultaneous writes: one of the writers
 638 will be favored (the one with the highest "R"), and that version is more
 639 likely to be accepted than the others. This defense is least effective in the
 640 pathological situation where S simultaneous writers are active, the one with
 641 the lowest "R" writes to N-k+1 of the shares and then dies, then the one with
 642 the next-lowest "R" writes to N-2k+1 of the shares and dies, etc, until the
 643 one with the highest "R" writes to k-1 shares and dies. Any other sequencing
 644 will allow the highest "R" to write to at least k shares and establish a new
 645 revision.
 646
 647 The fourth line of defense is the fact that each client keeps writing until
 648 at least one version has N shares. This uses additional servers, if
 649 necessary, to make sure that either the client's version or some
 650 newer/overriding version is highly available.
 651
 652 The fifth line of defense is the recovery algorithm, which seeks to make sure
 653 that at least *one* version is highly available, even if that version is
 654 somebody else's.
 655
 656 The write-shares-to-peers algorithm is as follows:
 657
 658 * permute peers according to storage index
 659 * walk through peers, trying to assign one share per peer
 660 * for each peer:
 661
 662   * send testv_and_writev, using "old(seqnum+R) <= our(seqnum+R)" as the test
 663
 664     * this means that we will overwrite any old versions, and we will
 665       overwrite simultaenous writers of the same version if our R is higher.
 666       We will not overwrite writers using a higher seqnum.
 667
 668   * record the version that each share winds up with. If the write was
 669     accepted, this is our own version. If it was rejected, read the
 670     old_test_data to find out what version was retained.
 671   * if old_test_data indicates the seqnum was equal or greater than our
 672     own, mark the "Simultanous Writes Detected" flag, which will eventually
 673     result in an error being reported to the writer (in their close() call).
 674   * build a histogram of "R" values
 675   * repeat until the histogram indicate that some version (possibly ours)
 676     has N shares. Use new servers if necessary.
 677   * If we run out of servers:
 678
 679     * if there are at least shares-of-happiness of any one version, we're
 680       happy, so return. (the close() might still get an error)
 681     * not happy, need to reinforce something, goto RECOVERY
 682
 683 Recovery:
 684
 685 * read all shares, count the versions, identify the recoverable ones,
 686   discard the unrecoverable ones.
 687 * sort versions: locate max(seqnums), put all versions with that seqnum
 688   in the list, sort by number of outstanding shares. Then put our own
 689   version. (TODO: put versions with seqnum <max but >us ahead of us?).
 690 * for each version:
 691
 692   * attempt to recover that version
 693   * if not possible, remove it from the list, go to next one
 694   * if recovered, start at beginning of peer list, push that version,
 695     continue until N shares are placed
 696   * if pushing our own version, bump up the seqnum to one higher than
 697     the max seqnum we saw
 698   * if we run out of servers:
 699
 700     * schedule retry and exponential backoff to repeat RECOVERY
 701
 702   * admit defeat after some period? presumeably the client will be shut down
 703     eventually, maybe keep trying (once per hour?) until then.
 704
 705
 706 Medium Distributed Mutable Files
 707 ================================
 708
 709 These are just like the SDMF case, but:
 710
 711 * We actually take advantage of the Merkle hash tree over the blocks, by
 712   reading a single segment of data at a time (and its necessary hashes), to
 713   reduce the read-time alacrity.
 714 * We allow arbitrary writes to any range of the file.
 715 * We add more code to first read each segment that a write must modify.
 716   This looks exactly like the way a normal filesystem uses a block device,
 717   or how a CPU must perform a cache-line fill before modifying a single word.
 718 * We might implement some sort of copy-based atomic update server call,
 719   to allow multiple writev() calls to appear atomic to any readers.
 720
 721 MDMF slots provide fairly efficient in-place edits of very large files (a few
 722 GB). Appending data is also fairly efficient.
 723
 724
 725 Large Distributed Mutable Files
 726 ===============================
 727
 728 LDMF slots (not implemented) would use a fundamentally different way to store
 729 the file, inspired by Mercurial's "revlog" format. This would enable very
 730 efficient insert/remove/replace editing of arbitrary spans. Multiple versions
 731 of the file can be retained, in a revision graph that can have multiple heads.
 732 Each revision can be referenced by a cryptographic identifier. There are two
 733 forms of the URI, one that means "most recent version", and a longer one that
 734 points to a specific revision.
 735
 736 Metadata can be attached to the revisions, like timestamps, to enable rolling
 737 back an entire tree to a specific point in history.
 738
 739 LDMF1 provides deltas but tries to avoid dealing with multiple heads. LDMF2
 740 provides explicit support for revision identifiers and branching.
 741
 742
 743 TODO
 744 ====
 745
 746 improve allocate-and-write or get-writer-buckets API to allow one-call (or
 747 maybe two-call) updates. The challenge is in figuring out which shares are on
 748 which machines. First cut will have lots of round trips.
 749
 750 (eventually) define behavior when seqnum wraps. At the very least make sure
 751 it can't cause a security problem. "the slot is worn out" is acceptable.
 752
 753 (eventually) define share-migration lease update protocol. Including the
 754 nodeid who accepted the lease is useful, we can use the same protocol as we
 755 do for updating the write enabler. However we need to know which lease to
 756 update.. maybe send back a list of all old nodeids that we find, then try all
 757 of them when we accept the update?
 758
 759 We now do this in a specially-formatted IndexError exception:
 760  "UNABLE to renew non-existent lease. I have leases accepted by " +
 761  "nodeids: '12345','abcde','44221' ."
 762
 763 confirm that a repairer can regenerate shares without the private key. Hmm,
 764 without the write-enabler they won't be able to write those shares to the
 765 servers.. although they could add immutable new shares to new servers.