docs/specifications/dirnodes.rst

   1 .. -*- coding: utf-8-with-signature -*-
   2
   3 ==========================
   4 Tahoe-LAFS Directory Nodes
   5 ==========================
   6
   7 As explained in the architecture docs, Tahoe-LAFS can be roughly viewed as
   8 a collection of three layers. The lowest layer is the key-value store: it
   9 provides operations that accept files and upload them to the grid, creating
  10 a URI in the process which securely references the file's contents.
  11 The middle layer is the filesystem, creating a structure of directories and
  12 filenames resembling the traditional unix/windows filesystems. The top layer
  13 is the application layer, which uses the lower layers to provide useful
  14 services to users, like a backup application, or a way to share files with
  15 friends.
  16
  17 This document examines the middle layer, the "filesystem".
  18
  19 1.  `Key-value Store Primitives`_
  20 2.  `Filesystem goals`_
  21 3.  `Dirnode goals`_
  22 4.  `Dirnode secret values`_
  23 5.  `Dirnode storage format`_
  24 6.  `Dirnode sizes, mutable-file initial read sizes`_
  25 7.  `Design Goals, redux`_
  26
  27     1. `Confidentiality leaks in the storage servers`_
  28     2. `Integrity failures in the storage servers`_
  29     3. `Improving the efficiency of dirnodes`_
  30     4. `Dirnode expiration and leases`_
  31
  32 8.  `Starting Points: root dirnodes`_
  33 9.  `Mounting and Sharing Directories`_
  34 10. `Revocation`_
  35
  36 Key-value Store Primitives
  37 ==========================
  38
  39 In the lowest layer (key-value store), there are two operations that reference
  40 immutable data (which we refer to as "CHK URIs" or "CHK read-capabilities" or
  41 "CHK read-caps"). One puts data into the grid (but only if it doesn't exist
  42 already), the other retrieves it::
  43
  44  chk_uri = put(data)
  45  data = get(chk_uri)
  46
  47 We also have three operations which reference mutable data (which we refer to
  48 as "mutable slots", or "mutable write-caps and read-caps", or sometimes "SSK
  49 slots"). One creates a slot with some initial contents, a second replaces the
  50 contents of a pre-existing slot, and the third retrieves the contents::
  51
  52  mutable_uri = create(initial_data)
  53  replace(mutable_uri, new_data)
  54  data = get(mutable_uri)
  55
  56 Filesystem Goals
  57 ================
  58
  59 The main goal for the middle (filesystem) layer is to give users a way to
  60 organize the data that they have uploaded into the grid. The traditional way
  61 to do this in computer filesystems is to put this data into files, give those
  62 files names, and collect these names into directories.
  63
  64 Each directory is a set of name-entry pairs, each of which maps a "child name"
  65 to a directory entry pointing to an object of some kind. Those child objects
  66 might be files, or they might be other directories. Each directory entry also
  67 contains metadata.
  68
  69 The directory structure is therefore a directed graph of nodes, in which each
  70 node might be a directory node or a file node. All file nodes are terminal
  71 nodes.
  72
  73 Dirnode Goals
  74 =============
  75
  76 What properties might be desirable for these directory nodes? In no
  77 particular order:
  78
  79 1. functional. Code which does not work doesn't count.
  80 2. easy to document, explain, and understand
  81 3. confidential: it should not be possible for others to see the contents of
  82    a directory
  83 4. integrity: it should not be possible for others to modify the contents
  84    of a directory
  85 5. available: directories should survive host failure, just like files do
  86 6. efficient: in storage, communication bandwidth, number of round-trips
  87 7. easy to delegate individual directories in a flexible way
  88 8. updateness: everybody looking at a directory should see the same contents
  89 9. monotonicity: everybody looking at a directory should see the same
  90    sequence of updates
  91
  92 Some of these goals are mutually exclusive. For example, availability and
  93 consistency are opposing, so it is not possible to achieve #5 and #8 at the
  94 same time. Moreover, it takes a more complex architecture to get close to the
  95 available-and-consistent ideal, so #2/#6 is in opposition to #5/#8.
  96
  97 Tahoe-LAFS v0.7.0 introduced distributed mutable files, which use public-key
  98 cryptography for integrity, and erasure coding for availability. These
  99 achieve roughly the same properties as immutable CHK files, but their
 100 contents can be replaced without changing their identity. Dirnodes are then
 101 just a special way of interpreting the contents of a specific mutable file.
 102 Earlier releases used a "vdrive server": this server was abolished in the
 103 v0.7.0 release.
 104
 105 For details of how mutable files work, please see mutable.rst_ in this
 106 directory.
 107
 108 For releases since v0.7.0, we achieve most of our desired properties. The
 109 integrity and availability of dirnodes is equivalent to that of regular
 110 (immutable) files, with the exception that there are more simultaneous-update
 111 failure modes for mutable slots. Delegation is quite strong: you can give
 112 read-write or read-only access to any subtree, and the data format used for
 113 dirnodes is such that read-only access is transitive: i.e. if you grant Bob
 114 read-only access to a parent directory, then Bob will get read-only access
 115 (and *not* read-write access) to its children.
 116
 117 Relative to the previous "vdrive-server" based scheme, the current
 118 distributed dirnode approach gives better availability, but cannot guarantee
 119 updateness quite as well, and requires far more network traffic for each
 120 retrieval and update. Mutable files are somewhat less available than
 121 immutable files, simply because of the increased number of combinations
 122 (shares of an immutable file are either present or not, whereas there are
 123 multiple versions of each mutable file, and you might have some shares of
 124 version 1 and other shares of version 2). In extreme cases of simultaneous
 125 update, mutable files might suffer from non-monotonicity.
 126
 127 .. _mutable.rst: mutable.rst
 128
 129
 130 Dirnode secret values
 131 =====================
 132
 133 As mentioned before, dirnodes are simply a special way to interpret the
 134 contents of a mutable file, so the secret keys and capability strings
 135 described in mutable.rst_ are all the same. Each dirnode contains an RSA
 136 public/private keypair, and the holder of the "write capability" will be able
 137 to retrieve the private key (as well as the AES encryption key used for the
 138 data itself). The holder of the "read capability" will be able to obtain the
 139 public key and the AES data key, but not the RSA private key needed to modify
 140 the data.
 141
 142 The "write capability" for a dirnode grants read-write access to its
 143 contents. This is expressed on concrete form as the "dirnode write cap": a
 144 printable string which contains the necessary secrets to grant this access.
 145 Likewise, the "read capability" grants read-only access to a dirnode, and can
 146 be represented by a "dirnode read cap" string.
 147
 148 For example,
 149 URI:DIR2:swdi8ge1s7qko45d3ckkyw1aac%3Aar8r5j99a4mezdojejmsfp4fj1zeky9gjigyrid4urxdimego68o
 150 is a write-capability URI, while
 151 URI:DIR2-RO:buxjqykt637u61nnmjg7s8zkny:ar8r5j99a4mezdojejmsfp4fj1zeky9gjigyrid4urxdimego68o
 152 is a read-capability URI, both for the same dirnode.
 153
 154
 155 Dirnode storage format
 156 ======================
 157
 158 Each dirnode is stored in a single mutable file, distributed in the Tahoe-LAFS
 159 grid. The contents of this file are a serialized list of netstrings, one per
 160 child. Each child is a list of four netstrings: (name, rocap, rwcap,
 161 metadata). (Remember that the contents of the mutable file are encrypted by
 162 the read-cap, so this section describes the plaintext contents of the mutable
 163 file, *after* it has been decrypted by the read-cap.)
 164
 165 The name is simple a UTF-8 -encoded child name. The 'rocap' is a read-only
 166 capability URI to that child, either an immutable (CHK) file, a mutable file,
 167 or a directory. It is also possible to store 'unknown' URIs that are not
 168 recognized by the current version of Tahoe-LAFS. The 'rwcap' is a read-write
 169 capability URI for that child, encrypted with the dirnode's write-cap: this
 170 enables the "transitive readonlyness" property, described further below. The
 171 'metadata' is a JSON-encoded dictionary of type,value metadata pairs. Some
 172 metadata keys are pre-defined, the rest are left up to the application.
 173
 174 Each rwcap is stored as IV + ciphertext + MAC. The IV is a 16-byte random
 175 value. The ciphertext is obtained by using AES in CTR mode on the rwcap URI
 176 string, using a key that is formed from a tagged hash of the IV and the
 177 dirnode's writekey. The MAC is written only for compatibility with older
 178 Tahoe-LAFS versions and is no longer verified.
 179
 180 If Bob has read-only access to the 'bar' directory, and he adds it as a child
 181 to the 'foo' directory, then he will put the read-only cap for 'bar' in both
 182 the rwcap and rocap slots (encrypting the rwcap contents as described above).
 183 If he has full read-write access to 'bar', then he will put the read-write
 184 cap in the 'rwcap' slot, and the read-only cap in the 'rocap' slot. Since
 185 other users who have read-only access to 'foo' will be unable to decrypt its
 186 rwcap slot, this limits those users to read-only access to 'bar' as well,
 187 thus providing the transitive readonlyness that we desire.
 188
 189 Dirnode sizes, mutable-file initial read sizes
 190 ==============================================
 191
 192 How big are dirnodes? When reading dirnode data out of mutable files, how
 193 large should our initial read be? If we guess exactly, we can read a dirnode
 194 in a single round-trip, and update one in two RTT. If we guess too high,
 195 we'll waste some amount of bandwidth. If we guess low, we need to make a
 196 second pass to get the data (or the encrypted privkey, for writes), which
 197 will cost us at least another RTT.
 198
 199 Assuming child names are between 10 and 99 characters long, how long are the
 200 various pieces of a dirnode?
 201
 202 ::
 203
 204  netstring(name) ~= 4+len(name)
 205  chk-cap = 97 (for 4-char filesizes)
 206  dir-rw-cap = 88
 207  dir-ro-cap = 91
 208  netstring(cap) = 4+len(cap)
 209  encrypted(cap) = 16+cap+32
 210  JSON({}) = 2
 211  JSON({ctime=float,mtime=float,'tahoe':{linkcrtime=float,linkmotime=float}}): 137
 212  netstring(metadata) = 4+137 = 141
 213
 214 so a CHK entry is::
 215
 216  5+ 4+len(name) + 4+97 + 5+16+97+32 + 4+137
 217
 218 And a 15-byte filename gives a 416-byte entry. When the entry points at a
 219 subdirectory instead of a file, the entry is a little bit smaller. So an
 220 empty directory uses 0 bytes, a directory with one child uses about 416
 221 bytes, a directory with two children uses about 832, etc.
 222
 223 When the dirnode data is encoding using our default 3-of-10, that means we
 224 get 139ish bytes of data in each share per child.
 225
 226 The pubkey, signature, and hashes form the first 935ish bytes of the
 227 container, then comes our data, then about 1216 bytes of encprivkey. So if we
 228 read the first::
 229
 230  1kB: we get 65bytes of dirnode data : only empty directories
 231  2kB: 1065bytes: about 8
 232  3kB: 2065bytes: about 15 entries, or 6 entries plus the encprivkey
 233  4kB: 3065bytes: about 22 entries, or about 13 plus the encprivkey
 234
 235 So we've written the code to do an initial read of 4kB from each share when
 236 we read the mutable file, which should give good performance (one RTT) for
 237 small directories.
 238
 239
 240 Design Goals, redux
 241 ===================
 242
 243 How well does this design meet the goals?
 244
 245 1. functional: YES: the code works and has extensive unit tests
 246 2. documentable: YES: this document is the existence proof
 247 3. confidential: YES: see below
 248 4. integrity: MOSTLY: a coalition of storage servers can rollback individual
 249    mutable files, but not a single one. No server can
 250    substitute fake data as genuine.
 251 5. availability: YES: as long as 'k' storage servers are present and have
 252    the same version of the mutable file, the dirnode will
 253    be available.
 254 6. efficient: MOSTLY:
 255      network: single dirnode lookup is very efficient, since clients can
 256        fetch specific keys rather than being required to get or set
 257        the entire dirnode each time. Traversing many directories
 258        takes a lot of roundtrips, and these can't be collapsed with
 259        promise-pipelining because the intermediate values must only
 260        be visible to the client. Modifying many dirnodes at once
 261        (e.g. importing a large pre-existing directory tree) is pretty
 262        slow, since each graph edge must be created independently.
 263      storage: each child has a separate IV, which makes them larger than
 264        if all children were aggregated into a single encrypted string
 265 7. delegation: VERY: each dirnode is a completely independent object,
 266    to which clients can be granted separate read-write or
 267    read-only access
 268 8. updateness: VERY: with only a single point of access, and no caching,
 269    each client operation starts by fetching the current
 270    value, so there are no opportunities for staleness
 271 9. monotonicity: VERY: the single point of access also protects against
 272    retrograde motion
 273
 274
 275
 276 Confidentiality leaks in the storage servers
 277 --------------------------------------------
 278
 279 Dirnode (and the mutable files upon which they are based) are very private
 280 against other clients: traffic between the client and the storage servers is
 281 protected by the Foolscap SSL connection, so they can observe very little.
 282 Storage index values are hashes of secrets and thus unguessable, and they are
 283 not made public, so other clients cannot snoop through encrypted dirnodes
 284 that they have not been told about.
 285
 286 Storage servers can observe access patterns and see ciphertext, but they
 287 cannot see the plaintext (of child names, metadata, or URIs). If an attacker
 288 operates a significant number of storage servers, they can infer the shape of
 289 the directory structure by assuming that directories are usually accessed
 290 from root to leaf in rapid succession. Since filenames are usually much
 291 shorter than read-caps and write-caps, the attacker can use the length of the
 292 ciphertext to guess the number of children of each node, and might be able to
 293 guess the length of the child names (or at least their sum). From this, the
 294 attacker may be able to build up a graph with the same shape as the plaintext
 295 filesystem, but with unlabeled edges and unknown file contents.
 296
 297
 298 Integrity failures in the storage servers
 299 -----------------------------------------
 300
 301 The mutable file's integrity mechanism (RSA signature on the hash of the file
 302 contents) prevents the storage server from modifying the dirnode's contents
 303 without detection. Therefore the storage servers can make the dirnode
 304 unavailable, but not corrupt it.
 305
 306 A sufficient number of colluding storage servers can perform a rollback
 307 attack: replace all shares of the whole mutable file with an earlier version.
 308 To prevent this, when retrieving the contents of a mutable file, the
 309 client queries more servers than necessary and uses the highest available
 310 version number. This insures that one or two misbehaving storage servers
 311 cannot cause this rollback on their own.
 312
 313
 314 Improving the efficiency of dirnodes
 315 ------------------------------------
 316
 317 The current mutable-file -based dirnode scheme suffers from certain
 318 inefficiencies. A very large directory (with thousands or millions of
 319 children) will take a significant time to extract any single entry, because
 320 the whole file must be downloaded first, then parsed and searched to find the
 321 desired child entry. Likewise, modifying a single child will require the
 322 whole file to be re-uploaded.
 323
 324 The current design assumes (and in some cases, requires) that dirnodes remain
 325 small. The mutable files on which dirnodes are based are currently using
 326 "SDMF" ("Small Distributed Mutable File") design rules, which state that the
 327 size of the data shall remain below one megabyte. More advanced forms of
 328 mutable files (MDMF and LDMF) are in the design phase to allow efficient
 329 manipulation of larger mutable files. This would reduce the work needed to
 330 modify a single entry in a large directory.
 331
 332 Judicious caching may help improve the reading-large-directory case. Some
 333 form of mutable index at the beginning of the dirnode might help as well. The
 334 MDMF design rules allow for efficient random-access reads from the middle of
 335 the file, which would give the index something useful to point at.
 336
 337 The current SDMF design generates a new RSA public/private keypair for each
 338 directory. This takes considerable time and CPU effort, generally one or two
 339 seconds per directory. We have designed (but not yet built) a DSA-based
 340 mutable file scheme which will use shared parameters to reduce the
 341 directory-creation effort to a bare minimum (picking a random number instead
 342 of generating two random primes).
 343
 344 When a backup program is run for the first time, it needs to copy a large
 345 amount of data from a pre-existing filesystem into reliable storage. This
 346 means that a large and complex directory structure needs to be duplicated in
 347 the dirnode layer. With the one-object-per-dirnode approach described here,
 348 this requires as many operations as there are edges in the imported
 349 filesystem graph.
 350
 351 Another approach would be to aggregate multiple directories into a single
 352 storage object. This object would contain a serialized graph rather than a
 353 single name-to-child dictionary. Most directory operations would fetch the
 354 whole block of data (and presumeably cache it for a while to avoid lots of
 355 re-fetches), and modification operations would need to replace the whole
 356 thing at once. This "realm" approach would have the added benefit of
 357 combining more data into a single encrypted bundle (perhaps hiding the shape
 358 of the graph from a determined attacker), and would reduce round-trips when
 359 performing deep directory traversals (assuming the realm was already cached).
 360 It would also prevent fine-grained rollback attacks from working: a coalition
 361 of storage servers could change the entire realm to look like an earlier
 362 state, but it could not independently roll back individual directories.
 363
 364 The drawbacks of this aggregation would be that small accesses (adding a
 365 single child, looking up a single child) would require pulling or pushing a
 366 lot of unrelated data, increasing network overhead (and necessitating
 367 test-and-set semantics for the modification side, which increases the chances
 368 that a user operation will fail, making it more challenging to provide
 369 promises of atomicity to the user).
 370
 371 It would also make it much more difficult to enable the delegation
 372 ("sharing") of specific directories. Since each aggregate "realm" provides
 373 all-or-nothing access control, the act of delegating any directory from the
 374 middle of the realm would require the realm first be split into the upper
 375 piece that isn't being shared and the lower piece that is. This splitting
 376 would have to be done in response to what is essentially a read operation,
 377 which is not traditionally supposed to be a high-effort action. On the other
 378 hand, it may be possible to aggregate the ciphertext, but use distinct
 379 encryption keys for each component directory, to get the benefits of both
 380 schemes at once.
 381
 382
 383 Dirnode expiration and leases
 384 -----------------------------
 385
 386 Dirnodes are created any time a client wishes to add a new directory. How
 387 long do they live? What's to keep them from sticking around forever, taking
 388 up space that nobody can reach any longer?
 389
 390 Mutable files are created with limited-time "leases", which keep the shares
 391 alive until the last lease has expired or been cancelled. Clients which know
 392 and care about specific dirnodes can ask to keep them alive for a while, by
 393 renewing a lease on them (with a typical period of one month). Clients are
 394 expected to assist in the deletion of dirnodes by canceling their leases as
 395 soon as they are done with them. This means that when a client unlinks a
 396 directory, it should also cancel its lease on that directory. When the lease
 397 count on a given share goes to zero, the storage server can delete the
 398 related storage. Multiple clients may all have leases on the same dirnode:
 399 the server may delete the shares only after all of the leases have gone away.
 400
 401 We expect that clients will periodically create a "manifest": a list of
 402 so-called "refresh capabilities" for all of the dirnodes and files that they
 403 can reach. They will give this manifest to the "repairer", which is a service
 404 that keeps files (and dirnodes) alive on behalf of clients who cannot take on
 405 this responsibility for themselves. These refresh capabilities include the
 406 storage index, but do *not* include the readkeys or writekeys, so the
 407 repairer does not get to read the files or directories that it is helping to
 408 keep alive.
 409
 410 After each change to the user's vdrive, the client creates a manifest and
 411 looks for differences from their previous version. Anything which was removed
 412 prompts the client to send out lease-cancellation messages, allowing the data
 413 to be deleted.
 414
 415
 416 Starting Points: root dirnodes
 417 ==============================
 418
 419 Any client can record the URI of a directory node in some external form (say,
 420 in a local file) and use it as the starting point of later traversal. Each
 421 Tahoe-LAFS user is expected to create a new (unattached) dirnode when they first
 422 start using the grid, and record its URI for later use.
 423
 424 Mounting and Sharing Directories
 425 ================================
 426
 427 The biggest benefit of this dirnode approach is that sharing individual
 428 directories is almost trivial. Alice creates a subdirectory that she wants to
 429 use to share files with Bob. This subdirectory is attached to Alice's
 430 filesystem at "~alice/share-with-bob". She asks her filesystem for the
 431 read-write directory URI for that new directory, and emails it to Bob. When
 432 Bob receives the URI, he asks his own local vdrive to attach the given URI,
 433 perhaps at a place named "~bob/shared-with-alice". Every time either party
 434 writes a file into this directory, the other will be able to read it. If
 435 Alice prefers, she can give a read-only URI to Bob instead, and then Bob will
 436 be able to read files but not change the contents of the directory. Neither
 437 Alice nor Bob will get access to any files above the mounted directory: there
 438 are no 'parent directory' pointers. If Alice creates a nested set of
 439 directories, "~alice/share-with-bob/subdir2", and gives a read-only URI to
 440 share-with-bob to Bob, then Bob will be unable to write to either
 441 share-with-bob/ or subdir2/.
 442
 443 A suitable UI needs to be created to allow users to easily perform this
 444 sharing action: dragging a folder their vdrive to an IM or email user icon,
 445 for example. The UI will need to give the sending user an opportunity to
 446 indicate whether they want to grant read-write or read-only access to the
 447 recipient. The recipient then needs an interface to drag the new folder into
 448 their vdrive and give it a home.
 449
 450 Revocation
 451 ==========
 452
 453 When Alice decides that she no longer wants Bob to be able to access the
 454 shared directory, what should she do? Suppose she's shared this folder with
 455 both Bob and Carol, and now she wants Carol to retain access to it but Bob to
 456 be shut out. Ideally Carol should not have to do anything: her access should
 457 continue unabated.
 458
 459 The current plan is to have her client create a deep copy of the folder in
 460 question, delegate access to the new folder to the remaining members of the
 461 group (Carol), asking the lucky survivors to replace their old reference with
 462 the new one. Bob may still have access to the old folder, but he is now the
 463 only one who cares: everyone else has moved on, and he will no longer be able
 464 to see their new changes. In a strict sense, this is the strongest form of
 465 revocation that can be accomplished: there is no point trying to force Bob to
 466 forget about the files that he read a moment before being kicked out. In
 467 addition it must be noted that anyone who can access the directory can proxy
 468 for Bob, reading files to him and accepting changes whenever he wants.
 469 Preventing delegation between communication parties is just as pointless as
 470 asking Bob to forget previously accessed files. However, there may be value
 471 to configuring the UI to ask Carol to not share files with Bob, or to
 472 removing all files from Bob's view at the same time his access is revoked.
 473