docs/architecture.txt

   1
   2    Tahoe-LAFS Architecture
   3
   4 (See the docs/specifications directory for more details.)
   5
   6 OVERVIEW
   7
   8 At a high-level this system consists of three layers: the key-value store,
   9 the filesystem, and the application.
  10
  11 The lowest layer is the key-value store, which is a distributed hashtable
  12 mapping from capabilities to data.  The capabilities are relatively short
  13 ASCII strings, each used as a reference to an arbitrary-length sequence of
  14 data bytes, and are like a URI for that data. This data is encrypted and
  15 distributed across a number of nodes, such that it will survive the loss of
  16 most of the nodes.
  17
  18 The middle layer is the decentralized filesystem: a directed graph in which
  19 the intermediate nodes are directories and the leaf nodes are files. The leaf
  20 nodes contain only the file data -- they contain no metadata about the file
  21 other than the length in bytes.  The edges leading to leaf nodes have metadata
  22 attached to them about the file they point to.  Therefore, the same file may
  23 be associated with different metadata if it is dereferenced through different
  24 edges.
  25
  26 The top layer consists of the applications using the filesystem.
  27 Allmydata.com uses it for a backup service: the application periodically
  28 copies files from the local disk onto the decentralized filesystem.  We later
  29 provide read-only access to those files, allowing users to recover them.  The
  30 filesystem can be used by other applications, too.
  31
  32
  33 THE GRID OF STORAGE SERVERS
  34
  35 A key-value store is implemented by a collection of peer nodes -- processes
  36 running on computers -- called a "grid". (The term "grid" is also used
  37 loosely for the filesystem supported by these nodes.) The nodes in a grid
  38 establish TCP connections to each other using Foolscap, a secure
  39 remote-message-passing library.
  40
  41 Each node offers certain services to the others. The primary service is that
  42 of the storage server, which holds data in the form of "shares".  Shares are
  43 encoded pieces of files.  There are a configurable number of shares for each
  44 file, 10 by default.  Normally, each share is stored on a separate server, but
  45 a single server can hold multiple shares for a single file.
  46
  47 Nodes learn about each other through an "introducer". Each node connects to a
  48 central introducer at startup, and receives a list of all other nodes from
  49 it. Each node then connects to all other nodes, creating a fully-connected
  50 topology.  In the current release, nodes behind NAT boxes will connect to all
  51 nodes that they can open connections to, but they cannot open connections to
  52 other nodes behind NAT boxes.  Therefore, the more nodes behind NAT boxes, the
  53 less the topology resembles the intended fully-connected topology.
  54
  55 The introducer in nominally a single point of failure, in that clients who
  56 never see the introducer will be unable to connect to any storage servers.
  57 But once a client has been introduced to everybody, they do not need the
  58 introducer again until they are restarted. The danger of a SPOF is further
  59 reduced in other ways. First, the introducer is defined by a hostname and a
  60 private key, which are easy to move to a new host in case the original one
  61 suffers an unrecoverable hardware problem. Second, even if the private key is
  62 lost, clients can be reconfigured with a new introducer.furl that points to a
  63 new one. Finally, we have plans to decentralize introduction, allowing any
  64 node to tell a new client about all the others. With decentralized
  65 "gossip-based" introduction, simply knowing how to contact any one node will
  66 be enough to contact all of them.
  67
  68
  69 FILE ENCODING
  70
  71 When a node stores a file on its grid, it first encrypts the file, using a key
  72 that is optionally derived from the hash of the file itself.  It then segments
  73 the encrypted file into small pieces, in order to reduce the memory footprint,
  74 and to decrease the lag between initiating a download and receiving the first
  75 part of the file; for example the lag between hitting "play" and a movie
  76 actually starting.
  77
  78 The node then erasure-codes each segment, producing blocks such that only a
  79 subset of them are needed to reconstruct the segment. It sends one block from
  80 each segment to a given server. The set of blocks on a given server
  81 constitutes a "share". Only a subset of the shares (3 out of 10, by default)
  82 are needed to reconstruct the file.
  83
  84 A tagged hash of the encryption key is used to form the "storage index", which
  85 is used for both server selection (described below) and to index shares within
  86 the Storage Servers on the selected nodes.
  87
  88 Hashes are computed while the shares are being produced, to validate the
  89 ciphertext and the shares themselves. Merkle hash trees are used to enable
  90 validation of individual segments of ciphertext without requiring the
  91 download/decoding of the whole file. These hashes go into the "Capability
  92 Extension Block", which will be stored with each share.
  93
  94 The capability contains the encryption key, the hash of the Capability
  95 Extension Block, and any encoding parameters necessary to perform the eventual
  96 decoding process.  For convenience, it also contains the size of the file
  97 being stored.
  98
  99
 100 On the download side, the node that wishes to turn a capability into a
 101 sequence of bytes will obtain the necessary shares from remote nodes, break
 102 them into blocks, use erasure-decoding to turn them into segments of
 103 ciphertext, use the decryption key to convert that into plaintext, then emit
 104 the plaintext bytes to the output target (which could be a file on disk, or it
 105 could be streamed directly to a web browser or media player).
 106
 107 All hashes use SHA-256, and a different tag is used for each purpose.
 108 Netstrings are used where necessary to insure these tags cannot be confused
 109 with the data to be hashed. All encryption uses AES in CTR mode. The erasure
 110 coding is performed with zfec.
 111
 112 A Merkle Hash Tree is used to validate the encoded blocks before they are fed
 113 into the decode process, and a transverse tree is used to validate the shares
 114 as they are retrieved. A third merkle tree is constructed over the plaintext
 115 segments, and a fourth is constructed over the ciphertext segments.  All
 116 necessary hashes are stored with the shares, and the hash tree roots are put
 117 in the Capability Extension Block. The final hash of the extension block goes
 118 into the capability itself.
 119
 120 Note that the number of shares created is fixed at the time the file is
 121 uploaded: it is not possible to create additional shares later. The use of a
 122 top-level hash tree also requires that nodes create all shares at once, even
 123 if they don't intend to upload some of them, otherwise the hashroot cannot be
 124 calculated correctly.
 125
 126
 127 CAPABILITIES
 128
 129 Capabilities to immutable files represent a specific set of bytes.  Think of
 130 it like a hash function: you feed in a bunch of bytes, and you get out a
 131 capability, which is deterministically derived from the input data: changing
 132 even one bit of the input data will result in a completely different
 133 capability.
 134
 135 Read-only capabilities to mutable files represent the ability to get a set of
 136 bytes representing some version of the file, most likely the latest version.
 137 Each read-only capability is unique. In fact, each mutable file has a unique
 138 public/private key pair created when the mutable file is created, and the
 139 read-only capability to that file includes a secure hash of the public key.
 140
 141 Read-write capabilities to mutable files represent the ability to read the
 142 file (just like a read-only capability) and also to write a new version of
 143 the file, overwriting any extant version. Read-write capabilities are unique
 144 -- each one includes the secure hash of the private key associated with that
 145 mutable file.
 146
 147 The capability provides both "location" and "identification": you can use it
 148 to retrieve a set of bytes, and then you can use it to validate ("identify")
 149 that these potential bytes are indeed the ones that you were looking for.
 150
 151 The "key-value store" layer is insufficient to provide a usable filesystem,
 152 which requires human-meaningful names. Capabilities sit on the
 153 "global+secure" edge of Zooko's Triangle[1]. They are self-authenticating,
 154 meaning that nobody can trick you into using a file that doesn't match the
 155 capability you used to refer to that file.
 156
 157
 158 SERVER SELECTION
 159
 160 When a file is uploaded, the encoded shares are sent to other nodes. But to
 161 which ones? The "server selection" algorithm is used to make this choice.
 162
 163 In the current version, the storage index is used to consistently-permute the
 164 set of all peer nodes (by sorting the peer nodes by
 165 HASH(storage_index+peerid)). Each file gets a different permutation, which
 166 (on average) will evenly distribute shares among the grid and avoid hotspots.
 167
 168 We use this permuted list of nodes to ask each node, in turn, if it will hold
 169 a share for us, by sending an 'allocate_buckets() query' to each one. Some
 170 will say yes, others (those who are full) will say no: when a node refuses our
 171 request, we just take that share to the next node on the list. We keep going
 172 until we run out of shares to place. At the end of the process, we'll have a
 173 table that maps each share number to a node, and then we can begin the
 174 encode+push phase, using the table to decide where each share should be sent.
 175
 176 Most of the time, this will result in one share per node, which gives us
 177 maximum reliability (since it disperses the failures as widely as possible).
 178 If there are fewer useable nodes than there are shares, we'll be forced to
 179 loop around, eventually giving multiple shares to a single node. This reduces
 180 reliability, so it isn't the sort of thing we want to happen all the time, and
 181 either indicates that the default encoding parameters are set incorrectly
 182 (creating more shares than you have nodes), or that the grid does not have
 183 enough space (many nodes are full). But apart from that, it doesn't hurt. If
 184 we have to loop through the node list a second time, we accelerate the query
 185 process, by asking each node to hold multiple shares on the second pass. In
 186 most cases, this means we'll never send more than two queries to any given
 187 node.
 188
 189 If a node is unreachable, or has an error, or refuses to accept any of our
 190 shares, we remove them from the permuted list, so we won't query them a second
 191 time for this file. If a node already has shares for the file we're uploading
 192 (or if someone else is currently sending them shares), we add that information
 193 to the share-to-peer-node table. This lets us do less work for files which have
 194 been uploaded once before, while making sure we still wind up with as many
 195 shares as we desire.
 196
 197 If we are unable to place every share that we want, but we still managed to
 198 place a quantity known as "shares of happiness", we'll do the upload anyways.
 199 If we cannot place at least this many, the upload is declared a failure.
 200
 201 The current defaults use k=3, shares_of_happiness=7, and N=10, meaning that
 202 we'll try to place 10 shares, we'll be happy if we can place 7, and we need to
 203 get back any 3 to recover the file. This results in a 3.3x expansion
 204 factor. In general, you should set N about equal to the number of nodes in
 205 your grid, then set N/k to achieve your desired availability goals.
 206
 207 When downloading a file, the current release just asks all known nodes for any
 208 shares they might have, chooses the minimal necessary subset, then starts
 209 downloading and processing those shares. A later release will use the full
 210 algorithm to reduce the number of queries that must be sent out. This
 211 algorithm uses the same consistent-hashing permutation as on upload, but stops
 212 after it has located k shares (instead of all N). This reduces the number of
 213 queries that must be sent before downloading can begin.
 214
 215 The actual number of queries is directly related to the availability of the
 216 nodes and the degree of overlap between the node list used at upload and at
 217 download. For stable grids, this overlap is very high, and usually the first k
 218 queries will result in shares. The number of queries grows as the stability
 219 decreases. Some limits may be imposed in large grids to avoid querying a
 220 million nodes; this provides a tradeoff between the work spent to discover
 221 that a file is unrecoverable and the probability that a retrieval will fail
 222 when it could have succeeded if we had just tried a little bit harder. The
 223 appropriate value of this tradeoff will depend upon the size of the grid, and
 224 will change over time.
 225
 226 Other peer-node selection algorithms are possible. One earlier version (known
 227 as "Tahoe 3") used the permutation to place the nodes around a large ring,
 228 distributed shares evenly around the same ring, then walks clockwise from 0
 229 with a basket: each time we encounter a share, put it in the basket, each time
 230 we encounter a node, give them as many shares from our basket as they'll
 231 accept. This reduced the number of queries (usually to 1) for small grids
 232 (where N is larger than the number of nodes), but resulted in extremely
 233 non-uniform share distribution, which significantly hurt reliability
 234 (sometimes the permutation resulted in most of the shares being dumped on a
 235 single node).
 236
 237 Another algorithm (known as "denver airport"[2]) uses the permuted hash to
 238 decide on an approximate target for each share, then sends lease requests via
 239 Chord routing. The request includes the contact information of the uploading
 240 node, and asks that the node which eventually accepts the lease should contact
 241 the uploader directly. The shares are then transferred over direct connections
 242 rather than through multiple Chord hops. Download uses the same approach. This
 243 allows nodes to avoid maintaining a large number of long-term connections, at
 244 the expense of complexity, latency, and reliability.
 245
 246
 247 SWARMING DOWNLOAD, TRICKLING UPLOAD
 248
 249 Because the shares being downloaded are distributed across a large number of
 250 nodes, the download process will pull from many of them at the same time. The
 251 current encoding parameters require 3 shares to be retrieved for each segment,
 252 which means that up to 3 nodes will be used simultaneously. For larger
 253 networks, 8-of-22 encoding could be used, meaning 8 nodes can be used
 254 simultaneously. This allows the download process to use the sum of the
 255 available nodes' upload bandwidths, resulting in downloads that take full
 256 advantage of the common 8x disparity between download and upload bandwith on
 257 modern ADSL lines.
 258
 259 On the other hand, uploads are hampered by the need to upload encoded shares
 260 that are larger than the original data (3.3x larger with the current default
 261 encoding parameters), through the slow end of the asymmetric connection. This
 262 means that on a typical 8x ADSL line, uploading a file will take about 32
 263 times longer than downloading it again later.
 264
 265 Smaller expansion ratios can reduce this upload penalty, at the expense of
 266 reliability. See RELIABILITY, below. By using an "upload helper", this penalty
 267 is eliminated: the client does a 1x upload of encrypted data to the helper,
 268 then the helper performs encoding and pushes the shares to the storage
 269 servers. This is an improvement if the helper has significantly higher upload
 270 bandwidth than the client, so it makes the most sense for a commercially-run
 271 grid for which all of the storage servers are in a colo facility with high
 272 interconnect bandwidth. In this case, the helper is placed in the same
 273 facility, so the helper-to-storage-server bandwidth is huge.
 274
 275 See "helper.txt" for details about the upload helper.
 276
 277
 278 THE FILESYSTEM LAYER
 279
 280 The "filesystem" layer is responsible for mapping human-meaningful pathnames
 281 (directories and filenames) to pieces of data. The actual bytes inside these
 282 files are referenced by capability, but the filesystem layer is where the
 283 directory names, file names, and metadata are kept.
 284
 285 The filesystem layer is a graph of directories. Each directory contains a
 286 table of named children. These children are either other directories or
 287 files. All children are referenced by their capability.
 288
 289 A directory has two forms of capability: read-write caps and read-only
 290 caps. The table of children inside the directory has a read-write and
 291 read-only capability for each child. If you have a read-only capability for a
 292 given directory, you will not be able to access the read-write capability of
 293 its children.  This results in "transitively read-only" directory access.
 294
 295 By having two different capabilities, you can choose which you want to share
 296 with someone else. If you create a new directory and share the read-write
 297 capability for it with a friend, then you will both be able to modify its
 298 contents. If instead you give them the read-only capability, then they will
 299 *not* be able to modify the contents. Any capability that you receive can be
 300 linked in to any directory that you can modify, so very powerful
 301 shared+published directory structures can be built from these components.
 302
 303 This structure enable individual users to have their own personal space, with
 304 links to spaces that are shared with specific other users, and other spaces
 305 that are globally visible.
 306
 307
 308 LEASES, REFRESHING, GARBAGE COLLECTION
 309
 310 When a file or directory in the virtual filesystem is no longer referenced,
 311 the space that its shares occupied on each storage server can be freed,
 312 making room for other shares. Tahoe currently uses a garbage collection
 313 ("GC") mechanism to implement this space-reclamation process. Each share has
 314 one or more "leases", which are managed by clients who want the
 315 file/directory to be retained. The storage server accepts each share for a
 316 pre-defined period of time, and is allowed to delete the share if all of the
 317 leases are cancelled or allowed to expire.
 318
 319 Garbage collection is not enabled by default: storage servers will not delete
 320 shares without being explicitly configured to do so. When GC is enabled,
 321 clients are responsible for renewing their leases on a periodic basis at
 322 least frequently enough to prevent any of the leases from expiring before the
 323 next renewal pass.
 324
 325 See docs/garbage-collection.txt for further information, and how to configure
 326 garbage collection.
 327
 328
 329 FILE REPAIRER
 330
 331 Shares may go away because the storage server hosting them has suffered a
 332 failure: either temporary downtime (affecting availability of the file), or a
 333 permanent data loss (affecting the reliability of the file). Hard drives
 334 crash, power supplies explode, coffee spills, and asteroids strike. The goal
 335 of a robust distributed filesystem is to survive these setbacks.
 336
 337 To work against this slow, continual loss of shares, a File Checker is used to
 338 periodically count the number of shares still available for any given file. A
 339 more extensive form of checking known as the File Verifier can download the
 340 ciphertext of the target file and perform integrity checks (using strong
 341 hashes) to make sure the data is stil intact. When the file is found to have
 342 decayed below some threshold, the File Repairer can be used to regenerate and
 343 re-upload the missing shares. These processes are conceptually distinct (the
 344 repairer is only run if the checker/verifier decides it is necessary), but in
 345 practice they will be closely related, and may run in the same process.
 346
 347 The repairer process does not get the full capability of the file to be
 348 maintained: it merely gets the "repairer capability" subset, which does not
 349 include the decryption key. The File Verifier uses that data to find out which
 350 nodes ought to hold shares for this file, and to see if those nodes are still
 351 around and willing to provide the data. If the file is not healthy enough, the
 352 File Repairer is invoked to download the ciphertext, regenerate any missing
 353 shares, and upload them to new nodes. The goal of the File Repairer is to
 354 finish up with a full set of "N" shares.
 355
 356 There are a number of engineering issues to be resolved here. The bandwidth,
 357 disk IO, and CPU time consumed by the verification/repair process must be
 358 balanced against the robustness that it provides to the grid. The nodes
 359 involved in repair will have very different access patterns than normal nodes,
 360 such that these processes may need to be run on hosts with more memory or
 361 network connectivity than usual. The frequency of repair will directly affect
 362 the resources consumed. In some cases, verification of multiple files can be
 363 performed at the same time, and repair of files can be delegated off to other
 364 nodes.
 365
 366 The security model we are currently using assumes that nodes who claim to hold
 367 a share will actually provide it when asked. (We validate the data they
 368 provide before using it in any way, but if enough nodes claim to hold the data
 369 and are wrong, the file will not be repaired, and may decay beyond
 370 recoverability). There are several interesting approaches to mitigate this
 371 threat, ranging from challenges to provide a keyed hash of the allegedly-held
 372 data (using "buddy nodes", in which two nodes hold the same block, and check
 373 up on each other), to reputation systems, or even the original Mojo Nation
 374 economic model.
 375
 376
 377 SECURITY
 378
 379 The design goal for this project is that an attacker may be able to deny
 380 service (i.e. prevent you from recovering a file that was uploaded earlier)
 381 but can accomplish none of the following three attacks:
 382
 383  1) violate confidentiality: the attacker gets to view data to which you have
 384     not granted them access
 385  2) violate consistency: the attacker convinces you that the wrong data is
 386     actually the data you were intending to retrieve
 387  3) violate mutability: the attacker gets to modify a directory (either the
 388     pathnames or the file contents) to which you have not given them
 389     mutability rights
 390
 391 Data validity and consistency (the promise that the downloaded data will match
 392 the originally uploaded data) is provided by the hashes embedded in the
 393 capability. Data confidentiality (the promise that the data is only readable
 394 by people with the capability) is provided by the encryption key embedded in
 395 the capability. Data availability (the hope that data which has been uploaded
 396 in the past will be downloadable in the future) is provided by the grid, which
 397 distributes failures in a way that reduces the correlation between individual
 398 node failure and overall file recovery failure, and by the erasure-coding
 399 technique used to generate shares.
 400
 401 Many of these security properties depend upon the usual cryptographic
 402 assumptions: the resistance of AES and RSA to attack, the resistance of
 403 SHA-256 to pre-image attacks, and upon the proximity of 2^-128 and 2^-256 to
 404 zero. A break in AES would allow a confidentiality violation, a pre-image
 405 break in SHA-256 would allow a consistency violation, and a break in RSA
 406 would allow a mutability violation. The discovery of a collision in SHA-256
 407 is unlikely to allow much, but could conceivably allow a consistency
 408 violation in data that was uploaded by the attacker. If SHA-256 is
 409 threatened, further analysis will be warranted.
 410
 411 There is no attempt made to provide anonymity, neither of the origin of a
 412 piece of data nor the identity of the subsequent downloaders. In general,
 413 anyone who already knows the contents of a file will be in a strong position
 414 to determine who else is uploading or downloading it. Also, it is quite easy
 415 for a sufficiently large coalition of nodes to correlate the set of nodes who
 416 are all uploading or downloading the same file, even if the attacker does not
 417 know the contents of the file in question.
 418
 419 Also note that the file size and (when convergence is being used) a keyed hash
 420 of the plaintext are not protected. Many people can determine the size of the
 421 file you are accessing, and if they already know the contents of a given file,
 422 they will be able to determine that you are uploading or downloading the same
 423 one.
 424
 425 A likely enhancement is the ability to use distinct encryption keys for each
 426 file, avoiding the file-correlation attacks at the expense of increased
 427 storage consumption. This is known as "non-convergent" encoding.
 428
 429 The capability-based security model is used throughout this project. Directory
 430 operations are expressed in terms of distinct read- and write- capabilities.
 431 Knowing the read-capability of a file is equivalent to the ability to read the
 432 corresponding data. The capability to validate the correctness of a file is
 433 strictly weaker than the read-capability (possession of read-capability
 434 automatically grants you possession of validate-capability, but not vice
 435 versa). These capabilities may be expressly delegated (irrevocably) by simply
 436 transferring the relevant secrets.
 437
 438 The application layer can provide whatever access model is desired, built on
 439 top of this capability access model.  The first big user of this system so far
 440 is allmydata.com.  The allmydata.com access model currently works like a
 441 normal web site, using username and password to give a user access to her
 442 virtual drive.  In addition, allmydata.com users can share individual files
 443 (using a file sharing interface built on top of the immutable file read
 444 capabilities).
 445
 446
 447 RELIABILITY
 448
 449 File encoding and peer-node selection parameters can be adjusted to achieve
 450 different goals. Each choice results in a number of properties; there are many
 451 tradeoffs.
 452
 453 First, some terms: the erasure-coding algorithm is described as K-out-of-N
 454 (for this release, the default values are K=3 and N=10). Each grid will have
 455 some number of nodes; this number will rise and fall over time as nodes join,
 456 drop out, come back, and leave forever. Files are of various sizes, some are
 457 popular, others are rare. Nodes have various capacities, variable
 458 upload/download bandwidths, and network latency. Most of the mathematical
 459 models that look at node failure assume some average (and independent)
 460 probability 'P' of a given node being available: this can be high (servers
 461 tend to be online and available >90% of the time) or low (laptops tend to be
 462 turned on for an hour then disappear for several days). Files are encoded in
 463 segments of a given maximum size, which affects memory usage.
 464
 465 The ratio of N/K is the "expansion factor". Higher expansion factors improve
 466 reliability very quickly (the binomial distribution curve is very sharp), but
 467 consumes much more grid capacity. When P=50%, the absolute value of K affects
 468 the granularity of the binomial curve (1-out-of-2 is much worse than
 469 50-out-of-100), but high values asymptotically approach a constant (i.e.
 470 500-of-1000 is not much better than 50-of-100). When P is high and the
 471 expansion factor is held at a constant, higher values of K and N give much
 472 better reliability (for P=99%, 50-out-of-100 is much much better than 5-of-10,
 473 roughly 10^50 times better), because there are more shares that can be lost
 474 without losing the file.
 475
 476 Likewise, the total number of nodes in the network affects the same
 477 granularity: having only one node means a single point of failure, no matter
 478 how many copies of the file you make. Independent nodes (with uncorrelated
 479 failures) are necessary to hit the mathematical ideals: if you have 100 nodes
 480 but they are all in the same office building, then a single power failure will
 481 take out all of them at once. The "Sybil Attack" is where a single attacker
 482 convinces you that they are actually multiple servers, so that you think you
 483 are using a large number of independent nodes, but in fact you have a single
 484 point of failure (where the attacker turns off all their machines at
 485 once). Large grids, with lots of truly independent nodes, will enable the use
 486 of lower expansion factors to achieve the same reliability, but will increase
 487 overhead because each node needs to know something about every other, and the
 488 rate at which nodes come and go will be higher (requiring network maintenance
 489 traffic). Also, the File Repairer work will increase with larger grids,
 490 although then the job can be distributed out to more nodes.
 491
 492 Higher values of N increase overhead: more shares means more Merkle hashes
 493 that must be included with the data, and more nodes to contact to retrieve the
 494 shares. Smaller segment sizes reduce memory usage (since each segment must be
 495 held in memory while erasure coding runs) and improves "alacrity" (since
 496 downloading can validate a smaller piece of data faster, delivering it to the
 497 target sooner), but also increase overhead (because more blocks means more
 498 Merkle hashes to validate them).
 499
 500 In general, small private grids should work well, but the participants will
 501 have to decide between storage overhead and reliability. Large stable grids
 502 will be able to reduce the expansion factor down to a bare minimum while still
 503 retaining high reliability, but large unstable grids (where nodes are coming
 504 and going very quickly) may require more repair/verification bandwidth than
 505 actual upload/download traffic.
 506
 507 Tahoe nodes that run a webserver have a page dedicated to provisioning
 508 decisions: this tool may help you evaluate different expansion factors and
 509 view the disk consumption of each. It is also acquiring some sections with
 510 availability/reliability numbers, as well as preliminary cost analysis data.
 511 This tool will continue to evolve as our analysis improves.
 512
 513 ------------------------------
 514
 515 [1]: http://en.wikipedia.org/wiki/Zooko%27s_triangle
 516
 517 [2]: all of these names are derived from the location where they were
 518      concocted, in this case in a car ride from Boulder to DEN. To be
 519      precise, "Tahoe 1" was an unworkable scheme in which everyone who holds
 520      shares for a given file would form a sort of cabal which kept track of
 521      all the others, "Tahoe 2" is the first-100-nodes in the permuted hash
 522      described in this document, and "Tahoe 3" (or perhaps "Potrero hill 1")
 523      was the abandoned ring-with-many-hands approach.
 524