docs/architecture.rst

   1 =======================
   2 Tahoe-LAFS Architecture
   3 =======================
   4
   5 1.  `Overview`_
   6 2.  `The Key-Value Store`_
   7 3.  `File Encoding`_
   8 4.  `Capabilities`_
   9 5.  `Server Selection`_
  10 6.  `Swarming Download, Trickling Upload`_
  11 7.  `The Filesystem Layer`_
  12 8.  `Leases, Refreshing, Garbage Collection`_
  13 9.  `File Repairer`_
  14 10. `Security`_
  15 11. `Reliability`_
  16
  17
  18 Overview
  19 ========
  20
  21 (See the `docs/specifications directory <specifications>`_ for more details.)
  22
  23 There are three layers: the key-value store, the filesystem, and the
  24 application.
  25
  26 The lowest layer is the key-value store. The keys are "capabilities" -- short
  27 ASCII strings -- and the values are sequences of data bytes. This data is
  28 encrypted and distributed across a number of nodes, such that it will survive
  29 the loss of most of the nodes. There are no hard limits on the size of the
  30 values, but there may be performance issues with extremely large values (just
  31 due to the limitation of network bandwidth). In practice, values as small as
  32 a few bytes and as large as tens of gigabytes are in common use.
  33
  34 The middle layer is the decentralized filesystem: a directed graph in which
  35 the intermediate nodes are directories and the leaf nodes are files. The leaf
  36 nodes contain only the data -- they contain no metadata other than the length
  37 in bytes. The edges leading to leaf nodes have metadata attached to them
  38 about the file they point to. Therefore, the same file may be associated with
  39 different metadata if it is referred to through different edges.
  40
  41 The top layer consists of the applications using the filesystem.
  42 Allmydata.com uses it for a backup service: the application periodically
  43 copies files from the local disk onto the decentralized filesystem. We later
  44 provide read-only access to those files, allowing users to recover them.
  45 There are several other applications built on top of the Tahoe-LAFS
  46 filesystem (see the `RelatedProjects
  47 <http://tahoe-lafs.org/trac/tahoe-lafs/wiki/RelatedProjects>`_ page of the
  48 wiki for a list).
  49
  50
  51 The Key-Value Store
  52 ===================
  53
  54 The key-value store is implemented by a grid of Tahoe-LAFS storage servers --
  55 user-space processes. Tahoe-LAFS storage clients communicate with the storage
  56 servers over TCP.
  57
  58 Storage servers hold data in the form of "shares". Shares are encoded pieces
  59 of files. There are a configurable number of shares for each file, 10 by
  60 default. Normally, each share is stored on a separate server, but in some
  61 cases a single server can hold multiple shares of a file.
  62
  63 Nodes learn about each other through an "introducer". Each server connects to
  64 the introducer at startup and announces its presence. Each client connects to
  65 the introducer at startup, and receives a list of all servers from it. Each
  66 client then connects to every server, creating a "bi-clique" topology. In the
  67 current release, nodes behind NAT boxes will connect to all nodes that they
  68 can open connections to, but they cannot open connections to other nodes
  69 behind NAT boxes. Therefore, the more nodes behind NAT boxes, the less the
  70 topology resembles the intended bi-clique topology.
  71
  72 The introducer is a Single Point of Failure ("SPoF"), in that clients who
  73 never connect to the introducer will be unable to connect to any storage
  74 servers, but once a client has been introduced to everybody, it does not need
  75 the introducer again until it is restarted. The danger of a SPoF is further
  76 reduced in two ways. First, the introducer is defined by a hostname and a
  77 private key, which are easy to move to a new host in case the original one
  78 suffers an unrecoverable hardware problem. Second, even if the private key is
  79 lost, clients can be reconfigured to use a new introducer.
  80
  81 For future releases, we have plans to decentralize introduction, allowing any
  82 server to tell a new client about all the others.
  83
  84
  85 File Encoding
  86 =============
  87
  88 When a client stores a file on the grid, it first encrypts the file. It then
  89 breaks the encrypted file into small segments, in order to reduce the memory
  90 footprint, and to decrease the lag between initiating a download and
  91 receiving the first part of the file; for example the lag between hitting
  92 "play" and a movie actually starting.
  93
  94 The client then erasure-codes each segment, producing blocks of which only a
  95 subset are needed to reconstruct the segment (3 out of 10, with the default
  96 settings).
  97
  98 It sends one block from each segment to a given server. The set of blocks on
  99 a given server constitutes a "share". Therefore a subset f the shares (3 out
 100 of 10, by default) are needed to reconstruct the file.
 101
 102 A hash of the encryption key is used to form the "storage index", which is
 103 used for both server selection (described below) and to index shares within
 104 the Storage Servers on the selected nodes.
 105
 106 The client computes secure hashes of the ciphertext and of the shares. It
 107 uses Merkle Trees so that it is possible to verify the correctness of a
 108 subset of the data without requiring all of the data. For example, this
 109 allows you to verify the correctness of the first segment of a movie file and
 110 then begin playing the movie file in your movie viewer before the entire
 111 movie file has been downloaded.
 112
 113 These hashes are stored in a small datastructure named the Capability
 114 Extension Block which is stored on the storage servers alongside each share.
 115
 116 The capability contains the encryption key, the hash of the Capability
 117 Extension Block, and any encoding parameters necessary to perform the
 118 eventual decoding process. For convenience, it also contains the size of the
 119 file being stored.
 120
 121 To download, the client that wishes to turn a capability into a sequence of
 122 bytes will obtain the blocks from storage servers, use erasure-decoding to
 123 turn them into segments of ciphertext, use the decryption key to convert that
 124 into plaintext, then emit the plaintext bytes to the output target.
 125
 126
 127 Capabilities
 128 ============
 129
 130 Capabilities to immutable files represent a specific set of bytes. Think of
 131 it like a hash function: you feed in a bunch of bytes, and you get out a
 132 capability, which is deterministically derived from the input data: changing
 133 even one bit of the input data will result in a completely different
 134 capability.
 135
 136 Read-only capabilities to mutable files represent the ability to get a set of
 137 bytes representing some version of the file, most likely the latest version.
 138 Each read-only capability is unique. In fact, each mutable file has a unique
 139 public/private key pair created when the mutable file is created, and the
 140 read-only capability to that file includes a secure hash of the public key.
 141
 142 Read-write capabilities to mutable files represent the ability to read the
 143 file (just like a read-only capability) and also to write a new version of
 144 the file, overwriting any extant version. Read-write capabilities are unique
 145 -- each one includes the secure hash of the private key associated with that
 146 mutable file.
 147
 148 The capability provides both "location" and "identification": you can use it
 149 to retrieve a set of bytes, and then you can use it to validate ("identify")
 150 that these potential bytes are indeed the ones that you were looking for.
 151
 152 The "key-value store" layer doesn't include human-meaningful names.
 153 Capabilities sit on the "global+secure" edge of `Zooko's Triangle`_. They are
 154 self-authenticating, meaning that nobody can trick you into accepting a file
 155 that doesn't match the capability you used to refer to that file. The
 156 filesystem layer (described below) adds human-meaningful names atop the
 157 key-value layer.
 158
 159 .. _`Zooko's Triangle`: http://en.wikipedia.org/wiki/Zooko%27s_triangle
 160
 161
 162 Server Selection
 163 ================
 164
 165 When a file is uploaded, the encoded shares are sent to some servers. But to
 166 which ones? The "server selection" algorithm is used to make this choice.
 167
 168 The storage index is used to consistently-permute the set of all servers nodes
 169 (by sorting them by ``HASH(storage_index+nodeid)``). Each file gets a different
 170 permutation, which (on average) will evenly distribute shares among the grid
 171 and avoid hotspots. Each server has announced its available space when it
 172 connected to the introducer, and we use that available space information to
 173 remove any servers that cannot hold an encoded share for our file. Then we ask
 174 some of the servers thus removed if they are already holding any encoded shares
 175 for our file; we use this information later. (We ask any servers which are in
 176 the first 2*``N`` elements of the permuted list.)
 177
 178 We then use the permuted list of servers to ask each server, in turn, if it
 179 will hold a share for us (a share that was not reported as being already
 180 present when we talked to the full servers earlier, and that we have not
 181 already planned to upload to a different server). We plan to send a share to a
 182 server by sending an 'allocate_buckets() query' to the server with the number
 183 of that share. Some will say yes they can hold that share, others (those who
 184 have become full since they announced their available space) will say no; when
 185 a server refuses our request, we take that share to the next server on the
 186 list. In the response to allocate_buckets() the server will also inform us of
 187 any shares of that file that it already has. We keep going until we run out of
 188 shares that need to be stored. At the end of the process, we'll have a table
 189 that maps each share number to a server, and then we can begin the encode and
 190 push phase, using the table to decide where each share should be sent.
 191
 192 Most of the time, this will result in one share per server, which gives us
 193 maximum reliability.  If there are fewer writable servers than there are
 194 unstored shares, we'll be forced to loop around, eventually giving multiple
 195 shares to a single server.
 196
 197 If we have to loop through the node list a second time, we accelerate the query
 198 process, by asking each node to hold multiple shares on the second pass. In
 199 most cases, this means we'll never send more than two queries to any given
 200 node.
 201
 202 If a server is unreachable, or has an error, or refuses to accept any of our
 203 shares, we remove it from the permuted list, so we won't query it again for
 204 this file. If a server already has shares for the file we're uploading, we add
 205 that information to the share-to-server table. This lets us do less work for
 206 files which have been uploaded once before, while making sure we still wind up
 207 with as many shares as we desire.
 208
 209 Before a file upload is called successful, it has to pass an upload health
 210 check. For immutable files, we check to see that a condition called
 211 'servers-of-happiness' is satisfied. When satisfied, 'servers-of-happiness'
 212 assures us that enough pieces of the file are distributed across enough
 213 servers on the grid to ensure that the availability of the file will not be
 214 affected if a few of those servers later fail. For mutable files and
 215 directories, we check to see that all of the encoded shares generated during
 216 the upload process were successfully placed on the grid. This is a weaker
 217 check than 'servers-of-happiness'; it does not consider any information about
 218 how the encoded shares are placed on the grid, and cannot detect situations in
 219 which all or a majority of the encoded shares generated during the upload
 220 process reside on only one storage server. We hope to extend
 221 'servers-of-happiness' to mutable files in a future release of Tahoe-LAFS. If,
 222 at the end of the upload process, the appropriate upload health check fails,
 223 the upload is considered a failure.
 224
 225 The current defaults use ``k``=3, ``servers_of_happiness``=7, and ``N``=10.
 226 ``N``=10 means that we'll try to place 10 shares. ``k``=3 means that we need
 227 any three shares to recover the file. ``servers_of_happiness``=7 means that
 228 we'll consider an immutable file upload to be successful if we can place shares
 229 on enough servers that there are 7 different servers, the correct functioning
 230 of any ``k`` of which guarantee the availability of the immutable file.
 231
 232 ``N``=10 and ``k``=3 means there is a 3.3x expansion factor. On a small grid, you
 233 should set ``N`` about equal to the number of storage servers in your grid; on a
 234 large grid, you might set it to something smaller to avoid the overhead of
 235 contacting every server to place a file. In either case, you should then set ``k``
 236 such that ``N``/``k`` reflects your desired availability goals. The best value for
 237 ``servers_of_happiness`` will depend on how you use Tahoe-LAFS. In a friendnet
 238 with a variable number of servers, it might make sense to set it to the smallest
 239 number of servers that you expect to have online and accepting shares at any
 240 given time. In a stable environment without much server churn, it may make
 241 sense to set ``servers_of_happiness`` = ``N``.
 242
 243 When downloading a file, the current version just asks all known servers for
 244 any shares they might have. Once it has received enough responses that it
 245 knows where to find the needed k shares, it downloads at least the first
 246 segment from those servers. This means that it tends to download shares from
 247 the fastest servers. If some servers had more than one share, it will continue
 248 sending "Do You Have Block" requests to other servers, so that it can download
 249 subsequent segments from distinct servers (sorted by their DYHB round-trip
 250 times), if possible.
 251
 252   *future work*
 253
 254   A future release will use the server selection algorithm to reduce the
 255   number of queries that must be sent out.
 256
 257   Other peer-node selection algorithms are possible. One earlier version
 258   (known as "Tahoe 3") used the permutation to place the nodes around a large
 259   ring, distributed the shares evenly around the same ring, then walked
 260   clockwise from 0 with a basket. Each time it encountered a share, it put it
 261   in the basket, each time it encountered a server, give it as many shares
 262   from the basket as they'd accept. This reduced the number of queries
 263   (usually to 1) for small grids (where ``N`` is larger than the number of
 264   nodes), but resulted in extremely non-uniform share distribution, which
 265   significantly hurt reliability (sometimes the permutation resulted in most
 266   of the shares being dumped on a single node).
 267
 268   Another algorithm (known as "denver airport" [#naming]_) uses the permuted hash to
 269   decide on an approximate target for each share, then sends lease requests
 270   via Chord routing. The request includes the contact information of the
 271   uploading node, and asks that the node which eventually accepts the lease
 272   should contact the uploader directly. The shares are then transferred over
 273   direct connections rather than through multiple Chord hops. Download uses
 274   the same approach. This allows nodes to avoid maintaining a large number of
 275   long-term connections, at the expense of complexity and latency.
 276
 277 .. [#naming]  all of these names are derived from the location where they were
 278         concocted, in this case in a car ride from Boulder to DEN. To be
 279         precise, "Tahoe 1" was an unworkable scheme in which everyone who holds
 280         shares for a given file would form a sort of cabal which kept track of
 281         all the others, "Tahoe 2" is the first-100-nodes in the permuted hash
 282         described in this document, and "Tahoe 3" (or perhaps "Potrero hill 1")
 283         was the abandoned ring-with-many-hands approach.
 284
 285
 286 Swarming Download, Trickling Upload
 287 ===================================
 288
 289 Because the shares being downloaded are distributed across a large number of
 290 nodes, the download process will pull from many of them at the same time. The
 291 current encoding parameters require 3 shares to be retrieved for each
 292 segment, which means that up to 3 nodes will be used simultaneously. For
 293 larger networks, 8-of-22 encoding could be used, meaning 8 nodes can be used
 294 simultaneously. This allows the download process to use the sum of the
 295 available nodes' upload bandwidths, resulting in downloads that take full
 296 advantage of the common 8x disparity between download and upload bandwith on
 297 modern ADSL lines.
 298
 299 On the other hand, uploads are hampered by the need to upload encoded shares
 300 that are larger than the original data (3.3x larger with the current default
 301 encoding parameters), through the slow end of the asymmetric connection. This
 302 means that on a typical 8x ADSL line, uploading a file will take about 32
 303 times longer than downloading it again later.
 304
 305 Smaller expansion ratios can reduce this upload penalty, at the expense of
 306 reliability (see RELIABILITY, below). By using an "upload helper", this
 307 penalty is eliminated: the client does a 1x upload of encrypted data to the
 308 helper, then the helper performs encoding and pushes the shares to the
 309 storage servers. This is an improvement if the helper has significantly
 310 higher upload bandwidth than the client, so it makes the most sense for a
 311 commercially-run grid for which all of the storage servers are in a colo
 312 facility with high interconnect bandwidth. In this case, the helper is placed
 313 in the same facility, so the helper-to-storage-server bandwidth is huge.
 314
 315 See `<helper.rst>`_ for details about the upload helper.
 316
 317
 318 The Filesystem Layer
 319 ====================
 320
 321 The "filesystem" layer is responsible for mapping human-meaningful pathnames
 322 (directories and filenames) to pieces of data. The actual bytes inside these
 323 files are referenced by capability, but the filesystem layer is where the
 324 directory names, file names, and metadata are kept.
 325
 326 The filesystem layer is a graph of directories. Each directory contains a
 327 table of named children. These children are either other directories or
 328 files. All children are referenced by their capability.
 329
 330 A directory has two forms of capability: read-write caps and read-only caps.
 331 The table of children inside the directory has a read-write and read-only
 332 capability for each child. If you have a read-only capability for a given
 333 directory, you will not be able to access the read-write capability of its
 334 children. This results in "transitively read-only" directory access.
 335
 336 By having two different capabilities, you can choose which you want to share
 337 with someone else. If you create a new directory and share the read-write
 338 capability for it with a friend, then you will both be able to modify its
 339 contents. If instead you give them the read-only capability, then they will
 340 *not* be able to modify the contents. Any capability that you receive can be
 341 linked in to any directory that you can modify, so very powerful
 342 shared+published directory structures can be built from these components.
 343
 344 This structure enable individual users to have their own personal space, with
 345 links to spaces that are shared with specific other users, and other spaces
 346 that are globally visible.
 347
 348
 349 Leases, Refreshing, Garbage Collection
 350 ======================================
 351
 352 When a file or directory in the virtual filesystem is no longer referenced,
 353 the space that its shares occupied on each storage server can be freed,
 354 making room for other shares. Tahoe-LAFS uses a garbage collection ("GC")
 355 mechanism to implement this space-reclamation process. Each share has one or
 356 more "leases", which are managed by clients who want the file/directory to be
 357 retained. The storage server accepts each share for a pre-defined period of
 358 time, and is allowed to delete the share if all of the leases are cancelled
 359 or allowed to expire.
 360
 361 Garbage collection is not enabled by default: storage servers will not delete
 362 shares without being explicitly configured to do so. When GC is enabled,
 363 clients are responsible for renewing their leases on a periodic basis at
 364 least frequently enough to prevent any of the leases from expiring before the
 365 next renewal pass.
 366
 367 See `<garbage-collection.rst>`_ for further information, and for how to
 368 configure garbage collection.
 369
 370
 371 File Repairer
 372 =============
 373
 374 Shares may go away because the storage server hosting them has suffered a
 375 failure: either temporary downtime (affecting availability of the file), or a
 376 permanent data loss (affecting the preservation of the file). Hard drives
 377 crash, power supplies explode, coffee spills, and asteroids strike. The goal
 378 of a robust distributed filesystem is to survive these setbacks.
 379
 380 To work against this slow, continual loss of shares, a File Checker is used
 381 to periodically count the number of shares still available for any given
 382 file. A more extensive form of checking known as the File Verifier can
 383 download the ciphertext of the target file and perform integrity checks
 384 (using strong hashes) to make sure the data is stil intact. When the file is
 385 found to have decayed below some threshold, the File Repairer can be used to
 386 regenerate and re-upload the missing shares. These processes are conceptually
 387 distinct (the repairer is only run if the checker/verifier decides it is
 388 necessary), but in practice they will be closely related, and may run in the
 389 same process.
 390
 391 The repairer process does not get the full capability of the file to be
 392 maintained: it merely gets the "repairer capability" subset, which does not
 393 include the decryption key. The File Verifier uses that data to find out
 394 which nodes ought to hold shares for this file, and to see if those nodes are
 395 still around and willing to provide the data. If the file is not healthy
 396 enough, the File Repairer is invoked to download the ciphertext, regenerate
 397 any missing shares, and upload them to new nodes. The goal of the File
 398 Repairer is to finish up with a full set of ``N`` shares.
 399
 400 There are a number of engineering issues to be resolved here. The bandwidth,
 401 disk IO, and CPU time consumed by the verification/repair process must be
 402 balanced against the robustness that it provides to the grid. The nodes
 403 involved in repair will have very different access patterns than normal
 404 nodes, such that these processes may need to be run on hosts with more memory
 405 or network connectivity than usual. The frequency of repair will directly
 406 affect the resources consumed. In some cases, verification of multiple files
 407 can be performed at the same time, and repair of files can be delegated off
 408 to other nodes.
 409
 410   *future work*
 411
 412   Currently there are two modes of checking on the health of your file:
 413   "Checker" simply asks storage servers which shares they have and does
 414   nothing to try to verify that they aren't lying. "Verifier" downloads and
 415   cryptographically verifies every bit of every share of the file from every
 416   server, which costs a lot of network and CPU. A future improvement would be
 417   to make a random-sampling verifier which downloads and cryptographically
 418   verifies only a few randomly-chosen blocks from each server. This would
 419   require much less network and CPU but it could make it extremely unlikely
 420   that any sort of corruption -- even malicious corruption intended to evade
 421   detection -- would evade detection. This would be an instance of a
 422   cryptographic notion called "Proof of Retrievability". Note that to implement
 423   this requires no change to the server or to the cryptographic data structure
 424   -- with the current data structure and the current protocol it is up to the
 425   client which blocks they choose to download, so this would be solely a change
 426   in client behavior.
 427
 428
 429 Security
 430 ========
 431
 432 The design goal for this project is that an attacker may be able to deny
 433 service (i.e. prevent you from recovering a file that was uploaded earlier)
 434 but can accomplish none of the following three attacks:
 435
 436 1) violate confidentiality: the attacker gets to view data to which you have
 437    not granted them access
 438 2) violate integrity: the attacker convinces you that the wrong data is
 439    actually the data you were intending to retrieve
 440 3) violate unforgeability: the attacker gets to modify a mutable file or
 441    directory (either the pathnames or the file contents) to which you have
 442    not given them write permission
 443
 444 Integrity (the promise that the downloaded data will match the uploaded data)
 445 is provided by the hashes embedded in the capability (for immutable files) or
 446 the digital signature (for mutable files). Confidentiality (the promise that
 447 the data is only readable by people with the capability) is provided by the
 448 encryption key embedded in the capability (for both immutable and mutable
 449 files). Data availability (the hope that data which has been uploaded in the
 450 past will be downloadable in the future) is provided by the grid, which
 451 distributes failures in a way that reduces the correlation between individual
 452 node failure and overall file recovery failure, and by the erasure-coding
 453 technique used to generate shares.
 454
 455 Many of these security properties depend upon the usual cryptographic
 456 assumptions: the resistance of AES and RSA to attack, the resistance of
 457 SHA-256 to collision attacks and pre-image attacks, and upon the proximity of
 458 2^-128 and 2^-256 to zero. A break in AES would allow a confidentiality
 459 violation, a collision break in SHA-256 would allow a consistency violation,
 460 and a break in RSA would allow a mutability violation.
 461
 462 There is no attempt made to provide anonymity, neither of the origin of a
 463 piece of data nor the identity of the subsequent downloaders. In general,
 464 anyone who already knows the contents of a file will be in a strong position
 465 to determine who else is uploading or downloading it. Also, it is quite easy
 466 for a sufficiently large coalition of nodes to correlate the set of nodes who
 467 are all uploading or downloading the same file, even if the attacker does not
 468 know the contents of the file in question.
 469
 470 Also note that the file size and (when convergence is being used) a keyed
 471 hash of the plaintext are not protected. Many people can determine the size
 472 of the file you are accessing, and if they already know the contents of a
 473 given file, they will be able to determine that you are uploading or
 474 downloading the same one.
 475
 476 The capability-based security model is used throughout this project.
 477 Directory operations are expressed in terms of distinct read- and write-
 478 capabilities. Knowing the read-capability of a file is equivalent to the
 479 ability to read the corresponding data. The capability to validate the
 480 correctness of a file is strictly weaker than the read-capability (possession
 481 of read-capability automatically grants you possession of
 482 validate-capability, but not vice versa). These capabilities may be expressly
 483 delegated (irrevocably) by simply transferring the relevant secrets.
 484
 485 The application layer can provide whatever access model is desired, built on
 486 top of this capability access model. The first big user of this system so far
 487 is allmydata.com. The allmydata.com access model currently works like a
 488 normal web site, using username and password to give a user access to her
 489 "virtual drive". In addition, allmydata.com users can share individual files
 490 (using a file sharing interface built on top of the immutable file read
 491 capabilities).
 492
 493
 494 Reliability
 495 ===========
 496
 497 File encoding and peer-node selection parameters can be adjusted to achieve
 498 different goals. Each choice results in a number of properties; there are
 499 many tradeoffs.
 500
 501 First, some terms: the erasure-coding algorithm is described as ``k``-out-of-``N``
 502 (for this release, the default values are ``k``=3 and ``N``=10). Each grid will
 503 have some number of nodes; this number will rise and fall over time as nodes
 504 join, drop out, come back, and leave forever. Files are of various sizes, some
 505 are popular, others are unpopular. Nodes have various capacities, variable
 506 upload/download bandwidths, and network latency. Most of the mathematical
 507 models that look at node failure assume some average (and independent)
 508 probability 'P' of a given node being available: this can be high (servers
 509 tend to be online and available >90% of the time) or low (laptops tend to be
 510 turned on for an hour then disappear for several days). Files are encoded in
 511 segments of a given maximum size, which affects memory usage.
 512
 513 The ratio of ``N``/``k`` is the "expansion factor". Higher expansion factors
 514 improve reliability very quickly (the binomial distribution curve is very sharp),
 515 but consumes much more grid capacity. When P=50%, the absolute value of ``k``
 516 affects the granularity of the binomial curve (1-out-of-2 is much worse than
 517 50-out-of-100), but high values asymptotically approach a constant (i.e.
 518 500-of-1000 is not much better than 50-of-100). When P is high and the
 519 expansion factor is held at a constant, higher values of ``k`` and ``N`` give
 520 much better reliability (for P=99%, 50-out-of-100 is much much better than
 521 5-of-10, roughly 10^50 times better), because there are more shares that can
 522 be lost without losing the file.
 523
 524 Likewise, the total number of nodes in the network affects the same
 525 granularity: having only one node means a single point of failure, no matter
 526 how many copies of the file you make. Independent nodes (with uncorrelated
 527 failures) are necessary to hit the mathematical ideals: if you have 100 nodes
 528 but they are all in the same office building, then a single power failure
 529 will take out all of them at once. The "Sybil Attack" is where a single
 530 attacker convinces you that they are actually multiple servers, so that you
 531 think you are using a large number of independent nodes, but in fact you have
 532 a single point of failure (where the attacker turns off all their machines at
 533 once). Large grids, with lots of truly independent nodes, will enable the use
 534 of lower expansion factors to achieve the same reliability, but will increase
 535 overhead because each node needs to know something about every other, and the
 536 rate at which nodes come and go will be higher (requiring network maintenance
 537 traffic). Also, the File Repairer work will increase with larger grids,
 538 although then the job can be distributed out to more nodes.
 539
 540 Higher values of ``N`` increase overhead: more shares means more Merkle hashes
 541 that must be included with the data, and more nodes to contact to retrieve
 542 the shares. Smaller segment sizes reduce memory usage (since each segment
 543 must be held in memory while erasure coding runs) and improves "alacrity"
 544 (since downloading can validate a smaller piece of data faster, delivering it
 545 to the target sooner), but also increase overhead (because more blocks means
 546 more Merkle hashes to validate them).
 547
 548 In general, small private grids should work well, but the participants will
 549 have to decide between storage overhead and reliability. Large stable grids
 550 will be able to reduce the expansion factor down to a bare minimum while
 551 still retaining high reliability, but large unstable grids (where nodes are
 552 coming and going very quickly) may require more repair/verification bandwidth
 553 than actual upload/download traffic.
 554
 555 Tahoe-LAFS nodes that run a webserver have a page dedicated to provisioning
 556 decisions: this tool may help you evaluate different expansion factors and
 557 view the disk consumption of each. It is also acquiring some sections with
 558 availability/reliability numbers, as well as preliminary cost analysis data.
 559 This tool will continue to evolve as our analysis improves.