docs/architecture.txt

   1
   2    Allmydata "Tahoe" Architecture
   3
   4 OVERVIEW
   5
   6 At a high-level this system consists of three layers: the grid, the
   7 filesystem, and the application.
   8
   9 The lowest layer is the "grid", a mapping from capabilities to data.
  10 The capabilities are relatively short ascii strings, each used as a
  11 reference to an arbitrary-length sequence of data bytes, and are like a
  12 URI for that data. This data is encrypted and distributed across a
  13 number of nodes, such that it will survive the loss of most of the
  14 nodes.
  15
  16 The middle layer is the decentralized filesystem: a directed graph in
  17 which the intermediate nodes are directories and the leaf nodes are
  18 files. The leaf nodes contain only the file data -- they contain no
  19 metadata about the file other than the length in bytes.  The edges leading to
  20 leaf nodes have metadata attached to them about the file they point
  21 to.  Therefore, the same file may be associated with different
  22 metadata if it is dereferenced through different edges.
  23
  24 The top layer consists of the applications using the filesystem.
  25 Allmydata.com uses it for a backup service: the application
  26 periodically copies files from the local disk onto the decentralized
  27 filesystem.  We later provide read-only access to those files, allowing
  28 users to recover them.  The filesystem can be used by other
  29 applications, too.
  30
  31
  32 THE GRID OF STORAGE SERVERS
  33
  34 The grid is composed of peer nodes -- processes running on
  35 computers.  They establish TCP connections to each other using Foolscap, a
  36 secure remote message passing library.
  37
  38 Each peer offers certain services to the others. The primary service
  39 is that of the storage server, which holds data in the form of
  40 "shares".  Shares are encoded pieces of files.  There are a
  41 configurable number of shares for each file, 10 by default.  Normally,
  42 each share is stored on a separate server, but a single server can
  43 hold multiple shares for a single file.
  44
  45 Peers learn about each other through an "introducer". Each peer
  46 connects to a central introducer at startup, and receives a list of
  47 all other peers from it. Each peer then connects to all other peers,
  48 creating a fully-connected topology.  In the current release, nodes
  49 behind NAT boxes will connect to all nodes that they can open
  50 connections to, but they cannot open connections to other nodes behind
  51 NAT boxes.  Therefore, the more nodes behind NAT boxes, the less the
  52 topology resembles the intended fully-connected topology.
  53
  54 The introducer in nominally a single point of failure, in that clients who
  55 never see the introducer will be unable to connect to any storage servers.
  56 But once a client has been introduced to everybody, they do not need the
  57 introducer again until they are restarted. The danger of a SPOF is further
  58 reduced in other ways. First, the introducer is defined by a hostname and a
  59 private key, which are easy to move to a new host in case the original one
  60 suffers an unrecoverable hardware problem. Second, even if the private key is
  61 lost, clients can be reconfigured with a new introducer.furl that points to a
  62 new one. Finally, we have plans to decentralize introduction, allowing any
  63 node to tell a new client about all the others. With decentralized
  64 "gossip-based" introduction, simply knowing how to contact any one node will
  65 be enough to contact all of them.
  66
  67 FILE ENCODING
  68
  69 When a peer stores a file on the grid, it first encrypts the file,
  70 using a key that is optionally derived from the hash of the file
  71 itself.  It then segments the encrypted file into small pieces, in
  72 order to reduce the memory footprint, and to decrease the lag between
  73 initiating a download and receiving the first part of the file; for
  74 example the lag between hitting "play" and a movie actually starting.
  75
  76 The peer then erasure-codes each segment, producing blocks such that
  77 only a subset of them are needed to reconstruct the segment. It sends
  78 one block from each segment to a given server. The set of blocks on a
  79 given server constitutes a "share". Only a subset of the shares (3 out
  80 of 10, by default) are needed to reconstruct the file.
  81
  82 A tagged hash of the encryption key is used to form the "storage
  83 index", which is used for both server selection (described below) and
  84 to index shares within the Storage Servers on the selected peers.
  85
  86 Hashes are computed while the shares are being produced, to validate
  87 the ciphertext and the shares themselves. Merkle hash trees are used to
  88 enable validation of individual segments of ciphertext without requiring
  89 the download/decoding of the whole file. These hashes go into the
  90 "Capability Extension Block", which will be stored with each share.
  91
  92 The capability contains the encryption key, the hash of the Capability
  93 Extension Block, and any encoding parameters necessary to perform the
  94 eventual decoding process.  For convenience, it also contains the size
  95 of the file being stored.
  96
  97
  98 On the download side, the node that wishes to turn a capability into a
  99 sequence of bytes will obtain the necessary shares from remote nodes, break
 100 them into blocks, use erasure-decoding to turn them into segments of
 101 ciphertext, use the decryption key to convert that into plaintext, then emit
 102 the plaintext bytes to the output target (which could be a file on disk, or
 103 it could be streamed directly to a web browser or media player).
 104
 105 All hashes use SHA-256, and a different tag is used for each purpose.
 106 Netstrings are used where necessary to insure these tags cannot be confused
 107 with the data to be hashed. All encryption uses AES in CTR mode. The erasure
 108 coding is performed with zfec.
 109
 110 A Merkle Hash Tree is used to validate the encoded blocks before they are fed
 111 into the decode process, and a transverse tree is used to validate the shares
 112 as they are retrieved. A third merkle tree is constructed over the plaintext
 113 segments, and a fourth is constructed over the ciphertext segments.  All
 114 necessary hashes are stored with the shares, and the hash tree roots are put
 115 in the Capability Extension Block. The final hash of the extension block goes
 116 into the capability itself.
 117
 118 Note that the number of shares created is fixed at the time the file is
 119 uploaded: it is not possible to create additional shares later. The use of a
 120 top-level hash tree also requires that nodes create all shares at once, even
 121 if they don't intend to upload some of them, otherwise the hashroot cannot be
 122 calculated correctly.
 123
 124
 125 CAPABILITIES
 126
 127 Capabilities to immutable files represent a specific set of bytes.  Think of
 128 it like a hash function: you feed in a bunch of bytes, and you get out a
 129 capability, which is deterministically derived from the input data: changing
 130 even one bit of the input data will result in a completely different
 131 capability.
 132
 133 Read-only capabilities to mutable files represent the ability to get a set of
 134 bytes representing some version of the file, most likely the latest version.
 135 Each read-only capability is unique. In fact, each mutable file has a unique
 136 public/private key pair created when the mutable file is created, and the
 137 read-only capability to that file includes a secure hash of the public key.
 138
 139 Read-write capabilities to mutable files represent the ability to read the
 140 file (just like a read-only capability) and also to write a new version of
 141 the file, overwriting any extant version.  Read-write capabilities are unique
 142 -- each one includes the secure hash of the private key associated with that
 143 mutable file.
 144
 145 The capability provides both "location" and "identification": you can use it
 146 to retrieve a set of bytes, and then you can use it to validate ("identify")
 147 that these potential bytes are indeed the ones that you were looking for.
 148
 149 The "grid" layer is insufficient to provide a virtual drive: an actual
 150 filesystem requires human-meaningful names.  Capabilities sit on the
 151 "global+secure" edge of Zooko's Triangle[1]. They are self-authenticating,
 152 meaning that nobody can trick you into using a file that doesn't match the
 153 capability you used to refer to that file.
 154
 155
 156 SERVER SELECTION
 157
 158 When a file is uploaded, the encoded shares are sent to other peers. But to
 159 which ones? The "server selection" algorithm is used to make this choice.
 160
 161 In the current version, the storage index is used to consistently-permute the
 162 set of all peers (by sorting the peers by HASH(storage_index+peerid)). Each
 163 file gets a different permutation, which (on average) will evenly distribute
 164 shares among the grid and avoid hotspots.
 165
 166 We use this permuted list of peers to ask each peer, in turn, if it will hold
 167 a share for us, by sending an 'allocate_buckets() query' to each one. Some
 168 will say yes, others (those who are full) will say no: when a peer refuses
 169 our request, we just take that share to the next peer on the list. We keep
 170 going until we run out of shares to place. At the end of the process, we'll
 171 have a table that maps each share number to a peer, and then we can begin the
 172 encode+push phase, using the table to decide where each share should be sent.
 173
 174 Most of the time, this will result in one share per peer, which gives us
 175 maximum reliability (since it disperses the failures as widely as possible).
 176 If there are fewer useable peers than there are shares, we'll be forced to
 177 loop around, eventually giving multiple shares to a single peer. This reduces
 178 reliability, so it isn't the sort of thing we want to happen all the time,
 179 and either indicates that the default encoding parameters are set incorrectly
 180 (creating more shares than you have peers), or that the grid does not have
 181 enough space (many peers are full). But apart from that, it doesn't hurt. If
 182 we have to loop through the peer list a second time, we accelerate the query
 183 process, by asking each peer to hold multiple shares on the second pass. In
 184 most cases, this means we'll never send more than two queries to any given
 185 peer.
 186
 187 If a peer is unreachable, or has an error, or refuses to accept any of our
 188 shares, we remove them from the permuted list, so we won't query them a
 189 second time for this file. If a peer already has shares for the file we're
 190 uploading (or if someone else is currently sending them shares), we add that
 191 information to the share-to-peer table. This lets us do less work for files
 192 which have been uploaded once before, while making sure we still wind up with
 193 as many shares as we desire.
 194
 195 If we are unable to place every share that we want, but we still managed to
 196 place a quantity known as "shares of happiness", we'll do the upload anyways.
 197 If we cannot place at least this many, the upload is declared a failure.
 198
 199 The current defaults use k=3, shares_of_happiness=7, and N=10, meaning that
 200 we'll try to place 10 shares, we'll be happy if we can place 7, and we need
 201 to get back any 3 to recover the file. This results in a 3.3x expansion
 202 factor. In general, you should set N about equal to the number of peers in
 203 your grid, then set N/k to achieve your desired availability goals.
 204
 205 When downloading a file, the current release just asks all known peers for
 206 any shares they might have, chooses the minimal necessary subset, then starts
 207 downloading and processing those shares. A later release will use the full
 208 algorithm to reduce the number of queries that must be sent out. This
 209 algorithm uses the same consistent-hashing permutation as on upload, but
 210 stops after it has located k shares (instead of all N). This reduces the
 211 number of queries that must be sent before downloading can begin.
 212
 213 The actual number of queries is directly related to the availability of the
 214 peers and the degree of overlap between the peerlist used at upload and at
 215 download. For stable grids, this overlap is very high, and usually the first
 216 k queries will result in shares. The number of queries grows as the stability
 217 decreases. Some limits may be imposed in large grids to avoid querying a
 218 million peers; this provides a tradeoff between the work spent to discover
 219 that a file is unrecoverable and the probability that a retrieval will fail
 220 when it could have succeeded if we had just tried a little bit harder. The
 221 appropriate value of this tradeoff will depend upon the size of the grid, and
 222 will change over time.
 223
 224 Other peer selection algorithms are possible. One earlier version (known as
 225 "tahoe 3") used the permutation to place the peers around a large ring,
 226 distributed shares evenly around the same ring, then walks clockwise from 0
 227 with a basket: each time we encounter a share, put it in the basket, each
 228 time we encounter a peer, give them as many shares from our basket as they'll
 229 accept. This reduced the number of queries (usually to 1) for small grids
 230 (where N is larger than the number of peers), but resulted in extremely
 231 non-uniform share distribution, which significantly hurt reliability
 232 (sometimes the permutation resulted in most of the shares being dumped on a
 233 single peer).
 234
 235 Another algorithm (known as "denver airport"[2]) uses the permuted hash to
 236 decide on an approximate target for each share, then sends lease requests via
 237 Chord routing. The request includes the contact information of the uploading
 238 node, and asks that the node which eventually accepts the lease should
 239 contact the uploader directly. The shares are then transferred over direct
 240 connections rather than through multiple Chord hops. Download uses the same
 241 approach. This allows nodes to avoid maintaining a large number of long-term
 242 connections, at the expense of complexity, latency, and reliability.
 243
 244
 245 SWARMING DOWNLOAD, TRICKLING UPLOAD
 246
 247 Because the shares being downloaded are distributed across a large number of
 248 peers, the download process will pull from many of them at the same time. The
 249 current encoding parameters require 3 shares to be retrieved for each
 250 segment, which means that up to 3 peers will be used simultaneously. For
 251 larger networks, 8-of-22 encoding could be used, meaning 8 peers can be used
 252 simultaneously. This allows the download process to use the sum of the
 253 available peers' upload bandwidths, resulting in downloads that take full
 254 advantage of the common 8x disparity between download and upload bandwith on
 255 modern ADSL lines.
 256
 257 On the other hand, uploads are hampered by the need to upload encoded shares
 258 that are larger than the original data (3.3x larger with the current default
 259 encoding parameters), through the slow end of the asymmetric connection. This
 260 means that on a typical 8x ADSL line, uploading a file will take about 32
 261 times longer than downloading it again later.
 262
 263 Smaller expansion ratios can reduce this upload penalty, at the expense of
 264 reliability. See RELIABILITY, below. By using an "upload helper", this
 265 penalty is eliminated: the client does a 1x upload of encrypted data to the
 266 helper, then the helper performs encoding and pushes the shares to the
 267 storage servers. This is an improvement if the helper has significantly
 268 higher upload bandwidth than the client, so it makes the most sense for a
 269 commercially-run grid for which all of the storage servers are in a colo
 270 facility with high interconnect bandwidth. In this case, the helper is placed
 271 in the same facility, so the helper-to-storage-server bandwidth is huge.
 272
 273 See "helper.txt" for details about the upload helper.
 274
 275
 276 THE FILESYSTEM LAYER
 277
 278 The "filesystem" layer is responsible for mapping human-meaningful
 279 pathnames (directories and filenames) to pieces of data. The actual bytes
 280 inside these files are referenced by capability, but the filesystem layer is
 281 where the directory names, file names, and metadata are kept.
 282
 283 The filesystem layer is a graph of directories. Each directory contains a table
 284 of named children. These children are either other directories or files. All
 285 children are referenced by their capability.
 286
 287 A directory has two forms of capability: read-write caps and read-only caps. The
 288 table of children inside the directory has a read-write and read-only capability
 289 for each child. If you have a read-only capability for a given directory, you will
 290 not be able to access the read-write capability of its children.  This results
 291 in "transitively read-only" directory access.
 292
 293 By having two different capabilities, you can choose which you want to share
 294 with someone else. If you create a new directory and share the read-write
 295 capability for it with a friend, then you will both be able to modify its
 296 contents. If instead you give them the read-only capability, then they will
 297 *not* be able to modify the contents. Any capability that you receive can be
 298 linked in to any directory that you can modify, so very powerful
 299 shared+published directory structures can be built from these components.
 300
 301 This structure enable individual users to have their own personal space, with
 302 links to spaces that are shared with specific other users, and other spaces
 303 that are globally visible.
 304
 305
 306 LEASES, REFRESHING, GARBAGE COLLECTION, QUOTAS
 307
 308 THIS SECTION IS OUT OF DATE.  Since we wrote this we've changed our minds about how we intend to implement these features.  Neither the old design, documented below, nor the new one, documented on the tahoe-dev mailing list and the wiki and the issue tracker, have actually been implemented yet.
 309
 310 Shares are uploaded to a storage server, but they do not necessarily stay
 311 there forever. We are anticipating three main share-lifetime management modes
 312 for Tahoe: 1) per-share leases which expire, 2) per-account timers which
 313 expire and cancel all leases for the account, and 3) centralized account
 314 management without expiration timers.
 315
 316 To be clear, none of these have been implemented yet. The
 317 http://allmydata.org/trac/tahoe/wiki/QuotaManagement "Quota Management" wiki
 318 page describes some of our plans for managing data lifetime and limited-space
 319 user accounts.
 320
 321 Multiple clients may be interested in a given share, for example if two
 322 clients uploaded the same file, or if two clients are sharing a directory and
 323 both want to make sure the files therein remain available. Consequently, each
 324 share (technically each "bucket", which may contain multiple shares for a
 325 single storage index) has a set of leases, one per client. One way to
 326 visualize this is with a large table, with shares (i.e. buckets, or storage
 327 indices, or files) as the rows, and accounts as columns. Each square of this
 328 table might hold a lease.
 329
 330 Using limited-duration leases reduces the storage consumed by clients who
 331 have (for whatever reason) forgotten about the share they once cared about.
 332 Clients are supposed to explicitly cancel leases for every file that they
 333 remove from their vdrive, and when the last lease is removed on a share, the
 334 storage server deletes that share. However, the storage server might be
 335 offline when the client deletes the file, or the client might experience a
 336 bug or a race condition that results in forgetting about the file. Using
 337 leases that expire unless otherwise renewed ensures that these lost files
 338 will not consume storage space forever. On the other hand, they require
 339 periodic maintenance, which can become prohibitively expensive for large
 340 grids. In addition, clients who go offline for a while are then obligated to
 341 get someone else to keep their files alive for them.
 342
 343
 344 In the first mode, each client holds a limited-duration lease on each share
 345 (typically one month), and clients are obligated to periodically renew these
 346 leases to keep them from expiring (typically once a week). In this mode, the
 347 storage server does not know anything about which client is which: it only
 348 knows about leases.
 349
 350 In the second mode, each server maintains a list of clients and which leases
 351 they hold. This is called the "account list", and each time a client wants to
 352 upload a share or establish a lease, it provides credentials to allow the
 353 server to know which Account it will be using. Rather than putting individual
 354 timers on each lease, the server puts a timer on the Account. When the
 355 account expires, all of the associated leases are cancelled.
 356
 357 In this mode, clients are obligated to renew the Account periodically, but
 358 not the (thousands of) individual share leases. Clients which forget about
 359 files are still incurring a storage cost for those files. An occasional
 360 reconcilliation process (in which the client presents the storage server with
 361 a list of all the files it cares about, and the server removes leases for
 362 anything that isn't on the list) can be used to free this storage, but the
 363 effort involved is large, so reconcilliation must be done very infrequently.
 364
 365 Our plan is to have the clients create their own Accounts, based upon the
 366 possession of a private key. Clients can create as many accounts as they
 367 wish, but they are responsible for their own maintenance. Servers can add up
 368 all the leases for each account and present a report of usage, in bytes per
 369 account. This is intended for friendnet scenarios where it would be nice to
 370 know how much space your friends are consuming on your disk.
 371
 372 In the third mode, the Account objects are centrally managed, and are not
 373 expired by the storage servers. In this mode, the client presents credentials
 374 that are issued by a central authority, such as a signed message which the
 375 storage server can verify. The storage used by this account is not freed
 376 unless and until the central account manager says so.
 377
 378 This mode is more appropriate for a commercial offering, in which use of the
 379 storage servers is contingent upon a monthly fee, or other membership
 380 criteria. Being able to ask the storage usage for each account (or establish
 381 limits on it) helps to enforce whatever kind of membership policy is desired.
 382
 383
 384 Each lease is created with a pair of secrets: the "renew secret" and the
 385 "cancel secret". These are just random-looking strings, derived by hashing
 386 other higher-level secrets, starting with a per-client master secret. Anyone
 387 who knows the secret is allowed to restart the expiration timer, or cancel
 388 the lease altogether. Having these be individual values allows the original
 389 uploading node to delegate these capabilities to others.
 390
 391 In the current release, clients provide lease secrets to the storage server,
 392 and each lease contains an expiration time, but there is no facility to
 393 actually expire leases, nor are there explicit owners (the "ownerid" field of
 394 each lease is always set to zero). In addition, many features have not been
 395 implemented yet: the client should claim leases on files which are added to
 396 the vdrive by linking (as opposed to uploading), and the client should cancel
 397 leases on files which are removed from the vdrive, but neither has been
 398 written yet. This means that shares are not ever deleted in this
 399 release. (Note, however, that if read-cap to a file is deleted then it will no
 400 longer be possible to decrypt that file, even if the shares which contain
 401 the erasure-coded ciphertext still exist.)
 402
 403
 404 FILE REPAIRER
 405
 406 Shares may go away because the storage server hosting them has suffered a
 407 failure: either temporary downtime (affecting availability of the file), or a
 408 permanent data loss (affecting the reliability of the file). Hard drives
 409 crash, power supplies explode, coffee spills, and asteroids strike. The goal
 410 of a robust distributed filesystem is to survive these setbacks.
 411
 412 To work against this slow, continual loss of shares, a File Checker is used
 413 to periodically count the number of shares still available for any given
 414 file. A more extensive form of checking known as the File Verifier can
 415 download the ciphertext of the target file and perform integrity checks (using
 416 strong hashes) to make sure the data is stil intact. When the file is found
 417 to have decayed below some threshold, the File Repairer can be used to
 418 regenerate and re-upload the missing shares. These processes are conceptually
 419 distinct (the repairer is only run if the checker/verifier decides it is
 420 necessary), but in practice they will be closely related, and may run in the
 421 same process.
 422
 423 The repairer process does not get the full capability of the file to be
 424 maintained: it merely gets the "repairer capability" subset, which does not
 425 include the decryption key. The File Verifier uses that data to find out
 426 which peers ought to hold shares for this file, and to see if those peers are
 427 still around and willing to provide the data. If the file is not healthy
 428 enough, the File Repairer is invoked to download the ciphertext, regenerate
 429 any missing shares, and upload them to new peers. The goal of the File
 430 Repairer is to finish up with a full set of "N" shares.
 431
 432 There are a number of engineering issues to be resolved here. The bandwidth,
 433 disk IO, and CPU time consumed by the verification/repair process must be
 434 balanced against the robustness that it provides to the grid. The nodes
 435 involved in repair will have very different access patterns than normal
 436 nodes, such that these processes may need to be run on hosts with more memory
 437 or network connectivity than usual. The frequency of repair will directly
 438 affect the resources consumed. In some cases, verification of multiple files
 439 can be performed at the same time, and repair of files can be delegated off
 440 to other nodes.
 441
 442 The security model we are currently using assumes that peers who claim to
 443 hold a share will actually provide it when asked. (We validate the data they
 444 provide before using it in any way, but if enough peers claim to hold the
 445 data and are wrong, the file will not be repaired, and may decay beyond
 446 recoverability). There are several interesting approaches to mitigate this
 447 threat, ranging from challenges to provide a keyed hash of the allegedly-held
 448 data (using "buddy nodes", in which two peers hold the same block, and check
 449 up on each other), to reputation systems, or even the original Mojo Nation
 450 economic model.
 451
 452
 453 SECURITY
 454
 455 The design goal for this project is that an attacker may be able to deny
 456 service (i.e. prevent you from recovering a file that was uploaded earlier)
 457 but can accomplish none of the following three attacks:
 458
 459  1) violate confidentiality: the attacker gets to view data to which you have
 460     not granted them access
 461  2) violate consistency: the attacker convinces you that the wrong data is
 462     actually the data you were intending to retrieve
 463  3) violate mutability: the attacker gets to modify a directory (either the
 464     pathnames or the file contents) to which you have not given them
 465     mutability rights
 466
 467 Data validity and consistency (the promise that the downloaded data will
 468 match the originally uploaded data) is provided by the hashes embedded in the
 469 capability. Data confidentiality (the promise that the data is only readable
 470 by people with the capability) is provided by the encryption key embedded in
 471 the capability. Data availability (the hope that data which has been uploaded
 472 in the past will be downloadable in the future) is provided by the grid,
 473 which distributes failures in a way that reduces the correlation between
 474 individual node failure and overall file recovery failure, and by the
 475 erasure-coding technique used to generate shares.
 476
 477 Many of these security properties depend upon the usual cryptographic
 478 assumptions: the resistance of AES and RSA to attack, the resistance of
 479 SHA256 to pre-image attacks, and upon the proximity of 2^-128 and 2^-256 to
 480 zero. A break in AES would allow a confidentiality violation, a pre-image
 481 break in SHA256 would allow a consistency violation, and a break in RSA would
 482 allow a mutability violation. The discovery of a collision in SHA256 is
 483 unlikely to allow much, but could conceivably allow a consistency violation
 484 in data that was uploaded by the attacker. If SHA256 is threatened, further
 485 analysis will be warranted.
 486
 487 There is no attempt made to provide anonymity, neither of the origin of a
 488 piece of data nor the identity of the subsequent downloaders. In general,
 489 anyone who already knows the contents of a file will be in a strong position
 490 to determine who else is uploading or downloading it. Also, it is quite easy
 491 for a sufficiently-large coalition of nodes to correlate the set of peers who
 492 are all uploading or downloading the same file, even if the attacker does not
 493 know the contents of the file in question.
 494
 495 Also note that the file size and (when convergence is being used) a keyed
 496 hash of the plaintext are not protected. Many people can determine the size
 497 of the file you are accessing, and if they already know the contents of a
 498 given file, they will be able to determine that you are uploading or
 499 downloading the same one.
 500
 501 A likely enhancement is the ability to use distinct encryption keys for each
 502 file, avoiding the file-correlation attacks at the expense of increased
 503 storage consumption. This is known as "non-convergent" encoding.
 504
 505 The capability-based security model is used throughout this project. Directory
 506 operations are expressed in terms of distinct read- and write- capabilities.
 507 Knowing the read-capability of a file is equivalent to the ability to read
 508 the corresponding data. The capability to validate the correctness of a file
 509 is strictly weaker than the read-capability (possession of read-capability
 510 automatically grants you possession of validate-capability, but not vice
 511 versa). These capabilities may be expressly delegated (irrevocably) by simply
 512 transferring the relevant secrets.
 513
 514 The application layer can provide whatever access model is desired, built on
 515 top of this capability access model.  The first big user of this system so far
 516 is allmydata.com.  The allmydata.com access model currently works like a normal
 517 web site, using username and password to give a user access to her virtual
 518 drive.  In addition, allmydata.com users can share individual files (using a
 519 file sharing interface built on top of the immutable file read capabilities).
 520
 521
 522 RELIABILITY
 523
 524 File encoding and peer selection parameters can be adjusted to achieve
 525 different goals. Each choice results in a number of properties; there are
 526 many tradeoffs.
 527
 528 First, some terms: the erasure-coding algorithm is described as K-out-of-N
 529 (for this release, the default values are K=3 and N=10). Each grid will have
 530 some number of peers; this number will rise and fall over time as peers join,
 531 drop out, come back, and leave forever. Files are of various sizes, some are
 532 popular, others are rare. Peers have various capacities, variable
 533 upload/download bandwidths, and network latency. Most of the mathematical
 534 models that look at peer failure assume some average (and independent)
 535 probability 'P' of a given peer being available: this can be high (servers
 536 tend to be online and available >90% of the time) or low (laptops tend to be
 537 turned on for an hour then disappear for several days). Files are encoded in
 538 segments of a given maximum size, which affects memory usage.
 539
 540 The ratio of N/K is the "expansion factor". Higher expansion factors improve
 541 reliability very quickly (the binomial distribution curve is very sharp), but
 542 consumes much more grid capacity. When P=50%, the absolute value of K affects
 543 the granularity of the binomial curve (1-out-of-2 is much worse than
 544 50-out-of-100), but high values asymptotically approach a constant (i.e.
 545 500-of-1000 is not much better than 50-of-100). When P is high and the
 546 expansion factor is held at a constant, higher values of K and N give much
 547 better reliability (for P=99%, 50-out-of-100 is much much better than
 548 5-of-10, roughly 10^50 times better), because there are more shares that can
 549 be lost without losing the file.
 550
 551 Likewise, the total number of peers in the network affects the same
 552 granularity: having only one peer means a single point of failure, no matter
 553 how many copies of the file you make. Independent peers (with uncorrelated
 554 failures) are necessary to hit the mathematical ideals: if you have 100 nodes
 555 but they are all in the same office building, then a single power failure
 556 will take out all of them at once. The "Sybil Attack" is where a single
 557 attacker convinces you that they are actually multiple servers, so that you
 558 think you are using a large number of independent peers, but in fact you have
 559 a single point of failure (where the attacker turns off all their machines at
 560 once). Large grids, with lots of truly-independent peers, will enable the use
 561 of lower expansion factors to achieve the same reliability, but will increase
 562 overhead because each peer needs to know something about every other, and the
 563 rate at which peers come and go will be higher (requiring network maintenance
 564 traffic). Also, the File Repairer work will increase with larger grids,
 565 although then the job can be distributed out to more peers.
 566
 567 Higher values of N increase overhead: more shares means more Merkle hashes
 568 that must be included with the data, and more peers to contact to retrieve
 569 the shares. Smaller segment sizes reduce memory usage (since each segment
 570 must be held in memory while erasure coding runs) and improves "alacrity"
 571 (since downloading can validate a smaller piece of data faster, delivering it
 572 to the target sooner), but also increase overhead (because more blocks means
 573 more Merkle hashes to validate them).
 574
 575 In general, small private grids should work well, but the participants will
 576 have to decide between storage overhead and reliability. Large stable grids
 577 will be able to reduce the expansion factor down to a bare minimum while
 578 still retaining high reliability, but large unstable grids (where nodes are
 579 coming and going very quickly) may require more repair/verification bandwidth
 580 than actual upload/download traffic.
 581
 582 Tahoe nodes that run a webserver have a page dedicated to provisioning
 583 decisions: this tool may help you evaluate different expansion factors and
 584 view the disk consumption of each. It is also acquiring some sections with
 585 availability/reliability numbers, as well as preliminary cost analysis data.
 586 This tool will continue to evolve as our analysis improves.
 587
 588 ------------------------------
 589
 590 [1]: http://en.wikipedia.org/wiki/Zooko%27s_triangle
 591
 592 [2]: all of these names are derived from the location where they were
 593      concocted, in this case in a car ride from Boulder to DEN. To be
 594      precise, "tahoe 1" was an unworkable scheme in which everyone who holds
 595      shares for a given file would form a sort of cabal which kept track of
 596      all the others, "tahoe 2" is the first-100-peers in the permuted hash
 597      described in this document, and "tahoe 3" (or perhaps "potrero hill 1")
 598      was the abandoned ring-with-many-hands approach.
 599