From: Brian Warner Date: Tue, 2 Feb 2010 05:53:04 +0000 (-0800) Subject: architecture.txt: remove trailing whitespace, wrap lines: no content changes X-Git-Tag: allmydata-tahoe-1.6.0~2 X-Git-Url: https://git.rkrishnan.org/listings/reliability?a=commitdiff_plain;h=479492b1a97291212a7655bd10cff6c75e37d893;p=tahoe-lafs%2Ftahoe-lafs.git architecture.txt: remove trailing whitespace, wrap lines: no content changes --- diff --git a/docs/architecture.txt b/docs/architecture.txt index c1cc2153..8a70609b 100644 --- a/docs/architecture.txt +++ b/docs/architecture.txt @@ -13,33 +13,33 @@ ascii strings -- and the values are sequences of data bytes. This data is encrypted and distributed across a number of nodes, such that it will survive the loss of most of the nodes. There are no hard limits on the size of the values, but there may be performance issues with extremely large values (just -due to the limitation of network bandwidth). In practice, values as small as a -few bytes and as large as tens of gigabytes are in common use. +due to the limitation of network bandwidth). In practice, values as small as +a few bytes and as large as tens of gigabytes are in common use. The middle layer is the decentralized filesystem: a directed graph in which the intermediate nodes are directories and the leaf nodes are files. The leaf nodes contain only the data -- they contain no metadata other than the length -in bytes. The edges leading to leaf nodes have metadata attached to them -about the file they point to. Therefore, the same file may be associated with +in bytes. The edges leading to leaf nodes have metadata attached to them +about the file they point to. Therefore, the same file may be associated with different metadata if it is referred to through different edges. The top layer consists of the applications using the filesystem. Allmydata.com uses it for a backup service: the application periodically -copies files from the local disk onto the decentralized filesystem. We later +copies files from the local disk onto the decentralized filesystem. We later provide read-only access to those files, allowing users to recover them. -There are several other applications built on top of the Tahoe-LAFS filesystem -(see the RelatedProjects page of the wiki for a list). +There are several other applications built on top of the Tahoe-LAFS +filesystem (see the RelatedProjects page of the wiki for a list). THE KEY-VALUE STORE The key-value store is implemented by a grid of Tahoe-LAFS storage servers -- -user-space processes. Tahoe-LAFS storage clients communicate with the storage +user-space processes. Tahoe-LAFS storage clients communicate with the storage servers over TCP. -Storage servers hold data in the form of "shares". Shares are encoded pieces -of files. There are a configurable number of shares for each file, 10 by -default. Normally, each share is stored on a separate server, but in some +Storage servers hold data in the form of "shares". Shares are encoded pieces +of files. There are a configurable number of shares for each file, 10 by +default. Normally, each share is stored on a separate server, but in some cases a single server can hold multiple shares of a file. Nodes learn about each other through an "introducer". Each server connects to @@ -48,7 +48,7 @@ the introducer at startup, and receives a list of all servers from it. Each client then connects to every server, creating a "bi-clique" topology. In the current release, nodes behind NAT boxes will connect to all nodes that they can open connections to, but they cannot open connections to other nodes -behind NAT boxes. Therefore, the more nodes behind NAT boxes, the less the +behind NAT boxes. Therefore, the more nodes behind NAT boxes, the less the topology resembles the intended bi-clique topology. The introducer is a Single Point of Failure ("SPoF"), in that clients who @@ -68,36 +68,36 @@ FILE ENCODING When a client stores a file on the grid, it first encrypts the file. It then breaks the encrypted file into small segments, in order to reduce the memory -footprint, and to decrease the lag between initiating a download and receiving -the first part of the file; for example the lag between hitting "play" and a -movie actually starting. +footprint, and to decrease the lag between initiating a download and +receiving the first part of the file; for example the lag between hitting +"play" and a movie actually starting. The client then erasure-codes each segment, producing blocks of which only a subset are needed to reconstruct the segment (3 out of 10, with the default settings). -It sends one block from each segment to a given server. The set of blocks on a -given server constitutes a "share". Therefore a subset f the shares (3 out of 10, -by default) are needed to reconstruct the file. +It sends one block from each segment to a given server. The set of blocks on +a given server constitutes a "share". Therefore a subset f the shares (3 out +of 10, by default) are needed to reconstruct the file. -A hash of the encryption key is used to form the "storage index", which is used -for both server selection (described below) and to index shares within the -Storage Servers on the selected nodes. +A hash of the encryption key is used to form the "storage index", which is +used for both server selection (described below) and to index shares within +the Storage Servers on the selected nodes. -The client computes secure hashes of the ciphertext and of the shares. It uses -Merkle Trees so that it is possible to verify the correctness of a subset of -the data without requiring all of the data. For example, this allows you to -verify the correctness of the first segment of a movie file and then begin -playing the movie file in your movie viewer before the entire movie file has -been downloaded. +The client computes secure hashes of the ciphertext and of the shares. It +uses Merkle Trees so that it is possible to verify the correctness of a +subset of the data without requiring all of the data. For example, this +allows you to verify the correctness of the first segment of a movie file and +then begin playing the movie file in your movie viewer before the entire +movie file has been downloaded. These hashes are stored in a small datastructure named the Capability Extension Block which is stored on the storage servers alongside each share. The capability contains the encryption key, the hash of the Capability -Extension Block, and any encoding parameters necessary to perform the eventual -decoding process. For convenience, it also contains the size of the file -being stored. +Extension Block, and any encoding parameters necessary to perform the +eventual decoding process. For convenience, it also contains the size of the +file being stored. To download, the client that wishes to turn a capability into a sequence of bytes will obtain the blocks from storage servers, use erasure-decoding to @@ -107,7 +107,7 @@ into plaintext, then emit the plaintext bytes to the output target. CAPABILITIES -Capabilities to immutable files represent a specific set of bytes. Think of +Capabilities to immutable files represent a specific set of bytes. Think of it like a hash function: you feed in a bunch of bytes, and you get out a capability, which is deterministically derived from the input data: changing even one bit of the input data will result in a completely different @@ -129,12 +129,12 @@ The capability provides both "location" and "identification": you can use it to retrieve a set of bytes, and then you can use it to validate ("identify") that these potential bytes are indeed the ones that you were looking for. -The "key-value store" layer doesn't include human-meaningful -names. Capabilities sit on the "global+secure" edge of Zooko's -Triangle[1]. They are self-authenticating, meaning that nobody can trick you -into accepting a file that doesn't match the capability you used to refer to -that file. The filesystem layer (described below) adds human-meaningful names -atop the key-value layer. +The "key-value store" layer doesn't include human-meaningful names. +Capabilities sit on the "global+secure" edge of Zooko's Triangle[1]. They are +self-authenticating, meaning that nobody can trick you into accepting a file +that doesn't match the capability you used to refer to that file. The +filesystem layer (described below) adds human-meaningful names atop the +key-value layer. SERVER SELECTION @@ -149,18 +149,18 @@ HASH(storage_index+peerid)). Each file gets a different permutation, which We use this permuted list of nodes to ask each node, in turn, if it will hold a share for us, by sending an 'allocate_buckets() query' to each one. Some -will say yes, others (those who are full) will say no: when a node refuses our -request, we just take that share to the next node on the list. We keep going -until we run out of shares to place. At the end of the process, we'll have a -table that maps each share number to a node, and then we can begin the +will say yes, others (those who are full) will say no: when a node refuses +our request, we just take that share to the next node on the list. We keep +going until we run out of shares to place. At the end of the process, we'll +have a table that maps each share number to a node, and then we can begin the encode+push phase, using the table to decide where each share should be sent. Most of the time, this will result in one share per node, which gives us maximum reliability (since it disperses the failures as widely as possible). If there are fewer useable nodes than there are shares, we'll be forced to loop around, eventually giving multiple shares to a single node. This reduces -reliability, so it isn't the sort of thing we want to happen all the time, and -either indicates that the default encoding parameters are set incorrectly +reliability, so it isn't the sort of thing we want to happen all the time, +and either indicates that the default encoding parameters are set incorrectly (creating more shares than you have nodes), or that the grid does not have enough space (many nodes are full). But apart from that, it doesn't hurt. If we have to loop through the node list a second time, we accelerate the query @@ -169,62 +169,62 @@ most cases, this means we'll never send more than two queries to any given node. If a node is unreachable, or has an error, or refuses to accept any of our -shares, we remove them from the permuted list, so we won't query them a second -time for this file. If a node already has shares for the file we're uploading -(or if someone else is currently sending them shares), we add that information -to the share-to-peer-node table. This lets us do less work for files which have -been uploaded once before, while making sure we still wind up with as many -shares as we desire. +shares, we remove them from the permuted list, so we won't query them a +second time for this file. If a node already has shares for the file we're +uploading (or if someone else is currently sending them shares), we add that +information to the share-to-peer-node table. This lets us do less work for +files which have been uploaded once before, while making sure we still wind +up with as many shares as we desire. If we are unable to place every share that we want, but we still managed to place a quantity known as "shares of happiness", we'll do the upload anyways. If we cannot place at least this many, the upload is declared a failure. The current defaults use k=3, shares_of_happiness=7, and N=10, meaning that -we'll try to place 10 shares, we'll be happy if we can place 7, and we need to -get back any 3 to recover the file. This results in a 3.3x expansion +we'll try to place 10 shares, we'll be happy if we can place 7, and we need +to get back any 3 to recover the file. This results in a 3.3x expansion factor. In general, you should set N about equal to the number of nodes in your grid, then set N/k to achieve your desired availability goals. When downloading a file, the current version just asks all known servers for -any shares they might have. Once it has received enough responses that it knows -where to find the needed k shares, it downloads the shares from those +any shares they might have. Once it has received enough responses that it +knows where to find the needed k shares, it downloads the shares from those servers. (This means that it tends to download shares from the fastest servers.) *future work* - A future release will use the server selection algorithm to reduce the number - of queries that must be sent out. + A future release will use the server selection algorithm to reduce the + number of queries that must be sent out. - Other peer-node selection algorithms are possible. One earlier version (known - as "Tahoe 3") used the permutation to place the nodes around a large ring, - distributed the shares evenly around the same ring, then walked clockwise from 0 - with a basket. Each time it encountered a share, it put it in the basket, each - time it encountered a server, give it as many shares from the basket as they'd - accept. This reduced the number of queries (usually to 1) for small grids - (where N is larger than the number of nodes), but resulted in extremely - non-uniform share distribution, which significantly hurt reliability - (sometimes the permutation resulted in most of the shares being dumped on a - single node). + Other peer-node selection algorithms are possible. One earlier version + (known as "Tahoe 3") used the permutation to place the nodes around a large + ring, distributed the shares evenly around the same ring, then walked + clockwise from 0 with a basket. Each time it encountered a share, it put it + in the basket, each time it encountered a server, give it as many shares + from the basket as they'd accept. This reduced the number of queries + (usually to 1) for small grids (where N is larger than the number of + nodes), but resulted in extremely non-uniform share distribution, which + significantly hurt reliability (sometimes the permutation resulted in most + of the shares being dumped on a single node). Another algorithm (known as "denver airport"[2]) uses the permuted hash to - decide on an approximate target for each share, then sends lease requests via - Chord routing. The request includes the contact information of the uploading - node, and asks that the node which eventually accepts the lease should - contact the uploader directly. The shares are then transferred over direct - connections rather than through multiple Chord hops. Download uses the same - approach. This allows nodes to avoid maintaining a large number of long-term - connections, at the expense of complexity and latency. + decide on an approximate target for each share, then sends lease requests + via Chord routing. The request includes the contact information of the + uploading node, and asks that the node which eventually accepts the lease + should contact the uploader directly. The shares are then transferred over + direct connections rather than through multiple Chord hops. Download uses + the same approach. This allows nodes to avoid maintaining a large number of + long-term connections, at the expense of complexity and latency. SWARMING DOWNLOAD, TRICKLING UPLOAD Because the shares being downloaded are distributed across a large number of nodes, the download process will pull from many of them at the same time. The -current encoding parameters require 3 shares to be retrieved for each segment, -which means that up to 3 nodes will be used simultaneously. For larger -networks, 8-of-22 encoding could be used, meaning 8 nodes can be used +current encoding parameters require 3 shares to be retrieved for each +segment, which means that up to 3 nodes will be used simultaneously. For +larger networks, 8-of-22 encoding could be used, meaning 8 nodes can be used simultaneously. This allows the download process to use the sum of the available nodes' upload bandwidths, resulting in downloads that take full advantage of the common 8x disparity between download and upload bandwith on @@ -237,14 +237,14 @@ means that on a typical 8x ADSL line, uploading a file will take about 32 times longer than downloading it again later. Smaller expansion ratios can reduce this upload penalty, at the expense of -reliability (see RELIABILITY, below). By using an "upload helper", this penalty -is eliminated: the client does a 1x upload of encrypted data to the helper, -then the helper performs encoding and pushes the shares to the storage -servers. This is an improvement if the helper has significantly higher upload -bandwidth than the client, so it makes the most sense for a commercially-run -grid for which all of the storage servers are in a colo facility with high -interconnect bandwidth. In this case, the helper is placed in the same -facility, so the helper-to-storage-server bandwidth is huge. +reliability (see RELIABILITY, below). By using an "upload helper", this +penalty is eliminated: the client does a 1x upload of encrypted data to the +helper, then the helper performs encoding and pushes the shares to the +storage servers. This is an improvement if the helper has significantly +higher upload bandwidth than the client, so it makes the most sense for a +commercially-run grid for which all of the storage servers are in a colo +facility with high interconnect bandwidth. In this case, the helper is placed +in the same facility, so the helper-to-storage-server bandwidth is huge. See "helper.txt" for details about the upload helper. @@ -260,11 +260,11 @@ The filesystem layer is a graph of directories. Each directory contains a table of named children. These children are either other directories or files. All children are referenced by their capability. -A directory has two forms of capability: read-write caps and read-only -caps. The table of children inside the directory has a read-write and -read-only capability for each child. If you have a read-only capability for a -given directory, you will not be able to access the read-write capability of -its children. This results in "transitively read-only" directory access. +A directory has two forms of capability: read-write caps and read-only caps. +The table of children inside the directory has a read-write and read-only +capability for each child. If you have a read-only capability for a given +directory, you will not be able to access the read-write capability of its +children. This results in "transitively read-only" directory access. By having two different capabilities, you can choose which you want to share with someone else. If you create a new directory and share the read-write @@ -281,14 +281,14 @@ that are globally visible. LEASES, REFRESHING, GARBAGE COLLECTION -When a file or directory in the virtual filesystem is no longer referenced, the -space that its shares occupied on each storage server can be freed, making room -for other shares. Tahoe-LAFS uses a garbage collection ("GC") mechanism to -implement this space-reclamation process. Each share has one or more "leases", -which are managed by clients who want the file/directory to be retained. The -storage server accepts each share for a pre-defined period of time, and is -allowed to delete the share if all of the leases are cancelled or allowed to -expire. +When a file or directory in the virtual filesystem is no longer referenced, +the space that its shares occupied on each storage server can be freed, +making room for other shares. Tahoe-LAFS uses a garbage collection ("GC") +mechanism to implement this space-reclamation process. Each share has one or +more "leases", which are managed by clients who want the file/directory to be +retained. The storage server accepts each share for a pre-defined period of +time, and is allowed to delete the share if all of the leases are cancelled +or allowed to expire. Garbage collection is not enabled by default: storage servers will not delete shares without being explicitly configured to do so. When GC is enabled, @@ -308,40 +308,41 @@ permanent data loss (affecting the preservation of the file). Hard drives crash, power supplies explode, coffee spills, and asteroids strike. The goal of a robust distributed filesystem is to survive these setbacks. -To work against this slow, continual loss of shares, a File Checker is used to -periodically count the number of shares still available for any given file. A -more extensive form of checking known as the File Verifier can download the -ciphertext of the target file and perform integrity checks (using strong -hashes) to make sure the data is stil intact. When the file is found to have -decayed below some threshold, the File Repairer can be used to regenerate and -re-upload the missing shares. These processes are conceptually distinct (the -repairer is only run if the checker/verifier decides it is necessary), but in -practice they will be closely related, and may run in the same process. +To work against this slow, continual loss of shares, a File Checker is used +to periodically count the number of shares still available for any given +file. A more extensive form of checking known as the File Verifier can +download the ciphertext of the target file and perform integrity checks +(using strong hashes) to make sure the data is stil intact. When the file is +found to have decayed below some threshold, the File Repairer can be used to +regenerate and re-upload the missing shares. These processes are conceptually +distinct (the repairer is only run if the checker/verifier decides it is +necessary), but in practice they will be closely related, and may run in the +same process. The repairer process does not get the full capability of the file to be maintained: it merely gets the "repairer capability" subset, which does not -include the decryption key. The File Verifier uses that data to find out which -nodes ought to hold shares for this file, and to see if those nodes are still -around and willing to provide the data. If the file is not healthy enough, the -File Repairer is invoked to download the ciphertext, regenerate any missing -shares, and upload them to new nodes. The goal of the File Repairer is to -finish up with a full set of "N" shares. +include the decryption key. The File Verifier uses that data to find out +which nodes ought to hold shares for this file, and to see if those nodes are +still around and willing to provide the data. If the file is not healthy +enough, the File Repairer is invoked to download the ciphertext, regenerate +any missing shares, and upload them to new nodes. The goal of the File +Repairer is to finish up with a full set of "N" shares. There are a number of engineering issues to be resolved here. The bandwidth, disk IO, and CPU time consumed by the verification/repair process must be balanced against the robustness that it provides to the grid. The nodes -involved in repair will have very different access patterns than normal nodes, -such that these processes may need to be run on hosts with more memory or -network connectivity than usual. The frequency of repair will directly affect -the resources consumed. In some cases, verification of multiple files can be -performed at the same time, and repair of files can be delegated off to other -nodes. +involved in repair will have very different access patterns than normal +nodes, such that these processes may need to be run on hosts with more memory +or network connectivity than usual. The frequency of repair will directly +affect the resources consumed. In some cases, verification of multiple files +can be performed at the same time, and repair of files can be delegated off +to other nodes. *future work* Currently there are two modes of checking on the health of your file: - "Checker" simply asks storage servers which shares they have and does nothing - to try to verify that they aren't lying. "Verifier" downloads and + "Checker" simply asks storage servers which shares they have and does + nothing to try to verify that they aren't lying. "Verifier" downloads and cryptographically verifies every bit of every share of the file from every server, which costs a lot of network and CPU. A future improvement would be to make a random-sampling verifier which downloads and cryptographically @@ -362,8 +363,8 @@ but can accomplish none of the following three attacks: not granted them access 2) violate integrity: the attacker convinces you that the wrong data is actually the data you were intending to retrieve - 3) violate unforgeability: the attacker gets to modify a mutable file or - directory (either the pathnames or the file contents) to which you have + 3) violate unforgeability: the attacker gets to modify a mutable file or + directory (either the pathnames or the file contents) to which you have not given them write permission Integrity (the promise that the downloaded data will match the uploaded data) @@ -378,11 +379,11 @@ node failure and overall file recovery failure, and by the erasure-coding technique used to generate shares. Many of these security properties depend upon the usual cryptographic -assumptions: the resistance of AES and RSA to attack, the resistance of SHA-256 -to collision attacks and pre-image attacks, and upon the proximity of 2^-128 -and 2^-256 to zero. A break in AES would allow a confidentiality violation, a -collision break in SHA-256 would allow a consistency violation, and a break in -RSA would allow a mutability violation. +assumptions: the resistance of AES and RSA to attack, the resistance of +SHA-256 to collision attacks and pre-image attacks, and upon the proximity of +2^-128 and 2^-256 to zero. A break in AES would allow a confidentiality +violation, a collision break in SHA-256 would allow a consistency violation, +and a break in RSA would allow a mutability violation. There is no attempt made to provide anonymity, neither of the origin of a piece of data nor the identity of the subsequent downloaders. In general, @@ -392,46 +393,47 @@ for a sufficiently large coalition of nodes to correlate the set of nodes who are all uploading or downloading the same file, even if the attacker does not know the contents of the file in question. -Also note that the file size and (when convergence is being used) a keyed hash -of the plaintext are not protected. Many people can determine the size of the -file you are accessing, and if they already know the contents of a given file, -they will be able to determine that you are uploading or downloading the same -one. - -The capability-based security model is used throughout this project. Directory -operations are expressed in terms of distinct read- and write- capabilities. -Knowing the read-capability of a file is equivalent to the ability to read the -corresponding data. The capability to validate the correctness of a file is -strictly weaker than the read-capability (possession of read-capability -automatically grants you possession of validate-capability, but not vice -versa). These capabilities may be expressly delegated (irrevocably) by simply -transferring the relevant secrets. +Also note that the file size and (when convergence is being used) a keyed +hash of the plaintext are not protected. Many people can determine the size +of the file you are accessing, and if they already know the contents of a +given file, they will be able to determine that you are uploading or +downloading the same one. + +The capability-based security model is used throughout this project. +Directory operations are expressed in terms of distinct read- and write- +capabilities. Knowing the read-capability of a file is equivalent to the +ability to read the corresponding data. The capability to validate the +correctness of a file is strictly weaker than the read-capability (possession +of read-capability automatically grants you possession of +validate-capability, but not vice versa). These capabilities may be expressly +delegated (irrevocably) by simply transferring the relevant secrets. The application layer can provide whatever access model is desired, built on -top of this capability access model. The first big user of this system so far -is allmydata.com. The allmydata.com access model currently works like a normal -web site, using username and password to give a user access to her "virtual -drive". In addition, allmydata.com users can share individual files (using a -file sharing interface built on top of the immutable file read capabilities). +top of this capability access model. The first big user of this system so far +is allmydata.com. The allmydata.com access model currently works like a +normal web site, using username and password to give a user access to her +"virtual drive". In addition, allmydata.com users can share individual files +(using a file sharing interface built on top of the immutable file read +capabilities). RELIABILITY File encoding and peer-node selection parameters can be adjusted to achieve -different goals. Each choice results in a number of properties; there are many -tradeoffs. +different goals. Each choice results in a number of properties; there are +many tradeoffs. -First, some terms: the erasure-coding algorithm is described as K-out-of-N (for -this release, the default values are K=3 and N=10). Each grid will have some -number of nodes; this number will rise and fall over time as nodes join, drop -out, come back, and leave forever. Files are of various sizes, some are +First, some terms: the erasure-coding algorithm is described as K-out-of-N +(for this release, the default values are K=3 and N=10). Each grid will have +some number of nodes; this number will rise and fall over time as nodes join, +drop out, come back, and leave forever. Files are of various sizes, some are popular, others are unpopular. Nodes have various capacities, variable upload/download bandwidths, and network latency. Most of the mathematical models that look at node failure assume some average (and independent) -probability 'P' of a given node being available: this can be high (servers tend -to be online and available >90% of the time) or low (laptops tend to be turned -on for an hour then disappear for several days). Files are encoded in segments -of a given maximum size, which affects memory usage. +probability 'P' of a given node being available: this can be high (servers +tend to be online and available >90% of the time) or low (laptops tend to be +turned on for an hour then disappear for several days). Files are encoded in +segments of a given maximum size, which affects memory usage. The ratio of N/K is the "expansion factor". Higher expansion factors improve reliability very quickly (the binomial distribution curve is very sharp), but @@ -440,19 +442,19 @@ the granularity of the binomial curve (1-out-of-2 is much worse than 50-out-of-100), but high values asymptotically approach a constant (i.e. 500-of-1000 is not much better than 50-of-100). When P is high and the expansion factor is held at a constant, higher values of K and N give much -better reliability (for P=99%, 50-out-of-100 is much much better than 5-of-10, -roughly 10^50 times better), because there are more shares that can be lost -without losing the file. +better reliability (for P=99%, 50-out-of-100 is much much better than +5-of-10, roughly 10^50 times better), because there are more shares that can +be lost without losing the file. Likewise, the total number of nodes in the network affects the same granularity: having only one node means a single point of failure, no matter how many copies of the file you make. Independent nodes (with uncorrelated failures) are necessary to hit the mathematical ideals: if you have 100 nodes -but they are all in the same office building, then a single power failure will -take out all of them at once. The "Sybil Attack" is where a single attacker -convinces you that they are actually multiple servers, so that you think you -are using a large number of independent nodes, but in fact you have a single -point of failure (where the attacker turns off all their machines at +but they are all in the same office building, then a single power failure +will take out all of them at once. The "Sybil Attack" is where a single +attacker convinces you that they are actually multiple servers, so that you +think you are using a large number of independent nodes, but in fact you have +a single point of failure (where the attacker turns off all their machines at once). Large grids, with lots of truly independent nodes, will enable the use of lower expansion factors to achieve the same reliability, but will increase overhead because each node needs to know something about every other, and the @@ -461,23 +463,23 @@ traffic). Also, the File Repairer work will increase with larger grids, although then the job can be distributed out to more nodes. Higher values of N increase overhead: more shares means more Merkle hashes -that must be included with the data, and more nodes to contact to retrieve the -shares. Smaller segment sizes reduce memory usage (since each segment must be -held in memory while erasure coding runs) and improves "alacrity" (since -downloading can validate a smaller piece of data faster, delivering it to the -target sooner), but also increase overhead (because more blocks means more -Merkle hashes to validate them). +that must be included with the data, and more nodes to contact to retrieve +the shares. Smaller segment sizes reduce memory usage (since each segment +must be held in memory while erasure coding runs) and improves "alacrity" +(since downloading can validate a smaller piece of data faster, delivering it +to the target sooner), but also increase overhead (because more blocks means +more Merkle hashes to validate them). In general, small private grids should work well, but the participants will have to decide between storage overhead and reliability. Large stable grids -will be able to reduce the expansion factor down to a bare minimum while still -retaining high reliability, but large unstable grids (where nodes are coming -and going very quickly) may require more repair/verification bandwidth than -actual upload/download traffic. +will be able to reduce the expansion factor down to a bare minimum while +still retaining high reliability, but large unstable grids (where nodes are +coming and going very quickly) may require more repair/verification bandwidth +than actual upload/download traffic. Tahoe-LAFS nodes that run a webserver have a page dedicated to provisioning -decisions: this tool may help you evaluate different expansion factors and view -the disk consumption of each. It is also acquiring some sections with +decisions: this tool may help you evaluate different expansion factors and +view the disk consumption of each. It is also acquiring some sections with availability/reliability numbers, as well as preliminary cost analysis data. This tool will continue to evolve as our analysis improves.