From da13cacd4159d8f6a16938679e8efffa4370e92b Mon Sep 17 00:00:00 2001 From: Zooko O'Whielacronx Date: Mon, 1 Feb 2010 20:51:17 -0800 Subject: [PATCH] docs: a few edits to architecture.txt, most significantly highlighting "future work" to avoid confusing it with the current version, and adding a "future work" about a random-sampling Proof of Retrievability verifier --- docs/architecture.txt | 183 ++++++++++++++++++++---------------------- 1 file changed, 85 insertions(+), 98 deletions(-) diff --git a/docs/architecture.txt b/docs/architecture.txt index c62f1c59..a0abbf97 100644 --- a/docs/architecture.txt +++ b/docs/architecture.txt @@ -187,45 +187,35 @@ factor. In general, you should set N about equal to the number of nodes in your grid, then set N/k to achieve your desired availability goals. When downloading a file, the current version just asks all known servers for -any shares they might have. and then downloads the shares from the first servers that - -chooses the minimal necessary subset, then starts -change downloading and processing those shares. A future release will use the -server selection algorithm to reduce the number of queries that must be sent -out. This algorithm uses the same consistent-hashing permutation as on upload, -but stops after it has located k shares (instead of all N). This reduces the -number of queries that must be sent before downloading can begin. - -The actual number of queries is directly related to the availability of the -nodes and the degree of overlap between the node list used at upload and at -download. For stable grids, this overlap is very high, and usually the first k -queries will result in shares. The number of queries grows as the stability -decreases. Some limits may be imposed in large grids to avoid querying a -million nodes; this provides a tradeoff between the work spent to discover -that a file is unrecoverable and the probability that a retrieval will fail -when it could have succeeded if we had just tried a little bit harder. The -appropriate value of this tradeoff will depend upon the size of the grid, and -will change over time. - -Other peer-node selection algorithms are possible. One earlier version (known -as "Tahoe 3") used the permutation to place the nodes around a large ring, -distributed shares evenly around the same ring, then walks clockwise from 0 -with a basket: each time we encounter a share, put it in the basket, each time -we encounter a node, give them as many shares from our basket as they'll -accept. This reduced the number of queries (usually to 1) for small grids -(where N is larger than the number of nodes), but resulted in extremely -non-uniform share distribution, which significantly hurt reliability -(sometimes the permutation resulted in most of the shares being dumped on a -single node). - -Another algorithm (known as "denver airport"[2]) uses the permuted hash to -decide on an approximate target for each share, then sends lease requests via -Chord routing. The request includes the contact information of the uploading -node, and asks that the node which eventually accepts the lease should contact -the uploader directly. The shares are then transferred over direct connections -rather than through multiple Chord hops. Download uses the same approach. This -allows nodes to avoid maintaining a large number of long-term connections, at -the expense of complexity, latency, and reliability. +any shares they might have. Once it has received enough responses that it knows +where to find the needed k shares, it downloads the shares from those +servers. (This means that it tends to download shares from the fastest +servers.) + + *future work* + + A future release will use the server selection algorithm to reduce the number + of queries that must be sent out. + + Other peer-node selection algorithms are possible. One earlier version (known + as "Tahoe 3") used the permutation to place the nodes around a large ring, + distributed shares evenly around the same ring, then walks clockwise from 0 + with a basket: each time we encounter a share, put it in the basket, each + time we encounter a node, give them as many shares from our basket as they'll + accept. This reduced the number of queries (usually to 1) for small grids + (where N is larger than the number of nodes), but resulted in extremely + non-uniform share distribution, which significantly hurt reliability + (sometimes the permutation resulted in most of the shares being dumped on a + single node). + + Another algorithm (known as "denver airport"[2]) uses the permuted hash to + decide on an approximate target for each share, then sends lease requests via + Chord routing. The request includes the contact information of the uploading + node, and asks that the node which eventually accepts the lease should + contact the uploader directly. The shares are then transferred over direct + connections rather than through multiple Chord hops. Download uses the same + approach. This allows nodes to avoid maintaining a large number of long-term + connections, at the expense of complexity and latency. SWARMING DOWNLOAD, TRICKLING UPLOAD @@ -247,7 +237,7 @@ means that on a typical 8x ADSL line, uploading a file will take about 32 times longer than downloading it again later. Smaller expansion ratios can reduce this upload penalty, at the expense of -reliability. See RELIABILITY, below. By using an "upload helper", this penalty +reliability (see RELIABILITY, below). By using an "upload helper", this penalty is eliminated: the client does a 1x upload of encrypted data to the helper, then the helper performs encoding and pushes the shares to the storage servers. This is an improvement if the helper has significantly higher upload @@ -291,14 +281,14 @@ that are globally visible. LEASES, REFRESHING, GARBAGE COLLECTION -When a file or directory in the virtual filesystem is no longer referenced, -the space that its shares occupied on each storage server can be freed, -making room for other shares. Tahoe currently uses a garbage collection -("GC") mechanism to implement this space-reclamation process. Each share has -one or more "leases", which are managed by clients who want the -file/directory to be retained. The storage server accepts each share for a -pre-defined period of time, and is allowed to delete the share if all of the -leases are cancelled or allowed to expire. +When a file or directory in the virtual filesystem is no longer referenced, the +space that its shares occupied on each storage server can be freed, making room +for other shares. Tahoe-LAFS uses a garbage collection ("GC") mechanism to +implement this space-reclamation process. Each share has one or more "leases", +which are managed by clients who want the file/directory to be retained. The +storage server accepts each share for a pre-defined period of time, and is +allowed to delete the share if all of the leases are cancelled or allowed to +expire. Garbage collection is not enabled by default: storage servers will not delete shares without being explicitly configured to do so. When GC is enabled, @@ -314,7 +304,7 @@ FILE REPAIRER Shares may go away because the storage server hosting them has suffered a failure: either temporary downtime (affecting availability of the file), or a -permanent data loss (affecting the reliability of the file). Hard drives +permanent data loss (affecting the preservation of the file). Hard drives crash, power supplies explode, coffee spills, and asteroids strike. The goal of a robust distributed filesystem is to survive these setbacks. @@ -347,15 +337,19 @@ the resources consumed. In some cases, verification of multiple files can be performed at the same time, and repair of files can be delegated off to other nodes. -The security model we are currently using assumes that nodes who claim to hold -a share will actually provide it when asked. (We validate the data they -provide before using it in any way, but if enough nodes claim to hold the data -and are wrong, the file will not be repaired, and may decay beyond -recoverability). There are several interesting approaches to mitigate this -threat, ranging from challenges to provide a keyed hash of the allegedly-held -data (using "buddy nodes", in which two nodes hold the same block, and check -up on each other), to reputation systems, or even the original Mojo Nation -economic model. + *future work* + + Currently there are two modes of checking on the health of your file: + "Checker" simply asks storage servers which shares they have and does nothing + to try to verify that they aren't lying. "Verifier" downloads and + cryptographically verifies every bit of every share of the file from every + server, which costs a lot of network and CPU. A future improvement would be + to make a random-sampling verifier which downloads and cryptographically + verifies only a few randomly-chosen blocks from each server. This would + require much less network and CPU but it could make it extremely unlikely + that any sort of corruption -- even malicious corruption intended to evade + detection -- would evade detection. This would be an instance of a + cryptographic notion called "Proof of Retrievability". SECURITY @@ -366,31 +360,29 @@ but can accomplish none of the following three attacks: 1) violate confidentiality: the attacker gets to view data to which you have not granted them access - 2) violate consistency: the attacker convinces you that the wrong data is + 2) violate integrity: the attacker convinces you that the wrong data is actually the data you were intending to retrieve - 3) violate mutability: the attacker gets to modify a directory (either the - pathnames or the file contents) to which you have not given them - mutability rights - -Data validity and consistency (the promise that the downloaded data will match -the originally uploaded data) is provided by the hashes embedded in the -capability. Data confidentiality (the promise that the data is only readable -by people with the capability) is provided by the encryption key embedded in -the capability. Data availability (the hope that data which has been uploaded -in the past will be downloadable in the future) is provided by the grid, which + 3) violate unforgeability: the attacker gets to modify a mutable file or + directory (either the pathnames or the file contents) to which you have + not given them write permission + +Integrity (the promise that the downloaded data will match the uploaded data) +is provided by the hashes embedded in the capability (for immutable files) or +the digital signature (for mutable files). Confidentiality (the promise that +the data is only readable by people with the capability) is provided by the +encryption key embedded in the capability (for both immutable and mutable +files). Data availability (the hope that data which has been uploaded in the +past will be downloadable in the future) is provided by the grid, which distributes failures in a way that reduces the correlation between individual node failure and overall file recovery failure, and by the erasure-coding technique used to generate shares. Many of these security properties depend upon the usual cryptographic -assumptions: the resistance of AES and RSA to attack, the resistance of -SHA-256 to pre-image attacks, and upon the proximity of 2^-128 and 2^-256 to -zero. A break in AES would allow a confidentiality violation, a pre-image -break in SHA-256 would allow a consistency violation, and a break in RSA -would allow a mutability violation. The discovery of a collision in SHA-256 -is unlikely to allow much, but could conceivably allow a consistency -violation in data that was uploaded by the attacker. If SHA-256 is -threatened, further analysis will be warranted. +assumptions: the resistance of AES and RSA to attack, the resistance of SHA-256 +to collision attacks and pre-image attacks, and upon the proximity of 2^-128 +and 2^-256 to zero. A break in AES would allow a confidentiality violation, a +collision break in SHA-256 would allow a consistency violation, and a break in +RSA would allow a mutability violation. There is no attempt made to provide anonymity, neither of the origin of a piece of data nor the identity of the subsequent downloaders. In general, @@ -406,10 +398,6 @@ file you are accessing, and if they already know the contents of a given file, they will be able to determine that you are uploading or downloading the same one. -A likely enhancement is the ability to use distinct encryption keys for each -file, avoiding the file-correlation attacks at the expense of increased -storage consumption. This is known as "non-convergent" encoding. - The capability-based security model is used throughout this project. Directory operations are expressed in terms of distinct read- and write- capabilities. Knowing the read-capability of a file is equivalent to the ability to read the @@ -421,11 +409,10 @@ transferring the relevant secrets. The application layer can provide whatever access model is desired, built on top of this capability access model. The first big user of this system so far -is allmydata.com. The allmydata.com access model currently works like a -normal web site, using username and password to give a user access to her -virtual drive. In addition, allmydata.com users can share individual files -(using a file sharing interface built on top of the immutable file read -capabilities). +is allmydata.com. The allmydata.com access model currently works like a normal +web site, using username and password to give a user access to her "virtual +drive". In addition, allmydata.com users can share individual files (using a +file sharing interface built on top of the immutable file read capabilities). RELIABILITY @@ -434,17 +421,17 @@ File encoding and peer-node selection parameters can be adjusted to achieve different goals. Each choice results in a number of properties; there are many tradeoffs. -First, some terms: the erasure-coding algorithm is described as K-out-of-N -(for this release, the default values are K=3 and N=10). Each grid will have -some number of nodes; this number will rise and fall over time as nodes join, -drop out, come back, and leave forever. Files are of various sizes, some are -popular, others are rare. Nodes have various capacities, variable +First, some terms: the erasure-coding algorithm is described as K-out-of-N (for +this release, the default values are K=3 and N=10). Each grid will have some +number of nodes; this number will rise and fall over time as nodes join, drop +out, come back, and leave forever. Files are of various sizes, some are +popular, others are unpopular. Nodes have various capacities, variable upload/download bandwidths, and network latency. Most of the mathematical models that look at node failure assume some average (and independent) -probability 'P' of a given node being available: this can be high (servers -tend to be online and available >90% of the time) or low (laptops tend to be -turned on for an hour then disappear for several days). Files are encoded in -segments of a given maximum size, which affects memory usage. +probability 'P' of a given node being available: this can be high (servers tend +to be online and available >90% of the time) or low (laptops tend to be turned +on for an hour then disappear for several days). Files are encoded in segments +of a given maximum size, which affects memory usage. The ratio of N/K is the "expansion factor". Higher expansion factors improve reliability very quickly (the binomial distribution curve is very sharp), but @@ -488,9 +475,9 @@ retaining high reliability, but large unstable grids (where nodes are coming and going very quickly) may require more repair/verification bandwidth than actual upload/download traffic. -Tahoe nodes that run a webserver have a page dedicated to provisioning -decisions: this tool may help you evaluate different expansion factors and -view the disk consumption of each. It is also acquiring some sections with +Tahoe-LAFS nodes that run a webserver have a page dedicated to provisioning +decisions: this tool may help you evaluate different expansion factors and view +the disk consumption of each. It is also acquiring some sections with availability/reliability numbers, as well as preliminary cost analysis data. This tool will continue to evolve as our analysis improves. -- 2.45.2