+.. -*- coding: utf-8-with-signature -*-
+
=======================
Tahoe-LAFS Architecture
=======================
Overview
========
-(See the `docs/specifications directory <specifications>`_ for more details.)
+(See the `docs/specifications directory`_ for more details.)
There are three layers: the key-value store, the filesystem, and the
application.
The lowest layer is the key-value store. The keys are "capabilities" -- short
-ascii strings -- and the values are sequences of data bytes. This data is
+ASCII strings -- and the values are sequences of data bytes. This data is
encrypted and distributed across a number of nodes, such that it will survive
the loss of most of the nodes. There are no hard limits on the size of the
values, but there may be performance issues with extremely large values (just
different metadata if it is referred to through different edges.
The top layer consists of the applications using the filesystem.
-Allmydata.com uses it for a backup service: the application periodically
+Allmydata.com used it for a backup service: the application periodically
copies files from the local disk onto the decentralized filesystem. We later
provide read-only access to those files, allowing users to recover them.
There are several other applications built on top of the Tahoe-LAFS
-filesystem (see the `RelatedProjects
-<http://tahoe-lafs.org/trac/tahoe-lafs/wiki/RelatedProjects>`_ page of the
-wiki for a list).
+filesystem (see the RelatedProjects_ page of the wiki for a list).
+.. _docs/specifications directory: specifications
+.. _RelatedProjects: https://tahoe-lafs.org/trac/tahoe-lafs/wiki/RelatedProjects
The Key-Value Store
===================
the Storage Servers on the selected nodes.
The client computes secure hashes of the ciphertext and of the shares. It
-uses Merkle Trees so that it is possible to verify the correctness of a
+uses `Merkle Trees`_ so that it is possible to verify the correctness of a
subset of the data without requiring all of the data. For example, this
allows you to verify the correctness of the first segment of a movie file and
then begin playing the movie file in your movie viewer before the entire
turn them into segments of ciphertext, use the decryption key to convert that
into plaintext, then emit the plaintext bytes to the output target.
+.. _`Merkle Trees`: http://systems.cs.colorado.edu/grunwald/Classes/Fall2003-InformationStorage/Papers/merkle-tree.pdf
+
Capabilities
============
filesystem layer (described below) adds human-meaningful names atop the
key-value layer.
-.. _`Zooko's Triangle`: http://en.wikipedia.org/wiki/Zooko%27s_triangle
+.. _`Zooko's Triangle`: https://en.wikipedia.org/wiki/Zooko%27s_triangle
Server Selection
remove any servers that cannot hold an encoded share for our file. Then we ask
some of the servers thus removed if they are already holding any encoded shares
for our file; we use this information later. (We ask any servers which are in
-the first 2*N elements of the permuted list.)
+the first 2*``N`` elements of the permuted list.)
We then use the permuted list of servers to ask each server, in turn, if it
will hold a share for us (a share that was not reported as being already
at the end of the upload process, the appropriate upload health check fails,
the upload is considered a failure.
-The current defaults use k=3, servers_of_happiness=7, and N=10. N=10 means that
-we'll try to place 10 shares. k=3 means that we need any three shares to
-recover the file. servers_of_happiness=7 means that we'll consider an immutable
-file upload to be successful if we can place shares on enough servers that
-there are 7 different servers, the correct functioning of any k of which
-guarantee the availability of the immutable file.
+The current defaults use ``k`` = 3, ``servers_of_happiness`` = 7, and ``N`` = 10.
+``N`` = 10 means that we'll try to place 10 shares. ``k`` = 3 means that we need
+any three shares to recover the file. ``servers_of_happiness`` = 7 means that
+we'll consider an immutable file upload to be successful if we can place shares
+on enough servers that there are 7 different servers, the correct functioning
+of any ``k`` of which guarantee the availability of the immutable file.
-N=10 and k=3 means there is a 3.3x expansion factor. On a small grid, you
-should set N about equal to the number of storage servers in your grid; on a
+``N`` = 10 and ``k`` = 3 means there is a 3.3x expansion factor. On a small grid, you
+should set ``N`` about equal to the number of storage servers in your grid; on a
large grid, you might set it to something smaller to avoid the overhead of
-contacting every server to place a file. In either case, you should then set k
-such that N/k reflects your desired availability goals. The best value for
-servers_of_happiness will depend on how you use Tahoe-LAFS. In a friendnet with
-a variable number of servers, it might make sense to set it to the smallest
+contacting every server to place a file. In either case, you should then set ``k``
+such that ``N``/``k`` reflects your desired availability goals. The best value for
+``servers_of_happiness`` will depend on how you use Tahoe-LAFS. In a friendnet
+with a variable number of servers, it might make sense to set it to the smallest
number of servers that you expect to have online and accepting shares at any
given time. In a stable environment without much server churn, it may make
-sense to set servers_of_happiness = N.
+sense to set ``servers_of_happiness`` = ``N``.
When downloading a file, the current version just asks all known servers for
any shares they might have. Once it has received enough responses that it
clockwise from 0 with a basket. Each time it encountered a share, it put it
in the basket, each time it encountered a server, give it as many shares
from the basket as they'd accept. This reduced the number of queries
- (usually to 1) for small grids (where N is larger than the number of
+ (usually to 1) for small grids (where ``N`` is larger than the number of
nodes), but resulted in extremely non-uniform share distribution, which
significantly hurt reliability (sometimes the permutation resulted in most
of the shares being dumped on a single node).
times longer than downloading it again later.
Smaller expansion ratios can reduce this upload penalty, at the expense of
-reliability (see RELIABILITY, below). By using an "upload helper", this
+reliability (see `Reliability`_, below). By using an "upload helper", this
penalty is eliminated: the client does a 1x upload of encrypted data to the
helper, then the helper performs encoding and pushes the shares to the
storage servers. This is an improvement if the helper has significantly
facility with high interconnect bandwidth. In this case, the helper is placed
in the same facility, so the helper-to-storage-server bandwidth is huge.
-See `<helper.rst>`_ for details about the upload helper.
+See helper.rst_ for details about the upload helper.
+
+.. _helper.rst: helper.rst
The Filesystem Layer
least frequently enough to prevent any of the leases from expiring before the
next renewal pass.
-See `<garbage-collection.rst>`_ for further information, and for how to
-configure garbage collection.
+See garbage-collection.rst_ for further information, and for how to configure
+garbage collection.
+
+.. _garbage-collection.rst: garbage-collection.rst
File Repairer
to periodically count the number of shares still available for any given
file. A more extensive form of checking known as the File Verifier can
download the ciphertext of the target file and perform integrity checks
-(using strong hashes) to make sure the data is stil intact. When the file is
+(using strong hashes) to make sure the data is still intact. When the file is
found to have decayed below some threshold, the File Repairer can be used to
regenerate and re-upload the missing shares. These processes are conceptually
distinct (the repairer is only run if the checker/verifier decides it is
still around and willing to provide the data. If the file is not healthy
enough, the File Repairer is invoked to download the ciphertext, regenerate
any missing shares, and upload them to new nodes. The goal of the File
-Repairer is to finish up with a full set of "N" shares.
+Repairer is to finish up with a full set of ``N`` shares.
There are a number of engineering issues to be resolved here. The bandwidth,
disk IO, and CPU time consumed by the verification/repair process must be
different goals. Each choice results in a number of properties; there are
many tradeoffs.
-First, some terms: the erasure-coding algorithm is described as K-out-of-N
-(for this release, the default values are K=3 and N=10). Each grid will have
-some number of nodes; this number will rise and fall over time as nodes join,
-drop out, come back, and leave forever. Files are of various sizes, some are
-popular, others are unpopular. Nodes have various capacities, variable
+First, some terms: the erasure-coding algorithm is described as ``k``-out-of-``N``
+(for this release, the default values are ``k`` = 3 and ``N`` = 10). Each grid will
+have some number of nodes; this number will rise and fall over time as nodes
+join, drop out, come back, and leave forever. Files are of various sizes, some
+are popular, others are unpopular. Nodes have various capacities, variable
upload/download bandwidths, and network latency. Most of the mathematical
models that look at node failure assume some average (and independent)
probability 'P' of a given node being available: this can be high (servers
turned on for an hour then disappear for several days). Files are encoded in
segments of a given maximum size, which affects memory usage.
-The ratio of N/K is the "expansion factor". Higher expansion factors improve
-reliability very quickly (the binomial distribution curve is very sharp), but
-consumes much more grid capacity. When P=50%, the absolute value of K affects
-the granularity of the binomial curve (1-out-of-2 is much worse than
+The ratio of ``N``/``k`` is the "expansion factor". Higher expansion factors
+improve reliability very quickly (the binomial distribution curve is very sharp),
+but consumes much more grid capacity. When P=50%, the absolute value of ``k``
+affects the granularity of the binomial curve (1-out-of-2 is much worse than
50-out-of-100), but high values asymptotically approach a constant (i.e.
500-of-1000 is not much better than 50-of-100). When P is high and the
-expansion factor is held at a constant, higher values of K and N give much
-better reliability (for P=99%, 50-out-of-100 is much much better than
+expansion factor is held at a constant, higher values of ``k`` and ``N`` give
+much better reliability (for P=99%, 50-out-of-100 is much much better than
5-of-10, roughly 10^50 times better), because there are more shares that can
be lost without losing the file.
how many copies of the file you make. Independent nodes (with uncorrelated
failures) are necessary to hit the mathematical ideals: if you have 100 nodes
but they are all in the same office building, then a single power failure
-will take out all of them at once. The "Sybil Attack" is where a single
-attacker convinces you that they are actually multiple servers, so that you
-think you are using a large number of independent nodes, but in fact you have
-a single point of failure (where the attacker turns off all their machines at
-once). Large grids, with lots of truly independent nodes, will enable the use
-of lower expansion factors to achieve the same reliability, but will increase
-overhead because each node needs to know something about every other, and the
-rate at which nodes come and go will be higher (requiring network maintenance
-traffic). Also, the File Repairer work will increase with larger grids,
-although then the job can be distributed out to more nodes.
-
-Higher values of N increase overhead: more shares means more Merkle hashes
+will take out all of them at once. Pseudospoofing, also called a "Sybil Attack",
+is where a single attacker convinces you that they are actually multiple
+servers, so that you think you are using a large number of independent nodes,
+but in fact you have a single point of failure (where the attacker turns off
+all their machines at once). Large grids, with lots of truly independent nodes,
+will enable the use of lower expansion factors to achieve the same reliability,
+but will increase overhead because each node needs to know something about
+every other, and the rate at which nodes come and go will be higher (requiring
+network maintenance traffic). Also, the File Repairer work will increase with
+larger grids, although then the job can be distributed out to more nodes.
+
+Higher values of ``N`` increase overhead: more shares means more Merkle hashes
that must be included with the data, and more nodes to contact to retrieve
the shares. Smaller segment sizes reduce memory usage (since each segment
must be held in memory while erasure coding runs) and improves "alacrity"
still retaining high reliability, but large unstable grids (where nodes are
coming and going very quickly) may require more repair/verification bandwidth
than actual upload/download traffic.
-
-Tahoe-LAFS nodes that run a webserver have a page dedicated to provisioning
-decisions: this tool may help you evaluate different expansion factors and
-view the disk consumption of each. It is also acquiring some sections with
-availability/reliability numbers, as well as preliminary cost analysis data.
-This tool will continue to evolve as our analysis improves.