From: Brian Warner Date: Fri, 20 Apr 2007 06:43:47 +0000 (-0700) Subject: add architecture/code-layout documents describing our current architecture and a... X-Git-Tag: tahoe_v0.1.0-0-UNSTABLE~27 X-Git-Url: https://git.rkrishnan.org/provisioning?a=commitdiff_plain;h=159a3fc6783016f093f1aabc14262e522cc20681;p=tahoe-lafs%2Ftahoe-lafs.git add architecture/code-layout documents describing our current architecture and a bit of our future plans --- diff --git a/docs/architecture.txt b/docs/architecture.txt new file mode 100644 index 00000000..8f705cb0 --- /dev/null +++ b/docs/architecture.txt @@ -0,0 +1,387 @@ + + Allmydata "Tahoe" Architecture + +OVERVIEW + +The high-level view of this system consists of three layers: the mesh, the +virtual drive, and the application that sits on top. + +The lowest layer is the "mesh" or "cloud", basically a DHT (Distributed Hash +Table) which maps URIs to data. The URIs are relatively-short ascii strings +(currently about 140 bytes), and they are used as references to an immutable +arbitrary-length sequence of data bytes. This data is distributed around the +cloud in a large number of nodes, such that a statistically unlikely number +of nodes would have to be unavailable for the data to be unavailable. + +The middle layer is the virtual drive: a tree-shaped data structure in which +the intermediate nodes are directories and the leaf nodes are files. Each +file contains both the URI of the file's data and all the necessary metadata +(MIME type, filename, ctime/mtime, etc) required to present the file to a +user in a meaningful way (displaying it in a web browser, or on a desktop). + +The top layer is where the applications that use this virtual drive operate. +Allmydata uses this for a backup service, in which the application copies the +files to be backed up from the local disk into the virtual drive on a +periodic basis. By providing read-only access to the same virtual drive +later, a user can recover older versions of their files. Other sorts of +applications can run on top of the virtual drive, of course, anything that +has a use for a secure, robust, distributed filestore. + +Note: some of the description below indicates design targets rather than +actual code present in the current release. Please take a look at roadmap.txt +to get an idea of how much of this has been implemented so far. + + +THE BIG CLOUD OF PEERS + +Underlying the mesh/cloud is a large collection of peer nodes. These are +processes running on a wide variety of computers, all of which know about +each other in some way or another. They establish TCP connections to one +another using Foolscap, an encrypted+authenticated remote message passing +library (using TLS connections and self-authenticating identifiers called +"FURLs"). + +Each peer offers certain services to the others. The primary service is the +StorageServer, which offers to hold data for a limited period of time (a +"lease"). Each StorageServer has a quota, and it will reject lease requests +that would cause it to consume more space than it wants to provide. When a +lease expires, the data is deleted. Peers might renew their leases. + +This storage is used to hold "shares", which are themselves used to store +files in the mesh. There are many shares for each file, typically around 100 +(the exact number depends upon the tradeoffs made between reliability, +overhead, and storage space consumed). The files are indexed by a piece of +the URI called the "verifierid", which is derived from the contents of the +file. Leases are indexed by verifierid, and a single StorageServer may hold +multiple shares for the corresponding file. Multiple peers can hold leases on +the same file, in which case the shares will be kept alive until the last +lease expires. The typical lease is expected to be for one month: enough time +for interested parties to renew it, but not so long that abandoned data +consumes unreasonable space. Peers are expected to "delete" (drop leases) on +data that they know they no longer want: lease expiration is meant as a +safety measure. + +In this release, peers learn about each other through the "introducer". Each +peer connects to this central introducer at startup, and receives a list of +all other peers from it. Each peer then connects to all other peers, creating +a full-mesh topology. Future versions will reduce the number of connections +considerably, to enable the mesh to scale larger than a full-mesh allows. + + +FILE ENCODING + +When a file is to be added to the mesh, it is first encrypted using a key +that is derived from the hash of the file itself. The encrypted file is then +broken up into segments so it can be processed in small pieces (to minimize +the memory footprint of both encode and decode operations, and to increase +the so-called "alacrity": how quickly can the download operation provide +validated data to the user). Each segment is erasure coded, which creates +encoded blocks that are larger than the input segment, such that only a +subset of the output blocks are required to reconstruct the segment. These +blocks are then combined into "shares", such that a subset of the shares can +be used to reconstruct the whole file. The shares are then deposited in +StorageServers in other peers. + +A tagged hash of the original file is called the "fileid", while a +differently-tagged hash of the original file provides the encryption key. A +tagged hash of the *encrypted* file is called the "verifierid", and is used +for both peer selection (described below) and to index shares within the +StorageServers on the selected peers. + +The URI contains the verifierid, the encryption key, any encoding parameters +necessary to perform the eventual decoding process, and some additional +hashes that allow the download process to validate the data it receives. + +On the download side, the node that wishes to turn a URI into a sequence of +bytes will obtain the necessary shares from remote nodes, break them into +blocks, use erasure-decoding to turn them into segments of crypttext, use the +decryption key to convert that into plaintext, then emit the plaintext bytes +to the output target (which could be a file on disk, or it could be streamed +directly to a web browser or media player). + +All hashes use SHA256, and a different tag is used for each purpose. +Netstrings are used where necessary to insure these tags cannot be confused +with the data to be hashed. All encryption uses AES in CTR mode. The erasure +coding is performed with zfec (a python wrapper around Rizzo's FEC library). +A Merkle Hash Tree is used to validate the encoded blocks before they are fed +into the decode process, and a second tree is used to validate the shares +before they are retrieved. The hash tree root is put into the URI. + + +URIs + +Each URI represents a specific set of bytes. Think of it like a hash +function: you feed in a bunch of bytes, and you get out a URI. The URI is +deterministically derived from the input data: changing even one bit of the +input data will result in a drastically different URI. The URI provides both +"identification" and "location": you can use it to locate a set of bytes that +are probably the same as the original file, and you can also use it to +validate that these potential bytes are indeed the ones that you were looking +for. + +URIs refer to an immutable set of bytes. If you modify a file and upload the +new one to the mesh, you will get a different URI. URIs do not represent +filenames at all, just the data that a filename might point to at some given +point in time. This is why the "mesh" layer is insufficient to provide a +virtual drive: an actual filesystem requires human-meaningful names and +mutability, while URIs provide neither. URIs sit on the "global+secure" edge +of Zooko's Triangle[1]. They are self-authenticating, meaning that nobody can +trick you into using the wrong data. + +The URI should be considered as a "read capability" for the corresponding +data: anyone who knows the full URI has the ability to read the given data. +There is a subset of the URI (which leaves out the encryption key and fileid) +which is called the "verification capability": it allows the holder to +retrieve and validate the crypttext, but not the plaintext. Once the +crypttext is available, the erasure-coded shares can be regenerated. This +will allow a file-repair process to maintain and improve the robustness of +files without being able to read their contents. + +The lease mechanism will also involve a "delete" capability, by which a peer +which uploaded a file can indicate that they don't want it anymore. It is not +truly a delete capability because other peers might be holding leases on the +same data, and it should not be deleted until the lease count (i.e. reference +count) goes to zero, so perhaps "cancel-the-lease capability" is more +accurate. The plan is to store this capability next to the URI in the virtual +drive structure. + + +PEER SELECTION + +When a file is uploaded, the encoded shares are sent to other peers. But to +which ones? The "peer selection" algorithm is used to make this choice. + +In the current version, the verifierid is used to consistently-permute the +set of all peers (by sorting the peers by HASH(verifierid+peerid)). This +places the peers around a 2^256-sized ring, like the rim of a big clock. The +100-or-so shares are then placed around the same ring (at 0, 1/100*2^256, +2/100*2^256, ... 99/100*2^256). Imagine that we start at 0 with an empty +basket in hand and proceed clockwise. When we come to a share, we pick it up +and put it in the basket. When we come to a peer, we ask that peer if they +will give us a lease for every share in our basket. + +The peer will grant us leases for some of those shares and reject others (if +they are full or almost full). If they reject all our requests, we remove +them from the ring, because they are full and thus unhelpful. Each share they +accept is removed from the basket. The remainder stay in the basket as we +continue walking clockwise. + +We keep walking, accumulating shares and distributing them to peers, until +either we find a home for all shares, or there are no peers left in the ring +(because they are all full). If we run out of peers before we run out of +shares, the upload may be considered a failure, depending upon how many +shares we were able to place. The current parameters try to place 100 shares, +of which 25 must be retrievable to recover the file, and the peer selection +algorithm is happy if it was able to place at least 75 shares. These numbers +are adjustable: 25-out-of-100 means an expansion factor of 4x (every file in +the mesh consumes four times as much space when totalled across all +StorageServers), but is highly reliable (the actual reliability is a binomial +distribution function of the expected availability of the individual peers, +but in general it goes up very quickly with the expansion factor). + +If the file has been uploaded before (or if two uploads are happening at the +same time), a peer might already have shares for the same file we are +proposing to send to them. In this case, those shares are removed from the +list and assumed to be available (or will be soon). This reduces the number +of uploads that must be performed. + +When downloading a file, the current release just asks all known peers for +any shares they might have, chooses the minimal necessary subset, then starts +downloading and processing those shares. A later release will use the full +algorithm to reduce the number of queries that must be sent out. This +algorithm uses the same consistent-hashing permutation as on upload, but +instead of one walker with one basket, we have 100 walkers (one per share). +They each proceed clockwise until they find a peer: this peer is the most +likely to be the same one to which the share was originally uploaded, and is +put on the "A" list. The next peer that each walker encounters is put on the +"B" list, etc. All the "A" list peers are asked for any shares they might +have. If enough of them can provide a share, the download phase begins and +those shares are retrieved and decoded. If not, the "B" list peers are +contacted, etc. This routine will eventually find all the peers that have +shares, and will find them quickly if there is significant overlap between +the set of peers that were present when the file was uploaded and the set of +peers that are present as it is downloaded (i.e. if the "peerlist stability" +is high). Some limits may be imposed in large meshes to avoid querying a +million peers; this provides a tradeoff between the work spent to discover +that a file is unrecoverable and the probability that a retrieval will fail +when it couldhave succeeded if we had just tried a little bit harder. The +appropriate value of this tradeoff will depend upon the size of the mesh. + +Other peer selection algorithms are being evaluated. One of them (known as +"tahoe 2") uses the same consistent hash, starts at 0 and requests one lease +per peer until it gets 100 of them. This is likely to get better overlap +(since a single insertion or deletion will still leave 99 overlapping peers), +but is non-ideal in other ways (TODO: what were they?). + +Another algorithm (known as "denver airport"[2]) uses the permuted hash to +decide on an approximate target for each share, then sends lease requests via +Chord routing (to avoid maintaining a large number of long-term connections). +The request includes the contact information of the uploading node, and asks +that the node which eventually accepts the lease should contact the uploader +directly. The shares are then transferred over direct connections rather than +through multiple Chord hops. Download uses the same approach. + + +FILETREE: THE VIRTUAL DRIVE LAYER + +The "virtual drive" layer is responsible for mapping human-meaningful +pathnames (directories and filenames) to pieces of data. The actual bytes +inside these files are referenced by URI, but the "filetree" is where the +directory names, file names, and metadata are kept. + +The current release has a very simplistic filetree model. There is a single +globally-shared directory structure, which maps filename to URI. This +structure is maintained in a central node (which happens to be the same node +that houses the Introducer), by writing URIs to files in a local filesystem. + +A future release (probably the next one) will offer each application the +ability to have a separate file tree. Each tree can reference others. Some +trees are redirections, while others actually contain subdirectories full of +filenames. The redirections may be mutable by some users but not by others, +allowing both read-only and read-write views of the same data. This will +enable individual users to have their own personal space, with links to +spaces that are shared with specific other users, and other spaces that are +globally visible. Eventually the application layer will present these pieces +in a way that allows the sharing of a specific file or the creation of a +"virtual CD" as easily as dragging a folder onto a user icon. + +The URIs described above are "Content Hash Key" (CHK) identifiers[3], in +which the identifier refers to a specific sequence of bytes. In this project, +CHK identifiers are used for both files and immutable directories (the tree +of directory and file nodes are serialized into a sequence of bytes, which is +then uploaded and turned into a URI). There is a separate kind of upload, not +yet implemented, called SSK (short for Signed Subspace Key), in which the URI +refers to a mutable slot. Some users have a write-capability to this slot, +allowing them to change the data that it refers to. Others only have a +read-capability, merely letting them read the current contents. These SSK +slots can be used to provide mutability in the filetree, so that users can +actually change the contents of their virtual drive. Redirection nodes can +also provide mutability, such as a central service which allows a user to set +the current URI of their top-level filetree. SSK slots provide a +decentralized way to accomplish this mutability, whereas centralized +redirection nodes are more vulnerable to single-point-of-failure issues. + + +FILE REPAIRER + +Each node is expected to explicitly drop leases on files that it knows it no +longer wants (the "delete" operation). Nodes are also expected to renew +leases on files that still exist in their filetrees. When nodes are offline +for an extended period of time, their files may decay (both because of leases +expiring and because of StorageServers going offline). A File Verifier is +used to check on the health of any given file, and a File Repairer is used to +to keep desired files alive. The two are conceptually distinct (the repairer +is run if the verifier decides it is necessary), but in practice they will be +closely related, and may run in the same process. + +The repairer process does not get the full URI of the file to be maintained: +it merely gets the "repairer capability" subset, which does not include the +decryption key. The File Verifier uses that data to find out which peers +ought to hold shares for this file, and to see if those peers are still +around and willing to provide the data. If the file is not healthy enough, +the File Repairer is invoked to download the crypttext, regenerate any +missing shares, and upload them to new peers. The goal of the File Repairer +is to finish up with a full set of 100 shares. + +There are a number of engineering issues to be resolved here. The bandwidth, +disk IO, and CPU time consumed by the verification/repair process must be +balanced against the robustness that it provides to the mesh. The nodes +involved in repair will have very different access patterns than normal +nodes, such that these processes may need to be run on hosts with more memory +or network connectivity than usual. The frequency of repair runs directly +affects the resources consumed. In some cases, verification of multiple files +can be performed at the same time, and repair of files can be delegated off +to other nodes. + +The security model we are currently using assumes that peers who claim to +hold a share will actually provide it when asked. (We validate the data they +provide before using it in any way, but if enough peers claim to hold the +data and are wrong, the file will not be repaired, and may decay beyond +recoverability). There are several interesting approaches to mitigate this +threat, ranging from challenges to provide a keyed hash of the allegedly-held +data (using "buddy nodes", in which two peers hold the same block, and check +up on each other), to the original MojoNation economic model. + + +SECURITY + +Data validity (the promise that the downloaded data will match the originally +uploaded data) is provided by the hash embedded the URI. Data security (the +promise that the data is only readable by people with the URI) is provided by +the encryption key embedded in the URI. Data availability (the hope that data +which has been uploaded in the past will be downloadable in the future) is +provided by the mesh, which distributes failures in a way that reduces the +correspondence between individual node failure and file recovery failure. + +The capability-based security model is used throughout this project. Filetree +operations are expressed in terms of distinct read and write capabilities. +The URI of a file is the read-capability: knowing the URI is equivalent to +the ability to read the corresponding data. + + +RELIABILITY + +File encoding and peer selection parameters can be adjusted to achieve +different goals. Each choice results in a number of properties; there are +many tradeoffs. + +First, some terms: the erasure-coding algorithm is described as K-out-of-N +(for this release, the default values are K=25 and N=100). Each mesh will +have some number of peers; this number will rise and fall over time as peers +join, drop out, come back, and leave forever. Files are of various sizes, +some are popular, others are rare. Peers have various capacities, variable +upload/download bandwidths, and network latency. Most of the mathematical +models that look at peer failure assume some average (and independent) +probability 'P' of a given peer being available: this can be high (servers +tend to be online and available >90% of the time) or low (laptops tend to be +turned on for an hour then disappear for several days). Files are encoded in +segments of a given maximum size, which affects memory usage. + +The ratio of N/K is the "expansion factor". Higher expansion factors improve +reliability very quickly (the binomial distribution curve is very sharp), but +consumes much more mesh capacity. The absolute value of K affects the +granularity of the binomial curve (1-out-of-2 is much worse than +50-out-of-100), but high values asymptotically approach a constant that +depends upon 'P' (i.e. 500-of-1000 is not much better than 50-of-100). + +Likewise, the total number of peers in the network affects the same +granularity: having only one peer means a single point of failure, no matter +how many copies of the file you make. Independent peers (with uncorrelated +failures) are necessary to hit the mathematical ideals: if you have 100 nodes +but they are all in the same office building, then a single power failure +will take out all of them at once. The "Sybil Attack" is where a single +attacker convinces you that they are actually multiple servers, so that you +think you are using a large number of independent peers, but in fact you have +a single point of failure (where the attacker turns off all their machines at +once). Large meshes, with lots of truly-independent peers, will enable the +use of lower expansion factors to achieve the same reliability, but increase +overhead because each peer needs to know something about every other, and the +rate at which peers come and go will be higher (requiring network maintenance +traffic). Also, the File Repairer work will increase with larger meshes, +although then the job can be distributed out to more peers. + +Higher values of N increase overhead: more shares means more Merkle hashes +that must be included with the data, and more peers to contact to retrieve +the shares. Smaller segment sizes reduce memory usage (since each segment +must be held in memory while erasure coding runs) and increases "alacrity" +(since downloading can validate a smaller piece of data faster, delivering it +to the target sooner), but also increase overhead (because more blocks means +more Merkle hashes to validate them). + +In general, small private meshes should work well, but the participants will +have to decide between storage overhead and reliability. Large stable meshes +will be able to reduce the expansion factor down to a bare minimum while +still retaining high reliability, but large unstable meshes (where nodes are +coming and going very quickly) may require more repair/verification bandwidth +than actual upload/download traffic. + + +------------------------------ + +[1]: http://en.wikipedia.org/wiki/Zooko%27s_triangle +[2]: all of these names are derived from the location where they were + concocted, in this case in a car ride from Boulder to DEN +[3]: the terms CHK and SSK come from Freenet, + http://wiki.freenetproject.org/FreenetCHKPages , + although we use "SSK" in a slightly different way + diff --git a/docs/codemap.txt b/docs/codemap.txt new file mode 100644 index 00000000..a26f51a6 --- /dev/null +++ b/docs/codemap.txt @@ -0,0 +1,74 @@ + +CODE OVERVIEW + +A brief map to where the code lives in this distribution: + + src/zfec: the erasure-coding library, turns data into shares and back again. + When installed, this provides the 'zfec' package. + + src/Crypto: a modified version of PyCrypto, which includes a patch to + greatly improve the speed of CTR mode, which unfortunately makes + it incompatible with the normal version of PyCrypto. When + installed, this provides the 'allmydata.Crypto' package. + + src/allmydata: the bulk of the code for this project. When installed, this + provides the 'allmydata' package + +Within src/allmydata/ : + + interfaces.py: declaration of zope.interface-style Interfaces for most + components, also defines Foolscap RemoteInterfaces for + all remotely-accessible components + + node.py: the base Node, which handles connection establishment and + application startup + + client.py, queen.py: two specialized subclasses of Node, for users + and the central introducer/vdrive handler, respectively + + introducer.py: node introduction handlers, client is used by client.py, + server is used by queen.py + + storageserver.py: provides storage services to other nodes + + codec.py: low-level erasure coding, wraps zfec + + encode.py: handles turning data into shares and blocks, computes hash trees + upload.py: upload-side peer selection, reading data from upload sources + + download.py: download-side peer selection, share retrieval, decoding + + filetable.py, vdrive.py: implements the current one-global-vdrive layer, + part runs on client nodes, part runs on the + central vdrive handler (aka the 'queen') + + webish.py, web/*.xhtml: provides the web frontend, using a Nevow server + + workqueue.py, filetree/*.py: building blocks for the future filetree work + + hashtree.py: Merkle hash tree classes + + debugshell.py, manhole.py: SSH-connected python shell, for debug purposes + + uri.py: URI packing/parsing routines + + util/*.py: misc utility classes + + test/*.py: unit tests + + +Both the client and the central-queen node runs as a tree of +(twisted.application.service) Service instances. The Foolscap "Tub" is one of +these. Client nodes have an Uploader service and a Downloader service that +turn data into URIs and back again. They also have a VDrive service which +provides access to the single global shared filesystem. + +The Uploader is given an "upload source" (which could be an open filehandle, +a filename on local disk, or even a string), and returns a Deferred that +fires with the URI once the upload is complete. The Downloader is given a URI +and a "download target" (an open filehandle, filename, or instructions to +provide a string), and fires a Deferred with a target-specific value when the +download is complete. The source/target API is intended to make it easy to +stream the incoming data to a media player or HTTP connection without having +to consume a lot of memory for buffering. +