From: Zooko O'Whielacronx Date: Fri, 21 Sep 2007 21:12:26 +0000 (-0700) Subject: a few edits to architecture.txt and related docs X-Git-Tag: allmydata-tahoe-0.6.0~25 X-Git-Url: https://git.rkrishnan.org/specifications/something?a=commitdiff_plain;h=f5518eca922fa7f71b79011a2364c93c31194186;p=tahoe-lafs%2Ftahoe-lafs.git a few edits to architecture.txt and related docs --- diff --git a/docs/architecture.txt b/docs/architecture.txt index c1c4cfcc..047c362d 100644 --- a/docs/architecture.txt +++ b/docs/architecture.txt @@ -9,9 +9,10 @@ virtual drive, and the application that sits on top. The lowest layer is the "grid", basically a DHT (Distributed Hash Table) which maps URIs to data. The URIs are relatively short ascii strings (currently about 140 bytes), and each is used as a reference to an immutable -arbitrary-length sequence of data bytes. This data is distributed around the -grid across a large number of nodes, such that a statistically unlikely number -of nodes would have to be unavailable for the data to become unavailable. +arbitrary-length sequence of data bytes. This data is encrypted and +distributed around the grid across a large number of nodes, such that a +statistically unlikely number of nodes would have to be unavailable for the +data to become unavailable. The middle layer is the virtual drive: a tree-shaped data structure in which the intermediate nodes are directories and the leaf nodes are files. Each @@ -27,9 +28,9 @@ later, a user can recover older versions of their files. Other sorts of applications can run on top of the virtual drive, of course -- anything that has a use for a secure, robust, distributed filestore. -Note: some of the description below indicates design targets rather than -actual code present in the current release. Please take a look at roadmap.txt -to get an idea of how much of this has been implemented so far. +Note: some of the text below describes design targets rather than actual code +present in the current release. Please take a look at roadmap.txt to get an +idea of how much of this has been implemented so far. THE BIG GRID OF PEERS @@ -46,11 +47,11 @@ StorageServer, which offers to hold data for a limited period of time (a that would cause it to consume more space than it wants to provide. When a lease expires, the data is deleted. Peers might renew their leases. -This storage is used to hold "shares", which are themselves used to store -files in the grid. There are many shares for each file, typically between 10 -and 100 (the exact number depends upon the tradeoffs made between -reliability, overhead, and storage space consumed). The files are indexed by -a "StorageIndex", which is derived from the encryption key, which may be +This storage is used to hold "shares", which are encoded pieces of files in +the grid. There are many shares for each file, typically between 10 and 100 +(the exact number depends upon the tradeoffs made between reliability, +overhead, and storage space consumed). The files are indexed by a +"StorageIndex", which is derived from the encryption key, which may be randomly generated or it may be derived from the contents of the file. Leases are indexed by StorageIndex, and a single StorageServer may hold multiple shares for the corresponding file. Multiple peers can hold leases on the same @@ -90,7 +91,7 @@ be used to reconstruct the whole file. The shares are then deposited in StorageServers in other peers. A tagged hash of the encryption key is used to form the "storage index", -which is used for both peer selection (described below) and to index shares +which is used for both server selection (described below) and to index shares within the StorageServers on the selected peers. A variety of hashes are computed while the shares are being produced, to @@ -173,10 +174,10 @@ accurate. The plan is to store this capability next to the URI in the virtual drive structure. -PEER SELECTION +SERVER SELECTION When a file is uploaded, the encoded shares are sent to other peers. But to -which ones? The "peer selection" algorithm is used to make this choice. +which ones? The "server selection" algorithm is used to make this choice. In the current version, the verifierid is used to consistently-permute the set of all peers (by sorting the peers by HASH(verifierid+peerid)). Each file diff --git a/docs/codemap.txt b/docs/codemap.txt index 891731ef..00aade5b 100644 --- a/docs/codemap.txt +++ b/docs/codemap.txt @@ -3,12 +3,9 @@ CODE OVERVIEW A brief map to where the code lives in this distribution: - src/zfec: the erasure-coding library, turns data into shares and back again. - When installed, this provides the 'zfec' package. - - src/allmydata: the bulk of the code for this project. When installed, this - provides the 'allmydata' package. This includes a few pieces - copied from the PyCrypto package, in allmydata/Crypto/* . + src/allmydata: the code for this project. When installed, this provides the + 'allmydata' package. This includes a few pieces copied from + the PyCrypto package, in allmydata/Crypto/* . Within src/allmydata/ : @@ -29,12 +26,13 @@ Within src/allmydata/ : storageserver.py: provides storage services to other nodes - codec.py: low-level erasure coding, wraps zfec + codec.py: low-level erasure coding, wraps the zfec library encode.py: handles turning data into shares and blocks, computes hash trees - upload.py: upload-side peer selection, reading data from upload sources - download.py: download-side peer selection, share retrieval, decoding + upload.py: upload server selection, reading data from upload sources + + download.py: download server selection, share retrieval, decoding dirnode.py: implements the directory nodes. One part runs on the global vdrive server, the other runs inside a client