From: Brian Warner Date: Sun, 22 Feb 2009 06:40:54 +0000 (-0700) Subject: docs: move many specification-like documents into specifications/ X-Git-Url: https://git.rkrishnan.org/pf/%3C?a=commitdiff_plain;h=4ab3397992245cdff207041a424df14b450186dd;p=tahoe-lafs%2Ftahoe-lafs.git docs: move many specification-like documents into specifications/ --- diff --git a/docs/CHK-hashes.svg b/docs/CHK-hashes.svg deleted file mode 100644 index 22bd524f..00000000 --- a/docs/CHK-hashes.svg +++ /dev/null @@ -1,723 +0,0 @@ - - - - - - - - - - - - - - - - image/svg+xml - - - - - - - - data(plaintext) - - - - data(crypttext) - - - shares - - - - - - - - - - - - - - - - - - - - - - plaintexthash tree - crypttexthash tree - sharehash tree - - URI Extension Block - plaintext root - plaintext (flat) hash - crypttext root - crypttext (flat) hash - share root - - - - - - - - - - URI - encryptionkey - storageindex - UEBhash - - - - - - AES - - - - - - - FEC - - - - - - A - B : - B is derived from A by hashing, therefore B validates A - - A - B : - B is derived from A by encryption or erasure coding - - A - B : - A is used as an index to retrieve data B - SHARE - CHK File Hashes - - diff --git a/docs/Makefile b/docs/Makefile index 04db86d8..49007217 100644 --- a/docs/Makefile +++ b/docs/Makefile @@ -1,7 +1,5 @@ -SOURCES = CHK-hashes.svg file-encoding1.svg file-encoding2.svg \ - file-encoding3.svg file-encoding4.svg file-encoding5.svg \ - file-encoding6.svg subtree1.svg lease-tradeoffs.svg +SOURCES = subtree1.svg lease-tradeoffs.svg PNGS = $(patsubst %.svg,%.png,$(SOURCES)) EPSS = $(patsubst %.svg,%.eps,$(SOURCES)) diff --git a/docs/URI-extension.txt b/docs/URI-extension.txt deleted file mode 100644 index 8ec383e0..00000000 --- a/docs/URI-extension.txt +++ /dev/null @@ -1,61 +0,0 @@ - -"URI Extension Block" - -This block is a serialized dictionary with string keys and string values -(some of which represent numbers, some of which are SHA-256 hashes). All -buckets hold an identical copy. The hash of the serialized data is kept in -the URI. - -The download process must obtain a valid copy of this data before any -decoding can take place. The download process must also obtain other data -before incremental validation can be performed. Full-file validation (for -clients who do not wish to do incremental validation) can be performed solely -with the data from this block. - -At the moment, this data block contains the following keys (and an estimate -on their sizes): - - size 5 - segment_size 7 - num_segments 2 - needed_shares 2 - total_shares 3 - - codec_name 3 - codec_params 5+1+2+1+3=12 - tail_codec_params 12 - - share_root_hash 32 (binary) or 52 (base32-encoded) each - plaintext_hash - plaintext_root_hash - crypttext_hash - crypttext_root_hash - -Some pieces are needed elsewhere (size should be visible without pulling the -block, the Tahoe3 algorithm needs total_shares to find the right peers, all -peer selection algorithms need needed_shares to ask a minimal set of peers). -Some pieces are arguably redundant but are convenient to have present -(test_encode.py makes use of num_segments). - -The rule for this data block is that it should be a constant size for all -files, regardless of file size. Therefore hash trees (which have a size that -depends linearly upon the number of segments) are stored elsewhere in the -bucket, with only the hash tree root stored in this data block. - -This block will be serialized as follows: - - assert that all keys match ^[a-zA-z_\-]+$ - sort all the keys lexicographically - for k in keys: - write("%s:" % k) - write(netstring(data[k])) - - -Serialized size: - - dense binary (but decimal) packing: 160+46=206 - including 'key:' (185) and netstring (6*3+7*4=46) on values: 231 - including 'key:%d\n' (185+13=198) and printable values (46+5*52=306)=504 - -We'll go with the 231-sized block, and provide a tool to dump it as text if -we really want one. diff --git a/docs/dirnodes.txt b/docs/dirnodes.txt deleted file mode 100644 index adc8fcab..00000000 --- a/docs/dirnodes.txt +++ /dev/null @@ -1,433 +0,0 @@ - -= Tahoe Directory Nodes = - -As explained in the architecture docs, Tahoe can be roughly viewed as a -collection of three layers. The lowest layer is the distributed filestore, or -DHT: it provides operations that accept files and upload them to the mesh, -creating a URI in the process which securely references the file's contents. -The middle layer is the filesystem, creating a structure of directories and -filenames resembling the traditional unix/windows filesystems. The top layer -is the application layer, which uses the lower layers to provide useful -services to users, like a backup application, or a way to share files with -friends. - -This document examines the middle layer, the "filesystem". - -== DHT Primitives == - -In the lowest layer (DHT), there are two operations that reference immutable -data (which we refer to as "CHK URIs" or "CHK read-capabilities" or "CHK -read-caps"). One puts data into the grid (but only if it doesn't exist -already), the other retrieves it: - - chk_uri = put(data) - data = get(chk_uri) - -We also have three operations which reference mutable data (which we refer to -as "mutable slots", or "mutable write-caps and read-caps", or sometimes "SSK -slots"). One creates a slot with some initial contents, a second replaces the -contents of a pre-existing slot, and the third retrieves the contents: - - mutable_uri = create(initial_data) - replace(mutable_uri, new_data) - data = get(mutable_uri) - -== Filesystem Goals == - -The main goal for the middle (filesystem) layer is to give users a way to -organize the data that they have uploaded into the mesh. The traditional way -to do this in computer filesystems is to put this data into files, give those -files names, and collect these names into directories. - -Each directory is a series of name-value pairs, which maps "child name" to an -object of some kind. Those child objects might be files, or they might be -other directories. - -The directory structure is therefore a directed graph of nodes, in which each -node might be a directory node or a file node. All file nodes are terminal -nodes. - -== Dirnode Goals == - -What properties might be desirable for these directory nodes? In no -particular order: - - 1: functional. Code which does not work doesn't count. - 2: easy to document, explain, and understand - 3: confidential: it should not be possible for others to see the contents of - a directory - 4: integrity: it should not be possible for others to modify the contents - of a directory - 5: available: directories should survive host failure, just like files do - 6: efficient: in storage, communication bandwidth, number of round-trips - 7: easy to delegate individual directories in a flexible way - 8: updateness: everybody looking at a directory should see the same contents - 9: monotonicity: everybody looking at a directory should see the same - sequence of updates - -Some of these goals are mutually exclusive. For example, availability and -consistency are opposing, so it is not possible to achieve #5 and #8 at the -same time. Moreover, it takes a more complex architecture to get close to the -available-and-consistent ideal, so #2/#6 is in opposition to #5/#8. - -Tahoe-0.7.0 introduced distributed mutable files, which use public key -cryptography for integrity, and erasure coding for availability. These -achieve roughly the same properties as immutable CHK files, but their -contents can be replaced without changing their identity. Dirnodes are then -just a special way of interpreting the contents of a specific mutable file. -Earlier releases used a "vdrive server": this server was abolished in the -0.7.0 release. - -For details of how mutable files work, please see "mutable.txt" in this -directory. - -For the current 0.7.0 release, we achieve most of our desired properties. The -integrity and availability of dirnodes is equivalent to that of regular -(immutable) files, with the exception that there are more simultaneous-update -failure modes for mutable slots. Delegation is quite strong: you can give -read-write or read-only access to any subtree, and the data format used for -dirnodes is such that read-only access is transitive: i.e. if you grant Bob -read-only access to a parent directory, then Bob will get read-only access -(and *not* read-write access) to its children. - -Relative to the previous "vdrive-server" based scheme, the current -distributed dirnode approach gives better availability, but cannot guarantee -updateness quite as well, and requires far more network traffic for each -retrieval and update. Mutable files are somewhat less available than -immutable files, simply because of the increased number of combinations -(shares of an immutable file are either present or not, whereas there are -multiple versions of each mutable file, and you might have some shares of -version 1 and other shares of version 2). In extreme cases of simultaneous -update, mutable files might suffer from non-monotonicity. - - -== Dirnode secret values == - -As mentioned before, dirnodes are simply a special way to interpret the -contents of a mutable file, so the secret keys and capability strings -described in "mutable.txt" are all the same. Each dirnode contains an RSA -public/private keypair, and the holder of the "write capability" will be able -to retrieve the private key (as well as the AES encryption key used for the -data itself). The holder of the "read capability" will be able to obtain the -public key and the AES data key, but not the RSA private key needed to modify -the data. - -The "write capability" for a dirnode grants read-write access to its -contents. This is expressed on concrete form as the "dirnode write cap": a -printable string which contains the necessary secrets to grant this access. -Likewise, the "read capability" grants read-only access to a dirnode, and can -be represented by a "dirnode read cap" string. - -For example, -URI:DIR2:swdi8ge1s7qko45d3ckkyw1aac%3Aar8r5j99a4mezdojejmsfp4fj1zeky9gjigyrid4urxdimego68o -is a write-capability URI, while -URI:DIR2-RO:buxjqykt637u61nnmjg7s8zkny:ar8r5j99a4mezdojejmsfp4fj1zeky9gjigyrid4urxdimego68o -is a read-capability URI, both for the same dirnode. - - -== Dirnode storage format == - -Each dirnode is stored in a single mutable file, distributed in the Tahoe -grid. The contents of this file are a serialized list of netstrings, one per -child. Each child is a list of four netstrings: (name, rocap, rwcap, -metadata). (remember that the contents of the mutable file are encrypted by -the read-cap, so this section describes the plaintext contents of the mutable -file, *after* it has been decrypted by the read-cap). - -The name is simple a UTF-8 -encoded child name. The 'rocap' is a read-only -capability URI to that child, either an immutable (CHK) file, a mutable file, -or a directory. The 'rwcap' is a read-write capability URI for that child, -encrypted with the dirnode's write-cap: this enables the "transitive -readonlyness" property, described further below. The 'metadata' is a -JSON-encoded dictionary of type,value metadata pairs. Some metadata keys are -pre-defined, the rest are left up to the application. - -Each rwcap is stored as IV + ciphertext + MAC. The IV is a 16-byte random -value. The ciphertext is obtained by using AES in CTR mode on the rwcap URI -string, using a key that is formed from a tagged hash of the IV and the -dirnode's writekey. The MAC is a 32-byte SHA-256 -based HMAC (using that same -AES key) over the (IV+ciphertext) pair. - -If Bob has read-only access to the 'bar' directory, and he adds it as a child -to the 'foo' directory, then he will put the read-only cap for 'bar' in both -the rwcap and rocap slots (encrypting the rwcap contents as described above). -If he has full read-write access to 'bar', then he will put the read-write -cap in the 'rwcap' slot, and the read-only cap in the 'rocap' slot. Since -other users who have read-only access to 'foo' will be unable to decrypt its -rwcap slot, this limits those users to read-only access to 'bar' as well, -thus providing the transitive readonlyness that we desire. - -=== Dirnode sizes, mutable-file initial read sizes === - -How big are dirnodes? When reading dirnode data out of mutable files, how -large should our initial read be? If we guess exactly, we can read a dirnode -in a single round-trip, and update one in two RTT. If we guess too high, -we'll waste some amount of bandwidth. If we guess low, we need to make a -second pass to get the data (or the encrypted privkey, for writes), which -will cost us at least another RTT. - -Assuming child names are between 10 and 99 characters long, how long are the -various pieces of a dirnode? - - netstring(name) ~= 4+len(name) - chk-cap = 97 (for 4-char filesizes) - dir-rw-cap = 88 - dir-ro-cap = 91 - netstring(cap) = 4+len(cap) - encrypted(cap) = 16+cap+32 - JSON({}) = 2 - JSON({ctime=float,mtime=float}): 57 - netstring(metadata) = 4+57 = 61 - -so a CHK entry is: - 5+ 4+len(name) + 4+97 + 5+16+97+32 + 4+57 -And a 15-byte filename gives a 336-byte entry. When the entry points at a -subdirectory instead of a file, the entry is a little bit smaller. So an -empty directory uses 0 bytes, a directory with one child uses about 336 -bytes, a directory with two children uses about 672, etc. - -When the dirnode data is encoding using our default 3-of-10, that means we -get 112ish bytes of data in each share per child. - -The pubkey, signature, and hashes form the first 935ish bytes of the -container, then comes our data, then about 1216 bytes of encprivkey. So if we -read the first: - - 1kB: we get 65bytes of dirnode data : only empty directories - 1kiB: 89bytes of dirnode data : maybe one short-named subdir - 2kB: 1065bytes: about 9 entries - 3kB: 2065bytes: about 18 entries, or 7.5 entries plus the encprivkey - 4kB: 3065bytes: about 27 entries, or about 16.5 plus the encprivkey - -So we've written the code to do an initial read of 2kB from each share when -we read the mutable file, which should give good performance (one RTT) for -small directories. - - -== Design Goals, redux == - -How well does this design meet the goals? - - #1 functional: YES: the code works and has extensive unit tests - #2 documentable: YES: this document is the existence proof - #3 confidential: YES: see below - #4 integrity: MOSTLY: a coalition of storage servers can rollback individual - mutable files, but not a single one. No server can - substitute fake data as genuine. - #5 availability: YES: as long as 'k' storage servers are present and have - the same version of the mutable file, the dirnode will - be available. - #6 efficient: MOSTLY: - network: single dirnode lookup is very efficient, since clients can - fetch specific keys rather than being required to get or set - the entire dirnode each time. Traversing many directories - takes a lot of roundtrips, and these can't be collapsed with - promise-pipelining because the intermediate values must only - be visible to the client. Modifying many dirnodes at once - (e.g. importing a large pre-existing directory tree) is pretty - slow, since each graph edge must be created independently. - storage: each child has a separate IV, which makes them larger than - if all children were aggregated into a single encrypted string - #7 delegation: VERY: each dirnode is a completely independent object, - to which clients can be granted separate read-write or - read-only access - #8 updateness: VERY: with only a single point of access, and no caching, - each client operation starts by fetching the current - value, so there are no opportunities for staleness - #9 monotonicity: VERY: the single point of access also protects against - retrograde motion - - - -=== Confidentiality leaks in the vdrive server === - -Dirnode (and the mutable files upon which they are based) are very private -against other clients: traffic between the client and the storage servers is -protected by the Foolscap SSL connection, so they can observe very little. -Storage index values are hashes of secrets and thus unguessable, and they are -not made public, so other clients cannot snoop through encrypted dirnodes -that they have not been told about. - -Storage servers can observe access patterns and see ciphertext, but they -cannot see the plaintext (of child names, metadata, or URIs). If an attacker -operates a significant number of storage servers, they can infer the shape of -the directory structure by assuming that directories are usually accessed -from root to leaf in rapid succession. Since filenames are usually much -shorter than read-caps and write-caps, the attacker can use the length of the -ciphertext to guess the number of children of each node, and might be able to -guess the length of the child names (or at least their sum). From this, the -attacker may be able to build up a graph with the same shape as the plaintext -filesystem, but with unlabeled edges and unknown file contents. - - -=== Integrity failures in the vdrive server === - -The mutable file's integrity mechanism (RSA signature on the hash of the file -contents) prevents the storage server from modifying the dirnode's contents -without detection. Therefore the storage servers can make the dirnode -unavailable, but not corrupt it. - -A sufficient number of colluding storage servers can perform a rollback -attack: replace all shares of the whole mutable file with an earlier version. -TODO: To prevent this, when retrieving the contents of a mutable file, the -client should query more servers than necessary and use the highest available -version number. This insures that one or two misbehaving storage servers -cannot cause this rollback on their own. - - -=== Improving the efficiency of dirnodes === - -The current mutable-file -based dirnode scheme suffers from certain -inefficiencies. A very large directory (with thousands or millions of -children) will take a significant time to extract any single entry, because -the whole file must be downloaded first, then parsed and searched to find the -desired child entry. Likewise, modifying a single child will require the -whole file to be re-uploaded. - -The current design assumes (and in some cases, requires) that dirnodes remain -small. The mutable files on which dirnodes are based are currently using -"SDMF" ("Small Distributed Mutable File") design rules, which state that the -size of the data shall remain below one megabyte. More advanced forms of -mutable files (MDMF and LDMF) are in the design phase to allow efficient -manipulation of larger mutable files. This would reduce the work needed to -modify a single entry in a large directory. - -Judicious caching may help improve the reading-large-directory case. Some -form of mutable index at the beginning of the dirnode might help as well. The -MDMF design rules allow for efficient random-access reads from the middle of -the file, which would give the index something useful to point at. - -The current SDMF design generates a new RSA public/private keypair for each -directory. This takes considerable time and CPU effort, generally one or two -seconds per directory. We have designed (but not yet built) a DSA-based -mutable file scheme which will use shared parameters to reduce the -directory-creation effort to a bare minimum (picking a random number instead -of generating two random primes). - - -When a backup program is run for the first time, it needs to copy a large -amount of data from a pre-existing filesystem into reliable storage. This -means that a large and complex directory structure needs to be duplicated in -the dirnode layer. With the one-object-per-dirnode approach described here, -this requires as many operations as there are edges in the imported -filesystem graph. - -Another approach would be to aggregate multiple directories into a single -storage object. This object would contain a serialized graph rather than a -single name-to-child dictionary. Most directory operations would fetch the -whole block of data (and presumeably cache it for a while to avoid lots of -re-fetches), and modification operations would need to replace the whole -thing at once. This "realm" approach would have the added benefit of -combining more data into a single encrypted bundle (perhaps hiding the shape -of the graph from a determined attacker), and would reduce round-trips when -performing deep directory traversals (assuming the realm was already cached). -It would also prevent fine-grained rollback attacks from working: a coalition -of storage servers could change the entire realm to look like an earlier -state, but it could not independently roll back individual directories. - -The drawbacks of this aggregation would be that small accesses (adding a -single child, looking up a single child) would require pulling or pushing a -lot of unrelated data, increasing network overhead (and necessitating -test-and-set semantics for the modification side, which increases the chances -that a user operation will fail, making it more challenging to provide -promises of atomicity to the user). - -It would also make it much more difficult to enable the delegation -("sharing") of specific directories. Since each aggregate "realm" provides -all-or-nothing access control, the act of delegating any directory from the -middle of the realm would require the realm first be split into the upper -piece that isn't being shared and the lower piece that is. This splitting -would have to be done in response to what is essentially a read operation, -which is not traditionally supposed to be a high-effort action. On the other -hand, it may be possible to aggregate the ciphertext, but use distinct -encryption keys for each component directory, to get the benefits of both -schemes at once. - - -=== Dirnode expiration and leases === - -Dirnodes are created any time a client wishes to add a new directory. How -long do they live? What's to keep them from sticking around forever, taking -up space that nobody can reach any longer? - -Mutable files are created with limited-time "leases", which keep the shares -alive until the last lease has expired or been cancelled. Clients which know -and care about specific dirnodes can ask to keep them alive for a while, by -renewing a lease on them (with a typical period of one month). Clients are -expected to assist in the deletion of dirnodes by canceling their leases as -soon as they are done with them. This means that when a client deletes a -directory, it should also cancel its lease on that directory. When the lease -count on a given share goes to zero, the storage server can delete the -related storage. Multiple clients may all have leases on the same dirnode: -the server may delete the shares only after all of the leases have gone away. - -We expect that clients will periodically create a "manifest": a list of -so-called "refresh capabilities" for all of the dirnodes and files that they -can reach. They will give this manifest to the "repairer", which is a service -that keeps files (and dirnodes) alive on behalf of clients who cannot take on -this responsibility for themselves. These refresh capabilities include the -storage index, but do *not* include the readkeys or writekeys, so the -repairer does not get to read the files or directories that it is helping to -keep alive. - -After each change to the user's vdrive, the client creates a manifest and -looks for differences from their previous version. Anything which was removed -prompts the client to send out lease-cancellation messages, allowing the data -to be deleted. - - -== Starting Points: root dirnodes == - -Any client can record the URI of a directory node in some external form (say, -in a local file) and use it as the starting point of later traversal. Each -Tahoe user is expected to create a new (unattached) dirnode when they first -start using the grid, and record its URI for later use. - -== Mounting and Sharing Directories == - -The biggest benefit of this dirnode approach is that sharing individual -directories is almost trivial. Alice creates a subdirectory that she wants to -use to share files with Bob. This subdirectory is attached to Alice's -filesystem at "~alice/share-with-bob". She asks her filesystem for the -read-write directory URI for that new directory, and emails it to Bob. When -Bob receives the URI, he asks his own local vdrive to attach the given URI, -perhaps at a place named "~bob/shared-with-alice". Every time either party -writes a file into this directory, the other will be able to read it. If -Alice prefers, she can give a read-only URI to Bob instead, and then Bob will -be able to read files but not change the contents of the directory. Neither -Alice nor Bob will get access to any files above the mounted directory: there -are no 'parent directory' pointers. If Alice creates a nested set of -directories, "~alice/share-with-bob/subdir2", and gives a read-only URI to -share-with-bob to Bob, then Bob will be unable to write to either -share-with-bob/ or subdir2/. - -A suitable UI needs to be created to allow users to easily perform this -sharing action: dragging a folder their vdrive to an IM or email user icon, -for example. The UI will need to give the sending user an opportunity to -indicate whether they want to grant read-write or read-only access to the -recipient. The recipient then needs an interface to drag the new folder into -their vdrive and give it a home. - -== Revocation == - -When Alice decides that she no longer wants Bob to be able to access the -shared directory, what should she do? Suppose she's shared this folder with -both Bob and Carol, and now she wants Carol to retain access to it but Bob to -be shut out. Ideally Carol should not have to do anything: her access should -continue unabated. - -The current plan is to have her client create a deep copy of the folder in -question, delegate access to the new folder to the remaining members of the -group (Carol), asking the lucky survivors to replace their old reference with -the new one. Bob may still have access to the old folder, but he is now the -only one who cares: everyone else has moved on, and he will no longer be able -to see their new changes. In a strict sense, this is the strongest form of -revocation that can be accomplished: there is no point trying to force Bob to -forget about the files that he read a moment before being kicked out. In -addition it must be noted that anyone who can access the directory can proxy -for Bob, reading files to him and accepting changes whenever he wants. -Preventing delegation between communication parties is just as pointless as -asking Bob to forget previously accessed files. However, there may be value -to configuring the UI to ask Carol to not share files with Bob, or to -removing all files from Bob's view at the same time his access is revoked. - diff --git a/docs/file-encoding.txt b/docs/file-encoding.txt deleted file mode 100644 index 23862ead..00000000 --- a/docs/file-encoding.txt +++ /dev/null @@ -1,148 +0,0 @@ - -== FileEncoding == - -When the client wishes to upload an immutable file, the first step is to -decide upon an encryption key. There are two methods: convergent or random. -The goal of the convergent-key method is to make sure that multiple uploads -of the same file will result in only one copy on the grid, whereas the -random-key method does not provide this "convergence" feature. - -The convergent-key method computes the SHA-256d hash of a single-purpose tag, -the encoding parameters, a "convergence secret", and the contents of the -file. It uses a portion of the resulting hash as the AES encryption key. -There are security concerns with using convergence this approach (the -"partial-information guessing attack", please see ticket #365 for some -references), so Tahoe uses a separate (randomly-generated) "convergence -secret" for each node, stored in NODEDIR/private/convergence . The encoding -parameters (k, N, and the segment size) are included in the hash to make sure -that two different encodings of the same file will get different keys. This -method requires an extra IO pass over the file, to compute this key, and -encryption cannot be started until the pass is complete. This means that the -convergent-key method will require at least two total passes over the file. - -The random-key method simply chooses a random encryption key. Convergence is -disabled, however this method does not require a separate IO pass, so upload -can be done with a single pass. This mode makes it easier to perform -streaming upload. - -Regardless of which method is used to generate the key, the plaintext file is -encrypted (using AES in CTR mode) to produce a ciphertext. This ciphertext is -then erasure-coded and uploaded to the servers. Two hashes of the ciphertext -are generated as the encryption proceeds: a flat hash of the whole -ciphertext, and a Merkle tree. These are used to verify the correctness of -the erasure decoding step, and can be used by a "verifier" process to make -sure the file is intact without requiring the decryption key. - -The encryption key is hashed (with SHA-256d and a single-purpose tag) to -produce the "Storage Index". This Storage Index (or SI) is used to identify -the shares produced by the method described below. The grid can be thought of -as a large table that maps Storage Index to a ciphertext. Since the -ciphertext is stored as erasure-coded shares, it can also be thought of as a -table that maps SI to shares. - -Anybody who knows a Storage Index can retrieve the associated ciphertext: -ciphertexts are not secret. - - -[[Image(file-encoding1.png)]] - -The ciphertext file is then broken up into segments. The last segment is -likely to be shorter than the rest. Each segment is erasure-coded into a -number of "blocks". This takes place one segment at a time. (In fact, -encryption and erasure-coding take place at the same time, once per plaintext -segment). Larger segment sizes result in less overhead overall, but increase -both the memory footprint and the "alacrity" (the number of bytes we have to -receive before we can deliver validated plaintext to the user). The current -default segment size is 128KiB. - -One block from each segment is sent to each shareholder (aka leaseholder, -aka landlord, aka storage node, aka peer). The "share" held by each remote -shareholder is nominally just a collection of these blocks. The file will -be recoverable when a certain number of shares have been retrieved. - -[[Image(file-encoding2.png)]] - -The blocks are hashed as they are generated and transmitted. These -block hashes are put into a Merkle hash tree. When the last share has been -created, the merkle tree is completed and delivered to the peer. Later, when -we retrieve these blocks, the peer will send many of the merkle hash tree -nodes ahead of time, so we can validate each block independently. - -The root of this block hash tree is called the "block root hash" and -used in the next step. - -[[Image(file-encoding3.png)]] - -There is a higher-level Merkle tree called the "share hash tree". Its leaves -are the block root hashes from each share. The root of this tree is called -the "share root hash" and is included in the "URI Extension Block", aka UEB. -The ciphertext hash and Merkle tree are also put here, along with the -original file size, and the encoding parameters. The UEB contains all the -non-secret values that could be put in the URI, but would have made the URI -too big. So instead, the UEB is stored with the share, and the hash of the -UEB is put in the URI. - -The URI then contains the secret encryption key and the UEB hash. It also -contains the basic encoding parameters (k and N) and the file size, to make -download more efficient (by knowing the number of required shares ahead of -time, sufficient download queries can be generated in parallel). - -The URI (also known as the immutable-file read-cap, since possessing it -grants the holder the capability to read the file's plaintext) is then -represented as a (relatively) short printable string like so: - - URI:CHK:auxet66ynq55naiy2ay7cgrshm:6rudoctmbxsmbg7gwtjlimd6umtwrrsxkjzthuldsmo4nnfoc6fa:3:10:1000000 - -[[Image(file-encoding4.png)]] - -During download, when a peer begins to transmit a share, it first transmits -all of the parts of the share hash tree that are necessary to validate its -block root hash. Then it transmits the portions of the block hash tree -that are necessary to validate the first block. Then it transmits the -first block. It then continues this loop: transmitting any portions of the -block hash tree to validate block#N, then sending block#N. - -[[Image(file-encoding5.png)]] - -So the "share" that is sent to the remote peer actually consists of three -pieces, sent in a specific order as they become available, and retrieved -during download in a different order according to when they are needed. - -The first piece is the blocks themselves, one per segment. The last -block will likely be shorter than the rest, because the last segment is -probably shorter than the rest. The second piece is the block hash tree, -consisting of a total of two SHA-1 hashes per block. The third piece is a -hash chain from the share hash tree, consisting of log2(numshares) hashes. - -During upload, all blocks are sent first, followed by the block hash -tree, followed by the share hash chain. During download, the share hash chain -is delivered first, followed by the block root hash. The client then uses -the hash chain to validate the block root hash. Then the peer delivers -enough of the block hash tree to validate the first block, followed by -the first block itself. The block hash chain is used to validate the -block, then it is passed (along with the first block from several other -peers) into decoding, to produce the first segment of crypttext, which is -then decrypted to produce the first segment of plaintext, which is finally -delivered to the user. - -[[Image(file-encoding6.png)]] - -== Hashes == - -All hashes use SHA-256d, as defined in Practical Cryptography (by Ferguson -and Schneier). All hashes use a single-purpose tag, e.g. the hash that -converts an encryption key into a storage index is defined as follows: - - SI = SHA256d(netstring("allmydata_immutable_key_to_storage_index_v1") + key) - -When two separate values need to be combined together in a hash, we wrap each -in a netstring. - -Using SHA-256d (instead of plain SHA-256) guards against length-extension -attacks. Using the tag protects our Merkle trees against attacks in which the -hash of a leaf is confused with a hash of two children (allowing an attacker -to generate corrupted data that nevertheless appears to be valid), and is -simply good "cryptograhic hygiene". The "Chosen Protocol Attack" by Kelsey, -Schneier, and Wagner (http://www.schneier.com/paper-chosen-protocol.html) is -relevant. Putting the tag in a netstring guards against attacks that seek to -confuse the end of the tag with the beginning of the subsequent value. diff --git a/docs/file-encoding1.svg b/docs/file-encoding1.svg deleted file mode 100644 index 06b702a2..00000000 --- a/docs/file-encoding1.svg +++ /dev/null @@ -1,435 +0,0 @@ - - - - - - - - - - - - - - - - - image/svg+xml - - - - - - - - FILE (plaintext) - - - - convergentencryptionkey - - - - AES-CTR - - - - FILE (crypttext) - - - - - tag - - - - storageindex - - - - SHA-256 - - - - SHA-256 - - - - - - tag - - - - - encoding parameters - - - - - - randomencryptionkey - - - - or - - - - - - diff --git a/docs/file-encoding2.svg b/docs/file-encoding2.svg deleted file mode 100644 index 6db3de37..00000000 --- a/docs/file-encoding2.svg +++ /dev/null @@ -1,922 +0,0 @@ - - - - - - - - - - - - - image/svg+xml - - - - - - - - FILE (crypttext) - - - - segA - - - - segB - - - - segC - - - - - - - segD - - - - FEC - - - block - A1 - - block - A2 - - block - A3 - - block - A4 - - - - - - - - FEC - - - block - B1 - - block - B2 - - block - B3 - - block - B4 - - - - - - - - FEC - - - block - C1 - - block - C2 - - block - C3 - - block - C4 - - - - - - - - FEC - - - block - D1 - - block - D2 - - block - D3 - - block - D4 - - - - - - - share4 - - peer 4 - - - diff --git a/docs/file-encoding3.svg b/docs/file-encoding3.svg deleted file mode 100644 index fb5fd4c0..00000000 --- a/docs/file-encoding3.svg +++ /dev/null @@ -1,484 +0,0 @@ - - - - - - - - - - - - - image/svg+xml - - - - - - - - SHA - - - - SHA - - - - SHA - - - - SHA - - - - SHA - - - - SHA - - - - SHA - - - share - A4 - - share - B4 - - share - C4 - - share - D4 - - share4 - - peer 4 - - - - - - - - - - - - - Merkle Tree - block hash tree - "block root hash" - - diff --git a/docs/file-encoding4.svg b/docs/file-encoding4.svg deleted file mode 100644 index f4b21d02..00000000 --- a/docs/file-encoding4.svg +++ /dev/null @@ -1,675 +0,0 @@ - - - - - - - - - - - - - - image/svg+xml - - - - - - blockroot hashes - - - SHA - - - - s1 - - - - s2 - - - - s3 - - - - s4 - - - - SHA - - - - SHA - - - - - - - - shares - - - share1 - - - - share2 - - - - share3 - - - - share4 - - - - - - - Merkle Tree - share hash tree - "share root hash" - - URI Extension Block - - - file size - - - - encoding parameters - - - - share root hash - - - - URI / "file read-cap" - - UEB hash - - - - encryption key - - - - - SHA - - - - - - other hashes - - - diff --git a/docs/file-encoding5.svg b/docs/file-encoding5.svg deleted file mode 100644 index a20a1369..00000000 --- a/docs/file-encoding5.svg +++ /dev/null @@ -1,585 +0,0 @@ - - - - - - - - - - - - - image/svg+xml - - - - - - blockroot hashes - - - SHA - - - - s1 - - - - s2 - - - - s3 - - - - s4 - - - - SHA - - - - SHA - - - - - - - - share hash tree - - - SHA - - - - s5 - - - - s6 - - - - s7 - - - - s8 - - - - SHA - - - - SHA - - - - - - - - Merkle Tree - "share root hash" - - - SHA - - - - - merkle hash chainto validate s1 - - - diff --git a/docs/file-encoding6.svg b/docs/file-encoding6.svg deleted file mode 100644 index 09ced3fe..00000000 --- a/docs/file-encoding6.svg +++ /dev/null @@ -1,760 +0,0 @@ - - - - - - - - - - - - - image/svg+xml - - - - - - - - SHA - - - - SHA - - - - SHA - - - - SHA - - - - SHA - - - - SHA - - - share - A4 - - share - B4 - - share - C4 - - share - D4 - share4 - - peer 4 - - - - - - - - - - - - - Merkle Tree - block hash tree - "block root hash" - blockroot hashes - - - SHA - - - - s1 - - - - s2 - - - - s3 - - - - s4 - - - - SHA - - - - SHA - - - - - - - - - Merkle Tree - share hash tree - "share root hash" - - - merkle hash chainto validate s4 - - - - s4 - - - diff --git a/docs/mut.svg b/docs/mut.svg deleted file mode 100644 index 3db01b8e..00000000 --- a/docs/mut.svg +++ /dev/null @@ -1,1602 +0,0 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - image/svg+xml - - - - - - - - - - - - - - - shares - - - - - - - - - - - - - - - - - - - - - - - - Merkle Tree - - - - - - AES-CTR - - - - - SHA256d - - - - - SHA256d - - - - - SHA256d - - - - - - - - - - FEC - - - - - - - - - - - - - - - - - - - - - - - - salt - - - - encryption key - - - - - - - write key - read key - - verifying (public) key - signing (private) key - - - - - encrypted signing key - - - - verify cap - - - read-write cap - - verify cap - - - - write key - read-only cap - verify cap - - - - read key - plaintext - ciphertext - - - - - - - - - - - - - - - - - SHA256dtruncated - - - - - SHA256dtruncated - - - - - SHA256dtruncated - - - - - SHA256dtruncated - - - - - AES-CTR - - - share 1 - - - share 2 - - - share 3 - - - share 4 - - - - - - diff --git a/docs/mutable.txt b/docs/mutable.txt deleted file mode 100644 index 40a5374b..00000000 --- a/docs/mutable.txt +++ /dev/null @@ -1,648 +0,0 @@ - -This describes the "RSA-based mutable files" which were shipped in Tahoe v0.8.0. - -= Mutable Files = - -Mutable File Slots are places with a stable identifier that can hold data -that changes over time. In contrast to CHK slots, for which the -URI/identifier is derived from the contents themselves, the Mutable File Slot -URI remains fixed for the life of the slot, regardless of what data is placed -inside it. - -Each mutable slot is referenced by two different URIs. The "read-write" URI -grants read-write access to its holder, allowing them to put whatever -contents they like into the slot. The "read-only" URI is less powerful, only -granting read access, and not enabling modification of the data. The -read-write URI can be turned into the read-only URI, but not the other way -around. - -The data in these slots is distributed over a number of servers, using the -same erasure coding that CHK files use, with 3-of-10 being a typical choice -of encoding parameters. The data is encrypted and signed in such a way that -only the holders of the read-write URI will be able to set the contents of -the slot, and only the holders of the read-only URI will be able to read -those contents. Holders of either URI will be able to validate the contents -as being written by someone with the read-write URI. The servers who hold the -shares cannot read or modify them: the worst they can do is deny service (by -deleting or corrupting the shares), or attempt a rollback attack (which can -only succeed with the cooperation of at least k servers). - -== Consistency vs Availability == - -There is an age-old battle between consistency and availability. Epic papers -have been written, elaborate proofs have been established, and generations of -theorists have learned that you cannot simultaneously achieve guaranteed -consistency with guaranteed reliability. In addition, the closer to 0 you get -on either axis, the cost and complexity of the design goes up. - -Tahoe's design goals are to largely favor design simplicity, then slightly -favor read availability, over the other criteria. - -As we develop more sophisticated mutable slots, the API may expose multiple -read versions to the application layer. The tahoe philosophy is to defer most -consistency recovery logic to the higher layers. Some applications have -effective ways to merge multiple versions, so inconsistency is not -necessarily a problem (i.e. directory nodes can usually merge multiple "add -child" operations). - -== The Prime Coordination Directive: "Don't Do That" == - -The current rule for applications which run on top of Tahoe is "do not -perform simultaneous uncoordinated writes". That means you need non-tahoe -means to make sure that two parties are not trying to modify the same mutable -slot at the same time. For example: - - * don't give the read-write URI to anyone else. Dirnodes in a private - directory generally satisfy this case, as long as you don't use two - clients on the same account at the same time - * if you give a read-write URI to someone else, stop using it yourself. An - inbox would be a good example of this. - * if you give a read-write URI to someone else, call them on the phone - before you write into it - * build an automated mechanism to have your agents coordinate writes. - For example, we expect a future release to include a FURL for a - "coordination server" in the dirnodes. The rule can be that you must - contact the coordination server and obtain a lock/lease on the file - before you're allowed to modify it. - -If you do not follow this rule, Bad Things will happen. The worst-case Bad -Thing is that the entire file will be lost. A less-bad Bad Thing is that one -or more of the simultaneous writers will lose their changes. An observer of -the file may not see monotonically-increasing changes to the file, i.e. they -may see version 1, then version 2, then 3, then 2 again. - -Tahoe takes some amount of care to reduce the badness of these Bad Things. -One way you can help nudge it from the "lose your file" case into the "lose -some changes" case is to reduce the number of competing versions: multiple -versions of the file that different parties are trying to establish as the -one true current contents. Each simultaneous writer counts as a "competing -version", as does the previous version of the file. If the count "S" of these -competing versions is larger than N/k, then the file runs the risk of being -lost completely. [TODO] If at least one of the writers remains running after -the collision is detected, it will attempt to recover, but if S>(N/k) and all -writers crash after writing a few shares, the file will be lost. - -Note that Tahoe uses serialization internally to make sure that a single -Tahoe node will not perform simultaneous modifications to a mutable file. It -accomplishes this by using a weakref cache of the MutableFileNode (so that -there will never be two distinct MutableFileNodes for the same file), and by -forcing all mutable file operations to obtain a per-node lock before they -run. The Prime Coordination Directive therefore applies to inter-node -conflicts, not intra-node ones. - - -== Small Distributed Mutable Files == - -SDMF slots are suitable for small (<1MB) files that are editing by rewriting -the entire file. The three operations are: - - * allocate (with initial contents) - * set (with new contents) - * get (old contents) - -The first use of SDMF slots will be to hold directories (dirnodes), which map -encrypted child names to rw-URI/ro-URI pairs. - -=== SDMF slots overview === - -Each SDMF slot is created with a public/private key pair. The public key is -known as the "verification key", while the private key is called the -"signature key". The private key is hashed and truncated to 16 bytes to form -the "write key" (an AES symmetric key). The write key is then hashed and -truncated to form the "read key". The read key is hashed and truncated to -form the 16-byte "storage index" (a unique string used as an index to locate -stored data). - -The public key is hashed by itself to form the "verification key hash". - -The write key is hashed a different way to form the "write enabler master". -For each storage server on which a share is kept, the write enabler master is -concatenated with the server's nodeid and hashed, and the result is called -the "write enabler" for that particular server. Note that multiple shares of -the same slot stored on the same server will all get the same write enabler, -i.e. the write enabler is associated with the "bucket", rather than the -individual shares. - -The private key is encrypted (using AES in counter mode) by the write key, -and the resulting crypttext is stored on the servers. so it will be -retrievable by anyone who knows the write key. The write key is not used to -encrypt anything else, and the private key never changes, so we do not need -an IV for this purpose. - -The actual data is encrypted (using AES in counter mode) with a key derived -by concatenating the readkey with the IV, the hashing the results and -truncating to 16 bytes. The IV is randomly generated each time the slot is -updated, and stored next to the encrypted data. - -The read-write URI consists of the write key and the verification key hash. -The read-only URI contains the read key and the verification key hash. The -verify-only URI contains the storage index and the verification key hash. - - URI:SSK-RW:b2a(writekey):b2a(verification_key_hash) - URI:SSK-RO:b2a(readkey):b2a(verification_key_hash) - URI:SSK-Verify:b2a(storage_index):b2a(verification_key_hash) - -Note that this allows the read-only and verify-only URIs to be derived from -the read-write URI without actually retrieving the public keys. Also note -that it means the read-write agent must validate both the private key and the -public key when they are first fetched. All users validate the public key in -exactly the same way. - -The SDMF slot is allocated by sending a request to the storage server with a -desired size, the storage index, and the write enabler for that server's -nodeid. If granted, the write enabler is stashed inside the slot's backing -store file. All further write requests must be accompanied by the write -enabler or they will not be honored. The storage server does not share the -write enabler with anyone else. - -The SDMF slot structure will be described in more detail below. The important -pieces are: - - * a sequence number - * a root hash "R" - * the encoding parameters (including k, N, file size, segment size) - * a signed copy of [seqnum,R,encoding_params], using the signature key - * the verification key (not encrypted) - * the share hash chain (part of a Merkle tree over the share hashes) - * the block hash tree (Merkle tree over blocks of share data) - * the share data itself (erasure-coding of read-key-encrypted file data) - * the signature key, encrypted with the write key - -The access pattern for read is: - * hash read-key to get storage index - * use storage index to locate 'k' shares with identical 'R' values - * either get one share, read 'k' from it, then read k-1 shares - * or read, say, 5 shares, discover k, either get more or be finished - * or copy k into the URIs - * read verification key - * hash verification key, compare against verification key hash - * read seqnum, R, encoding parameters, signature - * verify signature against verification key - * read share data, compute block-hash Merkle tree and root "r" - * read share hash chain (leading from "r" to "R") - * validate share hash chain up to the root "R" - * submit share data to erasure decoding - * decrypt decoded data with read-key - * submit plaintext to application - -The access pattern for write is: - * hash write-key to get read-key, hash read-key to get storage index - * use the storage index to locate at least one share - * read verification key and encrypted signature key - * decrypt signature key using write-key - * hash signature key, compare against write-key - * hash verification key, compare against verification key hash - * encrypt plaintext from application with read-key - * application can encrypt some data with the write-key to make it only - available to writers (use this for transitive read-onlyness of dirnodes) - * erasure-code crypttext to form shares - * split shares into blocks - * compute Merkle tree of blocks, giving root "r" for each share - * compute Merkle tree of shares, find root "R" for the file as a whole - * create share data structures, one per server: - * use seqnum which is one higher than the old version - * share hash chain has log(N) hashes, different for each server - * signed data is the same for each server - * now we have N shares and need homes for them - * walk through peers - * if share is not already present, allocate-and-set - * otherwise, try to modify existing share: - * send testv_and_writev operation to each one - * testv says to accept share if their(seqnum+R) <= our(seqnum+R) - * count how many servers wind up with which versions (histogram over R) - * keep going until N servers have the same version, or we run out of servers - * if any servers wound up with a different version, report error to - application - * if we ran out of servers, initiate recovery process (described below) - -=== Server Storage Protocol === - -The storage servers will provide a mutable slot container which is oblivious -to the details of the data being contained inside it. Each storage index -refers to a "bucket", and each bucket has one or more shares inside it. (In a -well-provisioned network, each bucket will have only one share). The bucket -is stored as a directory, using the base32-encoded storage index as the -directory name. Each share is stored in a single file, using the share number -as the filename. - -The container holds space for a container magic number (for versioning), the -write enabler, the nodeid which accepted the write enabler (used for share -migration, described below), a small number of lease structures, the embedded -data itself, and expansion space for additional lease structures. - - # offset size name - 1 0 32 magic verstr "tahoe mutable container v1" plus binary - 2 32 20 write enabler's nodeid - 3 52 32 write enabler - 4 84 8 data size (actual share data present) (a) - 5 92 8 offset of (8) count of extra leases (after data) - 6 100 368 four leases, 92 bytes each - 0 4 ownerid (0 means "no lease here") - 4 4 expiration timestamp - 8 32 renewal token - 40 32 cancel token - 72 20 nodeid which accepted the tokens - 7 468 (a) data - 8 ?? 4 count of extra leases - 9 ?? n*92 extra leases - -The "extra leases" field must be copied and rewritten each time the size of -the enclosed data changes. The hope is that most buckets will have four or -fewer leases and this extra copying will not usually be necessary. - -The (4) "data size" field contains the actual number of bytes of data present -in field (7), such that a client request to read beyond 504+(a) will result -in an error. This allows the client to (one day) read relative to the end of -the file. The container size (that is, (8)-(7)) might be larger, especially -if extra size was pre-allocated in anticipation of filling the container with -a lot of data. - -The offset in (5) points at the *count* of extra leases, at (8). The actual -leases (at (9)) begin 4 bytes later. If the container size changes, both (8) -and (9) must be relocated by copying. - -The server will honor any write commands that provide the write token and do -not exceed the server-wide storage size limitations. Read and write commands -MUST be restricted to the 'data' portion of the container: the implementation -of those commands MUST perform correct bounds-checking to make sure other -portions of the container are inaccessible to the clients. - -The two methods provided by the storage server on these "MutableSlot" share -objects are: - - * readv(ListOf(offset=int, length=int)) - * returns a list of bytestrings, of the various requested lengths - * offset < 0 is interpreted relative to the end of the data - * spans which hit the end of the data will return truncated data - - * testv_and_writev(write_enabler, test_vector, write_vector) - * this is a test-and-set operation which performs the given tests and only - applies the desired writes if all tests succeed. This is used to detect - simultaneous writers, and to reduce the chance that an update will lose - data recently written by some other party (written after the last time - this slot was read). - * test_vector=ListOf(TupleOf(offset, length, opcode, specimen)) - * the opcode is a string, from the set [gt, ge, eq, le, lt, ne] - * each element of the test vector is read from the slot's data and - compared against the specimen using the desired (in)equality. If all - tests evaluate True, the write is performed - * write_vector=ListOf(TupleOf(offset, newdata)) - * offset < 0 is not yet defined, it probably means relative to the - end of the data, which probably means append, but we haven't nailed - it down quite yet - * write vectors are executed in order, which specifies the results of - overlapping writes - * return value: - * error: OutOfSpace - * error: something else (io error, out of memory, whatever) - * (True, old_test_data): the write was accepted (test_vector passed) - * (False, old_test_data): the write was rejected (test_vector failed) - * both 'accepted' and 'rejected' return the old data that was used - for the test_vector comparison. This can be used by the client - to detect write collisions, including collisions for which the - desired behavior was to overwrite the old version. - -In addition, the storage server provides several methods to access these -share objects: - - * allocate_mutable_slot(storage_index, sharenums=SetOf(int)) - * returns DictOf(int, MutableSlot) - * get_mutable_slot(storage_index) - * returns DictOf(int, MutableSlot) - * or raises KeyError - -We intend to add an interface which allows small slots to allocate-and-write -in a single call, as well as do update or read in a single call. The goal is -to allow a reasonably-sized dirnode to be created (or updated, or read) in -just one round trip (to all N shareholders in parallel). - -==== migrating shares ==== - -If a share must be migrated from one server to another, two values become -invalid: the write enabler (since it was computed for the old server), and -the lease renew/cancel tokens. - -Suppose that a slot was first created on nodeA, and was thus initialized with -WE(nodeA) (= H(WEM+nodeA)). Later, for provisioning reasons, the share is -moved from nodeA to nodeB. - -Readers may still be able to find the share in its new home, depending upon -how many servers are present in the grid, where the new nodeid lands in the -permuted index for this particular storage index, and how many servers the -reading client is willing to contact. - -When a client attempts to write to this migrated share, it will get a "bad -write enabler" error, since the WE it computes for nodeB will not match the -WE(nodeA) that was embedded in the share. When this occurs, the "bad write -enabler" message must include the old nodeid (e.g. nodeA) that was in the -share. - -The client then computes H(nodeB+H(WEM+nodeA)), which is the same as -H(nodeB+WE(nodeA)). The client sends this along with the new WE(nodeB), which -is H(WEM+nodeB). Note that the client only sends WE(nodeB) to nodeB, never to -anyone else. Also note that the client does not send a value to nodeB that -would allow the node to impersonate the client to a third node: everything -sent to nodeB will include something specific to nodeB in it. - -The server locally computes H(nodeB+WE(nodeA)), using its own node id and the -old write enabler from the share. It compares this against the value supplied -by the client. If they match, this serves as proof that the client was able -to compute the old write enabler. The server then accepts the client's new -WE(nodeB) and writes it into the container. - -This WE-fixup process requires an extra round trip, and requires the error -message to include the old nodeid, but does not require any public key -operations on either client or server. - -Migrating the leases will require a similar protocol. This protocol will be -defined concretely at a later date. - -=== Code Details === - -The MutableFileNode class is used to manipulate mutable files (as opposed to -ImmutableFileNodes). These are initially generated with -client.create_mutable_file(), and later recreated from URIs with -client.create_node_from_uri(). Instances of this class will contain a URI and -a reference to the client (for peer selection and connection). - -NOTE: this section is out of date. Please see src/allmydata/interfaces.py -(the section on IMutableFilesystemNode) for more accurate information. - -The methods of MutableFileNode are: - - * download_to_data() -> [deferred] newdata, NotEnoughSharesError - * if there are multiple retrieveable versions in the grid, get() returns - the first version it can reconstruct, and silently ignores the others. - In the future, a more advanced API will signal and provide access to - the multiple heads. - * update(newdata) -> OK, UncoordinatedWriteError, NotEnoughSharesError - * overwrite(newdata) -> OK, UncoordinatedWriteError, NotEnoughSharesError - -download_to_data() causes a new retrieval to occur, pulling the current -contents from the grid and returning them to the caller. At the same time, -this call caches information about the current version of the file. This -information will be used in a subsequent call to update(), and if another -change has occured between the two, this information will be out of date, -triggering the UncoordinatedWriteError. - -update() is therefore intended to be used just after a download_to_data(), in -the following pattern: - - d = mfn.download_to_data() - d.addCallback(apply_delta) - d.addCallback(mfn.update) - -If the update() call raises UCW, then the application can simply return an -error to the user ("you violated the Prime Coordination Directive"), and they -can try again later. Alternatively, the application can attempt to retry on -its own. To accomplish this, the app needs to pause, download the new -(post-collision and post-recovery) form of the file, reapply their delta, -then submit the update request again. A randomized pause is necessary to -reduce the chances of colliding a second time with another client that is -doing exactly the same thing: - - d = mfn.download_to_data() - d.addCallback(apply_delta) - d.addCallback(mfn.update) - def _retry(f): - f.trap(UncoordinatedWriteError) - d1 = pause(random.uniform(5, 20)) - d1.addCallback(lambda res: mfn.download_to_data()) - d1.addCallback(apply_delta) - d1.addCallback(mfn.update) - return d1 - d.addErrback(_retry) - -Enthusiastic applications can retry multiple times, using a randomized -exponential backoff between each. A particularly enthusiastic application can -retry forever, but such apps are encouraged to provide a means to the user of -giving up after a while. - -UCW does not mean that the update was not applied, so it is also a good idea -to skip the retry-update step if the delta was already applied: - - d = mfn.download_to_data() - d.addCallback(apply_delta) - d.addCallback(mfn.update) - def _retry(f): - f.trap(UncoordinatedWriteError) - d1 = pause(random.uniform(5, 20)) - d1.addCallback(lambda res: mfn.download_to_data()) - def _maybe_apply_delta(contents): - new_contents = apply_delta(contents) - if new_contents != contents: - return mfn.update(new_contents) - d1.addCallback(_maybe_apply_delta) - return d1 - d.addErrback(_retry) - -update() is the right interface to use for delta-application situations, like -directory nodes (in which apply_delta might be adding or removing child -entries from a serialized table). - -Note that any uncoordinated write has the potential to lose data. We must do -more analysis to be sure, but it appears that two clients who write to the -same mutable file at the same time (even if both eventually retry) will, with -high probability, result in one client observing UCW and the other silently -losing their changes. It is also possible for both clients to observe UCW. -The moral of the story is that the Prime Coordination Directive is there for -a reason, and that recovery/UCW/retry is not a subsitute for write -coordination. - -overwrite() tells the client to ignore this cached version information, and -to unconditionally replace the mutable file's contents with the new data. -This should not be used in delta application, but rather in situations where -you want to replace the file's contents with completely unrelated ones. When -raw files are uploaded into a mutable slot through the tahoe webapi (using -POST and the ?mutable=true argument), they are put in place with overwrite(). - - - -The peer-selection and data-structure manipulation (and signing/verification) -steps will be implemented in a separate class in allmydata/mutable.py . - -=== SMDF Slot Format === - -This SMDF data lives inside a server-side MutableSlot container. The server -is oblivious to this format. - -This data is tightly packed. In particular, the share data is defined to run -all the way to the beginning of the encrypted private key (the encprivkey -offset is used both to terminate the share data and to begin the encprivkey). - - # offset size name - 1 0 1 version byte, \x00 for this format - 2 1 8 sequence number. 2^64-1 must be handled specially, TBD - 3 9 32 "R" (root of share hash Merkle tree) - 4 41 16 IV (share data is AES(H(readkey+IV)) ) - 5 57 18 encoding parameters: - 57 1 k - 58 1 N - 59 8 segment size - 67 8 data length (of original plaintext) - 6 75 32 offset table: - 75 4 (8) signature - 79 4 (9) share hash chain - 83 4 (10) block hash tree - 87 4 (11) share data - 91 8 (12) encrypted private key - 99 8 (13) EOF - 7 107 436ish verification key (2048 RSA key) - 8 543ish 256ish signature=RSAenc(sigkey, H(version+seqnum+r+IV+encparm)) - 9 799ish (a) share hash chain, encoded as: - "".join([pack(">H32s", shnum, hash) - for (shnum,hash) in needed_hashes]) -10 (927ish) (b) block hash tree, encoded as: - "".join([pack(">32s",hash) for hash in block_hash_tree]) -11 (935ish) LEN share data (no gap between this and encprivkey) -12 ?? 1216ish encrypted private key= AESenc(write-key, RSA-key) -13 ?? -- EOF - -(a) The share hash chain contains ceil(log(N)) hashes, each 32 bytes long. - This is the set of hashes necessary to validate this share's leaf in the - share Merkle tree. For N=10, this is 4 hashes, i.e. 128 bytes. -(b) The block hash tree contains ceil(length/segsize) hashes, each 32 bytes - long. This is the set of hashes necessary to validate any given block of - share data up to the per-share root "r". Each "r" is a leaf of the share - has tree (with root "R"), from which a minimal subset of hashes is put in - the share hash chain in (8). - -=== Recovery === - -The first line of defense against damage caused by colliding writes is the -Prime Coordination Directive: "Don't Do That". - -The second line of defense is to keep "S" (the number of competing versions) -lower than N/k. If this holds true, at least one competing version will have -k shares and thus be recoverable. Note that server unavailability counts -against us here: the old version stored on the unavailable server must be -included in the value of S. - -The third line of defense is our use of testv_and_writev() (described below), -which increases the convergence of simultaneous writes: one of the writers -will be favored (the one with the highest "R"), and that version is more -likely to be accepted than the others. This defense is least effective in the -pathological situation where S simultaneous writers are active, the one with -the lowest "R" writes to N-k+1 of the shares and then dies, then the one with -the next-lowest "R" writes to N-2k+1 of the shares and dies, etc, until the -one with the highest "R" writes to k-1 shares and dies. Any other sequencing -will allow the highest "R" to write to at least k shares and establish a new -revision. - -The fourth line of defense is the fact that each client keeps writing until -at least one version has N shares. This uses additional servers, if -necessary, to make sure that either the client's version or some -newer/overriding version is highly available. - -The fifth line of defense is the recovery algorithm, which seeks to make sure -that at least *one* version is highly available, even if that version is -somebody else's. - -The write-shares-to-peers algorithm is as follows: - - * permute peers according to storage index - * walk through peers, trying to assign one share per peer - * for each peer: - * send testv_and_writev, using "old(seqnum+R) <= our(seqnum+R)" as the test - * this means that we will overwrite any old versions, and we will - overwrite simultaenous writers of the same version if our R is higher. - We will not overwrite writers using a higher seqnum. - * record the version that each share winds up with. If the write was - accepted, this is our own version. If it was rejected, read the - old_test_data to find out what version was retained. - * if old_test_data indicates the seqnum was equal or greater than our - own, mark the "Simultanous Writes Detected" flag, which will eventually - result in an error being reported to the writer (in their close() call). - * build a histogram of "R" values - * repeat until the histogram indicate that some version (possibly ours) - has N shares. Use new servers if necessary. - * If we run out of servers: - * if there are at least shares-of-happiness of any one version, we're - happy, so return. (the close() might still get an error) - * not happy, need to reinforce something, goto RECOVERY - -RECOVERY: - * read all shares, count the versions, identify the recoverable ones, - discard the unrecoverable ones. - * sort versions: locate max(seqnums), put all versions with that seqnum - in the list, sort by number of outstanding shares. Then put our own - version. (TODO: put versions with seqnum us ahead of us?). - * for each version: - * attempt to recover that version - * if not possible, remove it from the list, go to next one - * if recovered, start at beginning of peer list, push that version, - continue until N shares are placed - * if pushing our own version, bump up the seqnum to one higher than - the max seqnum we saw - * if we run out of servers: - * schedule retry and exponential backoff to repeat RECOVERY - * admit defeat after some period? presumeably the client will be shut down - eventually, maybe keep trying (once per hour?) until then. - - - - -== Medium Distributed Mutable Files == - -These are just like the SDMF case, but: - - * we actually take advantage of the Merkle hash tree over the blocks, by - reading a single segment of data at a time (and its necessary hashes), to - reduce the read-time alacrity - * we allow arbitrary writes to the file (i.e. seek() is provided, and - O_TRUNC is no longer required) - * we write more code on the client side (in the MutableFileNode class), to - first read each segment that a write must modify. This looks exactly like - the way a normal filesystem uses a block device, or how a CPU must perform - a cache-line fill before modifying a single word. - * we might implement some sort of copy-based atomic update server call, - to allow multiple writev() calls to appear atomic to any readers. - -MDMF slots provide fairly efficient in-place edits of very large files (a few -GB). Appending data is also fairly efficient, although each time a power of 2 -boundary is crossed, the entire file must effectively be re-uploaded (because -the size of the block hash tree changes), so if the filesize is known in -advance, that space ought to be pre-allocated (by leaving extra space between -the block hash tree and the actual data). - -MDMF1 uses the Merkle tree to enable low-alacrity random-access reads. MDMF2 -adds cache-line reads to allow random-access writes. - -== Large Distributed Mutable Files == - -LDMF slots use a fundamentally different way to store the file, inspired by -Mercurial's "revlog" format. They enable very efficient insert/remove/replace -editing of arbitrary spans. Multiple versions of the file can be retained, in -a revision graph that can have multiple heads. Each revision can be -referenced by a cryptographic identifier. There are two forms of the URI, one -that means "most recent version", and a longer one that points to a specific -revision. - -Metadata can be attached to the revisions, like timestamps, to enable rolling -back an entire tree to a specific point in history. - -LDMF1 provides deltas but tries to avoid dealing with multiple heads. LDMF2 -provides explicit support for revision identifiers and branching. - -== TODO == - -improve allocate-and-write or get-writer-buckets API to allow one-call (or -maybe two-call) updates. The challenge is in figuring out which shares are on -which machines. First cut will have lots of round trips. - -(eventually) define behavior when seqnum wraps. At the very least make sure -it can't cause a security problem. "the slot is worn out" is acceptable. - -(eventually) define share-migration lease update protocol. Including the -nodeid who accepted the lease is useful, we can use the same protocol as we -do for updating the write enabler. However we need to know which lease to -update.. maybe send back a list of all old nodeids that we find, then try all -of them when we accept the update? - - We now do this in a specially-formatted IndexError exception: - "UNABLE to renew non-existent lease. I have leases accepted by " + - "nodeids: '12345','abcde','44221' ." - -confirm that a repairer can regenerate shares without the private key. Hmm, -without the write-enabler they won't be able to write those shares to the -servers.. although they could add immutable new shares to new servers. diff --git a/docs/specifications/CHK-hashes.svg b/docs/specifications/CHK-hashes.svg new file mode 100644 index 00000000..22bd524f --- /dev/null +++ b/docs/specifications/CHK-hashes.svg @@ -0,0 +1,723 @@ + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + data(plaintext) + + + + data(crypttext) + + + shares + + + + + + + + + + + + + + + + + + + + + + plaintexthash tree + crypttexthash tree + sharehash tree + + URI Extension Block + plaintext root + plaintext (flat) hash + crypttext root + crypttext (flat) hash + share root + + + + + + + + + + URI + encryptionkey + storageindex + UEBhash + + + + + + AES + + + + + + + FEC + + + + + + A + B : + B is derived from A by hashing, therefore B validates A + + A + B : + B is derived from A by encryption or erasure coding + + A + B : + A is used as an index to retrieve data B + SHARE + CHK File Hashes + + diff --git a/docs/specifications/Makefile b/docs/specifications/Makefile new file mode 100644 index 00000000..2625e04d --- /dev/null +++ b/docs/specifications/Makefile @@ -0,0 +1,19 @@ +SOURCES = CHK-hashes.svg file-encoding1.svg file-encoding2.svg \ + file-encoding3.svg file-encoding4.svg file-encoding5.svg \ + file-encoding6.svg + +PNGS = $(patsubst %.svg,%.png,$(SOURCES)) +EPSS = $(patsubst %.svg,%.eps,$(SOURCES)) + +.PHONY: images-png images-eps +all: $(PNGS) $(EPSS) +images-png: $(PNGS) +images-eps: $(EPSS) + +%.png: %.svg + inkscape -b white -d 90 -D --export-png $@ $< +%.eps: %.svg + inkscape --export-eps $@ $< + +clean: + rm -f *.png *.eps diff --git a/docs/specifications/URI-extension.txt b/docs/specifications/URI-extension.txt new file mode 100644 index 00000000..8ec383e0 --- /dev/null +++ b/docs/specifications/URI-extension.txt @@ -0,0 +1,61 @@ + +"URI Extension Block" + +This block is a serialized dictionary with string keys and string values +(some of which represent numbers, some of which are SHA-256 hashes). All +buckets hold an identical copy. The hash of the serialized data is kept in +the URI. + +The download process must obtain a valid copy of this data before any +decoding can take place. The download process must also obtain other data +before incremental validation can be performed. Full-file validation (for +clients who do not wish to do incremental validation) can be performed solely +with the data from this block. + +At the moment, this data block contains the following keys (and an estimate +on their sizes): + + size 5 + segment_size 7 + num_segments 2 + needed_shares 2 + total_shares 3 + + codec_name 3 + codec_params 5+1+2+1+3=12 + tail_codec_params 12 + + share_root_hash 32 (binary) or 52 (base32-encoded) each + plaintext_hash + plaintext_root_hash + crypttext_hash + crypttext_root_hash + +Some pieces are needed elsewhere (size should be visible without pulling the +block, the Tahoe3 algorithm needs total_shares to find the right peers, all +peer selection algorithms need needed_shares to ask a minimal set of peers). +Some pieces are arguably redundant but are convenient to have present +(test_encode.py makes use of num_segments). + +The rule for this data block is that it should be a constant size for all +files, regardless of file size. Therefore hash trees (which have a size that +depends linearly upon the number of segments) are stored elsewhere in the +bucket, with only the hash tree root stored in this data block. + +This block will be serialized as follows: + + assert that all keys match ^[a-zA-z_\-]+$ + sort all the keys lexicographically + for k in keys: + write("%s:" % k) + write(netstring(data[k])) + + +Serialized size: + + dense binary (but decimal) packing: 160+46=206 + including 'key:' (185) and netstring (6*3+7*4=46) on values: 231 + including 'key:%d\n' (185+13=198) and printable values (46+5*52=306)=504 + +We'll go with the 231-sized block, and provide a tool to dump it as text if +we really want one. diff --git a/docs/specifications/dirnodes.txt b/docs/specifications/dirnodes.txt new file mode 100644 index 00000000..adc8fcab --- /dev/null +++ b/docs/specifications/dirnodes.txt @@ -0,0 +1,433 @@ + += Tahoe Directory Nodes = + +As explained in the architecture docs, Tahoe can be roughly viewed as a +collection of three layers. The lowest layer is the distributed filestore, or +DHT: it provides operations that accept files and upload them to the mesh, +creating a URI in the process which securely references the file's contents. +The middle layer is the filesystem, creating a structure of directories and +filenames resembling the traditional unix/windows filesystems. The top layer +is the application layer, which uses the lower layers to provide useful +services to users, like a backup application, or a way to share files with +friends. + +This document examines the middle layer, the "filesystem". + +== DHT Primitives == + +In the lowest layer (DHT), there are two operations that reference immutable +data (which we refer to as "CHK URIs" or "CHK read-capabilities" or "CHK +read-caps"). One puts data into the grid (but only if it doesn't exist +already), the other retrieves it: + + chk_uri = put(data) + data = get(chk_uri) + +We also have three operations which reference mutable data (which we refer to +as "mutable slots", or "mutable write-caps and read-caps", or sometimes "SSK +slots"). One creates a slot with some initial contents, a second replaces the +contents of a pre-existing slot, and the third retrieves the contents: + + mutable_uri = create(initial_data) + replace(mutable_uri, new_data) + data = get(mutable_uri) + +== Filesystem Goals == + +The main goal for the middle (filesystem) layer is to give users a way to +organize the data that they have uploaded into the mesh. The traditional way +to do this in computer filesystems is to put this data into files, give those +files names, and collect these names into directories. + +Each directory is a series of name-value pairs, which maps "child name" to an +object of some kind. Those child objects might be files, or they might be +other directories. + +The directory structure is therefore a directed graph of nodes, in which each +node might be a directory node or a file node. All file nodes are terminal +nodes. + +== Dirnode Goals == + +What properties might be desirable for these directory nodes? In no +particular order: + + 1: functional. Code which does not work doesn't count. + 2: easy to document, explain, and understand + 3: confidential: it should not be possible for others to see the contents of + a directory + 4: integrity: it should not be possible for others to modify the contents + of a directory + 5: available: directories should survive host failure, just like files do + 6: efficient: in storage, communication bandwidth, number of round-trips + 7: easy to delegate individual directories in a flexible way + 8: updateness: everybody looking at a directory should see the same contents + 9: monotonicity: everybody looking at a directory should see the same + sequence of updates + +Some of these goals are mutually exclusive. For example, availability and +consistency are opposing, so it is not possible to achieve #5 and #8 at the +same time. Moreover, it takes a more complex architecture to get close to the +available-and-consistent ideal, so #2/#6 is in opposition to #5/#8. + +Tahoe-0.7.0 introduced distributed mutable files, which use public key +cryptography for integrity, and erasure coding for availability. These +achieve roughly the same properties as immutable CHK files, but their +contents can be replaced without changing their identity. Dirnodes are then +just a special way of interpreting the contents of a specific mutable file. +Earlier releases used a "vdrive server": this server was abolished in the +0.7.0 release. + +For details of how mutable files work, please see "mutable.txt" in this +directory. + +For the current 0.7.0 release, we achieve most of our desired properties. The +integrity and availability of dirnodes is equivalent to that of regular +(immutable) files, with the exception that there are more simultaneous-update +failure modes for mutable slots. Delegation is quite strong: you can give +read-write or read-only access to any subtree, and the data format used for +dirnodes is such that read-only access is transitive: i.e. if you grant Bob +read-only access to a parent directory, then Bob will get read-only access +(and *not* read-write access) to its children. + +Relative to the previous "vdrive-server" based scheme, the current +distributed dirnode approach gives better availability, but cannot guarantee +updateness quite as well, and requires far more network traffic for each +retrieval and update. Mutable files are somewhat less available than +immutable files, simply because of the increased number of combinations +(shares of an immutable file are either present or not, whereas there are +multiple versions of each mutable file, and you might have some shares of +version 1 and other shares of version 2). In extreme cases of simultaneous +update, mutable files might suffer from non-monotonicity. + + +== Dirnode secret values == + +As mentioned before, dirnodes are simply a special way to interpret the +contents of a mutable file, so the secret keys and capability strings +described in "mutable.txt" are all the same. Each dirnode contains an RSA +public/private keypair, and the holder of the "write capability" will be able +to retrieve the private key (as well as the AES encryption key used for the +data itself). The holder of the "read capability" will be able to obtain the +public key and the AES data key, but not the RSA private key needed to modify +the data. + +The "write capability" for a dirnode grants read-write access to its +contents. This is expressed on concrete form as the "dirnode write cap": a +printable string which contains the necessary secrets to grant this access. +Likewise, the "read capability" grants read-only access to a dirnode, and can +be represented by a "dirnode read cap" string. + +For example, +URI:DIR2:swdi8ge1s7qko45d3ckkyw1aac%3Aar8r5j99a4mezdojejmsfp4fj1zeky9gjigyrid4urxdimego68o +is a write-capability URI, while +URI:DIR2-RO:buxjqykt637u61nnmjg7s8zkny:ar8r5j99a4mezdojejmsfp4fj1zeky9gjigyrid4urxdimego68o +is a read-capability URI, both for the same dirnode. + + +== Dirnode storage format == + +Each dirnode is stored in a single mutable file, distributed in the Tahoe +grid. The contents of this file are a serialized list of netstrings, one per +child. Each child is a list of four netstrings: (name, rocap, rwcap, +metadata). (remember that the contents of the mutable file are encrypted by +the read-cap, so this section describes the plaintext contents of the mutable +file, *after* it has been decrypted by the read-cap). + +The name is simple a UTF-8 -encoded child name. The 'rocap' is a read-only +capability URI to that child, either an immutable (CHK) file, a mutable file, +or a directory. The 'rwcap' is a read-write capability URI for that child, +encrypted with the dirnode's write-cap: this enables the "transitive +readonlyness" property, described further below. The 'metadata' is a +JSON-encoded dictionary of type,value metadata pairs. Some metadata keys are +pre-defined, the rest are left up to the application. + +Each rwcap is stored as IV + ciphertext + MAC. The IV is a 16-byte random +value. The ciphertext is obtained by using AES in CTR mode on the rwcap URI +string, using a key that is formed from a tagged hash of the IV and the +dirnode's writekey. The MAC is a 32-byte SHA-256 -based HMAC (using that same +AES key) over the (IV+ciphertext) pair. + +If Bob has read-only access to the 'bar' directory, and he adds it as a child +to the 'foo' directory, then he will put the read-only cap for 'bar' in both +the rwcap and rocap slots (encrypting the rwcap contents as described above). +If he has full read-write access to 'bar', then he will put the read-write +cap in the 'rwcap' slot, and the read-only cap in the 'rocap' slot. Since +other users who have read-only access to 'foo' will be unable to decrypt its +rwcap slot, this limits those users to read-only access to 'bar' as well, +thus providing the transitive readonlyness that we desire. + +=== Dirnode sizes, mutable-file initial read sizes === + +How big are dirnodes? When reading dirnode data out of mutable files, how +large should our initial read be? If we guess exactly, we can read a dirnode +in a single round-trip, and update one in two RTT. If we guess too high, +we'll waste some amount of bandwidth. If we guess low, we need to make a +second pass to get the data (or the encrypted privkey, for writes), which +will cost us at least another RTT. + +Assuming child names are between 10 and 99 characters long, how long are the +various pieces of a dirnode? + + netstring(name) ~= 4+len(name) + chk-cap = 97 (for 4-char filesizes) + dir-rw-cap = 88 + dir-ro-cap = 91 + netstring(cap) = 4+len(cap) + encrypted(cap) = 16+cap+32 + JSON({}) = 2 + JSON({ctime=float,mtime=float}): 57 + netstring(metadata) = 4+57 = 61 + +so a CHK entry is: + 5+ 4+len(name) + 4+97 + 5+16+97+32 + 4+57 +And a 15-byte filename gives a 336-byte entry. When the entry points at a +subdirectory instead of a file, the entry is a little bit smaller. So an +empty directory uses 0 bytes, a directory with one child uses about 336 +bytes, a directory with two children uses about 672, etc. + +When the dirnode data is encoding using our default 3-of-10, that means we +get 112ish bytes of data in each share per child. + +The pubkey, signature, and hashes form the first 935ish bytes of the +container, then comes our data, then about 1216 bytes of encprivkey. So if we +read the first: + + 1kB: we get 65bytes of dirnode data : only empty directories + 1kiB: 89bytes of dirnode data : maybe one short-named subdir + 2kB: 1065bytes: about 9 entries + 3kB: 2065bytes: about 18 entries, or 7.5 entries plus the encprivkey + 4kB: 3065bytes: about 27 entries, or about 16.5 plus the encprivkey + +So we've written the code to do an initial read of 2kB from each share when +we read the mutable file, which should give good performance (one RTT) for +small directories. + + +== Design Goals, redux == + +How well does this design meet the goals? + + #1 functional: YES: the code works and has extensive unit tests + #2 documentable: YES: this document is the existence proof + #3 confidential: YES: see below + #4 integrity: MOSTLY: a coalition of storage servers can rollback individual + mutable files, but not a single one. No server can + substitute fake data as genuine. + #5 availability: YES: as long as 'k' storage servers are present and have + the same version of the mutable file, the dirnode will + be available. + #6 efficient: MOSTLY: + network: single dirnode lookup is very efficient, since clients can + fetch specific keys rather than being required to get or set + the entire dirnode each time. Traversing many directories + takes a lot of roundtrips, and these can't be collapsed with + promise-pipelining because the intermediate values must only + be visible to the client. Modifying many dirnodes at once + (e.g. importing a large pre-existing directory tree) is pretty + slow, since each graph edge must be created independently. + storage: each child has a separate IV, which makes them larger than + if all children were aggregated into a single encrypted string + #7 delegation: VERY: each dirnode is a completely independent object, + to which clients can be granted separate read-write or + read-only access + #8 updateness: VERY: with only a single point of access, and no caching, + each client operation starts by fetching the current + value, so there are no opportunities for staleness + #9 monotonicity: VERY: the single point of access also protects against + retrograde motion + + + +=== Confidentiality leaks in the vdrive server === + +Dirnode (and the mutable files upon which they are based) are very private +against other clients: traffic between the client and the storage servers is +protected by the Foolscap SSL connection, so they can observe very little. +Storage index values are hashes of secrets and thus unguessable, and they are +not made public, so other clients cannot snoop through encrypted dirnodes +that they have not been told about. + +Storage servers can observe access patterns and see ciphertext, but they +cannot see the plaintext (of child names, metadata, or URIs). If an attacker +operates a significant number of storage servers, they can infer the shape of +the directory structure by assuming that directories are usually accessed +from root to leaf in rapid succession. Since filenames are usually much +shorter than read-caps and write-caps, the attacker can use the length of the +ciphertext to guess the number of children of each node, and might be able to +guess the length of the child names (or at least their sum). From this, the +attacker may be able to build up a graph with the same shape as the plaintext +filesystem, but with unlabeled edges and unknown file contents. + + +=== Integrity failures in the vdrive server === + +The mutable file's integrity mechanism (RSA signature on the hash of the file +contents) prevents the storage server from modifying the dirnode's contents +without detection. Therefore the storage servers can make the dirnode +unavailable, but not corrupt it. + +A sufficient number of colluding storage servers can perform a rollback +attack: replace all shares of the whole mutable file with an earlier version. +TODO: To prevent this, when retrieving the contents of a mutable file, the +client should query more servers than necessary and use the highest available +version number. This insures that one or two misbehaving storage servers +cannot cause this rollback on their own. + + +=== Improving the efficiency of dirnodes === + +The current mutable-file -based dirnode scheme suffers from certain +inefficiencies. A very large directory (with thousands or millions of +children) will take a significant time to extract any single entry, because +the whole file must be downloaded first, then parsed and searched to find the +desired child entry. Likewise, modifying a single child will require the +whole file to be re-uploaded. + +The current design assumes (and in some cases, requires) that dirnodes remain +small. The mutable files on which dirnodes are based are currently using +"SDMF" ("Small Distributed Mutable File") design rules, which state that the +size of the data shall remain below one megabyte. More advanced forms of +mutable files (MDMF and LDMF) are in the design phase to allow efficient +manipulation of larger mutable files. This would reduce the work needed to +modify a single entry in a large directory. + +Judicious caching may help improve the reading-large-directory case. Some +form of mutable index at the beginning of the dirnode might help as well. The +MDMF design rules allow for efficient random-access reads from the middle of +the file, which would give the index something useful to point at. + +The current SDMF design generates a new RSA public/private keypair for each +directory. This takes considerable time and CPU effort, generally one or two +seconds per directory. We have designed (but not yet built) a DSA-based +mutable file scheme which will use shared parameters to reduce the +directory-creation effort to a bare minimum (picking a random number instead +of generating two random primes). + + +When a backup program is run for the first time, it needs to copy a large +amount of data from a pre-existing filesystem into reliable storage. This +means that a large and complex directory structure needs to be duplicated in +the dirnode layer. With the one-object-per-dirnode approach described here, +this requires as many operations as there are edges in the imported +filesystem graph. + +Another approach would be to aggregate multiple directories into a single +storage object. This object would contain a serialized graph rather than a +single name-to-child dictionary. Most directory operations would fetch the +whole block of data (and presumeably cache it for a while to avoid lots of +re-fetches), and modification operations would need to replace the whole +thing at once. This "realm" approach would have the added benefit of +combining more data into a single encrypted bundle (perhaps hiding the shape +of the graph from a determined attacker), and would reduce round-trips when +performing deep directory traversals (assuming the realm was already cached). +It would also prevent fine-grained rollback attacks from working: a coalition +of storage servers could change the entire realm to look like an earlier +state, but it could not independently roll back individual directories. + +The drawbacks of this aggregation would be that small accesses (adding a +single child, looking up a single child) would require pulling or pushing a +lot of unrelated data, increasing network overhead (and necessitating +test-and-set semantics for the modification side, which increases the chances +that a user operation will fail, making it more challenging to provide +promises of atomicity to the user). + +It would also make it much more difficult to enable the delegation +("sharing") of specific directories. Since each aggregate "realm" provides +all-or-nothing access control, the act of delegating any directory from the +middle of the realm would require the realm first be split into the upper +piece that isn't being shared and the lower piece that is. This splitting +would have to be done in response to what is essentially a read operation, +which is not traditionally supposed to be a high-effort action. On the other +hand, it may be possible to aggregate the ciphertext, but use distinct +encryption keys for each component directory, to get the benefits of both +schemes at once. + + +=== Dirnode expiration and leases === + +Dirnodes are created any time a client wishes to add a new directory. How +long do they live? What's to keep them from sticking around forever, taking +up space that nobody can reach any longer? + +Mutable files are created with limited-time "leases", which keep the shares +alive until the last lease has expired or been cancelled. Clients which know +and care about specific dirnodes can ask to keep them alive for a while, by +renewing a lease on them (with a typical period of one month). Clients are +expected to assist in the deletion of dirnodes by canceling their leases as +soon as they are done with them. This means that when a client deletes a +directory, it should also cancel its lease on that directory. When the lease +count on a given share goes to zero, the storage server can delete the +related storage. Multiple clients may all have leases on the same dirnode: +the server may delete the shares only after all of the leases have gone away. + +We expect that clients will periodically create a "manifest": a list of +so-called "refresh capabilities" for all of the dirnodes and files that they +can reach. They will give this manifest to the "repairer", which is a service +that keeps files (and dirnodes) alive on behalf of clients who cannot take on +this responsibility for themselves. These refresh capabilities include the +storage index, but do *not* include the readkeys or writekeys, so the +repairer does not get to read the files or directories that it is helping to +keep alive. + +After each change to the user's vdrive, the client creates a manifest and +looks for differences from their previous version. Anything which was removed +prompts the client to send out lease-cancellation messages, allowing the data +to be deleted. + + +== Starting Points: root dirnodes == + +Any client can record the URI of a directory node in some external form (say, +in a local file) and use it as the starting point of later traversal. Each +Tahoe user is expected to create a new (unattached) dirnode when they first +start using the grid, and record its URI for later use. + +== Mounting and Sharing Directories == + +The biggest benefit of this dirnode approach is that sharing individual +directories is almost trivial. Alice creates a subdirectory that she wants to +use to share files with Bob. This subdirectory is attached to Alice's +filesystem at "~alice/share-with-bob". She asks her filesystem for the +read-write directory URI for that new directory, and emails it to Bob. When +Bob receives the URI, he asks his own local vdrive to attach the given URI, +perhaps at a place named "~bob/shared-with-alice". Every time either party +writes a file into this directory, the other will be able to read it. If +Alice prefers, she can give a read-only URI to Bob instead, and then Bob will +be able to read files but not change the contents of the directory. Neither +Alice nor Bob will get access to any files above the mounted directory: there +are no 'parent directory' pointers. If Alice creates a nested set of +directories, "~alice/share-with-bob/subdir2", and gives a read-only URI to +share-with-bob to Bob, then Bob will be unable to write to either +share-with-bob/ or subdir2/. + +A suitable UI needs to be created to allow users to easily perform this +sharing action: dragging a folder their vdrive to an IM or email user icon, +for example. The UI will need to give the sending user an opportunity to +indicate whether they want to grant read-write or read-only access to the +recipient. The recipient then needs an interface to drag the new folder into +their vdrive and give it a home. + +== Revocation == + +When Alice decides that she no longer wants Bob to be able to access the +shared directory, what should she do? Suppose she's shared this folder with +both Bob and Carol, and now she wants Carol to retain access to it but Bob to +be shut out. Ideally Carol should not have to do anything: her access should +continue unabated. + +The current plan is to have her client create a deep copy of the folder in +question, delegate access to the new folder to the remaining members of the +group (Carol), asking the lucky survivors to replace their old reference with +the new one. Bob may still have access to the old folder, but he is now the +only one who cares: everyone else has moved on, and he will no longer be able +to see their new changes. In a strict sense, this is the strongest form of +revocation that can be accomplished: there is no point trying to force Bob to +forget about the files that he read a moment before being kicked out. In +addition it must be noted that anyone who can access the directory can proxy +for Bob, reading files to him and accepting changes whenever he wants. +Preventing delegation between communication parties is just as pointless as +asking Bob to forget previously accessed files. However, there may be value +to configuring the UI to ask Carol to not share files with Bob, or to +removing all files from Bob's view at the same time his access is revoked. + diff --git a/docs/specifications/file-encoding.txt b/docs/specifications/file-encoding.txt new file mode 100644 index 00000000..23862ead --- /dev/null +++ b/docs/specifications/file-encoding.txt @@ -0,0 +1,148 @@ + +== FileEncoding == + +When the client wishes to upload an immutable file, the first step is to +decide upon an encryption key. There are two methods: convergent or random. +The goal of the convergent-key method is to make sure that multiple uploads +of the same file will result in only one copy on the grid, whereas the +random-key method does not provide this "convergence" feature. + +The convergent-key method computes the SHA-256d hash of a single-purpose tag, +the encoding parameters, a "convergence secret", and the contents of the +file. It uses a portion of the resulting hash as the AES encryption key. +There are security concerns with using convergence this approach (the +"partial-information guessing attack", please see ticket #365 for some +references), so Tahoe uses a separate (randomly-generated) "convergence +secret" for each node, stored in NODEDIR/private/convergence . The encoding +parameters (k, N, and the segment size) are included in the hash to make sure +that two different encodings of the same file will get different keys. This +method requires an extra IO pass over the file, to compute this key, and +encryption cannot be started until the pass is complete. This means that the +convergent-key method will require at least two total passes over the file. + +The random-key method simply chooses a random encryption key. Convergence is +disabled, however this method does not require a separate IO pass, so upload +can be done with a single pass. This mode makes it easier to perform +streaming upload. + +Regardless of which method is used to generate the key, the plaintext file is +encrypted (using AES in CTR mode) to produce a ciphertext. This ciphertext is +then erasure-coded and uploaded to the servers. Two hashes of the ciphertext +are generated as the encryption proceeds: a flat hash of the whole +ciphertext, and a Merkle tree. These are used to verify the correctness of +the erasure decoding step, and can be used by a "verifier" process to make +sure the file is intact without requiring the decryption key. + +The encryption key is hashed (with SHA-256d and a single-purpose tag) to +produce the "Storage Index". This Storage Index (or SI) is used to identify +the shares produced by the method described below. The grid can be thought of +as a large table that maps Storage Index to a ciphertext. Since the +ciphertext is stored as erasure-coded shares, it can also be thought of as a +table that maps SI to shares. + +Anybody who knows a Storage Index can retrieve the associated ciphertext: +ciphertexts are not secret. + + +[[Image(file-encoding1.png)]] + +The ciphertext file is then broken up into segments. The last segment is +likely to be shorter than the rest. Each segment is erasure-coded into a +number of "blocks". This takes place one segment at a time. (In fact, +encryption and erasure-coding take place at the same time, once per plaintext +segment). Larger segment sizes result in less overhead overall, but increase +both the memory footprint and the "alacrity" (the number of bytes we have to +receive before we can deliver validated plaintext to the user). The current +default segment size is 128KiB. + +One block from each segment is sent to each shareholder (aka leaseholder, +aka landlord, aka storage node, aka peer). The "share" held by each remote +shareholder is nominally just a collection of these blocks. The file will +be recoverable when a certain number of shares have been retrieved. + +[[Image(file-encoding2.png)]] + +The blocks are hashed as they are generated and transmitted. These +block hashes are put into a Merkle hash tree. When the last share has been +created, the merkle tree is completed and delivered to the peer. Later, when +we retrieve these blocks, the peer will send many of the merkle hash tree +nodes ahead of time, so we can validate each block independently. + +The root of this block hash tree is called the "block root hash" and +used in the next step. + +[[Image(file-encoding3.png)]] + +There is a higher-level Merkle tree called the "share hash tree". Its leaves +are the block root hashes from each share. The root of this tree is called +the "share root hash" and is included in the "URI Extension Block", aka UEB. +The ciphertext hash and Merkle tree are also put here, along with the +original file size, and the encoding parameters. The UEB contains all the +non-secret values that could be put in the URI, but would have made the URI +too big. So instead, the UEB is stored with the share, and the hash of the +UEB is put in the URI. + +The URI then contains the secret encryption key and the UEB hash. It also +contains the basic encoding parameters (k and N) and the file size, to make +download more efficient (by knowing the number of required shares ahead of +time, sufficient download queries can be generated in parallel). + +The URI (also known as the immutable-file read-cap, since possessing it +grants the holder the capability to read the file's plaintext) is then +represented as a (relatively) short printable string like so: + + URI:CHK:auxet66ynq55naiy2ay7cgrshm:6rudoctmbxsmbg7gwtjlimd6umtwrrsxkjzthuldsmo4nnfoc6fa:3:10:1000000 + +[[Image(file-encoding4.png)]] + +During download, when a peer begins to transmit a share, it first transmits +all of the parts of the share hash tree that are necessary to validate its +block root hash. Then it transmits the portions of the block hash tree +that are necessary to validate the first block. Then it transmits the +first block. It then continues this loop: transmitting any portions of the +block hash tree to validate block#N, then sending block#N. + +[[Image(file-encoding5.png)]] + +So the "share" that is sent to the remote peer actually consists of three +pieces, sent in a specific order as they become available, and retrieved +during download in a different order according to when they are needed. + +The first piece is the blocks themselves, one per segment. The last +block will likely be shorter than the rest, because the last segment is +probably shorter than the rest. The second piece is the block hash tree, +consisting of a total of two SHA-1 hashes per block. The third piece is a +hash chain from the share hash tree, consisting of log2(numshares) hashes. + +During upload, all blocks are sent first, followed by the block hash +tree, followed by the share hash chain. During download, the share hash chain +is delivered first, followed by the block root hash. The client then uses +the hash chain to validate the block root hash. Then the peer delivers +enough of the block hash tree to validate the first block, followed by +the first block itself. The block hash chain is used to validate the +block, then it is passed (along with the first block from several other +peers) into decoding, to produce the first segment of crypttext, which is +then decrypted to produce the first segment of plaintext, which is finally +delivered to the user. + +[[Image(file-encoding6.png)]] + +== Hashes == + +All hashes use SHA-256d, as defined in Practical Cryptography (by Ferguson +and Schneier). All hashes use a single-purpose tag, e.g. the hash that +converts an encryption key into a storage index is defined as follows: + + SI = SHA256d(netstring("allmydata_immutable_key_to_storage_index_v1") + key) + +When two separate values need to be combined together in a hash, we wrap each +in a netstring. + +Using SHA-256d (instead of plain SHA-256) guards against length-extension +attacks. Using the tag protects our Merkle trees against attacks in which the +hash of a leaf is confused with a hash of two children (allowing an attacker +to generate corrupted data that nevertheless appears to be valid), and is +simply good "cryptograhic hygiene". The "Chosen Protocol Attack" by Kelsey, +Schneier, and Wagner (http://www.schneier.com/paper-chosen-protocol.html) is +relevant. Putting the tag in a netstring guards against attacks that seek to +confuse the end of the tag with the beginning of the subsequent value. diff --git a/docs/specifications/file-encoding1.svg b/docs/specifications/file-encoding1.svg new file mode 100644 index 00000000..06b702a2 --- /dev/null +++ b/docs/specifications/file-encoding1.svg @@ -0,0 +1,435 @@ + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + FILE (plaintext) + + + + convergentencryptionkey + + + + AES-CTR + + + + FILE (crypttext) + + + + + tag + + + + storageindex + + + + SHA-256 + + + + SHA-256 + + + + + + tag + + + + + encoding parameters + + + + + + randomencryptionkey + + + + or + + + + + + diff --git a/docs/specifications/file-encoding2.svg b/docs/specifications/file-encoding2.svg new file mode 100644 index 00000000..6db3de37 --- /dev/null +++ b/docs/specifications/file-encoding2.svg @@ -0,0 +1,922 @@ + + + + + + + + + + + + + image/svg+xml + + + + + + + + FILE (crypttext) + + + + segA + + + + segB + + + + segC + + + + + + + segD + + + + FEC + + + block + A1 + + block + A2 + + block + A3 + + block + A4 + + + + + + + + FEC + + + block + B1 + + block + B2 + + block + B3 + + block + B4 + + + + + + + + FEC + + + block + C1 + + block + C2 + + block + C3 + + block + C4 + + + + + + + + FEC + + + block + D1 + + block + D2 + + block + D3 + + block + D4 + + + + + + + share4 + + peer 4 + + + diff --git a/docs/specifications/file-encoding3.svg b/docs/specifications/file-encoding3.svg new file mode 100644 index 00000000..fb5fd4c0 --- /dev/null +++ b/docs/specifications/file-encoding3.svg @@ -0,0 +1,484 @@ + + + + + + + + + + + + + image/svg+xml + + + + + + + + SHA + + + + SHA + + + + SHA + + + + SHA + + + + SHA + + + + SHA + + + + SHA + + + share + A4 + + share + B4 + + share + C4 + + share + D4 + + share4 + + peer 4 + + + + + + + + + + + + + Merkle Tree + block hash tree + "block root hash" + + diff --git a/docs/specifications/file-encoding4.svg b/docs/specifications/file-encoding4.svg new file mode 100644 index 00000000..f4b21d02 --- /dev/null +++ b/docs/specifications/file-encoding4.svg @@ -0,0 +1,675 @@ + + + + + + + + + + + + + + image/svg+xml + + + + + + blockroot hashes + + + SHA + + + + s1 + + + + s2 + + + + s3 + + + + s4 + + + + SHA + + + + SHA + + + + + + + + shares + + + share1 + + + + share2 + + + + share3 + + + + share4 + + + + + + + Merkle Tree + share hash tree + "share root hash" + + URI Extension Block + + + file size + + + + encoding parameters + + + + share root hash + + + + URI / "file read-cap" + + UEB hash + + + + encryption key + + + + + SHA + + + + + + other hashes + + + diff --git a/docs/specifications/file-encoding5.svg b/docs/specifications/file-encoding5.svg new file mode 100644 index 00000000..a20a1369 --- /dev/null +++ b/docs/specifications/file-encoding5.svg @@ -0,0 +1,585 @@ + + + + + + + + + + + + + image/svg+xml + + + + + + blockroot hashes + + + SHA + + + + s1 + + + + s2 + + + + s3 + + + + s4 + + + + SHA + + + + SHA + + + + + + + + share hash tree + + + SHA + + + + s5 + + + + s6 + + + + s7 + + + + s8 + + + + SHA + + + + SHA + + + + + + + + Merkle Tree + "share root hash" + + + SHA + + + + + merkle hash chainto validate s1 + + + diff --git a/docs/specifications/file-encoding6.svg b/docs/specifications/file-encoding6.svg new file mode 100644 index 00000000..09ced3fe --- /dev/null +++ b/docs/specifications/file-encoding6.svg @@ -0,0 +1,760 @@ + + + + + + + + + + + + + image/svg+xml + + + + + + + + SHA + + + + SHA + + + + SHA + + + + SHA + + + + SHA + + + + SHA + + + share + A4 + + share + B4 + + share + C4 + + share + D4 + share4 + + peer 4 + + + + + + + + + + + + + Merkle Tree + block hash tree + "block root hash" + blockroot hashes + + + SHA + + + + s1 + + + + s2 + + + + s3 + + + + s4 + + + + SHA + + + + SHA + + + + + + + + + Merkle Tree + share hash tree + "share root hash" + + + merkle hash chainto validate s4 + + + + s4 + + + diff --git a/docs/specifications/mut.svg b/docs/specifications/mut.svg new file mode 100644 index 00000000..3db01b8e --- /dev/null +++ b/docs/specifications/mut.svg @@ -0,0 +1,1602 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + shares + + + + + + + + + + + + + + + + + + + + + + + + Merkle Tree + + + + + + AES-CTR + + + + + SHA256d + + + + + SHA256d + + + + + SHA256d + + + + + + + + + + FEC + + + + + + + + + + + + + + + + + + + + + + + + salt + + + + encryption key + + + + + + + write key + read key + + verifying (public) key + signing (private) key + + + + + encrypted signing key + + + + verify cap + + + read-write cap + + verify cap + + + + write key + read-only cap + verify cap + + + + read key + plaintext + ciphertext + + + + + + + + + + + + + + + + + SHA256dtruncated + + + + + SHA256dtruncated + + + + + SHA256dtruncated + + + + + SHA256dtruncated + + + + + AES-CTR + + + share 1 + + + share 2 + + + share 3 + + + share 4 + + + + + + diff --git a/docs/specifications/mutable.txt b/docs/specifications/mutable.txt new file mode 100644 index 00000000..40a5374b --- /dev/null +++ b/docs/specifications/mutable.txt @@ -0,0 +1,648 @@ + +This describes the "RSA-based mutable files" which were shipped in Tahoe v0.8.0. + += Mutable Files = + +Mutable File Slots are places with a stable identifier that can hold data +that changes over time. In contrast to CHK slots, for which the +URI/identifier is derived from the contents themselves, the Mutable File Slot +URI remains fixed for the life of the slot, regardless of what data is placed +inside it. + +Each mutable slot is referenced by two different URIs. The "read-write" URI +grants read-write access to its holder, allowing them to put whatever +contents they like into the slot. The "read-only" URI is less powerful, only +granting read access, and not enabling modification of the data. The +read-write URI can be turned into the read-only URI, but not the other way +around. + +The data in these slots is distributed over a number of servers, using the +same erasure coding that CHK files use, with 3-of-10 being a typical choice +of encoding parameters. The data is encrypted and signed in such a way that +only the holders of the read-write URI will be able to set the contents of +the slot, and only the holders of the read-only URI will be able to read +those contents. Holders of either URI will be able to validate the contents +as being written by someone with the read-write URI. The servers who hold the +shares cannot read or modify them: the worst they can do is deny service (by +deleting or corrupting the shares), or attempt a rollback attack (which can +only succeed with the cooperation of at least k servers). + +== Consistency vs Availability == + +There is an age-old battle between consistency and availability. Epic papers +have been written, elaborate proofs have been established, and generations of +theorists have learned that you cannot simultaneously achieve guaranteed +consistency with guaranteed reliability. In addition, the closer to 0 you get +on either axis, the cost and complexity of the design goes up. + +Tahoe's design goals are to largely favor design simplicity, then slightly +favor read availability, over the other criteria. + +As we develop more sophisticated mutable slots, the API may expose multiple +read versions to the application layer. The tahoe philosophy is to defer most +consistency recovery logic to the higher layers. Some applications have +effective ways to merge multiple versions, so inconsistency is not +necessarily a problem (i.e. directory nodes can usually merge multiple "add +child" operations). + +== The Prime Coordination Directive: "Don't Do That" == + +The current rule for applications which run on top of Tahoe is "do not +perform simultaneous uncoordinated writes". That means you need non-tahoe +means to make sure that two parties are not trying to modify the same mutable +slot at the same time. For example: + + * don't give the read-write URI to anyone else. Dirnodes in a private + directory generally satisfy this case, as long as you don't use two + clients on the same account at the same time + * if you give a read-write URI to someone else, stop using it yourself. An + inbox would be a good example of this. + * if you give a read-write URI to someone else, call them on the phone + before you write into it + * build an automated mechanism to have your agents coordinate writes. + For example, we expect a future release to include a FURL for a + "coordination server" in the dirnodes. The rule can be that you must + contact the coordination server and obtain a lock/lease on the file + before you're allowed to modify it. + +If you do not follow this rule, Bad Things will happen. The worst-case Bad +Thing is that the entire file will be lost. A less-bad Bad Thing is that one +or more of the simultaneous writers will lose their changes. An observer of +the file may not see monotonically-increasing changes to the file, i.e. they +may see version 1, then version 2, then 3, then 2 again. + +Tahoe takes some amount of care to reduce the badness of these Bad Things. +One way you can help nudge it from the "lose your file" case into the "lose +some changes" case is to reduce the number of competing versions: multiple +versions of the file that different parties are trying to establish as the +one true current contents. Each simultaneous writer counts as a "competing +version", as does the previous version of the file. If the count "S" of these +competing versions is larger than N/k, then the file runs the risk of being +lost completely. [TODO] If at least one of the writers remains running after +the collision is detected, it will attempt to recover, but if S>(N/k) and all +writers crash after writing a few shares, the file will be lost. + +Note that Tahoe uses serialization internally to make sure that a single +Tahoe node will not perform simultaneous modifications to a mutable file. It +accomplishes this by using a weakref cache of the MutableFileNode (so that +there will never be two distinct MutableFileNodes for the same file), and by +forcing all mutable file operations to obtain a per-node lock before they +run. The Prime Coordination Directive therefore applies to inter-node +conflicts, not intra-node ones. + + +== Small Distributed Mutable Files == + +SDMF slots are suitable for small (<1MB) files that are editing by rewriting +the entire file. The three operations are: + + * allocate (with initial contents) + * set (with new contents) + * get (old contents) + +The first use of SDMF slots will be to hold directories (dirnodes), which map +encrypted child names to rw-URI/ro-URI pairs. + +=== SDMF slots overview === + +Each SDMF slot is created with a public/private key pair. The public key is +known as the "verification key", while the private key is called the +"signature key". The private key is hashed and truncated to 16 bytes to form +the "write key" (an AES symmetric key). The write key is then hashed and +truncated to form the "read key". The read key is hashed and truncated to +form the 16-byte "storage index" (a unique string used as an index to locate +stored data). + +The public key is hashed by itself to form the "verification key hash". + +The write key is hashed a different way to form the "write enabler master". +For each storage server on which a share is kept, the write enabler master is +concatenated with the server's nodeid and hashed, and the result is called +the "write enabler" for that particular server. Note that multiple shares of +the same slot stored on the same server will all get the same write enabler, +i.e. the write enabler is associated with the "bucket", rather than the +individual shares. + +The private key is encrypted (using AES in counter mode) by the write key, +and the resulting crypttext is stored on the servers. so it will be +retrievable by anyone who knows the write key. The write key is not used to +encrypt anything else, and the private key never changes, so we do not need +an IV for this purpose. + +The actual data is encrypted (using AES in counter mode) with a key derived +by concatenating the readkey with the IV, the hashing the results and +truncating to 16 bytes. The IV is randomly generated each time the slot is +updated, and stored next to the encrypted data. + +The read-write URI consists of the write key and the verification key hash. +The read-only URI contains the read key and the verification key hash. The +verify-only URI contains the storage index and the verification key hash. + + URI:SSK-RW:b2a(writekey):b2a(verification_key_hash) + URI:SSK-RO:b2a(readkey):b2a(verification_key_hash) + URI:SSK-Verify:b2a(storage_index):b2a(verification_key_hash) + +Note that this allows the read-only and verify-only URIs to be derived from +the read-write URI without actually retrieving the public keys. Also note +that it means the read-write agent must validate both the private key and the +public key when they are first fetched. All users validate the public key in +exactly the same way. + +The SDMF slot is allocated by sending a request to the storage server with a +desired size, the storage index, and the write enabler for that server's +nodeid. If granted, the write enabler is stashed inside the slot's backing +store file. All further write requests must be accompanied by the write +enabler or they will not be honored. The storage server does not share the +write enabler with anyone else. + +The SDMF slot structure will be described in more detail below. The important +pieces are: + + * a sequence number + * a root hash "R" + * the encoding parameters (including k, N, file size, segment size) + * a signed copy of [seqnum,R,encoding_params], using the signature key + * the verification key (not encrypted) + * the share hash chain (part of a Merkle tree over the share hashes) + * the block hash tree (Merkle tree over blocks of share data) + * the share data itself (erasure-coding of read-key-encrypted file data) + * the signature key, encrypted with the write key + +The access pattern for read is: + * hash read-key to get storage index + * use storage index to locate 'k' shares with identical 'R' values + * either get one share, read 'k' from it, then read k-1 shares + * or read, say, 5 shares, discover k, either get more or be finished + * or copy k into the URIs + * read verification key + * hash verification key, compare against verification key hash + * read seqnum, R, encoding parameters, signature + * verify signature against verification key + * read share data, compute block-hash Merkle tree and root "r" + * read share hash chain (leading from "r" to "R") + * validate share hash chain up to the root "R" + * submit share data to erasure decoding + * decrypt decoded data with read-key + * submit plaintext to application + +The access pattern for write is: + * hash write-key to get read-key, hash read-key to get storage index + * use the storage index to locate at least one share + * read verification key and encrypted signature key + * decrypt signature key using write-key + * hash signature key, compare against write-key + * hash verification key, compare against verification key hash + * encrypt plaintext from application with read-key + * application can encrypt some data with the write-key to make it only + available to writers (use this for transitive read-onlyness of dirnodes) + * erasure-code crypttext to form shares + * split shares into blocks + * compute Merkle tree of blocks, giving root "r" for each share + * compute Merkle tree of shares, find root "R" for the file as a whole + * create share data structures, one per server: + * use seqnum which is one higher than the old version + * share hash chain has log(N) hashes, different for each server + * signed data is the same for each server + * now we have N shares and need homes for them + * walk through peers + * if share is not already present, allocate-and-set + * otherwise, try to modify existing share: + * send testv_and_writev operation to each one + * testv says to accept share if their(seqnum+R) <= our(seqnum+R) + * count how many servers wind up with which versions (histogram over R) + * keep going until N servers have the same version, or we run out of servers + * if any servers wound up with a different version, report error to + application + * if we ran out of servers, initiate recovery process (described below) + +=== Server Storage Protocol === + +The storage servers will provide a mutable slot container which is oblivious +to the details of the data being contained inside it. Each storage index +refers to a "bucket", and each bucket has one or more shares inside it. (In a +well-provisioned network, each bucket will have only one share). The bucket +is stored as a directory, using the base32-encoded storage index as the +directory name. Each share is stored in a single file, using the share number +as the filename. + +The container holds space for a container magic number (for versioning), the +write enabler, the nodeid which accepted the write enabler (used for share +migration, described below), a small number of lease structures, the embedded +data itself, and expansion space for additional lease structures. + + # offset size name + 1 0 32 magic verstr "tahoe mutable container v1" plus binary + 2 32 20 write enabler's nodeid + 3 52 32 write enabler + 4 84 8 data size (actual share data present) (a) + 5 92 8 offset of (8) count of extra leases (after data) + 6 100 368 four leases, 92 bytes each + 0 4 ownerid (0 means "no lease here") + 4 4 expiration timestamp + 8 32 renewal token + 40 32 cancel token + 72 20 nodeid which accepted the tokens + 7 468 (a) data + 8 ?? 4 count of extra leases + 9 ?? n*92 extra leases + +The "extra leases" field must be copied and rewritten each time the size of +the enclosed data changes. The hope is that most buckets will have four or +fewer leases and this extra copying will not usually be necessary. + +The (4) "data size" field contains the actual number of bytes of data present +in field (7), such that a client request to read beyond 504+(a) will result +in an error. This allows the client to (one day) read relative to the end of +the file. The container size (that is, (8)-(7)) might be larger, especially +if extra size was pre-allocated in anticipation of filling the container with +a lot of data. + +The offset in (5) points at the *count* of extra leases, at (8). The actual +leases (at (9)) begin 4 bytes later. If the container size changes, both (8) +and (9) must be relocated by copying. + +The server will honor any write commands that provide the write token and do +not exceed the server-wide storage size limitations. Read and write commands +MUST be restricted to the 'data' portion of the container: the implementation +of those commands MUST perform correct bounds-checking to make sure other +portions of the container are inaccessible to the clients. + +The two methods provided by the storage server on these "MutableSlot" share +objects are: + + * readv(ListOf(offset=int, length=int)) + * returns a list of bytestrings, of the various requested lengths + * offset < 0 is interpreted relative to the end of the data + * spans which hit the end of the data will return truncated data + + * testv_and_writev(write_enabler, test_vector, write_vector) + * this is a test-and-set operation which performs the given tests and only + applies the desired writes if all tests succeed. This is used to detect + simultaneous writers, and to reduce the chance that an update will lose + data recently written by some other party (written after the last time + this slot was read). + * test_vector=ListOf(TupleOf(offset, length, opcode, specimen)) + * the opcode is a string, from the set [gt, ge, eq, le, lt, ne] + * each element of the test vector is read from the slot's data and + compared against the specimen using the desired (in)equality. If all + tests evaluate True, the write is performed + * write_vector=ListOf(TupleOf(offset, newdata)) + * offset < 0 is not yet defined, it probably means relative to the + end of the data, which probably means append, but we haven't nailed + it down quite yet + * write vectors are executed in order, which specifies the results of + overlapping writes + * return value: + * error: OutOfSpace + * error: something else (io error, out of memory, whatever) + * (True, old_test_data): the write was accepted (test_vector passed) + * (False, old_test_data): the write was rejected (test_vector failed) + * both 'accepted' and 'rejected' return the old data that was used + for the test_vector comparison. This can be used by the client + to detect write collisions, including collisions for which the + desired behavior was to overwrite the old version. + +In addition, the storage server provides several methods to access these +share objects: + + * allocate_mutable_slot(storage_index, sharenums=SetOf(int)) + * returns DictOf(int, MutableSlot) + * get_mutable_slot(storage_index) + * returns DictOf(int, MutableSlot) + * or raises KeyError + +We intend to add an interface which allows small slots to allocate-and-write +in a single call, as well as do update or read in a single call. The goal is +to allow a reasonably-sized dirnode to be created (or updated, or read) in +just one round trip (to all N shareholders in parallel). + +==== migrating shares ==== + +If a share must be migrated from one server to another, two values become +invalid: the write enabler (since it was computed for the old server), and +the lease renew/cancel tokens. + +Suppose that a slot was first created on nodeA, and was thus initialized with +WE(nodeA) (= H(WEM+nodeA)). Later, for provisioning reasons, the share is +moved from nodeA to nodeB. + +Readers may still be able to find the share in its new home, depending upon +how many servers are present in the grid, where the new nodeid lands in the +permuted index for this particular storage index, and how many servers the +reading client is willing to contact. + +When a client attempts to write to this migrated share, it will get a "bad +write enabler" error, since the WE it computes for nodeB will not match the +WE(nodeA) that was embedded in the share. When this occurs, the "bad write +enabler" message must include the old nodeid (e.g. nodeA) that was in the +share. + +The client then computes H(nodeB+H(WEM+nodeA)), which is the same as +H(nodeB+WE(nodeA)). The client sends this along with the new WE(nodeB), which +is H(WEM+nodeB). Note that the client only sends WE(nodeB) to nodeB, never to +anyone else. Also note that the client does not send a value to nodeB that +would allow the node to impersonate the client to a third node: everything +sent to nodeB will include something specific to nodeB in it. + +The server locally computes H(nodeB+WE(nodeA)), using its own node id and the +old write enabler from the share. It compares this against the value supplied +by the client. If they match, this serves as proof that the client was able +to compute the old write enabler. The server then accepts the client's new +WE(nodeB) and writes it into the container. + +This WE-fixup process requires an extra round trip, and requires the error +message to include the old nodeid, but does not require any public key +operations on either client or server. + +Migrating the leases will require a similar protocol. This protocol will be +defined concretely at a later date. + +=== Code Details === + +The MutableFileNode class is used to manipulate mutable files (as opposed to +ImmutableFileNodes). These are initially generated with +client.create_mutable_file(), and later recreated from URIs with +client.create_node_from_uri(). Instances of this class will contain a URI and +a reference to the client (for peer selection and connection). + +NOTE: this section is out of date. Please see src/allmydata/interfaces.py +(the section on IMutableFilesystemNode) for more accurate information. + +The methods of MutableFileNode are: + + * download_to_data() -> [deferred] newdata, NotEnoughSharesError + * if there are multiple retrieveable versions in the grid, get() returns + the first version it can reconstruct, and silently ignores the others. + In the future, a more advanced API will signal and provide access to + the multiple heads. + * update(newdata) -> OK, UncoordinatedWriteError, NotEnoughSharesError + * overwrite(newdata) -> OK, UncoordinatedWriteError, NotEnoughSharesError + +download_to_data() causes a new retrieval to occur, pulling the current +contents from the grid and returning them to the caller. At the same time, +this call caches information about the current version of the file. This +information will be used in a subsequent call to update(), and if another +change has occured between the two, this information will be out of date, +triggering the UncoordinatedWriteError. + +update() is therefore intended to be used just after a download_to_data(), in +the following pattern: + + d = mfn.download_to_data() + d.addCallback(apply_delta) + d.addCallback(mfn.update) + +If the update() call raises UCW, then the application can simply return an +error to the user ("you violated the Prime Coordination Directive"), and they +can try again later. Alternatively, the application can attempt to retry on +its own. To accomplish this, the app needs to pause, download the new +(post-collision and post-recovery) form of the file, reapply their delta, +then submit the update request again. A randomized pause is necessary to +reduce the chances of colliding a second time with another client that is +doing exactly the same thing: + + d = mfn.download_to_data() + d.addCallback(apply_delta) + d.addCallback(mfn.update) + def _retry(f): + f.trap(UncoordinatedWriteError) + d1 = pause(random.uniform(5, 20)) + d1.addCallback(lambda res: mfn.download_to_data()) + d1.addCallback(apply_delta) + d1.addCallback(mfn.update) + return d1 + d.addErrback(_retry) + +Enthusiastic applications can retry multiple times, using a randomized +exponential backoff between each. A particularly enthusiastic application can +retry forever, but such apps are encouraged to provide a means to the user of +giving up after a while. + +UCW does not mean that the update was not applied, so it is also a good idea +to skip the retry-update step if the delta was already applied: + + d = mfn.download_to_data() + d.addCallback(apply_delta) + d.addCallback(mfn.update) + def _retry(f): + f.trap(UncoordinatedWriteError) + d1 = pause(random.uniform(5, 20)) + d1.addCallback(lambda res: mfn.download_to_data()) + def _maybe_apply_delta(contents): + new_contents = apply_delta(contents) + if new_contents != contents: + return mfn.update(new_contents) + d1.addCallback(_maybe_apply_delta) + return d1 + d.addErrback(_retry) + +update() is the right interface to use for delta-application situations, like +directory nodes (in which apply_delta might be adding or removing child +entries from a serialized table). + +Note that any uncoordinated write has the potential to lose data. We must do +more analysis to be sure, but it appears that two clients who write to the +same mutable file at the same time (even if both eventually retry) will, with +high probability, result in one client observing UCW and the other silently +losing their changes. It is also possible for both clients to observe UCW. +The moral of the story is that the Prime Coordination Directive is there for +a reason, and that recovery/UCW/retry is not a subsitute for write +coordination. + +overwrite() tells the client to ignore this cached version information, and +to unconditionally replace the mutable file's contents with the new data. +This should not be used in delta application, but rather in situations where +you want to replace the file's contents with completely unrelated ones. When +raw files are uploaded into a mutable slot through the tahoe webapi (using +POST and the ?mutable=true argument), they are put in place with overwrite(). + + + +The peer-selection and data-structure manipulation (and signing/verification) +steps will be implemented in a separate class in allmydata/mutable.py . + +=== SMDF Slot Format === + +This SMDF data lives inside a server-side MutableSlot container. The server +is oblivious to this format. + +This data is tightly packed. In particular, the share data is defined to run +all the way to the beginning of the encrypted private key (the encprivkey +offset is used both to terminate the share data and to begin the encprivkey). + + # offset size name + 1 0 1 version byte, \x00 for this format + 2 1 8 sequence number. 2^64-1 must be handled specially, TBD + 3 9 32 "R" (root of share hash Merkle tree) + 4 41 16 IV (share data is AES(H(readkey+IV)) ) + 5 57 18 encoding parameters: + 57 1 k + 58 1 N + 59 8 segment size + 67 8 data length (of original plaintext) + 6 75 32 offset table: + 75 4 (8) signature + 79 4 (9) share hash chain + 83 4 (10) block hash tree + 87 4 (11) share data + 91 8 (12) encrypted private key + 99 8 (13) EOF + 7 107 436ish verification key (2048 RSA key) + 8 543ish 256ish signature=RSAenc(sigkey, H(version+seqnum+r+IV+encparm)) + 9 799ish (a) share hash chain, encoded as: + "".join([pack(">H32s", shnum, hash) + for (shnum,hash) in needed_hashes]) +10 (927ish) (b) block hash tree, encoded as: + "".join([pack(">32s",hash) for hash in block_hash_tree]) +11 (935ish) LEN share data (no gap between this and encprivkey) +12 ?? 1216ish encrypted private key= AESenc(write-key, RSA-key) +13 ?? -- EOF + +(a) The share hash chain contains ceil(log(N)) hashes, each 32 bytes long. + This is the set of hashes necessary to validate this share's leaf in the + share Merkle tree. For N=10, this is 4 hashes, i.e. 128 bytes. +(b) The block hash tree contains ceil(length/segsize) hashes, each 32 bytes + long. This is the set of hashes necessary to validate any given block of + share data up to the per-share root "r". Each "r" is a leaf of the share + has tree (with root "R"), from which a minimal subset of hashes is put in + the share hash chain in (8). + +=== Recovery === + +The first line of defense against damage caused by colliding writes is the +Prime Coordination Directive: "Don't Do That". + +The second line of defense is to keep "S" (the number of competing versions) +lower than N/k. If this holds true, at least one competing version will have +k shares and thus be recoverable. Note that server unavailability counts +against us here: the old version stored on the unavailable server must be +included in the value of S. + +The third line of defense is our use of testv_and_writev() (described below), +which increases the convergence of simultaneous writes: one of the writers +will be favored (the one with the highest "R"), and that version is more +likely to be accepted than the others. This defense is least effective in the +pathological situation where S simultaneous writers are active, the one with +the lowest "R" writes to N-k+1 of the shares and then dies, then the one with +the next-lowest "R" writes to N-2k+1 of the shares and dies, etc, until the +one with the highest "R" writes to k-1 shares and dies. Any other sequencing +will allow the highest "R" to write to at least k shares and establish a new +revision. + +The fourth line of defense is the fact that each client keeps writing until +at least one version has N shares. This uses additional servers, if +necessary, to make sure that either the client's version or some +newer/overriding version is highly available. + +The fifth line of defense is the recovery algorithm, which seeks to make sure +that at least *one* version is highly available, even if that version is +somebody else's. + +The write-shares-to-peers algorithm is as follows: + + * permute peers according to storage index + * walk through peers, trying to assign one share per peer + * for each peer: + * send testv_and_writev, using "old(seqnum+R) <= our(seqnum+R)" as the test + * this means that we will overwrite any old versions, and we will + overwrite simultaenous writers of the same version if our R is higher. + We will not overwrite writers using a higher seqnum. + * record the version that each share winds up with. If the write was + accepted, this is our own version. If it was rejected, read the + old_test_data to find out what version was retained. + * if old_test_data indicates the seqnum was equal or greater than our + own, mark the "Simultanous Writes Detected" flag, which will eventually + result in an error being reported to the writer (in their close() call). + * build a histogram of "R" values + * repeat until the histogram indicate that some version (possibly ours) + has N shares. Use new servers if necessary. + * If we run out of servers: + * if there are at least shares-of-happiness of any one version, we're + happy, so return. (the close() might still get an error) + * not happy, need to reinforce something, goto RECOVERY + +RECOVERY: + * read all shares, count the versions, identify the recoverable ones, + discard the unrecoverable ones. + * sort versions: locate max(seqnums), put all versions with that seqnum + in the list, sort by number of outstanding shares. Then put our own + version. (TODO: put versions with seqnum us ahead of us?). + * for each version: + * attempt to recover that version + * if not possible, remove it from the list, go to next one + * if recovered, start at beginning of peer list, push that version, + continue until N shares are placed + * if pushing our own version, bump up the seqnum to one higher than + the max seqnum we saw + * if we run out of servers: + * schedule retry and exponential backoff to repeat RECOVERY + * admit defeat after some period? presumeably the client will be shut down + eventually, maybe keep trying (once per hour?) until then. + + + + +== Medium Distributed Mutable Files == + +These are just like the SDMF case, but: + + * we actually take advantage of the Merkle hash tree over the blocks, by + reading a single segment of data at a time (and its necessary hashes), to + reduce the read-time alacrity + * we allow arbitrary writes to the file (i.e. seek() is provided, and + O_TRUNC is no longer required) + * we write more code on the client side (in the MutableFileNode class), to + first read each segment that a write must modify. This looks exactly like + the way a normal filesystem uses a block device, or how a CPU must perform + a cache-line fill before modifying a single word. + * we might implement some sort of copy-based atomic update server call, + to allow multiple writev() calls to appear atomic to any readers. + +MDMF slots provide fairly efficient in-place edits of very large files (a few +GB). Appending data is also fairly efficient, although each time a power of 2 +boundary is crossed, the entire file must effectively be re-uploaded (because +the size of the block hash tree changes), so if the filesize is known in +advance, that space ought to be pre-allocated (by leaving extra space between +the block hash tree and the actual data). + +MDMF1 uses the Merkle tree to enable low-alacrity random-access reads. MDMF2 +adds cache-line reads to allow random-access writes. + +== Large Distributed Mutable Files == + +LDMF slots use a fundamentally different way to store the file, inspired by +Mercurial's "revlog" format. They enable very efficient insert/remove/replace +editing of arbitrary spans. Multiple versions of the file can be retained, in +a revision graph that can have multiple heads. Each revision can be +referenced by a cryptographic identifier. There are two forms of the URI, one +that means "most recent version", and a longer one that points to a specific +revision. + +Metadata can be attached to the revisions, like timestamps, to enable rolling +back an entire tree to a specific point in history. + +LDMF1 provides deltas but tries to avoid dealing with multiple heads. LDMF2 +provides explicit support for revision identifiers and branching. + +== TODO == + +improve allocate-and-write or get-writer-buckets API to allow one-call (or +maybe two-call) updates. The challenge is in figuring out which shares are on +which machines. First cut will have lots of round trips. + +(eventually) define behavior when seqnum wraps. At the very least make sure +it can't cause a security problem. "the slot is worn out" is acceptable. + +(eventually) define share-migration lease update protocol. Including the +nodeid who accepted the lease is useful, we can use the same protocol as we +do for updating the write enabler. However we need to know which lease to +update.. maybe send back a list of all old nodeids that we find, then try all +of them when we accept the update? + + We now do this in a specially-formatted IndexError exception: + "UNABLE to renew non-existent lease. I have leases accepted by " + + "nodeids: '12345','abcde','44221' ." + +confirm that a repairer can regenerate shares without the private key. Hmm, +without the write-enabler they won't be able to write those shares to the +servers.. although they could add immutable new shares to new servers. diff --git a/docs/specifications/uri.txt b/docs/specifications/uri.txt new file mode 100644 index 00000000..5599fa19 --- /dev/null +++ b/docs/specifications/uri.txt @@ -0,0 +1,187 @@ + += Tahoe URIs = + +Each file and directory in a Tahoe filesystem is described by a "URI". There +are different kinds of URIs for different kinds of objects, and there are +different kinds of URIs to provide different kinds of access to those +objects. Each URI is a string representation of a "capability" or "cap", and +there are read-caps, write-caps, verify-caps, and others. + +Each URI provides both '''location''' and '''identification''' properties. +'''location''' means that holding the URI is sufficient to locate the data it +represents (this means it contains a storage index or a lookup key, whatever +is necessary to find the place or places where the data is being kept). +'''identification''' means that the URI also serves to validate the data: an +attacker who wants to trick you into into using the wrong data will be +limited in their abilities by the identification properties of the URI. + +Some URIs are subsets of others. In particular, if you know a URI which +allows you to modify some object, you can produce a weaker read-only URI and +give it to someone else, and they will be able to read that object but not +modify it. Directories, for example, have a read-cap which is derived from +the write-cap: anyone with read/write access to the directory can produce a +limited URI that grants read-only access, but not the other way around. + +source:src/allmydata/uri.py is the main place where URIs are processed. It is +the authoritative definition point for all the the URI types described +herein. + +== File URIs == + +The lowest layer of the Tahoe architecture (the "grid") is reponsible for +mapping URIs to data. This is basically a distributed hash table, in which +the URI is the key, and some sequence of bytes is the value. + +There are two kinds of entries in this table: immutable and mutable. For +immutable entries, the URI represents a fixed chunk of data. The URI itself +is derived from the data when it is uploaded into the grid, and can be used +to locate and download that data from the grid at some time in the future. + +For mutable entries, the URI identifies a "slot" or "container", which can be +filled with different pieces of data at different times. + +It is important to note that the "files" described by these URIs are just a +bunch of bytes, and that __no__ filenames or other metadata is retained at +this layer. The vdrive layer (which sits above the grid layer) is entirely +responsible for directories and filenames and the like. + +=== CHI URIs === + +CHK (Content Hash Keyed) files are immutable sequences of bytes. They are +uploaded in a distributed fashion using a "storage index" (for the "location" +property), and encrypted using a "read key". A secure hash of the data is +computed to help validate the data afterwards (providing the "identification" +property). All of these pieces, plus information about the file's size and +the number of shares into which it has been distributed, are put into the +"CHK" uri. The storage index is derived by hashing the read key (using a +tagged SHA-256d hash, then truncated to 128 bits), so it does not need to be +physically present in the URI. + +The current format for CHK URIs is the concatenation of the following +strings: + + URI:CHK:(key):(hash):(needed-shares):(total-shares):(size) + +Where (key) is the base32 encoding of the 16-byte AES read key, (hash) is the +base32 encoding of the SHA-256 hash of the URI Extension Block, +(needed-shares) is an ascii decimal representation of the number of shares +required to reconstruct this file, (total-shares) is the same representation +of the total number of shares created, and (size) is an ascii decimal +representation of the size of the data represented by this URI. All base32 +encodings are expressed in lower-case, with the trailing '=' signs removed. + +For example, the following is a CHK URI, generated from the contents of the +architecture.txt document that lives next to this one in the source tree: + +URI:CHK:ihrbeov7lbvoduupd4qblysj7a:bg5agsdt62jb34hxvxmdsbza6do64f4fg5anxxod2buttbo6udzq:3:10:28733 + +Historical note: The name "CHK" is somewhat inaccurate and continues to be +used for historical reasons. "Content Hash Key" means that the encryption key +is derived by hashing the contents, which gives the useful property that +encoding the same file twice will result in the same URI. However, this is an +optional step: by passing a different flag to the appropriate API call, Tahoe +will generate a random encryption key instead of hashing the file: this gives +the useful property that the URI or storage index does not reveal anything +about the file's contents (except filesize), which improves privacy. The +URI:CHK: prefix really indicates that an immutable file is in use, without +saying anything about how the key was derived. + +=== LIT URIs === + +LITeral files are also an immutable sequence of bytes, but they are so short +that the data is stored inside the URI itself. These are used for files of 55 +bytes or shorter, which is the point at which the LIT URI is the same length +as a CHK URI would be. + +LIT URIs do not require an upload or download phase, as their data is stored +directly in the URI. + +The format of a LIT URI is simply a fixed prefix concatenated with the base32 +encoding of the file's data: + + URI:LIT:bjuw4y3movsgkidbnrwg26lemf2gcl3xmvrc6kropbuhi3lmbi + +The LIT URI for an empty file is "URI:LIT:", and the LIT URI for a 5-byte +file that contains the string "hello" is "URI:LIT:nbswy3dp". + +=== Mutable File URIs === + +The other kind of DHT entry is the "mutable slot", in which the URI names a +container to which data can be placed and retrieved without changing the +identity of the container. + +These slots have write-caps (which allow read/write access), read-caps (which +only allow read-access), and verify-caps (which allow a file checker/repairer +to confirm that the contents exist, but does not let it decrypt the +contents). + +Mutable slots use public key technology to provide data integrity, and put a +hash of the public key in the URI. As a result, the data validation is +limited to confirming that the data retrieved matches _some_ data that was +uploaded in the past, but not _which_ version of that data. + +The format of the write-cap for mutable files is: + + URI:SSK:(writekey):(fingerprint) + +Where (writekey) is the base32 encoding of the 16-byte AES encryption key +that is used to encrypt the RSA private key, and (fingerprint) is the base32 +encoded 32-byte SHA-256 hash of the RSA public key. For more details about +the way these keys are used, please see docs/mutable.txt . + +The format for mutable read-caps is: + + URI:SSK-RO:(readkey):(fingerprint) + +The read-cap is just like the write-cap except it contains the other AES +encryption key: the one used for encrypting the mutable file's contents. This +second key is derived by hashing the writekey, which allows the holder of a +write-cap to produce a read-cap, but not the other way around. The +fingerprint is the same in both caps. + +Historical note: the "SSK" prefix is a perhaps-inaccurate reference to +"Sub-Space Keys" from the Freenet project, which uses a vaguely similar +structure to provide mutable file access. + +== Directory URIs == + +The grid layer provides a mapping from URI to data. To turn this into a graph +of directories and files, the "vdrive" layer (which sits on top of the grid +layer) needs to keep track of "directory nodes", or "dirnodes" for short. +source:docs/dirnodes.txt describes how these work. + +Dirnodes are contained inside mutable files, and are thus simply a particular +way to interpret the contents of these files. As a result, a directory +write-cap looks a lot like a mutable-file write-cap: + + URI:DIR2:(writekey):(fingerprint) + +Likewise directory read-caps (which provide read-only access to the +directory) look much like mutable-file read-caps: + + URI:DIR2-RO:(readkey):(fingerprint) + +Historical note: the "DIR2" prefix is used because the non-distributed +dirnodes in earlier Tahoe releases had already claimed the "DIR" prefix. + +== Internal Usage of URIs == + +The classes in source:src/allmydata/uri.py are used to pack and unpack these +various kinds of URIs. Three Interfaces are defined (IURI, IFileURI, and +IDirnodeURI) which are implemented by these classes, and string-to-URI-class +conversion routines have been registered as adapters, so that code which +wants to extract e.g. the size of a CHK or LIT uri can do: + +{{{ +print IFileURI(uri).get_size() +}}} + +If the URI does not represent a CHK or LIT uri (for example, if it was for a +directory instead), the adaptation will fail, raising a TypeError inside the +IFileURI() call. + +Several utility methods are provided on these objects. The most important is +{{{ to_string() }}}, which returns the string form of the URI. Therefore {{{ +IURI(uri).to_string == uri }}} is true for any valid URI. See the IURI class +in source:src/allmydata/interfaces.py for more details. + diff --git a/docs/uri.txt b/docs/uri.txt deleted file mode 100644 index 5599fa19..00000000 --- a/docs/uri.txt +++ /dev/null @@ -1,187 +0,0 @@ - -= Tahoe URIs = - -Each file and directory in a Tahoe filesystem is described by a "URI". There -are different kinds of URIs for different kinds of objects, and there are -different kinds of URIs to provide different kinds of access to those -objects. Each URI is a string representation of a "capability" or "cap", and -there are read-caps, write-caps, verify-caps, and others. - -Each URI provides both '''location''' and '''identification''' properties. -'''location''' means that holding the URI is sufficient to locate the data it -represents (this means it contains a storage index or a lookup key, whatever -is necessary to find the place or places where the data is being kept). -'''identification''' means that the URI also serves to validate the data: an -attacker who wants to trick you into into using the wrong data will be -limited in their abilities by the identification properties of the URI. - -Some URIs are subsets of others. In particular, if you know a URI which -allows you to modify some object, you can produce a weaker read-only URI and -give it to someone else, and they will be able to read that object but not -modify it. Directories, for example, have a read-cap which is derived from -the write-cap: anyone with read/write access to the directory can produce a -limited URI that grants read-only access, but not the other way around. - -source:src/allmydata/uri.py is the main place where URIs are processed. It is -the authoritative definition point for all the the URI types described -herein. - -== File URIs == - -The lowest layer of the Tahoe architecture (the "grid") is reponsible for -mapping URIs to data. This is basically a distributed hash table, in which -the URI is the key, and some sequence of bytes is the value. - -There are two kinds of entries in this table: immutable and mutable. For -immutable entries, the URI represents a fixed chunk of data. The URI itself -is derived from the data when it is uploaded into the grid, and can be used -to locate and download that data from the grid at some time in the future. - -For mutable entries, the URI identifies a "slot" or "container", which can be -filled with different pieces of data at different times. - -It is important to note that the "files" described by these URIs are just a -bunch of bytes, and that __no__ filenames or other metadata is retained at -this layer. The vdrive layer (which sits above the grid layer) is entirely -responsible for directories and filenames and the like. - -=== CHI URIs === - -CHK (Content Hash Keyed) files are immutable sequences of bytes. They are -uploaded in a distributed fashion using a "storage index" (for the "location" -property), and encrypted using a "read key". A secure hash of the data is -computed to help validate the data afterwards (providing the "identification" -property). All of these pieces, plus information about the file's size and -the number of shares into which it has been distributed, are put into the -"CHK" uri. The storage index is derived by hashing the read key (using a -tagged SHA-256d hash, then truncated to 128 bits), so it does not need to be -physically present in the URI. - -The current format for CHK URIs is the concatenation of the following -strings: - - URI:CHK:(key):(hash):(needed-shares):(total-shares):(size) - -Where (key) is the base32 encoding of the 16-byte AES read key, (hash) is the -base32 encoding of the SHA-256 hash of the URI Extension Block, -(needed-shares) is an ascii decimal representation of the number of shares -required to reconstruct this file, (total-shares) is the same representation -of the total number of shares created, and (size) is an ascii decimal -representation of the size of the data represented by this URI. All base32 -encodings are expressed in lower-case, with the trailing '=' signs removed. - -For example, the following is a CHK URI, generated from the contents of the -architecture.txt document that lives next to this one in the source tree: - -URI:CHK:ihrbeov7lbvoduupd4qblysj7a:bg5agsdt62jb34hxvxmdsbza6do64f4fg5anxxod2buttbo6udzq:3:10:28733 - -Historical note: The name "CHK" is somewhat inaccurate and continues to be -used for historical reasons. "Content Hash Key" means that the encryption key -is derived by hashing the contents, which gives the useful property that -encoding the same file twice will result in the same URI. However, this is an -optional step: by passing a different flag to the appropriate API call, Tahoe -will generate a random encryption key instead of hashing the file: this gives -the useful property that the URI or storage index does not reveal anything -about the file's contents (except filesize), which improves privacy. The -URI:CHK: prefix really indicates that an immutable file is in use, without -saying anything about how the key was derived. - -=== LIT URIs === - -LITeral files are also an immutable sequence of bytes, but they are so short -that the data is stored inside the URI itself. These are used for files of 55 -bytes or shorter, which is the point at which the LIT URI is the same length -as a CHK URI would be. - -LIT URIs do not require an upload or download phase, as their data is stored -directly in the URI. - -The format of a LIT URI is simply a fixed prefix concatenated with the base32 -encoding of the file's data: - - URI:LIT:bjuw4y3movsgkidbnrwg26lemf2gcl3xmvrc6kropbuhi3lmbi - -The LIT URI for an empty file is "URI:LIT:", and the LIT URI for a 5-byte -file that contains the string "hello" is "URI:LIT:nbswy3dp". - -=== Mutable File URIs === - -The other kind of DHT entry is the "mutable slot", in which the URI names a -container to which data can be placed and retrieved without changing the -identity of the container. - -These slots have write-caps (which allow read/write access), read-caps (which -only allow read-access), and verify-caps (which allow a file checker/repairer -to confirm that the contents exist, but does not let it decrypt the -contents). - -Mutable slots use public key technology to provide data integrity, and put a -hash of the public key in the URI. As a result, the data validation is -limited to confirming that the data retrieved matches _some_ data that was -uploaded in the past, but not _which_ version of that data. - -The format of the write-cap for mutable files is: - - URI:SSK:(writekey):(fingerprint) - -Where (writekey) is the base32 encoding of the 16-byte AES encryption key -that is used to encrypt the RSA private key, and (fingerprint) is the base32 -encoded 32-byte SHA-256 hash of the RSA public key. For more details about -the way these keys are used, please see docs/mutable.txt . - -The format for mutable read-caps is: - - URI:SSK-RO:(readkey):(fingerprint) - -The read-cap is just like the write-cap except it contains the other AES -encryption key: the one used for encrypting the mutable file's contents. This -second key is derived by hashing the writekey, which allows the holder of a -write-cap to produce a read-cap, but not the other way around. The -fingerprint is the same in both caps. - -Historical note: the "SSK" prefix is a perhaps-inaccurate reference to -"Sub-Space Keys" from the Freenet project, which uses a vaguely similar -structure to provide mutable file access. - -== Directory URIs == - -The grid layer provides a mapping from URI to data. To turn this into a graph -of directories and files, the "vdrive" layer (which sits on top of the grid -layer) needs to keep track of "directory nodes", or "dirnodes" for short. -source:docs/dirnodes.txt describes how these work. - -Dirnodes are contained inside mutable files, and are thus simply a particular -way to interpret the contents of these files. As a result, a directory -write-cap looks a lot like a mutable-file write-cap: - - URI:DIR2:(writekey):(fingerprint) - -Likewise directory read-caps (which provide read-only access to the -directory) look much like mutable-file read-caps: - - URI:DIR2-RO:(readkey):(fingerprint) - -Historical note: the "DIR2" prefix is used because the non-distributed -dirnodes in earlier Tahoe releases had already claimed the "DIR" prefix. - -== Internal Usage of URIs == - -The classes in source:src/allmydata/uri.py are used to pack and unpack these -various kinds of URIs. Three Interfaces are defined (IURI, IFileURI, and -IDirnodeURI) which are implemented by these classes, and string-to-URI-class -conversion routines have been registered as adapters, so that code which -wants to extract e.g. the size of a CHK or LIT uri can do: - -{{{ -print IFileURI(uri).get_size() -}}} - -If the URI does not represent a CHK or LIT uri (for example, if it was for a -directory instead), the adaptation will fail, raising a TypeError inside the -IFileURI() call. - -Several utility methods are provided on these objects. The most important is -{{{ to_string() }}}, which returns the string form of the URI. Therefore {{{ -IURI(uri).to_string == uri }}} is true for any valid URI. See the IURI class -in source:src/allmydata/interfaces.py for more details. -