From: david-sarah Date: Sun, 12 Dec 2010 01:02:51 +0000 (-0800) Subject: Move .txt files in docs/frontends and docs/specifications to .rst. refs #1225 X-Git-Url: https://git.rkrishnan.org/?a=commitdiff_plain;h=1d5c705201fdcf0eaf490290eb6abd01fa1f68dc;p=tahoe-lafs%2Ftahoe-lafs.git Move .txt files in docs/frontends and docs/specifications to .rst. refs #1225 --- diff --git a/docs/frontends/CLI.rst b/docs/frontends/CLI.rst new file mode 100644 index 00000000..743b8871 --- /dev/null +++ b/docs/frontends/CLI.rst @@ -0,0 +1,548 @@ +====================== +The Tahoe CLI commands +====================== + +1. `Overview`_ +2. `CLI Command Overview`_ +3. `Node Management`_ +4. `Filesystem Manipulation`_ + + 1. `Starting Directories`_ + 2. `Command Syntax Summary`_ + 3. `Command Examples`_ + +5. `Storage Grid Maintenance`_ +6. `Debugging`_ + + +Overview +======== + +Tahoe provides a single executable named "``tahoe``", which can be used to +create and manage client/server nodes, manipulate the filesystem, and perform +several debugging/maintenance tasks. + +This executable lives in the source tree at "``bin/tahoe``". Once you've done a +build (by running "make"), ``bin/tahoe`` can be run in-place: if it discovers +that it is being run from within a Tahoe source tree, it will modify sys.path +as necessary to use all the source code and dependent libraries contained in +that tree. + +If you've installed Tahoe (using "``make install``", or by installing a binary +package), then the tahoe executable will be available somewhere else, perhaps +in ``/usr/bin/tahoe``. In this case, it will use your platform's normal +PYTHONPATH search paths to find the tahoe code and other libraries. + + +CLI Command Overview +==================== + +The "``tahoe``" tool provides access to three categories of commands. + +* node management: create a client/server node, start/stop/restart it +* filesystem manipulation: list files, upload, download, delete, rename +* debugging: unpack cap-strings, examine share files + +To get a list of all commands, just run "``tahoe``" with no additional +arguments. "``tahoe --help``" might also provide something useful. + +Running "``tahoe --version``" will display a list of version strings, starting +with the "allmydata" module (which contains the majority of the Tahoe +functionality) and including versions for a number of dependent libraries, +like Twisted, Foolscap, pycryptopp, and zfec. + + +Node Management +=============== + +"``tahoe create-node [NODEDIR]``" is the basic make-a-new-node command. It +creates a new directory and populates it with files that will allow the +"``tahoe start``" command to use it later on. This command creates nodes that +have client functionality (upload/download files), web API services +(controlled by the 'webport' file), and storage services (unless +"--no-storage" is specified). + +NODEDIR defaults to ~/.tahoe/ , and newly-created nodes default to +publishing a web server on port 3456 (limited to the loopback interface, at +127.0.0.1, to restrict access to other programs on the same host). All of the +other "``tahoe``" subcommands use corresponding defaults. + +"``tahoe create-client [NODEDIR]``" creates a node with no storage service. +That is, it behaves like "``tahoe create-node --no-storage [NODEDIR]``". +(This is a change from versions prior to 1.6.0.) + +"``tahoe create-introducer [NODEDIR]``" is used to create the Introducer node. +This node provides introduction services and nothing else. When started, this +node will produce an introducer.furl, which should be published to all +clients. + +"``tahoe create-key-generator [NODEDIR]``" is used to create a special +"key-generation" service, which allows a client to offload their RSA key +generation to a separate process. Since RSA key generation takes several +seconds, and must be done each time a directory is created, moving it to a +separate process allows the first process (perhaps a busy wapi server) to +continue servicing other requests. The key generator exports a FURL that can +be copied into a node to enable this functionality. + +"``tahoe run [NODEDIR]``" will start a previously-created node in the foreground. + +"``tahoe start [NODEDIR]``" will launch a previously-created node. It will launch +the node into the background, using the standard Twisted "twistd" +daemon-launching tool. On some platforms (including Windows) this command is +unable to run a daemon in the background; in that case it behaves in the +same way as "``tahoe run``". + +"``tahoe stop [NODEDIR]``" will shut down a running node. + +"``tahoe restart [NODEDIR]``" will stop and then restart a running node. This is +most often used by developers who have just modified the code and want to +start using their changes. + + +Filesystem Manipulation +======================= + +These commands let you exmaine a Tahoe filesystem, providing basic +list/upload/download/delete/rename/mkdir functionality. They can be used as +primitives by other scripts. Most of these commands are fairly thin wrappers +around wapi calls. + +By default, all filesystem-manipulation commands look in ~/.tahoe/ to figure +out which Tahoe node they should use. When the CLI command uses wapi calls, +it will use ~/.tahoe/node.url for this purpose: a running Tahoe node that +provides a wapi port will write its URL into this file. If you want to use +a node on some other host, just create ~/.tahoe/ and copy that node's wapi +URL into this file, and the CLI commands will contact that node instead of a +local one. + +These commands also use a table of "aliases" to figure out which directory +they ought to use a starting point. This is explained in more detail below. + +As of Tahoe v1.7, passing non-ASCII characters to the CLI should work, +except on Windows. The command-line arguments are assumed to use the +character encoding specified by the current locale. + +Starting Directories +-------------------- + +As described in architecture.txt, the Tahoe distributed filesystem consists +of a collection of directories and files, each of which has a "read-cap" or a +"write-cap" (also known as a URI). Each directory is simply a table that maps +a name to a child file or directory, and this table is turned into a string +and stored in a mutable file. The whole set of directory and file "nodes" are +connected together into a directed graph. + +To use this collection of files and directories, you need to choose a +starting point: some specific directory that we will refer to as a +"starting directory". For a given starting directory, the "``ls +[STARTING_DIR]:``" command would list the contents of this directory, +the "``ls [STARTING_DIR]:dir1``" command would look inside this directory +for a child named "dir1" and list its contents, "``ls +[STARTING_DIR]:dir1/subdir2``" would look two levels deep, etc. + +Note that there is no real global "root" directory, but instead each +starting directory provides a different, possibly overlapping +perspective on the graph of files and directories. + +Each tahoe node remembers a list of starting points, named "aliases", +in a file named ~/.tahoe/private/aliases . These aliases are short UTF-8 +encoded strings that stand in for a directory read- or write- cap. If +you use the command line "``ls``" without any "[STARTING_DIR]:" argument, +then it will use the default alias, which is "tahoe", therefore "``tahoe +ls``" has the same effect as "``tahoe ls tahoe:``". The same goes for the +other commands which can reasonably use a default alias: get, put, +mkdir, mv, and rm. + +For backwards compatibility with Tahoe-1.0, if the "tahoe": alias is not +found in ~/.tahoe/private/aliases, the CLI will use the contents of +~/.tahoe/private/root_dir.cap instead. Tahoe-1.0 had only a single starting +point, and stored it in this root_dir.cap file, so Tahoe-1.1 will use it if +necessary. However, once you've set a "tahoe:" alias with "``tahoe set-alias``", +that will override anything in the old root_dir.cap file. + +The Tahoe CLI commands use the same filename syntax as scp and rsync +-- an optional "alias:" prefix, followed by the pathname or filename. +Some commands (like "tahoe cp") use the lack of an alias to mean that +you want to refer to a local file, instead of something from the tahoe +virtual filesystem. [TODO] Another way to indicate this is to start +the pathname with a dot, slash, or tilde. + +When you're dealing a single starting directory, the "tahoe:" alias is +all you need. But when you want to refer to something that isn't yet +attached to the graph rooted at that starting directory, you need to +refer to it by its capability. The way to do that is either to use its +capability directory as an argument on the command line, or to add an +alias to it, with the "tahoe add-alias" command. Once you've added an +alias, you can use that alias as an argument to commands. + +The best way to get started with Tahoe is to create a node, start it, then +use the following command to create a new directory and set it as your +"tahoe:" alias:: + + tahoe create-alias tahoe + +After that you can use "``tahoe ls tahoe:``" and +"``tahoe cp local.txt tahoe:``", and both will refer to the directory that +you've just created. + +SECURITY NOTE: For users of shared systems +`````````````````````````````````````````` + +Another way to achieve the same effect as the above "tahoe create-alias" +command is:: + + tahoe add-alias tahoe `tahoe mkdir` + +However, command-line arguments are visible to other users (through the +'ps' command, or the Windows Process Explorer tool), so if you are using a +tahoe node on a shared host, your login neighbors will be able to see (and +capture) any directory caps that you set up with the "``tahoe add-alias``" +command. + +The "``tahoe create-alias``" command avoids this problem by creating a new +directory and putting the cap into your aliases file for you. Alternatively, +you can edit the NODEDIR/private/aliases file directly, by adding a line like +this:: + + fun: URI:DIR2:ovjy4yhylqlfoqg2vcze36dhde:4d4f47qko2xm5g7osgo2yyidi5m4muyo2vjjy53q4vjju2u55mfa + +By entering the dircap through the editor, the command-line arguments are +bypassed, and other users will not be able to see them. Once you've added the +alias, no other secrets are passed through the command line, so this +vulnerability becomes less significant: they can still see your filenames and +other arguments you type there, but not the caps that Tahoe uses to permit +access to your files and directories. + + +Command Syntax Summary +---------------------- + +tahoe add-alias alias cap + +tahoe create-alias alias + +tahoe list-aliases + +tahoe mkdir + +tahoe mkdir [alias:]path + +tahoe ls [alias:][path] + +tahoe webopen [alias:][path] + +tahoe put [--mutable] [localfrom:-] + +tahoe put [--mutable] [localfrom:-] [alias:]to + +tahoe put [--mutable] [localfrom:-] [alias:]subdir/to + +tahoe put [--mutable] [localfrom:-] dircap:to + +tahoe put [--mutable] [localfrom:-] dircap:./subdir/to + +tahoe put [localfrom:-] mutable-file-writecap + +tahoe get [alias:]from [localto:-] + +tahoe cp [-r] [alias:]frompath [alias:]topath + +tahoe rm [alias:]what + +tahoe mv [alias:]from [alias:]to + +tahoe ln [alias:]from [alias:]to + +tahoe backup localfrom [alias:]to + +Command Examples +---------------- + +``tahoe mkdir`` + + This creates a new empty unlinked directory, and prints its write-cap to + stdout. The new directory is not attached to anything else. + +``tahoe add-alias fun DIRCAP`` + + An example would be:: + + tahoe add-alias fun URI:DIR2:ovjy4yhylqlfoqg2vcze36dhde:4d4f47qko2xm5g7osgo2yyidi5m4muyo2vjjy53q4vjju2u55mfa + + This creates an alias "fun:" and configures it to use the given directory + cap. Once this is done, "tahoe ls fun:" will list the contents of this + directory. Use "tahoe add-alias tahoe DIRCAP" to set the contents of the + default "tahoe:" alias. + +``tahoe create-alias fun`` + + This combines "``tahoe mkdir``" and "``tahoe add-alias``" into a single step. + +``tahoe list-aliases`` + + This displays a table of all configured aliases. + +``tahoe mkdir subdir`` + +``tahoe mkdir /subdir`` + + This both create a new empty directory and attaches it to your root with the + name "subdir". + +``tahoe ls`` + +``tahoe ls /`` + +``tahoe ls tahoe:`` + +``tahoe ls tahoe:/`` + + All four list the root directory of your personal virtual filesystem. + +``tahoe ls subdir`` + + This lists a subdirectory of your filesystem. + +``tahoe webopen`` + +``tahoe webopen tahoe:`` + +``tahoe webopen tahoe:subdir/`` + +``tahoe webopen subdir/`` + + This uses the python 'webbrowser' module to cause a local web browser to + open to the web page for the given directory. This page offers interfaces to + add, dowlonad, rename, and delete files in the directory. If not given an + alias or path, opens "tahoe:", the root dir of the default alias. + +``tahoe put file.txt`` + +``tahoe put ./file.txt`` + +``tahoe put /tmp/file.txt`` + +``tahoe put ~/file.txt`` + + These upload the local file into the grid, and prints the new read-cap to + stdout. The uploaded file is not attached to any directory. All one-argument + forms of "``tahoe put``" perform an unlinked upload. + +``tahoe put -`` + +``tahoe put`` + + These also perform an unlinked upload, but the data to be uploaded is taken + from stdin. + +``tahoe put file.txt uploaded.txt`` + +``tahoe put file.txt tahoe:uploaded.txt`` + + These upload the local file and add it to your root with the name + "uploaded.txt" + +``tahoe put file.txt subdir/foo.txt`` + +``tahoe put - subdir/foo.txt`` + +``tahoe put file.txt tahoe:subdir/foo.txt`` + +``tahoe put file.txt DIRCAP:./foo.txt`` + +``tahoe put file.txt DIRCAP:./subdir/foo.txt`` + + These upload the named file and attach them to a subdirectory of the given + root directory, under the name "foo.txt". Note that to use a directory + write-cap instead of an alias, you must use ":./" as a separator, rather + than ":", to help the CLI parser figure out where the dircap ends. When the + source file is named "-", the contents are taken from stdin. + +``tahoe put file.txt --mutable`` + + Create a new mutable file, fill it with the contents of file.txt, and print + the new write-cap to stdout. + +``tahoe put file.txt MUTABLE-FILE-WRITECAP`` + + Replace the contents of the given mutable file with the contents of file.txt + and prints the same write-cap to stdout. + +``tahoe cp file.txt tahoe:uploaded.txt`` + +``tahoe cp file.txt tahoe:`` + +``tahoe cp file.txt tahoe:/`` + +``tahoe cp ./file.txt tahoe:`` + + These upload the local file and add it to your root with the name + "uploaded.txt". + +``tahoe cp tahoe:uploaded.txt downloaded.txt`` + +``tahoe cp tahoe:uploaded.txt ./downloaded.txt`` + +``tahoe cp tahoe:uploaded.txt /tmp/downloaded.txt`` + +``tahoe cp tahoe:uploaded.txt ~/downloaded.txt`` + + This downloads the named file from your tahoe root, and puts the result on + your local filesystem. + +``tahoe cp tahoe:uploaded.txt fun:stuff.txt`` + + This copies a file from your tahoe root to a different virtual directory, + set up earlier with "tahoe add-alias fun DIRCAP". + +``tahoe rm uploaded.txt`` + +``tahoe rm tahoe:uploaded.txt`` + + This deletes a file from your tahoe root. + +``tahoe mv uploaded.txt renamed.txt`` + +``tahoe mv tahoe:uploaded.txt tahoe:renamed.txt`` + + These rename a file within your tahoe root directory. + +``tahoe mv uploaded.txt fun:`` + +``tahoe mv tahoe:uploaded.txt fun:`` + +``tahoe mv tahoe:uploaded.txt fun:uploaded.txt`` + + These move a file from your tahoe root directory to the virtual directory + set up earlier with "tahoe add-alias fun DIRCAP" + +``tahoe backup ~ work:backups`` + + This command performs a full versioned backup of every file and directory + underneath your "~" home directory, placing an immutable timestamped + snapshot in e.g. work:backups/Archives/2009-02-06_04:00:05Z/ (note that the + timestamp is in UTC, hence the "Z" suffix), and a link to the latest + snapshot in work:backups/Latest/ . This command uses a small SQLite database + known as the "backupdb", stored in ~/.tahoe/private/backupdb.sqlite, to + remember which local files have been backed up already, and will avoid + uploading files that have already been backed up. It compares timestamps and + filesizes when making this comparison. It also re-uses existing directories + which have identical contents. This lets it run faster and reduces the + number of directories created. + + If you reconfigure your client node to switch to a different grid, you + should delete the stale backupdb.sqlite file, to force "tahoe backup" to + upload all files to the new grid. + +``tahoe backup --exclude=*~ ~ work:backups`` + + Same as above, but this time the backup process will ignore any + filename that will end with '~'. '--exclude' will accept any standard + unix shell-style wildcards, have a look at + http://docs.python.org/library/fnmatch.html for a more detailed + reference. You may give multiple '--exclude' options. Please pay + attention that the pattern will be matched against any level of the + directory tree, it's still impossible to specify absolute path exclusions. + +``tahoe backup --exclude-from=/path/to/filename ~ work:backups`` + + '--exclude-from' is similar to '--exclude', but reads exclusion + patterns from '/path/to/filename', one per line. + +``tahoe backup --exclude-vcs ~ work:backups`` + + This command will ignore any known file or directory that's used by + version control systems to store metadata. The excluded names are: + + * CVS + * RCS + * SCCS + * .git + * .gitignore + * .cvsignore + * .svn + * .arch-ids + * {arch} + * =RELEASE-ID + * =meta-update + * =update + * .bzr + * .bzrignore + * .bzrtags + * .hg + * .hgignore + * _darcs + +Storage Grid Maintenance +======================== + +``tahoe manifest tahoe:`` + +``tahoe manifest --storage-index tahoe:`` + +``tahoe manifest --verify-cap tahoe:`` + +``tahoe manifest --repair-cap tahoe:`` + +``tahoe manifest --raw tahoe:`` + + This performs a recursive walk of the given directory, visiting every file + and directory that can be reached from that point. It then emits one line to + stdout for each object it encounters. + + The default behavior is to print the access cap string (like URI:CHK:.. or + URI:DIR2:..), followed by a space, followed by the full path name. + + If --storage-index is added, each line will instead contain the object's + storage index. This (string) value is useful to determine which share files + (on the server) are associated with this directory tree. The --verify-cap + and --repair-cap options are similar, but emit a verify-cap and repair-cap, + respectively. If --raw is provided instead, the output will be a + JSON-encoded dictionary that includes keys for pathnames, storage index + strings, and cap strings. The last line of the --raw output will be a JSON + encoded deep-stats dictionary. + +``tahoe stats tahoe:`` + + This performs a recursive walk of the given directory, visiting every file + and directory that can be reached from that point. It gathers statistics on + the sizes of the objects it encounters, and prints a summary to stdout. + + +Debugging +========= + +For a list of all debugging commands, use "tahoe debug". + +"``tahoe debug find-shares STORAGEINDEX NODEDIRS..``" will look through one or +more storage nodes for the share files that are providing storage for the +given storage index. + +"``tahoe debug catalog-shares NODEDIRS..``" will look through one or more +storage nodes and locate every single share they contain. It produces a report +on stdout with one line per share, describing what kind of share it is, the +storage index, the size of the file is used for, etc. It may be useful to +concatenate these reports from all storage hosts and use it to look for +anomalies. + +"``tahoe debug dump-share SHAREFILE``" will take the name of a single share file +(as found by "tahoe find-shares") and print a summary of its contents to +stdout. This includes a list of leases, summaries of the hash tree, and +information from the UEB (URI Extension Block). For mutable file shares, it +will describe which version (seqnum and root-hash) is being stored in this +share. + +"``tahoe debug dump-cap CAP``" will take a URI (a file read-cap, or a directory +read- or write- cap) and unpack it into separate pieces. The most useful +aspect of this command is to reveal the storage index for any given URI. This +can be used to locate the share files that are holding the encoded+encrypted +data for this file. + +"``tahoe debug repl``" will launch an interactive python interpreter in which +the Tahoe packages and modules are available on sys.path (e.g. by using 'import +allmydata'). This is most useful from a source tree: it simply sets the +PYTHONPATH correctly and runs the 'python' executable. + +"``tahoe debug corrupt-share SHAREFILE``" will flip a bit in the given +sharefile. This can be used to test the client-side verification/repair code. +Obviously, this command should not be used during normal operation. diff --git a/docs/frontends/CLI.txt b/docs/frontends/CLI.txt deleted file mode 100644 index 743b8871..00000000 --- a/docs/frontends/CLI.txt +++ /dev/null @@ -1,548 +0,0 @@ -====================== -The Tahoe CLI commands -====================== - -1. `Overview`_ -2. `CLI Command Overview`_ -3. `Node Management`_ -4. `Filesystem Manipulation`_ - - 1. `Starting Directories`_ - 2. `Command Syntax Summary`_ - 3. `Command Examples`_ - -5. `Storage Grid Maintenance`_ -6. `Debugging`_ - - -Overview -======== - -Tahoe provides a single executable named "``tahoe``", which can be used to -create and manage client/server nodes, manipulate the filesystem, and perform -several debugging/maintenance tasks. - -This executable lives in the source tree at "``bin/tahoe``". Once you've done a -build (by running "make"), ``bin/tahoe`` can be run in-place: if it discovers -that it is being run from within a Tahoe source tree, it will modify sys.path -as necessary to use all the source code and dependent libraries contained in -that tree. - -If you've installed Tahoe (using "``make install``", or by installing a binary -package), then the tahoe executable will be available somewhere else, perhaps -in ``/usr/bin/tahoe``. In this case, it will use your platform's normal -PYTHONPATH search paths to find the tahoe code and other libraries. - - -CLI Command Overview -==================== - -The "``tahoe``" tool provides access to three categories of commands. - -* node management: create a client/server node, start/stop/restart it -* filesystem manipulation: list files, upload, download, delete, rename -* debugging: unpack cap-strings, examine share files - -To get a list of all commands, just run "``tahoe``" with no additional -arguments. "``tahoe --help``" might also provide something useful. - -Running "``tahoe --version``" will display a list of version strings, starting -with the "allmydata" module (which contains the majority of the Tahoe -functionality) and including versions for a number of dependent libraries, -like Twisted, Foolscap, pycryptopp, and zfec. - - -Node Management -=============== - -"``tahoe create-node [NODEDIR]``" is the basic make-a-new-node command. It -creates a new directory and populates it with files that will allow the -"``tahoe start``" command to use it later on. This command creates nodes that -have client functionality (upload/download files), web API services -(controlled by the 'webport' file), and storage services (unless -"--no-storage" is specified). - -NODEDIR defaults to ~/.tahoe/ , and newly-created nodes default to -publishing a web server on port 3456 (limited to the loopback interface, at -127.0.0.1, to restrict access to other programs on the same host). All of the -other "``tahoe``" subcommands use corresponding defaults. - -"``tahoe create-client [NODEDIR]``" creates a node with no storage service. -That is, it behaves like "``tahoe create-node --no-storage [NODEDIR]``". -(This is a change from versions prior to 1.6.0.) - -"``tahoe create-introducer [NODEDIR]``" is used to create the Introducer node. -This node provides introduction services and nothing else. When started, this -node will produce an introducer.furl, which should be published to all -clients. - -"``tahoe create-key-generator [NODEDIR]``" is used to create a special -"key-generation" service, which allows a client to offload their RSA key -generation to a separate process. Since RSA key generation takes several -seconds, and must be done each time a directory is created, moving it to a -separate process allows the first process (perhaps a busy wapi server) to -continue servicing other requests. The key generator exports a FURL that can -be copied into a node to enable this functionality. - -"``tahoe run [NODEDIR]``" will start a previously-created node in the foreground. - -"``tahoe start [NODEDIR]``" will launch a previously-created node. It will launch -the node into the background, using the standard Twisted "twistd" -daemon-launching tool. On some platforms (including Windows) this command is -unable to run a daemon in the background; in that case it behaves in the -same way as "``tahoe run``". - -"``tahoe stop [NODEDIR]``" will shut down a running node. - -"``tahoe restart [NODEDIR]``" will stop and then restart a running node. This is -most often used by developers who have just modified the code and want to -start using their changes. - - -Filesystem Manipulation -======================= - -These commands let you exmaine a Tahoe filesystem, providing basic -list/upload/download/delete/rename/mkdir functionality. They can be used as -primitives by other scripts. Most of these commands are fairly thin wrappers -around wapi calls. - -By default, all filesystem-manipulation commands look in ~/.tahoe/ to figure -out which Tahoe node they should use. When the CLI command uses wapi calls, -it will use ~/.tahoe/node.url for this purpose: a running Tahoe node that -provides a wapi port will write its URL into this file. If you want to use -a node on some other host, just create ~/.tahoe/ and copy that node's wapi -URL into this file, and the CLI commands will contact that node instead of a -local one. - -These commands also use a table of "aliases" to figure out which directory -they ought to use a starting point. This is explained in more detail below. - -As of Tahoe v1.7, passing non-ASCII characters to the CLI should work, -except on Windows. The command-line arguments are assumed to use the -character encoding specified by the current locale. - -Starting Directories --------------------- - -As described in architecture.txt, the Tahoe distributed filesystem consists -of a collection of directories and files, each of which has a "read-cap" or a -"write-cap" (also known as a URI). Each directory is simply a table that maps -a name to a child file or directory, and this table is turned into a string -and stored in a mutable file. The whole set of directory and file "nodes" are -connected together into a directed graph. - -To use this collection of files and directories, you need to choose a -starting point: some specific directory that we will refer to as a -"starting directory". For a given starting directory, the "``ls -[STARTING_DIR]:``" command would list the contents of this directory, -the "``ls [STARTING_DIR]:dir1``" command would look inside this directory -for a child named "dir1" and list its contents, "``ls -[STARTING_DIR]:dir1/subdir2``" would look two levels deep, etc. - -Note that there is no real global "root" directory, but instead each -starting directory provides a different, possibly overlapping -perspective on the graph of files and directories. - -Each tahoe node remembers a list of starting points, named "aliases", -in a file named ~/.tahoe/private/aliases . These aliases are short UTF-8 -encoded strings that stand in for a directory read- or write- cap. If -you use the command line "``ls``" without any "[STARTING_DIR]:" argument, -then it will use the default alias, which is "tahoe", therefore "``tahoe -ls``" has the same effect as "``tahoe ls tahoe:``". The same goes for the -other commands which can reasonably use a default alias: get, put, -mkdir, mv, and rm. - -For backwards compatibility with Tahoe-1.0, if the "tahoe": alias is not -found in ~/.tahoe/private/aliases, the CLI will use the contents of -~/.tahoe/private/root_dir.cap instead. Tahoe-1.0 had only a single starting -point, and stored it in this root_dir.cap file, so Tahoe-1.1 will use it if -necessary. However, once you've set a "tahoe:" alias with "``tahoe set-alias``", -that will override anything in the old root_dir.cap file. - -The Tahoe CLI commands use the same filename syntax as scp and rsync --- an optional "alias:" prefix, followed by the pathname or filename. -Some commands (like "tahoe cp") use the lack of an alias to mean that -you want to refer to a local file, instead of something from the tahoe -virtual filesystem. [TODO] Another way to indicate this is to start -the pathname with a dot, slash, or tilde. - -When you're dealing a single starting directory, the "tahoe:" alias is -all you need. But when you want to refer to something that isn't yet -attached to the graph rooted at that starting directory, you need to -refer to it by its capability. The way to do that is either to use its -capability directory as an argument on the command line, or to add an -alias to it, with the "tahoe add-alias" command. Once you've added an -alias, you can use that alias as an argument to commands. - -The best way to get started with Tahoe is to create a node, start it, then -use the following command to create a new directory and set it as your -"tahoe:" alias:: - - tahoe create-alias tahoe - -After that you can use "``tahoe ls tahoe:``" and -"``tahoe cp local.txt tahoe:``", and both will refer to the directory that -you've just created. - -SECURITY NOTE: For users of shared systems -`````````````````````````````````````````` - -Another way to achieve the same effect as the above "tahoe create-alias" -command is:: - - tahoe add-alias tahoe `tahoe mkdir` - -However, command-line arguments are visible to other users (through the -'ps' command, or the Windows Process Explorer tool), so if you are using a -tahoe node on a shared host, your login neighbors will be able to see (and -capture) any directory caps that you set up with the "``tahoe add-alias``" -command. - -The "``tahoe create-alias``" command avoids this problem by creating a new -directory and putting the cap into your aliases file for you. Alternatively, -you can edit the NODEDIR/private/aliases file directly, by adding a line like -this:: - - fun: URI:DIR2:ovjy4yhylqlfoqg2vcze36dhde:4d4f47qko2xm5g7osgo2yyidi5m4muyo2vjjy53q4vjju2u55mfa - -By entering the dircap through the editor, the command-line arguments are -bypassed, and other users will not be able to see them. Once you've added the -alias, no other secrets are passed through the command line, so this -vulnerability becomes less significant: they can still see your filenames and -other arguments you type there, but not the caps that Tahoe uses to permit -access to your files and directories. - - -Command Syntax Summary ----------------------- - -tahoe add-alias alias cap - -tahoe create-alias alias - -tahoe list-aliases - -tahoe mkdir - -tahoe mkdir [alias:]path - -tahoe ls [alias:][path] - -tahoe webopen [alias:][path] - -tahoe put [--mutable] [localfrom:-] - -tahoe put [--mutable] [localfrom:-] [alias:]to - -tahoe put [--mutable] [localfrom:-] [alias:]subdir/to - -tahoe put [--mutable] [localfrom:-] dircap:to - -tahoe put [--mutable] [localfrom:-] dircap:./subdir/to - -tahoe put [localfrom:-] mutable-file-writecap - -tahoe get [alias:]from [localto:-] - -tahoe cp [-r] [alias:]frompath [alias:]topath - -tahoe rm [alias:]what - -tahoe mv [alias:]from [alias:]to - -tahoe ln [alias:]from [alias:]to - -tahoe backup localfrom [alias:]to - -Command Examples ----------------- - -``tahoe mkdir`` - - This creates a new empty unlinked directory, and prints its write-cap to - stdout. The new directory is not attached to anything else. - -``tahoe add-alias fun DIRCAP`` - - An example would be:: - - tahoe add-alias fun URI:DIR2:ovjy4yhylqlfoqg2vcze36dhde:4d4f47qko2xm5g7osgo2yyidi5m4muyo2vjjy53q4vjju2u55mfa - - This creates an alias "fun:" and configures it to use the given directory - cap. Once this is done, "tahoe ls fun:" will list the contents of this - directory. Use "tahoe add-alias tahoe DIRCAP" to set the contents of the - default "tahoe:" alias. - -``tahoe create-alias fun`` - - This combines "``tahoe mkdir``" and "``tahoe add-alias``" into a single step. - -``tahoe list-aliases`` - - This displays a table of all configured aliases. - -``tahoe mkdir subdir`` - -``tahoe mkdir /subdir`` - - This both create a new empty directory and attaches it to your root with the - name "subdir". - -``tahoe ls`` - -``tahoe ls /`` - -``tahoe ls tahoe:`` - -``tahoe ls tahoe:/`` - - All four list the root directory of your personal virtual filesystem. - -``tahoe ls subdir`` - - This lists a subdirectory of your filesystem. - -``tahoe webopen`` - -``tahoe webopen tahoe:`` - -``tahoe webopen tahoe:subdir/`` - -``tahoe webopen subdir/`` - - This uses the python 'webbrowser' module to cause a local web browser to - open to the web page for the given directory. This page offers interfaces to - add, dowlonad, rename, and delete files in the directory. If not given an - alias or path, opens "tahoe:", the root dir of the default alias. - -``tahoe put file.txt`` - -``tahoe put ./file.txt`` - -``tahoe put /tmp/file.txt`` - -``tahoe put ~/file.txt`` - - These upload the local file into the grid, and prints the new read-cap to - stdout. The uploaded file is not attached to any directory. All one-argument - forms of "``tahoe put``" perform an unlinked upload. - -``tahoe put -`` - -``tahoe put`` - - These also perform an unlinked upload, but the data to be uploaded is taken - from stdin. - -``tahoe put file.txt uploaded.txt`` - -``tahoe put file.txt tahoe:uploaded.txt`` - - These upload the local file and add it to your root with the name - "uploaded.txt" - -``tahoe put file.txt subdir/foo.txt`` - -``tahoe put - subdir/foo.txt`` - -``tahoe put file.txt tahoe:subdir/foo.txt`` - -``tahoe put file.txt DIRCAP:./foo.txt`` - -``tahoe put file.txt DIRCAP:./subdir/foo.txt`` - - These upload the named file and attach them to a subdirectory of the given - root directory, under the name "foo.txt". Note that to use a directory - write-cap instead of an alias, you must use ":./" as a separator, rather - than ":", to help the CLI parser figure out where the dircap ends. When the - source file is named "-", the contents are taken from stdin. - -``tahoe put file.txt --mutable`` - - Create a new mutable file, fill it with the contents of file.txt, and print - the new write-cap to stdout. - -``tahoe put file.txt MUTABLE-FILE-WRITECAP`` - - Replace the contents of the given mutable file with the contents of file.txt - and prints the same write-cap to stdout. - -``tahoe cp file.txt tahoe:uploaded.txt`` - -``tahoe cp file.txt tahoe:`` - -``tahoe cp file.txt tahoe:/`` - -``tahoe cp ./file.txt tahoe:`` - - These upload the local file and add it to your root with the name - "uploaded.txt". - -``tahoe cp tahoe:uploaded.txt downloaded.txt`` - -``tahoe cp tahoe:uploaded.txt ./downloaded.txt`` - -``tahoe cp tahoe:uploaded.txt /tmp/downloaded.txt`` - -``tahoe cp tahoe:uploaded.txt ~/downloaded.txt`` - - This downloads the named file from your tahoe root, and puts the result on - your local filesystem. - -``tahoe cp tahoe:uploaded.txt fun:stuff.txt`` - - This copies a file from your tahoe root to a different virtual directory, - set up earlier with "tahoe add-alias fun DIRCAP". - -``tahoe rm uploaded.txt`` - -``tahoe rm tahoe:uploaded.txt`` - - This deletes a file from your tahoe root. - -``tahoe mv uploaded.txt renamed.txt`` - -``tahoe mv tahoe:uploaded.txt tahoe:renamed.txt`` - - These rename a file within your tahoe root directory. - -``tahoe mv uploaded.txt fun:`` - -``tahoe mv tahoe:uploaded.txt fun:`` - -``tahoe mv tahoe:uploaded.txt fun:uploaded.txt`` - - These move a file from your tahoe root directory to the virtual directory - set up earlier with "tahoe add-alias fun DIRCAP" - -``tahoe backup ~ work:backups`` - - This command performs a full versioned backup of every file and directory - underneath your "~" home directory, placing an immutable timestamped - snapshot in e.g. work:backups/Archives/2009-02-06_04:00:05Z/ (note that the - timestamp is in UTC, hence the "Z" suffix), and a link to the latest - snapshot in work:backups/Latest/ . This command uses a small SQLite database - known as the "backupdb", stored in ~/.tahoe/private/backupdb.sqlite, to - remember which local files have been backed up already, and will avoid - uploading files that have already been backed up. It compares timestamps and - filesizes when making this comparison. It also re-uses existing directories - which have identical contents. This lets it run faster and reduces the - number of directories created. - - If you reconfigure your client node to switch to a different grid, you - should delete the stale backupdb.sqlite file, to force "tahoe backup" to - upload all files to the new grid. - -``tahoe backup --exclude=*~ ~ work:backups`` - - Same as above, but this time the backup process will ignore any - filename that will end with '~'. '--exclude' will accept any standard - unix shell-style wildcards, have a look at - http://docs.python.org/library/fnmatch.html for a more detailed - reference. You may give multiple '--exclude' options. Please pay - attention that the pattern will be matched against any level of the - directory tree, it's still impossible to specify absolute path exclusions. - -``tahoe backup --exclude-from=/path/to/filename ~ work:backups`` - - '--exclude-from' is similar to '--exclude', but reads exclusion - patterns from '/path/to/filename', one per line. - -``tahoe backup --exclude-vcs ~ work:backups`` - - This command will ignore any known file or directory that's used by - version control systems to store metadata. The excluded names are: - - * CVS - * RCS - * SCCS - * .git - * .gitignore - * .cvsignore - * .svn - * .arch-ids - * {arch} - * =RELEASE-ID - * =meta-update - * =update - * .bzr - * .bzrignore - * .bzrtags - * .hg - * .hgignore - * _darcs - -Storage Grid Maintenance -======================== - -``tahoe manifest tahoe:`` - -``tahoe manifest --storage-index tahoe:`` - -``tahoe manifest --verify-cap tahoe:`` - -``tahoe manifest --repair-cap tahoe:`` - -``tahoe manifest --raw tahoe:`` - - This performs a recursive walk of the given directory, visiting every file - and directory that can be reached from that point. It then emits one line to - stdout for each object it encounters. - - The default behavior is to print the access cap string (like URI:CHK:.. or - URI:DIR2:..), followed by a space, followed by the full path name. - - If --storage-index is added, each line will instead contain the object's - storage index. This (string) value is useful to determine which share files - (on the server) are associated with this directory tree. The --verify-cap - and --repair-cap options are similar, but emit a verify-cap and repair-cap, - respectively. If --raw is provided instead, the output will be a - JSON-encoded dictionary that includes keys for pathnames, storage index - strings, and cap strings. The last line of the --raw output will be a JSON - encoded deep-stats dictionary. - -``tahoe stats tahoe:`` - - This performs a recursive walk of the given directory, visiting every file - and directory that can be reached from that point. It gathers statistics on - the sizes of the objects it encounters, and prints a summary to stdout. - - -Debugging -========= - -For a list of all debugging commands, use "tahoe debug". - -"``tahoe debug find-shares STORAGEINDEX NODEDIRS..``" will look through one or -more storage nodes for the share files that are providing storage for the -given storage index. - -"``tahoe debug catalog-shares NODEDIRS..``" will look through one or more -storage nodes and locate every single share they contain. It produces a report -on stdout with one line per share, describing what kind of share it is, the -storage index, the size of the file is used for, etc. It may be useful to -concatenate these reports from all storage hosts and use it to look for -anomalies. - -"``tahoe debug dump-share SHAREFILE``" will take the name of a single share file -(as found by "tahoe find-shares") and print a summary of its contents to -stdout. This includes a list of leases, summaries of the hash tree, and -information from the UEB (URI Extension Block). For mutable file shares, it -will describe which version (seqnum and root-hash) is being stored in this -share. - -"``tahoe debug dump-cap CAP``" will take a URI (a file read-cap, or a directory -read- or write- cap) and unpack it into separate pieces. The most useful -aspect of this command is to reveal the storage index for any given URI. This -can be used to locate the share files that are holding the encoded+encrypted -data for this file. - -"``tahoe debug repl``" will launch an interactive python interpreter in which -the Tahoe packages and modules are available on sys.path (e.g. by using 'import -allmydata'). This is most useful from a source tree: it simply sets the -PYTHONPATH correctly and runs the 'python' executable. - -"``tahoe debug corrupt-share SHAREFILE``" will flip a bit in the given -sharefile. This can be used to test the client-side verification/repair code. -Obviously, this command should not be used during normal operation. diff --git a/docs/frontends/FTP-and-SFTP.rst b/docs/frontends/FTP-and-SFTP.rst new file mode 100644 index 00000000..230dca31 --- /dev/null +++ b/docs/frontends/FTP-and-SFTP.rst @@ -0,0 +1,240 @@ +================================= +Tahoe-LAFS FTP and SFTP Frontends +================================= + +1. `FTP/SFTP Background`_ +2. `Tahoe-LAFS Support`_ +3. `Creating an Account File`_ +4. `Configuring FTP Access`_ +5. `Configuring SFTP Access`_ +6. `Dependencies`_ +7. `Immutable and mutable files`_ +8. `Known Issues`_ + + +FTP/SFTP Background +=================== + +FTP is the venerable internet file-transfer protocol, first developed in +1971. The FTP server usually listens on port 21. A separate connection is +used for the actual data transfers, either in the same direction as the +initial client-to-server connection (for PORT mode), or in the reverse +direction (for PASV) mode. Connections are unencrypted, so passwords, file +names, and file contents are visible to eavesdroppers. + +SFTP is the modern replacement, developed as part of the SSH "secure shell" +protocol, and runs as a subchannel of the regular SSH connection. The SSH +server usually listens on port 22. All connections are encrypted. + +Both FTP and SFTP were developed assuming a UNIX-like server, with accounts +and passwords, octal file modes (user/group/other, read/write/execute), and +ctime/mtime timestamps. + +Tahoe-LAFS Support +================== + +All Tahoe-LAFS client nodes can run a frontend FTP server, allowing regular FTP +clients (like /usr/bin/ftp, ncftp, and countless others) to access the +virtual filesystem. They can also run an SFTP server, so SFTP clients (like +/usr/bin/sftp, the sshfs FUSE plugin, and others) can too. These frontends +sit at the same level as the webapi interface. + +Since Tahoe-LAFS does not use user accounts or passwords, the FTP/SFTP servers +must be configured with a way to first authenticate a user (confirm that a +prospective client has a legitimate claim to whatever authorities we might +grant a particular user), and second to decide what root directory cap should +be granted to the authenticated username. A username and password is used +for this purpose. (The SFTP protocol is also capable of using client +RSA or DSA public keys, but this is not currently implemented.) + +Tahoe-LAFS provides two mechanisms to perform this user-to-rootcap mapping. The +first is a simple flat file with one account per line. The second is an +HTTP-based login mechanism, backed by simple PHP script and a database. The +latter form is used by allmydata.com to provide secure access to customer +rootcaps. + +Creating an Account File +======================== + +To use the first form, create a file (probably in +BASEDIR/private/ftp.accounts) in which each non-comment/non-blank line is a +space-separated line of (USERNAME, PASSWORD, ROOTCAP), like so:: + + % cat BASEDIR/private/ftp.accounts + # This is a password line, (username, password, rootcap) + alice password URI:DIR2:ioej8xmzrwilg772gzj4fhdg7a:wtiizszzz2rgmczv4wl6bqvbv33ag4kvbr6prz3u6w3geixa6m6a + bob sekrit URI:DIR2:6bdmeitystckbl9yqlw7g56f4e:serp5ioqxnh34mlbmzwvkp3odehsyrr7eytt5f64we3k9hhcrcja + +Future versions of Tahoe-LAFS may support using client public keys for SFTP. +The words "ssh-rsa" and "ssh-dsa" after the username are reserved to specify +the public key format, so users cannot have a password equal to either of +these strings. + +Now add an 'accounts.file' directive to your tahoe.cfg file, as described +in the next sections. + +Configuring FTP Access +====================== + +To enable the FTP server with an accounts file, add the following lines to +the BASEDIR/tahoe.cfg file:: + + [ftpd] + enabled = true + port = tcp:8021:interface=127.0.0.1 + accounts.file = private/ftp.accounts + +The FTP server will listen on the given port number and on the loopback +interface only. The "accounts.file" pathname will be interpreted +relative to the node's BASEDIR. + +To enable the FTP server with an account server instead, provide the URL of +that server in an "accounts.url" directive:: + + [ftpd] + enabled = true + port = tcp:8021:interface=127.0.0.1 + accounts.url = https://example.com/login + +You can provide both accounts.file and accounts.url, although it probably +isn't very useful except for testing. + +FTP provides no security, and so your password or caps could be eavesdropped +if you connect to the FTP server remotely. The examples above include +":interface=127.0.0.1" in the "port" option, which causes the server to only +accept connections from localhost. + +Configuring SFTP Access +======================= + +The Tahoe-LAFS SFTP server requires a host keypair, just like the regular SSH +server. It is important to give each server a distinct keypair, to prevent +one server from masquerading as different one. The first time a client +program talks to a given server, it will store the host key it receives, and +will complain if a subsequent connection uses a different key. This reduces +the opportunity for man-in-the-middle attacks to just the first connection. + +Exercise caution when connecting to the SFTP server remotely. The AES +implementation used by the SFTP code does not have defenses against timing +attacks. The code for encrypting the SFTP connection was not written by the +Tahoe-LAFS team, and we have not reviewed it as carefully as we have reviewed +the code for encrypting files and directories in Tahoe-LAFS itself. If you +can connect to the SFTP server (which is provided by the Tahoe-LAFS gateway) +only from a client on the same host, then you would be safe from any problem +with the SFTP connection security. The examples given below enforce this +policy by including ":interface=127.0.0.1" in the "port" option, which +causes the server to only accept connections from localhost. + +You will use directives in the tahoe.cfg file to tell the SFTP code where to +find these keys. To create one, use the ``ssh-keygen`` tool (which comes with +the standard openssh client distribution):: + + % cd BASEDIR + % ssh-keygen -f private/ssh_host_rsa_key + +The server private key file must not have a passphrase. + +Then, to enable the SFTP server with an accounts file, add the following +lines to the BASEDIR/tahoe.cfg file:: + + [sftpd] + enabled = true + port = tcp:8022:interface=127.0.0.1 + host_pubkey_file = private/ssh_host_rsa_key.pub + host_privkey_file = private/ssh_host_rsa_key + accounts.file = private/ftp.accounts + +The SFTP server will listen on the given port number and on the loopback +interface only. The "accounts.file" pathname will be interpreted +relative to the node's BASEDIR. + +Or, to use an account server instead, do this:: + + [sftpd] + enabled = true + port = tcp:8022:interface=127.0.0.1 + host_pubkey_file = private/ssh_host_rsa_key.pub + host_privkey_file = private/ssh_host_rsa_key + accounts.url = https://example.com/login + +You can provide both accounts.file and accounts.url, although it probably +isn't very useful except for testing. + +For further information on SFTP compatibility and known issues with various +clients and with the sshfs filesystem, see +http://tahoe-lafs.org/trac/tahoe-lafs/wiki/SftpFrontend . + +Dependencies +============ + +The Tahoe-LAFS SFTP server requires the Twisted "Conch" component (a "conch" is +a twisted shell, get it?). Many Linux distributions package the Conch code +separately: debian puts it in the "python-twisted-conch" package. Conch +requires the "pycrypto" package, which is a Python+C implementation of many +cryptographic functions (the debian package is named "python-crypto"). + +Note that "pycrypto" is different than the "pycryptopp" package that Tahoe-LAFS +uses (which is a Python wrapper around the C++ -based Crypto++ library, a +library that is frequently installed as /usr/lib/libcryptopp.a, to avoid +problems with non-alphanumerics in filenames). + +The FTP server requires code in Twisted that enables asynchronous closing of +file-upload operations. This code was landed to Twisted's SVN trunk in r28453 +on 23-Feb-2010, slightly too late for the Twisted-10.0 release, but it should +be present in the next release after that. To use Tahoe-LAFS's FTP server with +Twisted-10.0 or earlier, you will need to apply the patch attached to +http://twistedmatrix.com/trac/ticket/3462 . The Tahoe-LAFS node will refuse to +start the FTP server unless it detects the necessary support code in Twisted. +This patch is not needed for SFTP. + +Immutable and Mutable Files +=========================== + +All files created via SFTP (and FTP) are immutable files. However, files +can only be created in writeable directories, which allows the directory +entry to be relinked to a different file. Normally, when the path of an +immutable file is opened for writing by SFTP, the directory entry is +relinked to another file with the newly written contents when the file +handle is closed. The old file is still present on the grid, and any other +caps to it will remain valid. (See docs/garbage-collection.txt for how to +reclaim the space used by files that are no longer needed.) + +The 'no-write' metadata field of a directory entry can override this +behaviour. If the 'no-write' field holds a true value, then a permission +error will occur when trying to write to the file, even if it is in a +writeable directory. This does not prevent the directory entry from being +unlinked or replaced. + +When using sshfs, the 'no-write' field can be set by clearing the 'w' +bits in the Unix permissions, for example using the command +'chmod 444 path/to/file'. Note that this does not mean that arbitrary +combinations of Unix permissions are supported. If the 'w' bits are +cleared on a link to a mutable file or directory, that link will become +read-only. + +If SFTP is used to write to an existing mutable file, it will publish a +new version when the file handle is closed. + +Known Issues +============ + +Mutable files are not supported by the FTP frontend (`ticket #680 +`_). Currently, a directory +containing mutable files cannot even be listed over FTP. + +The FTP frontend sometimes fails to report errors, for example if an upload +fails because it does meet the "servers of happiness" threshold (`ticket #1081 +`_). Upload errors also may not +be reported when writing files using SFTP via sshfs (`ticket #1059 +`_). + +Non-ASCII filenames are not supported by FTP (`ticket #682 +`_). They can be used +with SFTP only if the client encodes filenames as UTF-8 (`ticket #1089 +`_). + +The gateway node may incur a memory leak when accessing many files via SFTP +(`ticket #1045 `_). + +For other known issues in SFTP, see +. diff --git a/docs/frontends/FTP-and-SFTP.txt b/docs/frontends/FTP-and-SFTP.txt deleted file mode 100644 index 230dca31..00000000 --- a/docs/frontends/FTP-and-SFTP.txt +++ /dev/null @@ -1,240 +0,0 @@ -================================= -Tahoe-LAFS FTP and SFTP Frontends -================================= - -1. `FTP/SFTP Background`_ -2. `Tahoe-LAFS Support`_ -3. `Creating an Account File`_ -4. `Configuring FTP Access`_ -5. `Configuring SFTP Access`_ -6. `Dependencies`_ -7. `Immutable and mutable files`_ -8. `Known Issues`_ - - -FTP/SFTP Background -=================== - -FTP is the venerable internet file-transfer protocol, first developed in -1971. The FTP server usually listens on port 21. A separate connection is -used for the actual data transfers, either in the same direction as the -initial client-to-server connection (for PORT mode), or in the reverse -direction (for PASV) mode. Connections are unencrypted, so passwords, file -names, and file contents are visible to eavesdroppers. - -SFTP is the modern replacement, developed as part of the SSH "secure shell" -protocol, and runs as a subchannel of the regular SSH connection. The SSH -server usually listens on port 22. All connections are encrypted. - -Both FTP and SFTP were developed assuming a UNIX-like server, with accounts -and passwords, octal file modes (user/group/other, read/write/execute), and -ctime/mtime timestamps. - -Tahoe-LAFS Support -================== - -All Tahoe-LAFS client nodes can run a frontend FTP server, allowing regular FTP -clients (like /usr/bin/ftp, ncftp, and countless others) to access the -virtual filesystem. They can also run an SFTP server, so SFTP clients (like -/usr/bin/sftp, the sshfs FUSE plugin, and others) can too. These frontends -sit at the same level as the webapi interface. - -Since Tahoe-LAFS does not use user accounts or passwords, the FTP/SFTP servers -must be configured with a way to first authenticate a user (confirm that a -prospective client has a legitimate claim to whatever authorities we might -grant a particular user), and second to decide what root directory cap should -be granted to the authenticated username. A username and password is used -for this purpose. (The SFTP protocol is also capable of using client -RSA or DSA public keys, but this is not currently implemented.) - -Tahoe-LAFS provides two mechanisms to perform this user-to-rootcap mapping. The -first is a simple flat file with one account per line. The second is an -HTTP-based login mechanism, backed by simple PHP script and a database. The -latter form is used by allmydata.com to provide secure access to customer -rootcaps. - -Creating an Account File -======================== - -To use the first form, create a file (probably in -BASEDIR/private/ftp.accounts) in which each non-comment/non-blank line is a -space-separated line of (USERNAME, PASSWORD, ROOTCAP), like so:: - - % cat BASEDIR/private/ftp.accounts - # This is a password line, (username, password, rootcap) - alice password URI:DIR2:ioej8xmzrwilg772gzj4fhdg7a:wtiizszzz2rgmczv4wl6bqvbv33ag4kvbr6prz3u6w3geixa6m6a - bob sekrit URI:DIR2:6bdmeitystckbl9yqlw7g56f4e:serp5ioqxnh34mlbmzwvkp3odehsyrr7eytt5f64we3k9hhcrcja - -Future versions of Tahoe-LAFS may support using client public keys for SFTP. -The words "ssh-rsa" and "ssh-dsa" after the username are reserved to specify -the public key format, so users cannot have a password equal to either of -these strings. - -Now add an 'accounts.file' directive to your tahoe.cfg file, as described -in the next sections. - -Configuring FTP Access -====================== - -To enable the FTP server with an accounts file, add the following lines to -the BASEDIR/tahoe.cfg file:: - - [ftpd] - enabled = true - port = tcp:8021:interface=127.0.0.1 - accounts.file = private/ftp.accounts - -The FTP server will listen on the given port number and on the loopback -interface only. The "accounts.file" pathname will be interpreted -relative to the node's BASEDIR. - -To enable the FTP server with an account server instead, provide the URL of -that server in an "accounts.url" directive:: - - [ftpd] - enabled = true - port = tcp:8021:interface=127.0.0.1 - accounts.url = https://example.com/login - -You can provide both accounts.file and accounts.url, although it probably -isn't very useful except for testing. - -FTP provides no security, and so your password or caps could be eavesdropped -if you connect to the FTP server remotely. The examples above include -":interface=127.0.0.1" in the "port" option, which causes the server to only -accept connections from localhost. - -Configuring SFTP Access -======================= - -The Tahoe-LAFS SFTP server requires a host keypair, just like the regular SSH -server. It is important to give each server a distinct keypair, to prevent -one server from masquerading as different one. The first time a client -program talks to a given server, it will store the host key it receives, and -will complain if a subsequent connection uses a different key. This reduces -the opportunity for man-in-the-middle attacks to just the first connection. - -Exercise caution when connecting to the SFTP server remotely. The AES -implementation used by the SFTP code does not have defenses against timing -attacks. The code for encrypting the SFTP connection was not written by the -Tahoe-LAFS team, and we have not reviewed it as carefully as we have reviewed -the code for encrypting files and directories in Tahoe-LAFS itself. If you -can connect to the SFTP server (which is provided by the Tahoe-LAFS gateway) -only from a client on the same host, then you would be safe from any problem -with the SFTP connection security. The examples given below enforce this -policy by including ":interface=127.0.0.1" in the "port" option, which -causes the server to only accept connections from localhost. - -You will use directives in the tahoe.cfg file to tell the SFTP code where to -find these keys. To create one, use the ``ssh-keygen`` tool (which comes with -the standard openssh client distribution):: - - % cd BASEDIR - % ssh-keygen -f private/ssh_host_rsa_key - -The server private key file must not have a passphrase. - -Then, to enable the SFTP server with an accounts file, add the following -lines to the BASEDIR/tahoe.cfg file:: - - [sftpd] - enabled = true - port = tcp:8022:interface=127.0.0.1 - host_pubkey_file = private/ssh_host_rsa_key.pub - host_privkey_file = private/ssh_host_rsa_key - accounts.file = private/ftp.accounts - -The SFTP server will listen on the given port number and on the loopback -interface only. The "accounts.file" pathname will be interpreted -relative to the node's BASEDIR. - -Or, to use an account server instead, do this:: - - [sftpd] - enabled = true - port = tcp:8022:interface=127.0.0.1 - host_pubkey_file = private/ssh_host_rsa_key.pub - host_privkey_file = private/ssh_host_rsa_key - accounts.url = https://example.com/login - -You can provide both accounts.file and accounts.url, although it probably -isn't very useful except for testing. - -For further information on SFTP compatibility and known issues with various -clients and with the sshfs filesystem, see -http://tahoe-lafs.org/trac/tahoe-lafs/wiki/SftpFrontend . - -Dependencies -============ - -The Tahoe-LAFS SFTP server requires the Twisted "Conch" component (a "conch" is -a twisted shell, get it?). Many Linux distributions package the Conch code -separately: debian puts it in the "python-twisted-conch" package. Conch -requires the "pycrypto" package, which is a Python+C implementation of many -cryptographic functions (the debian package is named "python-crypto"). - -Note that "pycrypto" is different than the "pycryptopp" package that Tahoe-LAFS -uses (which is a Python wrapper around the C++ -based Crypto++ library, a -library that is frequently installed as /usr/lib/libcryptopp.a, to avoid -problems with non-alphanumerics in filenames). - -The FTP server requires code in Twisted that enables asynchronous closing of -file-upload operations. This code was landed to Twisted's SVN trunk in r28453 -on 23-Feb-2010, slightly too late for the Twisted-10.0 release, but it should -be present in the next release after that. To use Tahoe-LAFS's FTP server with -Twisted-10.0 or earlier, you will need to apply the patch attached to -http://twistedmatrix.com/trac/ticket/3462 . The Tahoe-LAFS node will refuse to -start the FTP server unless it detects the necessary support code in Twisted. -This patch is not needed for SFTP. - -Immutable and Mutable Files -=========================== - -All files created via SFTP (and FTP) are immutable files. However, files -can only be created in writeable directories, which allows the directory -entry to be relinked to a different file. Normally, when the path of an -immutable file is opened for writing by SFTP, the directory entry is -relinked to another file with the newly written contents when the file -handle is closed. The old file is still present on the grid, and any other -caps to it will remain valid. (See docs/garbage-collection.txt for how to -reclaim the space used by files that are no longer needed.) - -The 'no-write' metadata field of a directory entry can override this -behaviour. If the 'no-write' field holds a true value, then a permission -error will occur when trying to write to the file, even if it is in a -writeable directory. This does not prevent the directory entry from being -unlinked or replaced. - -When using sshfs, the 'no-write' field can be set by clearing the 'w' -bits in the Unix permissions, for example using the command -'chmod 444 path/to/file'. Note that this does not mean that arbitrary -combinations of Unix permissions are supported. If the 'w' bits are -cleared on a link to a mutable file or directory, that link will become -read-only. - -If SFTP is used to write to an existing mutable file, it will publish a -new version when the file handle is closed. - -Known Issues -============ - -Mutable files are not supported by the FTP frontend (`ticket #680 -`_). Currently, a directory -containing mutable files cannot even be listed over FTP. - -The FTP frontend sometimes fails to report errors, for example if an upload -fails because it does meet the "servers of happiness" threshold (`ticket #1081 -`_). Upload errors also may not -be reported when writing files using SFTP via sshfs (`ticket #1059 -`_). - -Non-ASCII filenames are not supported by FTP (`ticket #682 -`_). They can be used -with SFTP only if the client encodes filenames as UTF-8 (`ticket #1089 -`_). - -The gateway node may incur a memory leak when accessing many files via SFTP -(`ticket #1045 `_). - -For other known issues in SFTP, see -. diff --git a/docs/frontends/download-status.rst b/docs/frontends/download-status.rst new file mode 100644 index 00000000..315b6a3b --- /dev/null +++ b/docs/frontends/download-status.rst @@ -0,0 +1,135 @@ +=============== +Download status +=============== + + +Introduction +============ + +The WUI will display the "status" of uploads and downloads. + +The Welcome Page has a link entitled "Recent Uploads and Downloads" +which goes to this URL: + +http://$GATEWAY/status + +Each entry in the list of recent operations has a "status" link which +will take you to a page describing that operation. + +For immutable downloads, the page has a lot of information, and this +document is to explain what it all means. It was written by Brian +Warner, who wrote the v1.8.0 downloader code and the code which +generates this status report about the v1.8.0 downloader's +behavior. Brian posted it to the trac: +http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1169#comment:1 + +Then Zooko lightly edited it while copying it into the docs/ +directory. + +What's involved in a download? +============================== + +Downloads are triggered by read() calls, each with a starting offset (defaults +to 0) and a length (defaults to the whole file). A regular webapi GET request +will result in a whole-file read() call. + +Each read() call turns into an ordered sequence of get_segment() calls. A +whole-file read will fetch all segments, in order, but partial reads or +multiple simultaneous reads will result in random-access of segments. Segment +reads always return ciphertext: the layer above that (in read()) is responsible +for decryption. + +Before we can satisfy any segment reads, we need to find some shares. ("DYHB" +is an abbreviation for "Do You Have Block", and is the message we send to +storage servers to ask them if they have any shares for us. The name is +historical, from Mojo Nation/Mnet/Mountain View, but nicely distinctive. +Tahoe-LAFS's actual message name is remote_get_buckets().). Responses come +back eventually, or don't. + +Once we get enough positive DYHB responses, we have enough shares to start +downloading. We send "block requests" for various pieces of the share. +Responses come back eventually, or don't. + +When we get enough block-request responses for a given segment, we can decode +the data and satisfy the segment read. + +When the segment read completes, some or all of the segment data is used to +satisfy the read() call (if the read call started or ended in the middle of a +segment, we'll only use part of the data, otherwise we'll use all of it). + +Data on the download-status page +================================ + +DYHB Requests +------------- + +This shows every Do-You-Have-Block query sent to storage servers and their +results. Each line shows the following: + +* the serverid to which the request was sent +* the time at which the request was sent. Note that all timestamps are + relative to the start of the first read() call and indicated with a "+" sign +* the time at which the response was received (if ever) +* the share numbers that the server has, if any +* the elapsed time taken by the request + +Also, each line is colored according to the serverid. This color is also used +in the "Requests" section below. + +Read Events +----------- + +This shows all the FileNode read() calls and their overall results. Each line +shows: + +* the range of the file that was requested (as [OFFSET:+LENGTH]). A whole-file + GET will start at 0 and read the entire file. +* the time at which the read() was made +* the time at which the request finished, either because the last byte of data + was returned to the read() caller, or because they cancelled the read by + calling stopProducing (i.e. closing the HTTP connection) +* the number of bytes returned to the caller so far +* the time spent on the read, so far +* the total time spent in AES decryption +* total time spend paused by the client (pauseProducing), generally because the + HTTP connection filled up, which most streaming media players will do to + limit how much data they have to buffer +* effective speed of the read(), not including paused time + +Segment Events +-------------- + +This shows each get_segment() call and its resolution. This table is not well +organized, and my post-1.8.0 work will clean it up a lot. In its present form, +it records "request" and "delivery" events separately, indicated by the "type" +column. + +Each request shows the segment number being requested and the time at which the +get_segment() call was made. + +Each delivery shows: + +* segment number +* range of file data (as [OFFSET:+SIZE]) delivered +* elapsed time spent doing ZFEC decoding +* overall elapsed time fetching the segment +* effective speed of the segment fetch + +Requests +-------- + +This shows every block-request sent to the storage servers. Each line shows: + +* the server to which the request was sent +* which share number it is referencing +* the portion of the share data being requested (as [OFFSET:+SIZE]) +* the time the request was sent +* the time the response was received (if ever) +* the amount of data that was received (which might be less than SIZE if we + tried to read off the end of the share) +* the elapsed time for the request (RTT=Round-Trip-Time) + +Also note that each Request line is colored according to the serverid it was +sent to. And all timestamps are shown relative to the start of the first +read() call: for example the first DYHB message was sent at +0.001393s about +1.4 milliseconds after the read() call started everything off. diff --git a/docs/frontends/download-status.txt b/docs/frontends/download-status.txt deleted file mode 100644 index 315b6a3b..00000000 --- a/docs/frontends/download-status.txt +++ /dev/null @@ -1,135 +0,0 @@ -=============== -Download status -=============== - - -Introduction -============ - -The WUI will display the "status" of uploads and downloads. - -The Welcome Page has a link entitled "Recent Uploads and Downloads" -which goes to this URL: - -http://$GATEWAY/status - -Each entry in the list of recent operations has a "status" link which -will take you to a page describing that operation. - -For immutable downloads, the page has a lot of information, and this -document is to explain what it all means. It was written by Brian -Warner, who wrote the v1.8.0 downloader code and the code which -generates this status report about the v1.8.0 downloader's -behavior. Brian posted it to the trac: -http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1169#comment:1 - -Then Zooko lightly edited it while copying it into the docs/ -directory. - -What's involved in a download? -============================== - -Downloads are triggered by read() calls, each with a starting offset (defaults -to 0) and a length (defaults to the whole file). A regular webapi GET request -will result in a whole-file read() call. - -Each read() call turns into an ordered sequence of get_segment() calls. A -whole-file read will fetch all segments, in order, but partial reads or -multiple simultaneous reads will result in random-access of segments. Segment -reads always return ciphertext: the layer above that (in read()) is responsible -for decryption. - -Before we can satisfy any segment reads, we need to find some shares. ("DYHB" -is an abbreviation for "Do You Have Block", and is the message we send to -storage servers to ask them if they have any shares for us. The name is -historical, from Mojo Nation/Mnet/Mountain View, but nicely distinctive. -Tahoe-LAFS's actual message name is remote_get_buckets().). Responses come -back eventually, or don't. - -Once we get enough positive DYHB responses, we have enough shares to start -downloading. We send "block requests" for various pieces of the share. -Responses come back eventually, or don't. - -When we get enough block-request responses for a given segment, we can decode -the data and satisfy the segment read. - -When the segment read completes, some or all of the segment data is used to -satisfy the read() call (if the read call started or ended in the middle of a -segment, we'll only use part of the data, otherwise we'll use all of it). - -Data on the download-status page -================================ - -DYHB Requests -------------- - -This shows every Do-You-Have-Block query sent to storage servers and their -results. Each line shows the following: - -* the serverid to which the request was sent -* the time at which the request was sent. Note that all timestamps are - relative to the start of the first read() call and indicated with a "+" sign -* the time at which the response was received (if ever) -* the share numbers that the server has, if any -* the elapsed time taken by the request - -Also, each line is colored according to the serverid. This color is also used -in the "Requests" section below. - -Read Events ------------ - -This shows all the FileNode read() calls and their overall results. Each line -shows: - -* the range of the file that was requested (as [OFFSET:+LENGTH]). A whole-file - GET will start at 0 and read the entire file. -* the time at which the read() was made -* the time at which the request finished, either because the last byte of data - was returned to the read() caller, or because they cancelled the read by - calling stopProducing (i.e. closing the HTTP connection) -* the number of bytes returned to the caller so far -* the time spent on the read, so far -* the total time spent in AES decryption -* total time spend paused by the client (pauseProducing), generally because the - HTTP connection filled up, which most streaming media players will do to - limit how much data they have to buffer -* effective speed of the read(), not including paused time - -Segment Events --------------- - -This shows each get_segment() call and its resolution. This table is not well -organized, and my post-1.8.0 work will clean it up a lot. In its present form, -it records "request" and "delivery" events separately, indicated by the "type" -column. - -Each request shows the segment number being requested and the time at which the -get_segment() call was made. - -Each delivery shows: - -* segment number -* range of file data (as [OFFSET:+SIZE]) delivered -* elapsed time spent doing ZFEC decoding -* overall elapsed time fetching the segment -* effective speed of the segment fetch - -Requests --------- - -This shows every block-request sent to the storage servers. Each line shows: - -* the server to which the request was sent -* which share number it is referencing -* the portion of the share data being requested (as [OFFSET:+SIZE]) -* the time the request was sent -* the time the response was received (if ever) -* the amount of data that was received (which might be less than SIZE if we - tried to read off the end of the share) -* the elapsed time for the request (RTT=Round-Trip-Time) - -Also note that each Request line is colored according to the serverid it was -sent to. And all timestamps are shown relative to the start of the first -read() call: for example the first DYHB message was sent at +0.001393s about -1.4 milliseconds after the read() call started everything off. diff --git a/docs/frontends/webapi.rst b/docs/frontends/webapi.rst new file mode 100644 index 00000000..31924bcc --- /dev/null +++ b/docs/frontends/webapi.rst @@ -0,0 +1,1963 @@ +========================== +The Tahoe REST-ful Web API +========================== + +1. `Enabling the web-API port`_ +2. `Basic Concepts: GET, PUT, DELETE, POST`_ +3. `URLs`_ + + 1. `Child Lookup`_ + +4. `Slow Operations, Progress, and Cancelling`_ +5. `Programmatic Operations`_ + + 1. `Reading a file`_ + 2. `Writing/Uploading a File`_ + 3. `Creating a New Directory`_ + 4. `Get Information About A File Or Directory (as JSON)`_ + 5. `Attaching an existing File or Directory by its read- or write-cap`_ + 6. `Adding multiple files or directories to a parent directory at once`_ + 7. `Deleting a File or Directory`_ + +6. `Browser Operations: Human-Oriented Interfaces`_ + + 1. `Viewing A Directory (as HTML)`_ + 2. `Viewing/Downloading a File`_ + 3. `Get Information About A File Or Directory (as HTML)`_ + 4. `Creating a Directory`_ + 5. `Uploading a File`_ + 6. `Attaching An Existing File Or Directory (by URI)`_ + 7. `Deleting A Child`_ + 8. `Renaming A Child`_ + 9. `Other Utilities`_ + 10. `Debugging and Testing Features`_ + +7. `Other Useful Pages`_ +8. `Static Files in /public_html`_ +9. `Safety and security issues -- names vs. URIs`_ +10. `Concurrency Issues`_ + +Enabling the web-API port +========================= + +Every Tahoe node is capable of running a built-in HTTP server. To enable +this, just write a port number into the "[node]web.port" line of your node's +tahoe.cfg file. For example, writing "web.port = 3456" into the "[node]" +section of $NODEDIR/tahoe.cfg will cause the node to run a webserver on port +3456. + +This string is actually a Twisted "strports" specification, meaning you can +get more control over the interface to which the server binds by supplying +additional arguments. For more details, see the documentation on +`twisted.application.strports +`_. + +Writing "tcp:3456:interface=127.0.0.1" into the web.port line does the same +but binds to the loopback interface, ensuring that only the programs on the +local host can connect. Using "ssl:3456:privateKey=mykey.pem:certKey=cert.pem" +runs an SSL server. + +This webport can be set when the node is created by passing a --webport +option to the 'tahoe create-node' command. By default, the node listens on +port 3456, on the loopback (127.0.0.1) interface. + +Basic Concepts: GET, PUT, DELETE, POST +====================================== + +As described in `architecture.rst`_, each file and directory in a Tahoe virtual +filesystem is referenced by an identifier that combines the designation of +the object with the authority to do something with it (such as read or modify +the contents). This identifier is called a "read-cap" or "write-cap", +depending upon whether it enables read-only or read-write access. These +"caps" are also referred to as URIs. + +.. _architecture.rst: http://tahoe-lafs.org/source/tahoe-lafs/trunk/docs/architecture.rst + +The Tahoe web-based API is "REST-ful", meaning it implements the concepts of +"REpresentational State Transfer": the original scheme by which the World +Wide Web was intended to work. Each object (file or directory) is referenced +by a URL that includes the read- or write- cap. HTTP methods (GET, PUT, and +DELETE) are used to manipulate these objects. You can think of the URL as a +noun, and the method as a verb. + +In REST, the GET method is used to retrieve information about an object, or +to retrieve some representation of the object itself. When the object is a +file, the basic GET method will simply return the contents of that file. +Other variations (generally implemented by adding query parameters to the +URL) will return information about the object, such as metadata. GET +operations are required to have no side-effects. + +PUT is used to upload new objects into the filesystem, or to replace an +existing object. DELETE it used to delete objects from the filesystem. Both +PUT and DELETE are required to be idempotent: performing the same operation +multiple times must have the same side-effects as only performing it once. + +POST is used for more complicated actions that cannot be expressed as a GET, +PUT, or DELETE. POST operations can be thought of as a method call: sending +some message to the object referenced by the URL. In Tahoe, POST is also used +for operations that must be triggered by an HTML form (including upload and +delete), because otherwise a regular web browser has no way to accomplish +these tasks. In general, everything that can be done with a PUT or DELETE can +also be done with a POST. + +Tahoe's web API is designed for two different kinds of consumer. The first is +a program that needs to manipulate the virtual file system. Such programs are +expected to use the RESTful interface described above. The second is a human +using a standard web browser to work with the filesystem. This user is given +a series of HTML pages with links to download files, and forms that use POST +actions to upload, rename, and delete files. + +When an error occurs, the HTTP response code will be set to an appropriate +400-series code (like 404 Not Found for an unknown childname, or 400 Bad Request +when the parameters to a webapi operation are invalid), and the HTTP response +body will usually contain a few lines of explanation as to the cause of the +error and possible responses. Unusual exceptions may result in a 500 Internal +Server Error as a catch-all, with a default response body containing +a Nevow-generated HTML-ized representation of the Python exception stack trace +that caused the problem. CLI programs which want to copy the response body to +stderr should provide an "Accept: text/plain" header to their requests to get +a plain text stack trace instead. If the Accept header contains ``*/*``, or +``text/*``, or text/html (or if there is no Accept header), HTML tracebacks will +be generated. + +URLs +==== + +Tahoe uses a variety of read- and write- caps to identify files and +directories. The most common of these is the "immutable file read-cap", which +is used for most uploaded files. These read-caps look like the following:: + + URI:CHK:ime6pvkaxuetdfah2p2f35pe54:4btz54xk3tew6nd4y2ojpxj4m6wxjqqlwnztgre6gnjgtucd5r4a:3:10:202 + +The next most common is a "directory write-cap", which provides both read and +write access to a directory, and look like this:: + + URI:DIR2:djrdkfawoqihigoett4g6auz6a:jx5mplfpwexnoqff7y5e4zjus4lidm76dcuarpct7cckorh2dpgq + +There are also "directory read-caps", which start with "URI:DIR2-RO:", and +give read-only access to a directory. Finally there are also mutable file +read- and write- caps, which start with "URI:SSK", and give access to mutable +files. + +(Later versions of Tahoe will make these strings shorter, and will remove the +unfortunate colons, which must be escaped when these caps are embedded in +URLs.) + +To refer to any Tahoe object through the web API, you simply need to combine +a prefix (which indicates the HTTP server to use) with the cap (which +indicates which object inside that server to access). Since the default Tahoe +webport is 3456, the most common prefix is one that will use a local node +listening on this port:: + + http://127.0.0.1:3456/uri/ + $CAP + +So, to access the directory named above (which happens to be the +publically-writeable sample directory on the Tahoe test grid, described at +http://allmydata.org/trac/tahoe/wiki/TestGrid), the URL would be:: + + http://127.0.0.1:3456/uri/URI%3ADIR2%3Adjrdkfawoqihigoett4g6auz6a%3Ajx5mplfpwexnoqff7y5e4zjus4lidm76dcuarpct7cckorh2dpgq/ + +(note that the colons in the directory-cap are url-encoded into "%3A" +sequences). + +Likewise, to access the file named above, use:: + + http://127.0.0.1:3456/uri/URI%3ACHK%3Aime6pvkaxuetdfah2p2f35pe54%3A4btz54xk3tew6nd4y2ojpxj4m6wxjqqlwnztgre6gnjgtucd5r4a%3A3%3A10%3A202 + +In the rest of this document, we'll use "$DIRCAP" as shorthand for a read-cap +or write-cap that refers to a directory, and "$FILECAP" to abbreviate a cap +that refers to a file (whether mutable or immutable). So those URLs above can +be abbreviated as:: + + http://127.0.0.1:3456/uri/$DIRCAP/ + http://127.0.0.1:3456/uri/$FILECAP + +The operation summaries below will abbreviate these further, by eliding the +server prefix. They will be displayed like this:: + + /uri/$DIRCAP/ + /uri/$FILECAP + + +Child Lookup +------------ + +Tahoe directories contain named child entries, just like directories in a regular +local filesystem. These child entries, called "dirnodes", consist of a name, +metadata, a write slot, and a read slot. The write and read slots normally contain +a write-cap and read-cap referring to the same object, which can be either a file +or a subdirectory. The write slot may be empty (actually, both may be empty, +but that is unusual). + +If you have a Tahoe URL that refers to a directory, and want to reference a +named child inside it, just append the child name to the URL. For example, if +our sample directory contains a file named "welcome.txt", we can refer to +that file with:: + + http://127.0.0.1:3456/uri/$DIRCAP/welcome.txt + +(or http://127.0.0.1:3456/uri/URI%3ADIR2%3Adjrdkfawoqihigoett4g6auz6a%3Ajx5mplfpwexnoqff7y5e4zjus4lidm76dcuarpct7cckorh2dpgq/welcome.txt) + +Multiple levels of subdirectories can be handled this way:: + + http://127.0.0.1:3456/uri/$DIRCAP/tahoe-source/docs/webapi.txt + +In this document, when we need to refer to a URL that references a file using +this child-of-some-directory format, we'll use the following string:: + + /uri/$DIRCAP/[SUBDIRS../]FILENAME + +The "[SUBDIRS../]" part means that there are zero or more (optional) +subdirectory names in the middle of the URL. The "FILENAME" at the end means +that this whole URL refers to a file of some sort, rather than to a +directory. + +When we need to refer specifically to a directory in this way, we'll write:: + + /uri/$DIRCAP/[SUBDIRS../]SUBDIR + + +Note that all components of pathnames in URLs are required to be UTF-8 +encoded, so "resume.doc" (with an acute accent on both E's) would be accessed +with:: + + http://127.0.0.1:3456/uri/$DIRCAP/r%C3%A9sum%C3%A9.doc + +Also note that the filenames inside upload POST forms are interpreted using +whatever character set was provided in the conventional '_charset' field, and +defaults to UTF-8 if not otherwise specified. The JSON representation of each +directory contains native unicode strings. Tahoe directories are specified to +contain unicode filenames, and cannot contain binary strings that are not +representable as such. + +All Tahoe operations that refer to existing files or directories must include +a suitable read- or write- cap in the URL: the webapi server won't add one +for you. If you don't know the cap, you can't access the file. This allows +the security properties of Tahoe caps to be extended across the webapi +interface. + +Slow Operations, Progress, and Cancelling +========================================= + +Certain operations can be expected to take a long time. The "t=deep-check", +described below, will recursively visit every file and directory reachable +from a given starting point, which can take minutes or even hours for +extremely large directory structures. A single long-running HTTP request is a +fragile thing: proxies, NAT boxes, browsers, and users may all grow impatient +with waiting and give up on the connection. + +For this reason, long-running operations have an "operation handle", which +can be used to poll for status/progress messages while the operation +proceeds. This handle can also be used to cancel the operation. These handles +are created by the client, and passed in as a an "ophandle=" query argument +to the POST or PUT request which starts the operation. The following +operations can then be used to retrieve status: + +``GET /operations/$HANDLE?output=HTML (with or without t=status)`` + +``GET /operations/$HANDLE?output=JSON (same)`` + + These two retrieve the current status of the given operation. Each operation + presents a different sort of information, but in general the page retrieved + will indicate: + + * whether the operation is complete, or if it is still running + * how much of the operation is complete, and how much is left, if possible + + Note that the final status output can be quite large: a deep-manifest of a + directory structure with 300k directories and 200k unique files is about + 275MB of JSON, and might take two minutes to generate. For this reason, the + full status is not provided until the operation has completed. + + The HTML form will include a meta-refresh tag, which will cause a regular + web browser to reload the status page about 60 seconds later. This tag will + be removed once the operation has completed. + + There may be more status information available under + /operations/$HANDLE/$ETC : i.e., the handle forms the root of a URL space. + +``POST /operations/$HANDLE?t=cancel`` + + This terminates the operation, and returns an HTML page explaining what was + cancelled. If the operation handle has already expired (see below), this + POST will return a 404, which indicates that the operation is no longer + running (either it was completed or terminated). The response body will be + the same as a GET /operations/$HANDLE on this operation handle, and the + handle will be expired immediately afterwards. + +The operation handle will eventually expire, to avoid consuming an unbounded +amount of memory. The handle's time-to-live can be reset at any time, by +passing a retain-for= argument (with a count of seconds) to either the +initial POST that starts the operation, or the subsequent GET request which +asks about the operation. For example, if a 'GET +/operations/$HANDLE?output=JSON&retain-for=600' query is performed, the +handle will remain active for 600 seconds (10 minutes) after the GET was +received. + +In addition, if the GET includes a release-after-complete=True argument, and +the operation has completed, the operation handle will be released +immediately. + +If a retain-for= argument is not used, the default handle lifetimes are: + + * handles will remain valid at least until their operation finishes + * uncollected handles for finished operations (i.e. handles for + operations that have finished but for which the GET page has not been + accessed since completion) will remain valid for four days, or for + the total time consumed by the operation, whichever is greater. + * collected handles (i.e. the GET page has been retrieved at least once + since the operation completed) will remain valid for one day. + +Many "slow" operations can begin to use unacceptable amounts of memory when +operating on large directory structures. The memory usage increases when the +ophandle is polled, as the results must be copied into a JSON string, sent +over the wire, then parsed by a client. So, as an alternative, many "slow" +operations have streaming equivalents. These equivalents do not use operation +handles. Instead, they emit line-oriented status results immediately. Client +code can cancel the operation by simply closing the HTTP connection. + +Programmatic Operations +======================= + +Now that we know how to build URLs that refer to files and directories in a +Tahoe virtual filesystem, what sorts of operations can we do with those URLs? +This section contains a catalog of GET, PUT, DELETE, and POST operations that +can be performed on these URLs. This set of operations are aimed at programs +that use HTTP to communicate with a Tahoe node. A later section describes +operations that are intended for web browsers. + +Reading A File +-------------- + +``GET /uri/$FILECAP`` + +``GET /uri/$DIRCAP/[SUBDIRS../]FILENAME`` + + This will retrieve the contents of the given file. The HTTP response body + will contain the sequence of bytes that make up the file. + + To view files in a web browser, you may want more control over the + Content-Type and Content-Disposition headers. Please see the next section + "Browser Operations", for details on how to modify these URLs for that + purpose. + +Writing/Uploading A File +------------------------ + +``PUT /uri/$FILECAP`` + +``PUT /uri/$DIRCAP/[SUBDIRS../]FILENAME`` + + Upload a file, using the data from the HTTP request body, and add whatever + child links and subdirectories are necessary to make the file available at + the given location. Once this operation succeeds, a GET on the same URL will + retrieve the same contents that were just uploaded. This will create any + necessary intermediate subdirectories. + + To use the /uri/$FILECAP form, $FILECAP must be a write-cap for a mutable file. + + In the /uri/$DIRCAP/[SUBDIRS../]FILENAME form, if the target file is a + writeable mutable file, that file's contents will be overwritten in-place. If + it is a read-cap for a mutable file, an error will occur. If it is an + immutable file, the old file will be discarded, and a new one will be put in + its place. + + When creating a new file, if "mutable=true" is in the query arguments, the + operation will create a mutable file instead of an immutable one. + + This returns the file-cap of the resulting file. If a new file was created + by this method, the HTTP response code (as dictated by rfc2616) will be set + to 201 CREATED. If an existing file was replaced or modified, the response + code will be 200 OK. + + Note that the 'curl -T localfile http://127.0.0.1:3456/uri/$DIRCAP/foo.txt' + command can be used to invoke this operation. + +``PUT /uri`` + + This uploads a file, and produces a file-cap for the contents, but does not + attach the file into the filesystem. No directories will be modified by + this operation. The file-cap is returned as the body of the HTTP response. + + If "mutable=true" is in the query arguments, the operation will create a + mutable file, and return its write-cap in the HTTP respose. The default is + to create an immutable file, returning the read-cap as a response. + +Creating A New Directory +------------------------ + +``POST /uri?t=mkdir`` + +``PUT /uri?t=mkdir`` + + Create a new empty directory and return its write-cap as the HTTP response + body. This does not make the newly created directory visible from the + filesystem. The "PUT" operation is provided for backwards compatibility: + new code should use POST. + +``POST /uri?t=mkdir-with-children`` + + Create a new directory, populated with a set of child nodes, and return its + write-cap as the HTTP response body. The new directory is not attached to + any other directory: the returned write-cap is the only reference to it. + + Initial children are provided as the body of the POST form (this is more + efficient than doing separate mkdir and set_children operations). If the + body is empty, the new directory will be empty. If not empty, the body will + be interpreted as a UTF-8 JSON-encoded dictionary of children with which the + new directory should be populated, using the same format as would be + returned in the 'children' value of the t=json GET request, described below. + Each dictionary key should be a child name, and each value should be a list + of [TYPE, PROPDICT], where PROPDICT contains "rw_uri", "ro_uri", and + "metadata" keys (all others are ignored). For example, the PUT request body + could be:: + + { + "Fran\u00e7ais": [ "filenode", { + "ro_uri": "URI:CHK:...", + "size": bytes, + "metadata": { + "ctime": 1202777696.7564139, + "mtime": 1202777696.7564139, + "tahoe": { + "linkcrtime": 1202777696.7564139, + "linkmotime": 1202777696.7564139 + } } } ], + "subdir": [ "dirnode", { + "rw_uri": "URI:DIR2:...", + "ro_uri": "URI:DIR2-RO:...", + "metadata": { + "ctime": 1202778102.7589991, + "mtime": 1202778111.2160511, + "tahoe": { + "linkcrtime": 1202777696.7564139, + "linkmotime": 1202777696.7564139 + } } } ] + } + + For forward-compatibility, a mutable directory can also contain caps in + a format that is unknown to the webapi server. When such caps are retrieved + from a mutable directory in a "ro_uri" field, they will be prefixed with + the string "ro.", indicating that they must not be decoded without + checking that they are read-only. The "ro." prefix must not be stripped + off without performing this check. (Future versions of the webapi server + will perform it where necessary.) + + If both the "rw_uri" and "ro_uri" fields are present in a given PROPDICT, + and the webapi server recognizes the rw_uri as a write cap, then it will + reset the ro_uri to the corresponding read cap and discard the original + contents of ro_uri (in order to ensure that the two caps correspond to the + same object and that the ro_uri is in fact read-only). However this may not + happen for caps in a format unknown to the webapi server. Therefore, when + writing a directory the webapi client should ensure that the contents + of "rw_uri" and "ro_uri" for a given PROPDICT are a consistent + (write cap, read cap) pair if possible. If the webapi client only has + one cap and does not know whether it is a write cap or read cap, then + it is acceptable to set "rw_uri" to that cap and omit "ro_uri". The + client must not put a write cap into a "ro_uri" field. + + The metadata may have a "no-write" field. If this is set to true in the + metadata of a link, it will not be possible to open that link for writing + via the SFTP frontend; see `FTP-and-SFTP.rst`_ for details. + Also, if the "no-write" field is set to true in the metadata of a link to + a mutable child, it will cause the link to be diminished to read-only. + + .. _FTP-and-SFTP.rst: http://tahoe-lafs.org/source/tahoe-lafs/trunk/docs/frontents/FTP-and-SFTP.rst + + Note that the webapi-using client application must not provide the + "Content-Type: multipart/form-data" header that usually accompanies HTML + form submissions, since the body is not formatted this way. Doing so will + cause a server error as the lower-level code misparses the request body. + + Child file names should each be expressed as a unicode string, then used as + keys of the dictionary. The dictionary should then be converted into JSON, + and the resulting string encoded into UTF-8. This UTF-8 bytestring should + then be used as the POST body. + +``POST /uri?t=mkdir-immutable`` + + Like t=mkdir-with-children above, but the new directory will be + deep-immutable. This means that the directory itself is immutable, and that + it can only contain objects that are treated as being deep-immutable, like + immutable files, literal files, and deep-immutable directories. + + For forward-compatibility, a deep-immutable directory can also contain caps + in a format that is unknown to the webapi server. When such caps are retrieved + from a deep-immutable directory in a "ro_uri" field, they will be prefixed + with the string "imm.", indicating that they must not be decoded without + checking that they are immutable. The "imm." prefix must not be stripped + off without performing this check. (Future versions of the webapi server + will perform it where necessary.) + + The cap for each child may be given either in the "rw_uri" or "ro_uri" + field of the PROPDICT (not both). If a cap is given in the "rw_uri" field, + then the webapi server will check that it is an immutable read-cap of a + *known* format, and give an error if it is not. If a cap is given in the + "ro_uri" field, then the webapi server will still check whether known + caps are immutable, but for unknown caps it will simply assume that the + cap can be stored, as described above. Note that an attacker would be + able to store any cap in an immutable directory, so this check when + creating the directory is only to help non-malicious clients to avoid + accidentally giving away more authority than intended. + + A non-empty request body is mandatory, since after the directory is created, + it will not be possible to add more children to it. + +``POST /uri/$DIRCAP/[SUBDIRS../]SUBDIR?t=mkdir`` + +``PUT /uri/$DIRCAP/[SUBDIRS../]SUBDIR?t=mkdir`` + + Create new directories as necessary to make sure that the named target + ($DIRCAP/SUBDIRS../SUBDIR) is a directory. This will create additional + intermediate mutable directories as necessary. If the named target directory + already exists, this will make no changes to it. + + If the final directory is created, it will be empty. + + This operation will return an error if a blocking file is present at any of + the parent names, preventing the server from creating the necessary parent + directory; or if it would require changing an immutable directory. + + The write-cap of the new directory will be returned as the HTTP response + body. + +``POST /uri/$DIRCAP/[SUBDIRS../]SUBDIR?t=mkdir-with-children`` + + Like /uri?t=mkdir-with-children, but the final directory is created as a + child of an existing mutable directory. This will create additional + intermediate mutable directories as necessary. If the final directory is + created, it will be populated with initial children from the POST request + body, as described above. + + This operation will return an error if a blocking file is present at any of + the parent names, preventing the server from creating the necessary parent + directory; or if it would require changing an immutable directory; or if + the immediate parent directory already has a a child named SUBDIR. + +``POST /uri/$DIRCAP/[SUBDIRS../]SUBDIR?t=mkdir-immutable`` + + Like /uri?t=mkdir-immutable, but the final directory is created as a child + of an existing mutable directory. The final directory will be deep-immutable, + and will be populated with the children specified as a JSON dictionary in + the POST request body. + + In Tahoe 1.6 this operation creates intermediate mutable directories if + necessary, but that behaviour should not be relied on; see ticket #920. + + This operation will return an error if the parent directory is immutable, + or already has a child named SUBDIR. + +``POST /uri/$DIRCAP/[SUBDIRS../]?t=mkdir&name=NAME`` + + Create a new empty mutable directory and attach it to the given existing + directory. This will create additional intermediate directories as necessary. + + This operation will return an error if a blocking file is present at any of + the parent names, preventing the server from creating the necessary parent + directory, or if it would require changing any immutable directory. + + The URL of this operation points to the parent of the bottommost new directory, + whereas the /uri/$DIRCAP/[SUBDIRS../]SUBDIR?t=mkdir operation above has a URL + that points directly to the bottommost new directory. + +``POST /uri/$DIRCAP/[SUBDIRS../]?t=mkdir-with-children&name=NAME`` + + Like /uri/$DIRCAP/[SUBDIRS../]?t=mkdir&name=NAME, but the new directory will + be populated with initial children via the POST request body. This command + will create additional intermediate mutable directories as necessary. + + This operation will return an error if a blocking file is present at any of + the parent names, preventing the server from creating the necessary parent + directory; or if it would require changing an immutable directory; or if + the immediate parent directory already has a a child named NAME. + + Note that the name= argument must be passed as a queryarg, because the POST + request body is used for the initial children JSON. + +``POST /uri/$DIRCAP/[SUBDIRS../]?t=mkdir-immutable&name=NAME`` + + Like /uri/$DIRCAP/[SUBDIRS../]?t=mkdir-with-children&name=NAME, but the + final directory will be deep-immutable. The children are specified as a + JSON dictionary in the POST request body. Again, the name= argument must be + passed as a queryarg. + + In Tahoe 1.6 this operation creates intermediate mutable directories if + necessary, but that behaviour should not be relied on; see ticket #920. + + This operation will return an error if the parent directory is immutable, + or already has a child named NAME. + +Get Information About A File Or Directory (as JSON) +--------------------------------------------------- + +``GET /uri/$FILECAP?t=json`` + +``GET /uri/$DIRCAP?t=json`` + +``GET /uri/$DIRCAP/[SUBDIRS../]SUBDIR?t=json`` + +``GET /uri/$DIRCAP/[SUBDIRS../]FILENAME?t=json`` + + This returns a machine-parseable JSON-encoded description of the given + object. The JSON always contains a list, and the first element of the list is + always a flag that indicates whether the referenced object is a file or a + directory. If it is a capability to a file, then the information includes + file size and URI, like this:: + + GET /uri/$FILECAP?t=json : + + [ "filenode", { + "ro_uri": file_uri, + "verify_uri": verify_uri, + "size": bytes, + "mutable": false + } ] + + If it is a capability to a directory followed by a path from that directory + to a file, then the information also includes metadata from the link to the + file in the parent directory, like this:: + + GET /uri/$DIRCAP/[SUBDIRS../]FILENAME?t=json + + [ "filenode", { + "ro_uri": file_uri, + "verify_uri": verify_uri, + "size": bytes, + "mutable": false, + "metadata": { + "ctime": 1202777696.7564139, + "mtime": 1202777696.7564139, + "tahoe": { + "linkcrtime": 1202777696.7564139, + "linkmotime": 1202777696.7564139 + } } } ] + + If it is a directory, then it includes information about the children of + this directory, as a mapping from child name to a set of data about the + child (the same data that would appear in a corresponding GET?t=json of the + child itself). The child entries also include metadata about each child, + including link-creation- and link-change- timestamps. The output looks like + this:: + + GET /uri/$DIRCAP?t=json : + GET /uri/$DIRCAP/[SUBDIRS../]SUBDIR?t=json : + + [ "dirnode", { + "rw_uri": read_write_uri, + "ro_uri": read_only_uri, + "verify_uri": verify_uri, + "mutable": true, + "children": { + "foo.txt": [ "filenode", { + "ro_uri": uri, + "size": bytes, + "metadata": { + "ctime": 1202777696.7564139, + "mtime": 1202777696.7564139, + "tahoe": { + "linkcrtime": 1202777696.7564139, + "linkmotime": 1202777696.7564139 + } } } ], + "subdir": [ "dirnode", { + "rw_uri": rwuri, + "ro_uri": rouri, + "metadata": { + "ctime": 1202778102.7589991, + "mtime": 1202778111.2160511, + "tahoe": { + "linkcrtime": 1202777696.7564139, + "linkmotime": 1202777696.7564139 + } } } ] + } } ] + + In the above example, note how 'children' is a dictionary in which the keys + are child names and the values depend upon whether the child is a file or a + directory. The value is mostly the same as the JSON representation of the + child object (except that directories do not recurse -- the "children" + entry of the child is omitted, and the directory view includes the metadata + that is stored on the directory edge). + + The rw_uri field will be present in the information about a directory + if and only if you have read-write access to that directory. The verify_uri + field will be present if and only if the object has a verify-cap + (non-distributed LIT files do not have verify-caps). + + If the cap is of an unknown format, then the file size and verify_uri will + not be available:: + + GET /uri/$UNKNOWNCAP?t=json : + + [ "unknown", { + "ro_uri": unknown_read_uri + } ] + + GET /uri/$DIRCAP/[SUBDIRS../]UNKNOWNCHILDNAME?t=json : + + [ "unknown", { + "rw_uri": unknown_write_uri, + "ro_uri": unknown_read_uri, + "mutable": true, + "metadata": { + "ctime": 1202777696.7564139, + "mtime": 1202777696.7564139, + "tahoe": { + "linkcrtime": 1202777696.7564139, + "linkmotime": 1202777696.7564139 + } } } ] + + As in the case of file nodes, the metadata will only be present when the + capability is to a directory followed by a path. The "mutable" field is also + not always present; when it is absent, the mutability of the object is not + known. + +About the metadata +`````````````````` + +The value of the 'tahoe':'linkmotime' key is updated whenever a link to a +child is set. The value of the 'tahoe':'linkcrtime' key is updated whenever +a link to a child is created -- i.e. when there was not previously a link +under that name. + +Note however, that if the edge in the Tahoe filesystem points to a mutable +file and the contents of that mutable file is changed, then the +'tahoe':'linkmotime' value on that edge will *not* be updated, since the +edge itself wasn't updated -- only the mutable file was. + +The timestamps are represented as a number of seconds since the UNIX epoch +(1970-01-01 00:00:00 UTC), with leap seconds not being counted in the long +term. + +In Tahoe earlier than v1.4.0, 'mtime' and 'ctime' keys were populated +instead of the 'tahoe':'linkmotime' and 'tahoe':'linkcrtime' keys. Starting +in Tahoe v1.4.0, the 'linkmotime'/'linkcrtime' keys in the 'tahoe' sub-dict +are populated. However, prior to Tahoe v1.7beta, a bug caused the 'tahoe' +sub-dict to be deleted by webapi requests in which new metadata is +specified, and not to be added to existing child links that lack it. + +From Tahoe v1.7.0 onward, the 'mtime' and 'ctime' fields are no longer +populated or updated (see ticket #924), except by "tahoe backup" as +explained below. For backward compatibility, when an existing link is +updated and 'tahoe':'linkcrtime' is not present in the previous metadata +but 'ctime' is, the old value of 'ctime' is used as the new value of +'tahoe':'linkcrtime'. + +The reason we added the new fields in Tahoe v1.4.0 is that there is a +"set_children" API (described below) which you can use to overwrite the +values of the 'mtime'/'ctime' pair, and this API is used by the +"tahoe backup" command (in Tahoe v1.3.0 and later) to set the 'mtime' and +'ctime' values when backing up files from a local filesystem into the +Tahoe filesystem. As of Tahoe v1.4.0, the set_children API cannot be used +to set anything under the 'tahoe' key of the metadata dict -- if you +include 'tahoe' keys in your 'metadata' arguments then it will silently +ignore those keys. + +Therefore, if the 'tahoe' sub-dict is present, you can rely on the +'linkcrtime' and 'linkmotime' values therein to have the semantics described +above. (This is assuming that only official Tahoe clients have been used to +write those links, and that their system clocks were set to what you expected +-- there is nothing preventing someone from editing their Tahoe client or +writing their own Tahoe client which would overwrite those values however +they like, and there is nothing to constrain their system clock from taking +any value.) + +When an edge is created or updated by "tahoe backup", the 'mtime' and +'ctime' keys on that edge are set as follows: + +* 'mtime' is set to the timestamp read from the local filesystem for the + "mtime" of the local file in question, which means the last time the + contents of that file were changed. + +* On Windows, 'ctime' is set to the creation timestamp for the file + read from the local filesystem. On other platforms, 'ctime' is set to + the UNIX "ctime" of the local file, which means the last time that + either the contents or the metadata of the local file was changed. + +There are several ways that the 'ctime' field could be confusing: + +1. You might be confused about whether it reflects the time of the creation + of a link in the Tahoe filesystem (by a version of Tahoe < v1.7.0) or a + timestamp copied in by "tahoe backup" from a local filesystem. + +2. You might be confused about whether it is a copy of the file creation + time (if "tahoe backup" was run on a Windows system) or of the last + contents-or-metadata change (if "tahoe backup" was run on a different + operating system). + +3. You might be confused by the fact that changing the contents of a + mutable file in Tahoe doesn't have any effect on any links pointing at + that file in any directories, although "tahoe backup" sets the link + 'ctime'/'mtime' to reflect timestamps about the local file corresponding + to the Tahoe file to which the link points. + +4. Also, quite apart from Tahoe, you might be confused about the meaning + of the "ctime" in UNIX local filesystems, which people sometimes think + means file creation time, but which actually means, in UNIX local + filesystems, the most recent time that the file contents or the file + metadata (such as owner, permission bits, extended attributes, etc.) + has changed. Note that although "ctime" does not mean file creation time + in UNIX, links created by a version of Tahoe prior to v1.7.0, and never + written by "tahoe backup", will have 'ctime' set to the link creation + time. + + +Attaching an existing File or Directory by its read- or write-cap +----------------------------------------------------------------- + +``PUT /uri/$DIRCAP/[SUBDIRS../]CHILDNAME?t=uri`` + + This attaches a child object (either a file or directory) to a specified + location in the virtual filesystem. The child object is referenced by its + read- or write- cap, as provided in the HTTP request body. This will create + intermediate directories as necessary. + + This is similar to a UNIX hardlink: by referencing a previously-uploaded file + (or previously-created directory) instead of uploading/creating a new one, + you can create two references to the same object. + + The read- or write- cap of the child is provided in the body of the HTTP + request, and this same cap is returned in the response body. + + The default behavior is to overwrite any existing object at the same + location. To prevent this (and make the operation return an error instead + of overwriting), add a "replace=false" argument, as "?t=uri&replace=false". + With replace=false, this operation will return an HTTP 409 "Conflict" error + if there is already an object at the given location, rather than + overwriting the existing object. To allow the operation to overwrite a + file, but return an error when trying to overwrite a directory, use + "replace=only-files" (this behavior is closer to the traditional UNIX "mv" + command). Note that "true", "t", and "1" are all synonyms for "True", and + "false", "f", and "0" are synonyms for "False", and the parameter is + case-insensitive. + + Note that this operation does not take its child cap in the form of + separate "rw_uri" and "ro_uri" fields. Therefore, it cannot accept a + child cap in a format unknown to the webapi server, unless its URI + starts with "ro." or "imm.". This restriction is necessary because the + server is not able to attenuate an unknown write cap to a read cap. + Unknown URIs starting with "ro." or "imm.", on the other hand, are + assumed to represent read caps. The client should not prefix a write + cap with "ro." or "imm." and pass it to this operation, since that + would result in granting the cap's write authority to holders of the + directory read cap. + +Adding multiple files or directories to a parent directory at once +------------------------------------------------------------------ + +``POST /uri/$DIRCAP/[SUBDIRS..]?t=set_children`` + +``POST /uri/$DIRCAP/[SUBDIRS..]?t=set-children`` (Tahoe >= v1.6) + + This command adds multiple children to a directory in a single operation. + It reads the request body and interprets it as a JSON-encoded description + of the child names and read/write-caps that should be added. + + The body should be a JSON-encoded dictionary, in the same format as the + "children" value returned by the "GET /uri/$DIRCAP?t=json" operation + described above. In this format, each key is a child names, and the + corresponding value is a tuple of (type, childinfo). "type" is ignored, and + "childinfo" is a dictionary that contains "rw_uri", "ro_uri", and + "metadata" keys. You can take the output of "GET /uri/$DIRCAP1?t=json" and + use it as the input to "POST /uri/$DIRCAP2?t=set_children" to make DIR2 + look very much like DIR1 (except for any existing children of DIR2 that + were not overwritten, and any existing "tahoe" metadata keys as described + below). + + When the set_children request contains a child name that already exists in + the target directory, this command defaults to overwriting that child with + the new value (both child cap and metadata, but if the JSON data does not + contain a "metadata" key, the old child's metadata is preserved). The + command takes a boolean "overwrite=" query argument to control this + behavior. If you use "?t=set_children&overwrite=false", then an attempt to + replace an existing child will instead cause an error. + + Any "tahoe" key in the new child's "metadata" value is ignored. Any + existing "tahoe" metadata is preserved. The metadata["tahoe"] value is + reserved for metadata generated by the tahoe node itself. The only two keys + currently placed here are "linkcrtime" and "linkmotime". For details, see + the section above entitled "Get Information About A File Or Directory (as + JSON)", in the "About the metadata" subsection. + + Note that this command was introduced with the name "set_children", which + uses an underscore rather than a hyphen as other multi-word command names + do. The variant with a hyphen is now accepted, but clients that desire + backward compatibility should continue to use "set_children". + + +Deleting a File or Directory +---------------------------- + +``DELETE /uri/$DIRCAP/[SUBDIRS../]CHILDNAME`` + + This removes the given name from its parent directory. CHILDNAME is the + name to be removed, and $DIRCAP/SUBDIRS.. indicates the directory that will + be modified. + + Note that this does not actually delete the file or directory that the name + points to from the tahoe grid -- it only removes the named reference from + this directory. If there are other names in this directory or in other + directories that point to the resource, then it will remain accessible + through those paths. Even if all names pointing to this object are removed + from their parent directories, then someone with possession of its read-cap + can continue to access the object through that cap. + + The object will only become completely unreachable once 1: there are no + reachable directories that reference it, and 2: nobody is holding a read- + or write- cap to the object. (This behavior is very similar to the way + hardlinks and anonymous files work in traditional UNIX filesystems). + + This operation will not modify more than a single directory. Intermediate + directories which were implicitly created by PUT or POST methods will *not* + be automatically removed by DELETE. + + This method returns the file- or directory- cap of the object that was just + removed. + +Browser Operations: Human-oriented interfaces +============================================= + +This section describes the HTTP operations that provide support for humans +running a web browser. Most of these operations use HTML forms that use POST +to drive the Tahoe node. This section is intended for HTML authors who want +to write web pages that contain forms and buttons which manipulate the Tahoe +filesystem. + +Note that for all POST operations, the arguments listed can be provided +either as URL query arguments or as form body fields. URL query arguments are +separated from the main URL by "?", and from each other by "&". For example, +"POST /uri/$DIRCAP?t=upload&mutable=true". Form body fields are usually +specified by using elements. For clarity, the +descriptions below display the most significant arguments as URL query args. + +Viewing A Directory (as HTML) +----------------------------- + +``GET /uri/$DIRCAP/[SUBDIRS../]`` + + This returns an HTML page, intended to be displayed to a human by a web + browser, which contains HREF links to all files and directories reachable + from this directory. These HREF links do not have a t= argument, meaning + that a human who follows them will get pages also meant for a human. It also + contains forms to upload new files, and to delete files and directories. + Those forms use POST methods to do their job. + +Viewing/Downloading a File +-------------------------- + +``GET /uri/$FILECAP`` + +``GET /uri/$DIRCAP/[SUBDIRS../]FILENAME`` + + This will retrieve the contents of the given file. The HTTP response body + will contain the sequence of bytes that make up the file. + + If you want the HTTP response to include a useful Content-Type header, + either use the second form (which starts with a $DIRCAP), or add a + "filename=foo" query argument, like "GET /uri/$FILECAP?filename=foo.jpg". + The bare "GET /uri/$FILECAP" does not give the Tahoe node enough information + to determine a Content-Type (since Tahoe immutable files are merely + sequences of bytes, not typed+named file objects). + + If the URL has both filename= and "save=true" in the query arguments, then + the server to add a "Content-Disposition: attachment" header, along with a + filename= parameter. When a user clicks on such a link, most browsers will + offer to let the user save the file instead of displaying it inline (indeed, + most browsers will refuse to display it inline). "true", "t", "1", and other + case-insensitive equivalents are all treated the same. + + Character-set handling in URLs and HTTP headers is a dubious art [1]_. For + maximum compatibility, Tahoe simply copies the bytes from the filename= + argument into the Content-Disposition header's filename= parameter, without + trying to interpret them in any particular way. + + +``GET /named/$FILECAP/FILENAME`` + + This is an alternate download form which makes it easier to get the correct + filename. The Tahoe server will provide the contents of the given file, with + a Content-Type header derived from the given filename. This form is used to + get browsers to use the "Save Link As" feature correctly, and also helps + command-line tools like "wget" and "curl" use the right filename. Note that + this form can *only* be used with file caps; it is an error to use a + directory cap after the /named/ prefix. + +Get Information About A File Or Directory (as HTML) +--------------------------------------------------- + +``GET /uri/$FILECAP?t=info`` + +``GET /uri/$DIRCAP/?t=info`` + +``GET /uri/$DIRCAP/[SUBDIRS../]SUBDIR/?t=info`` + +``GET /uri/$DIRCAP/[SUBDIRS../]FILENAME?t=info`` + + This returns a human-oriented HTML page with more detail about the selected + file or directory object. This page contains the following items: + + * object size + * storage index + * JSON representation + * raw contents (text/plain) + * access caps (URIs): verify-cap, read-cap, write-cap (for mutable objects) + * check/verify/repair form + * deep-check/deep-size/deep-stats/manifest (for directories) + * replace-conents form (for mutable files) + +Creating a Directory +-------------------- + +``POST /uri?t=mkdir`` + + This creates a new empty directory, but does not attach it to the virtual + filesystem. + + If a "redirect_to_result=true" argument is provided, then the HTTP response + will cause the web browser to be redirected to a /uri/$DIRCAP page that + gives access to the newly-created directory. If you bookmark this page, + you'll be able to get back to the directory again in the future. This is the + recommended way to start working with a Tahoe server: create a new unlinked + directory (using redirect_to_result=true), then bookmark the resulting + /uri/$DIRCAP page. There is a "create directory" button on the Welcome page + to invoke this action. + + If "redirect_to_result=true" is not provided (or is given a value of + "false"), then the HTTP response body will simply be the write-cap of the + new directory. + +``POST /uri/$DIRCAP/[SUBDIRS../]?t=mkdir&name=CHILDNAME`` + + This creates a new empty directory as a child of the designated SUBDIR. This + will create additional intermediate directories as necessary. + + If a "when_done=URL" argument is provided, the HTTP response will cause the + web browser to redirect to the given URL. This provides a convenient way to + return the browser to the directory that was just modified. Without a + when_done= argument, the HTTP response will simply contain the write-cap of + the directory that was just created. + + +Uploading a File +---------------- + +``POST /uri?t=upload`` + + This uploads a file, and produces a file-cap for the contents, but does not + attach the file into the filesystem. No directories will be modified by + this operation. + + The file must be provided as the "file" field of an HTML encoded form body, + produced in response to an HTML form like this:: + +
+ + + +
+ + If a "when_done=URL" argument is provided, the response body will cause the + browser to redirect to the given URL. If the when_done= URL has the string + "%(uri)s" in it, that string will be replaced by a URL-escaped form of the + newly created file-cap. (Note that without this substitution, there is no + way to access the file that was just uploaded). + + The default (in the absence of when_done=) is to return an HTML page that + describes the results of the upload. This page will contain information + about which storage servers were used for the upload, how long each + operation took, etc. + + If a "mutable=true" argument is provided, the operation will create a + mutable file, and the response body will contain the write-cap instead of + the upload results page. The default is to create an immutable file, + returning the upload results page as a response. + + +``POST /uri/$DIRCAP/[SUBDIRS../]?t=upload`` + + This uploads a file, and attaches it as a new child of the given directory, + which must be mutable. The file must be provided as the "file" field of an + HTML-encoded form body, produced in response to an HTML form like this:: + +
+ + + +
+ + A "name=" argument can be provided to specify the new child's name, + otherwise it will be taken from the "filename" field of the upload form + (most web browsers will copy the last component of the original file's + pathname into this field). To avoid confusion, name= is not allowed to + contain a slash. + + If there is already a child with that name, and it is a mutable file, then + its contents are replaced with the data being uploaded. If it is not a + mutable file, the default behavior is to remove the existing child before + creating a new one. To prevent this (and make the operation return an error + instead of overwriting the old child), add a "replace=false" argument, as + "?t=upload&replace=false". With replace=false, this operation will return an + HTTP 409 "Conflict" error if there is already an object at the given + location, rather than overwriting the existing object. Note that "true", + "t", and "1" are all synonyms for "True", and "false", "f", and "0" are + synonyms for "False". the parameter is case-insensitive. + + This will create additional intermediate directories as necessary, although + since it is expected to be triggered by a form that was retrieved by "GET + /uri/$DIRCAP/[SUBDIRS../]", it is likely that the parent directory will + already exist. + + If a "mutable=true" argument is provided, any new file that is created will + be a mutable file instead of an immutable one. will give the user a way to set this option. + + If a "when_done=URL" argument is provided, the HTTP response will cause the + web browser to redirect to the given URL. This provides a convenient way to + return the browser to the directory that was just modified. Without a + when_done= argument, the HTTP response will simply contain the file-cap of + the file that was just uploaded (a write-cap for mutable files, or a + read-cap for immutable files). + +``POST /uri/$DIRCAP/[SUBDIRS../]FILENAME?t=upload`` + + This also uploads a file and attaches it as a new child of the given + directory, which must be mutable. It is a slight variant of the previous + operation, as the URL refers to the target file rather than the parent + directory. It is otherwise identical: this accepts mutable= and when_done= + arguments too. + +``POST /uri/$FILECAP?t=upload`` + + This modifies the contents of an existing mutable file in-place. An error is + signalled if $FILECAP does not refer to a mutable file. It behaves just like + the "PUT /uri/$FILECAP" form, but uses a POST for the benefit of HTML forms + in a web browser. + +Attaching An Existing File Or Directory (by URI) +------------------------------------------------ + +``POST /uri/$DIRCAP/[SUBDIRS../]?t=uri&name=CHILDNAME&uri=CHILDCAP`` + + This attaches a given read- or write- cap "CHILDCAP" to the designated + directory, with a specified child name. This behaves much like the PUT t=uri + operation, and is a lot like a UNIX hardlink. It is subject to the same + restrictions as that operation on the use of cap formats unknown to the + webapi server. + + This will create additional intermediate directories as necessary, although + since it is expected to be triggered by a form that was retrieved by "GET + /uri/$DIRCAP/[SUBDIRS../]", it is likely that the parent directory will + already exist. + + This accepts the same replace= argument as POST t=upload. + +Deleting A Child +---------------- + +``POST /uri/$DIRCAP/[SUBDIRS../]?t=delete&name=CHILDNAME`` + + This instructs the node to remove a child object (file or subdirectory) from + the given directory, which must be mutable. Note that the entire subtree is + unlinked from the parent. Unlike deleting a subdirectory in a UNIX local + filesystem, the subtree need not be empty; if it isn't, then other references + into the subtree will see that the child subdirectories are not modified by + this operation. Only the link from the given directory to its child is severed. + +Renaming A Child +---------------- + +``POST /uri/$DIRCAP/[SUBDIRS../]?t=rename&from_name=OLD&to_name=NEW`` + + This instructs the node to rename a child of the given directory, which must + be mutable. This has a similar effect to removing the child, then adding the + same child-cap under the new name, except that it preserves metadata. This + operation cannot move the child to a different directory. + + This operation will replace any existing child of the new name, making it + behave like the UNIX "``mv -f``" command. + +Other Utilities +--------------- + +``GET /uri?uri=$CAP`` + + This causes a redirect to /uri/$CAP, and retains any additional query + arguments (like filename= or save=). This is for the convenience of web + forms which allow the user to paste in a read- or write- cap (obtained + through some out-of-band channel, like IM or email). + + Note that this form merely redirects to the specific file or directory + indicated by the $CAP: unlike the GET /uri/$DIRCAP form, you cannot + traverse to children by appending additional path segments to the URL. + +``GET /uri/$DIRCAP/[SUBDIRS../]?t=rename-form&name=$CHILDNAME`` + + This provides a useful facility to browser-based user interfaces. It + returns a page containing a form targetting the "POST $DIRCAP t=rename" + functionality described above, with the provided $CHILDNAME present in the + 'from_name' field of that form. I.e. this presents a form offering to + rename $CHILDNAME, requesting the new name, and submitting POST rename. + +``GET /uri/$DIRCAP/[SUBDIRS../]CHILDNAME?t=uri`` + + This returns the file- or directory- cap for the specified object. + +``GET /uri/$DIRCAP/[SUBDIRS../]CHILDNAME?t=readonly-uri`` + + This returns a read-only file- or directory- cap for the specified object. + If the object is an immutable file, this will return the same value as + t=uri. + +Debugging and Testing Features +------------------------------ + +These URLs are less-likely to be helpful to the casual Tahoe user, and are +mainly intended for developers. + +``POST $URL?t=check`` + + This triggers the FileChecker to determine the current "health" of the + given file or directory, by counting how many shares are available. The + page that is returned will display the results. This can be used as a "show + me detailed information about this file" page. + + If a verify=true argument is provided, the node will perform a more + intensive check, downloading and verifying every single bit of every share. + + If an add-lease=true argument is provided, the node will also add (or + renew) a lease to every share it encounters. Each lease will keep the share + alive for a certain period of time (one month by default). Once the last + lease expires or is explicitly cancelled, the storage server is allowed to + delete the share. + + If an output=JSON argument is provided, the response will be + machine-readable JSON instead of human-oriented HTML. The data is a + dictionary with the following keys:: + + storage-index: a base32-encoded string with the objects's storage index, + or an empty string for LIT files + summary: a string, with a one-line summary of the stats of the file + results: a dictionary that describes the state of the file. For LIT files, + this dictionary has only the 'healthy' key, which will always be + True. For distributed files, this dictionary has the following + keys: + count-shares-good: the number of good shares that were found + count-shares-needed: 'k', the number of shares required for recovery + count-shares-expected: 'N', the number of total shares generated + count-good-share-hosts: this was intended to be the number of distinct + storage servers with good shares. It is currently + (as of Tahoe-LAFS v1.8.0) computed incorrectly; + see ticket #1115. + count-wrong-shares: for mutable files, the number of shares for + versions other than the 'best' one (highest + sequence number, highest roothash). These are + either old ... + count-recoverable-versions: for mutable files, the number of + recoverable versions of the file. For + a healthy file, this will equal 1. + count-unrecoverable-versions: for mutable files, the number of + unrecoverable versions of the file. + For a healthy file, this will be 0. + count-corrupt-shares: the number of shares with integrity failures + list-corrupt-shares: a list of "share locators", one for each share + that was found to be corrupt. Each share locator + is a list of (serverid, storage_index, sharenum). + needs-rebalancing: (bool) True if there are multiple shares on a single + storage server, indicating a reduction in reliability + that could be resolved by moving shares to new + servers. + servers-responding: list of base32-encoded storage server identifiers, + one for each server which responded to the share + query. + healthy: (bool) True if the file is completely healthy, False otherwise. + Healthy files have at least N good shares. Overlapping shares + do not currently cause a file to be marked unhealthy. If there + are at least N good shares, then corrupt shares do not cause the + file to be marked unhealthy, although the corrupt shares will be + listed in the results (list-corrupt-shares) and should be manually + removed to wasting time in subsequent downloads (as the + downloader rediscovers the corruption and uses alternate shares). + Future compatibility: the meaning of this field may change to + reflect whether the servers-of-happiness criterion is met + (see ticket #614). + sharemap: dict mapping share identifier to list of serverids + (base32-encoded strings). This indicates which servers are + holding which shares. For immutable files, the shareid is + an integer (the share number, from 0 to N-1). For + immutable files, it is a string of the form + 'seq%d-%s-sh%d', containing the sequence number, the + roothash, and the share number. + +``POST $URL?t=start-deep-check`` (must add &ophandle=XYZ) + + This initiates a recursive walk of all files and directories reachable from + the target, performing a check on each one just like t=check. The result + page will contain a summary of the results, including details on any + file/directory that was not fully healthy. + + t=start-deep-check can only be invoked on a directory. An error (400 + BAD_REQUEST) will be signalled if it is invoked on a file. The recursive + walker will deal with loops safely. + + This accepts the same verify= and add-lease= arguments as t=check. + + Since this operation can take a long time (perhaps a second per object), + the ophandle= argument is required (see "Slow Operations, Progress, and + Cancelling" above). The response to this POST will be a redirect to the + corresponding /operations/$HANDLE page (with output=HTML or output=JSON to + match the output= argument given to the POST). The deep-check operation + will continue to run in the background, and the /operations page should be + used to find out when the operation is done. + + Detailed check results for non-healthy files and directories will be + available under /operations/$HANDLE/$STORAGEINDEX, and the HTML status will + contain links to these detailed results. + + The HTML /operations/$HANDLE page for incomplete operations will contain a + meta-refresh tag, set to 60 seconds, so that a browser which uses + deep-check will automatically poll until the operation has completed. + + The JSON page (/options/$HANDLE?output=JSON) will contain a + machine-readable JSON dictionary with the following keys:: + + finished: a boolean, True if the operation is complete, else False. Some + of the remaining keys may not be present until the operation + is complete. + root-storage-index: a base32-encoded string with the storage index of the + starting point of the deep-check operation + count-objects-checked: count of how many objects were checked. Note that + non-distributed objects (i.e. small immutable LIT + files) are not checked, since for these objects, + the data is contained entirely in the URI. + count-objects-healthy: how many of those objects were completely healthy + count-objects-unhealthy: how many were damaged in some way + count-corrupt-shares: how many shares were found to have corruption, + summed over all objects examined + list-corrupt-shares: a list of "share identifiers", one for each share + that was found to be corrupt. Each share identifier + is a list of (serverid, storage_index, sharenum). + list-unhealthy-files: a list of (pathname, check-results) tuples, for + each file that was not fully healthy. 'pathname' is + a list of strings (which can be joined by "/" + characters to turn it into a single string), + relative to the directory on which deep-check was + invoked. The 'check-results' field is the same as + that returned by t=check&output=JSON, described + above. + stats: a dictionary with the same keys as the t=start-deep-stats command + (described below) + +``POST $URL?t=stream-deep-check`` + + This initiates a recursive walk of all files and directories reachable from + the target, performing a check on each one just like t=check. For each + unique object (duplicates are skipped), a single line of JSON is emitted to + the HTTP response channel (or an error indication, see below). When the walk + is complete, a final line of JSON is emitted which contains the accumulated + file-size/count "deep-stats" data. + + This command takes the same arguments as t=start-deep-check. + + A CLI tool can split the response stream on newlines into "response units", + and parse each response unit as JSON. Each such parsed unit will be a + dictionary, and will contain at least the "type" key: a string, one of + "file", "directory", or "stats". + + For all units that have a type of "file" or "directory", the dictionary will + contain the following keys:: + + "path": a list of strings, with the path that is traversed to reach the + object + "cap": a write-cap URI for the file or directory, if available, else a + read-cap URI + "verifycap": a verify-cap URI for the file or directory + "repaircap": an URI for the weakest cap that can still be used to repair + the object + "storage-index": a base32 storage index for the object + "check-results": a copy of the dictionary which would be returned by + t=check&output=json, with three top-level keys: + "storage-index", "summary", and "results", and a variety + of counts and sharemaps in the "results" value. + + Note that non-distributed files (i.e. LIT files) will have values of None + for verifycap, repaircap, and storage-index, since these files can neither + be verified nor repaired, and are not stored on the storage servers. + Likewise the check-results dictionary will be limited: an empty string for + storage-index, and a results dictionary with only the "healthy" key. + + The last unit in the stream will have a type of "stats", and will contain + the keys described in the "start-deep-stats" operation, below. + + If any errors occur during the traversal (specifically if a directory is + unrecoverable, such that further traversal is not possible), an error + indication is written to the response body, instead of the usual line of + JSON. This error indication line will begin with the string "ERROR:" (in all + caps), and contain a summary of the error on the rest of the line. The + remaining lines of the response body will be a python exception. The client + application should look for the ERROR: and stop processing JSON as soon as + it is seen. Note that neither a file being unrecoverable nor a directory + merely being unhealthy will cause traversal to stop. The line just before + the ERROR: will describe the directory that was untraversable, since the + unit is emitted to the HTTP response body before the child is traversed. + + +``POST $URL?t=check&repair=true`` + + This performs a health check of the given file or directory, and if the + checker determines that the object is not healthy (some shares are missing + or corrupted), it will perform a "repair". During repair, any missing + shares will be regenerated and uploaded to new servers. + + This accepts the same verify=true and add-lease= arguments as t=check. When + an output=JSON argument is provided, the machine-readable JSON response + will contain the following keys:: + + storage-index: a base32-encoded string with the objects's storage index, + or an empty string for LIT files + repair-attempted: (bool) True if repair was attempted + repair-successful: (bool) True if repair was attempted and the file was + fully healthy afterwards. False if no repair was + attempted, or if a repair attempt failed. + pre-repair-results: a dictionary that describes the state of the file + before any repair was performed. This contains exactly + the same keys as the 'results' value of the t=check + response, described above. + post-repair-results: a dictionary that describes the state of the file + after any repair was performed. If no repair was + performed, post-repair-results and pre-repair-results + will be the same. This contains exactly the same keys + as the 'results' value of the t=check response, + described above. + +``POST $URL?t=start-deep-check&repair=true`` (must add &ophandle=XYZ) + + This triggers a recursive walk of all files and directories, performing a + t=check&repair=true on each one. + + Like t=start-deep-check without the repair= argument, this can only be + invoked on a directory. An error (400 BAD_REQUEST) will be signalled if it + is invoked on a file. The recursive walker will deal with loops safely. + + This accepts the same verify= and add-lease= arguments as + t=start-deep-check. It uses the same ophandle= mechanism as + start-deep-check. When an output=JSON argument is provided, the response + will contain the following keys:: + + finished: (bool) True if the operation has completed, else False + root-storage-index: a base32-encoded string with the storage index of the + starting point of the deep-check operation + count-objects-checked: count of how many objects were checked + + count-objects-healthy-pre-repair: how many of those objects were completely + healthy, before any repair + count-objects-unhealthy-pre-repair: how many were damaged in some way + count-objects-healthy-post-repair: how many of those objects were completely + healthy, after any repair + count-objects-unhealthy-post-repair: how many were damaged in some way + + count-repairs-attempted: repairs were attempted on this many objects. + count-repairs-successful: how many repairs resulted in healthy objects + count-repairs-unsuccessful: how many repairs resulted did not results in + completely healthy objects + count-corrupt-shares-pre-repair: how many shares were found to have + corruption, summed over all objects + examined, before any repair + count-corrupt-shares-post-repair: how many shares were found to have + corruption, summed over all objects + examined, after any repair + list-corrupt-shares: a list of "share identifiers", one for each share + that was found to be corrupt (before any repair). + Each share identifier is a list of (serverid, + storage_index, sharenum). + list-remaining-corrupt-shares: like list-corrupt-shares, but mutable shares + that were successfully repaired are not + included. These are shares that need + manual processing. Since immutable shares + cannot be modified by clients, all corruption + in immutable shares will be listed here. + list-unhealthy-files: a list of (pathname, check-results) tuples, for + each file that was not fully healthy. 'pathname' is + relative to the directory on which deep-check was + invoked. The 'check-results' field is the same as + that returned by t=check&repair=true&output=JSON, + described above. + stats: a dictionary with the same keys as the t=start-deep-stats command + (described below) + +``POST $URL?t=stream-deep-check&repair=true`` + + This triggers a recursive walk of all files and directories, performing a + t=check&repair=true on each one. For each unique object (duplicates are + skipped), a single line of JSON is emitted to the HTTP response channel (or + an error indication). When the walk is complete, a final line of JSON is + emitted which contains the accumulated file-size/count "deep-stats" data. + + This emits the same data as t=stream-deep-check (without the repair=true), + except that the "check-results" field is replaced with a + "check-and-repair-results" field, which contains the keys returned by + t=check&repair=true&output=json (i.e. repair-attempted, repair-successful, + pre-repair-results, and post-repair-results). The output does not contain + the summary dictionary that is provied by t=start-deep-check&repair=true + (the one with count-objects-checked and list-unhealthy-files), since the + receiving client is expected to calculate those values itself from the + stream of per-object check-and-repair-results. + + Note that the "ERROR:" indication will only be emitted if traversal stops, + which will only occur if an unrecoverable directory is encountered. If a + file or directory repair fails, the traversal will continue, and the repair + failure will be indicated in the JSON data (in the "repair-successful" key). + +``POST $DIRURL?t=start-manifest`` (must add &ophandle=XYZ) + + This operation generates a "manfest" of the given directory tree, mostly + for debugging. This is a table of (path, filecap/dircap), for every object + reachable from the starting directory. The path will be slash-joined, and + the filecap/dircap will contain a link to the object in question. This page + gives immediate access to every object in the virtual filesystem subtree. + + This operation uses the same ophandle= mechanism as deep-check. The + corresponding /operations/$HANDLE page has three different forms. The + default is output=HTML. + + If output=text is added to the query args, the results will be a text/plain + list. The first line is special: it is either "finished: yes" or "finished: + no"; if the operation is not finished, you must periodically reload the + page until it completes. The rest of the results are a plaintext list, with + one file/dir per line, slash-separated, with the filecap/dircap separated + by a space. + + If output=JSON is added to the queryargs, then the results will be a + JSON-formatted dictionary with six keys. Note that because large directory + structures can result in very large JSON results, the full results will not + be available until the operation is complete (i.e. until output["finished"] + is True):: + + finished (bool): if False then you must reload the page until True + origin_si (base32 str): the storage index of the starting point + manifest: list of (path, cap) tuples, where path is a list of strings. + verifycaps: list of (printable) verify cap strings + storage-index: list of (base32) storage index strings + stats: a dictionary with the same keys as the t=start-deep-stats command + (described below) + +``POST $DIRURL?t=start-deep-size`` (must add &ophandle=XYZ) + + This operation generates a number (in bytes) containing the sum of the + filesize of all directories and immutable files reachable from the given + directory. This is a rough lower bound of the total space consumed by this + subtree. It does not include space consumed by mutable files, nor does it + take expansion or encoding overhead into account. Later versions of the + code may improve this estimate upwards. + + The /operations/$HANDLE status output consists of two lines of text:: + + finished: yes + size: 1234 + +``POST $DIRURL?t=start-deep-stats`` (must add &ophandle=XYZ) + + This operation performs a recursive walk of all files and directories + reachable from the given directory, and generates a collection of + statistics about those objects. + + The result (obtained from the /operations/$OPHANDLE page) is a + JSON-serialized dictionary with the following keys (note that some of these + keys may be missing until 'finished' is True):: + + finished: (bool) True if the operation has finished, else False + count-immutable-files: count of how many CHK files are in the set + count-mutable-files: same, for mutable files (does not include directories) + count-literal-files: same, for LIT files (data contained inside the URI) + count-files: sum of the above three + count-directories: count of directories + count-unknown: count of unrecognized objects (perhaps from the future) + size-immutable-files: total bytes for all CHK files in the set, =deep-size + size-mutable-files (TODO): same, for current version of all mutable files + size-literal-files: same, for LIT files + size-directories: size of directories (includes size-literal-files) + size-files-histogram: list of (minsize, maxsize, count) buckets, + with a histogram of filesizes, 5dB/bucket, + for both literal and immutable files + largest-directory: number of children in the largest directory + largest-immutable-file: number of bytes in the largest CHK file + + size-mutable-files is not implemented, because it would require extra + queries to each mutable file to get their size. This may be implemented in + the future. + + Assuming no sharing, the basic space consumed by a single root directory is + the sum of size-immutable-files, size-mutable-files, and size-directories. + The actual disk space used by the shares is larger, because of the + following sources of overhead:: + + integrity data + expansion due to erasure coding + share management data (leases) + backend (ext3) minimum block size + +``POST $URL?t=stream-manifest`` + + This operation performs a recursive walk of all files and directories + reachable from the given starting point. For each such unique object + (duplicates are skipped), a single line of JSON is emitted to the HTTP + response channel (or an error indication, see below). When the walk is + complete, a final line of JSON is emitted which contains the accumulated + file-size/count "deep-stats" data. + + A CLI tool can split the response stream on newlines into "response units", + and parse each response unit as JSON. Each such parsed unit will be a + dictionary, and will contain at least the "type" key: a string, one of + "file", "directory", or "stats". + + For all units that have a type of "file" or "directory", the dictionary will + contain the following keys:: + + "path": a list of strings, with the path that is traversed to reach the + object + "cap": a write-cap URI for the file or directory, if available, else a + read-cap URI + "verifycap": a verify-cap URI for the file or directory + "repaircap": an URI for the weakest cap that can still be used to repair + the object + "storage-index": a base32 storage index for the object + + Note that non-distributed files (i.e. LIT files) will have values of None + for verifycap, repaircap, and storage-index, since these files can neither + be verified nor repaired, and are not stored on the storage servers. + + The last unit in the stream will have a type of "stats", and will contain + the keys described in the "start-deep-stats" operation, below. + + If any errors occur during the traversal (specifically if a directory is + unrecoverable, such that further traversal is not possible), an error + indication is written to the response body, instead of the usual line of + JSON. This error indication line will begin with the string "ERROR:" (in all + caps), and contain a summary of the error on the rest of the line. The + remaining lines of the response body will be a python exception. The client + application should look for the ERROR: and stop processing JSON as soon as + it is seen. The line just before the ERROR: will describe the directory that + was untraversable, since the manifest entry is emitted to the HTTP response + body before the child is traversed. + +Other Useful Pages +================== + +The portion of the web namespace that begins with "/uri" (and "/named") is +dedicated to giving users (both humans and programs) access to the Tahoe +virtual filesystem. The rest of the namespace provides status information +about the state of the Tahoe node. + +``GET /`` (the root page) + +This is the "Welcome Page", and contains a few distinct sections:: + + Node information: library versions, local nodeid, services being provided. + + Filesystem Access Forms: create a new directory, view a file/directory by + URI, upload a file (unlinked), download a file by + URI. + + Grid Status: introducer information, helper information, connected storage + servers. + +``GET /status/`` + + This page lists all active uploads and downloads, and contains a short list + of recent upload/download operations. Each operation has a link to a page + that describes file sizes, servers that were involved, and the time consumed + in each phase of the operation. + + A GET of /status/?t=json will contain a machine-readable subset of the same + data. It returns a JSON-encoded dictionary. The only key defined at this + time is "active", with a value that is a list of operation dictionaries, one + for each active operation. Once an operation is completed, it will no longer + appear in data["active"] . + + Each op-dict contains a "type" key, one of "upload", "download", + "mapupdate", "publish", or "retrieve" (the first two are for immutable + files, while the latter three are for mutable files and directories). + + The "upload" op-dict will contain the following keys:: + + type (string): "upload" + storage-index-string (string): a base32-encoded storage index + total-size (int): total size of the file + status (string): current status of the operation + progress-hash (float): 1.0 when the file has been hashed + progress-ciphertext (float): 1.0 when the file has been encrypted. + progress-encode-push (float): 1.0 when the file has been encoded and + pushed to the storage servers. For helper + uploads, the ciphertext value climbs to 1.0 + first, then encoding starts. For unassisted + uploads, ciphertext and encode-push progress + will climb at the same pace. + + The "download" op-dict will contain the following keys:: + + type (string): "download" + storage-index-string (string): a base32-encoded storage index + total-size (int): total size of the file + status (string): current status of the operation + progress (float): 1.0 when the file has been fully downloaded + + Front-ends which want to report progress information are advised to simply + average together all the progress-* indicators. A slightly more accurate + value can be found by ignoring the progress-hash value (since the current + implementation hashes synchronously, so clients will probably never see + progress-hash!=1.0). + +``GET /provisioning/`` + + This page provides a basic tool to predict the likely storage and bandwidth + requirements of a large Tahoe grid. It provides forms to input things like + total number of users, number of files per user, average file size, number + of servers, expansion ratio, hard drive failure rate, etc. It then provides + numbers like how many disks per server will be needed, how many read + operations per second should be expected, and the likely MTBF for files in + the grid. This information is very preliminary, and the model upon which it + is based still needs a lot of work. + +``GET /helper_status/`` + + If the node is running a helper (i.e. if [helper]enabled is set to True in + tahoe.cfg), then this page will provide a list of all the helper operations + currently in progress. If "?t=json" is added to the URL, it will return a + JSON-formatted list of helper statistics, which can then be used to produce + graphs to indicate how busy the helper is. + +``GET /statistics/`` + + This page provides "node statistics", which are collected from a variety of + sources:: + + load_monitor: every second, the node schedules a timer for one second in + the future, then measures how late the subsequent callback + is. The "load_average" is this tardiness, measured in + seconds, averaged over the last minute. It is an indication + of a busy node, one which is doing more work than can be + completed in a timely fashion. The "max_load" value is the + highest value that has been seen in the last 60 seconds. + + cpu_monitor: every minute, the node uses time.clock() to measure how much + CPU time it has used, and it uses this value to produce + 1min/5min/15min moving averages. These values range from 0% + (0.0) to 100% (1.0), and indicate what fraction of the CPU + has been used by the Tahoe node. Not all operating systems + provide meaningful data to time.clock(): they may report 100% + CPU usage at all times. + + uploader: this counts how many immutable files (and bytes) have been + uploaded since the node was started + + downloader: this counts how many immutable files have been downloaded + since the node was started + + publishes: this counts how many mutable files (including directories) have + been modified since the node was started + + retrieves: this counts how many mutable files (including directories) have + been read since the node was started + + There are other statistics that are tracked by the node. The "raw stats" + section shows a formatted dump of all of them. + + By adding "?t=json" to the URL, the node will return a JSON-formatted + dictionary of stats values, which can be used by other tools to produce + graphs of node behavior. The misc/munin/ directory in the source + distribution provides some tools to produce these graphs. + +``GET /`` (introducer status) + + For Introducer nodes, the welcome page displays information about both + clients and servers which are connected to the introducer. Servers make + "service announcements", and these are listed in a table. Clients will + subscribe to hear about service announcements, and these subscriptions are + listed in a separate table. Both tables contain information about what + version of Tahoe is being run by the remote node, their advertised and + outbound IP addresses, their nodeid and nickname, and how long they have + been available. + + By adding "?t=json" to the URL, the node will return a JSON-formatted + dictionary of stats values, which can be used to produce graphs of connected + clients over time. This dictionary has the following keys:: + + ["subscription_summary"] : a dictionary mapping service name (like + "storage") to an integer with the number of + clients that have subscribed to hear about that + service + ["announcement_summary"] : a dictionary mapping service name to an integer + with the number of servers which are announcing + that service + ["announcement_distinct_hosts"] : a dictionary mapping service name to an + integer which represents the number of + distinct hosts that are providing that + service. If two servers have announced + FURLs which use the same hostnames (but + different ports and tubids), they are + considered to be on the same host. + + +Static Files in /public_html +============================ + +The webapi server will take any request for a URL that starts with /static +and serve it from a configurable directory which defaults to +$BASEDIR/public_html . This is configured by setting the "[node]web.static" +value in $BASEDIR/tahoe.cfg . If this is left at the default value of +"public_html", then http://localhost:3456/static/subdir/foo.html will be +served with the contents of the file $BASEDIR/public_html/subdir/foo.html . + +This can be useful to serve a javascript application which provides a +prettier front-end to the rest of the Tahoe webapi. + + +Safety and security issues -- names vs. URIs +============================================ + +Summary: use explicit file- and dir- caps whenever possible, to reduce the +potential for surprises when the filesystem structure is changed. + +Tahoe provides a mutable filesystem, but the ways that the filesystem can +change are limited. The only thing that can change is that the mapping from +child names to child objects that each directory contains can be changed by +adding a new child name pointing to an object, removing an existing child name, +or changing an existing child name to point to a different object. + +Obviously if you query Tahoe for information about the filesystem and then act +to change the filesystem (such as by getting a listing of the contents of a +directory and then adding a file to the directory), then the filesystem might +have been changed after you queried it and before you acted upon it. However, +if you use the URI instead of the pathname of an object when you act upon the +object, then the only change that can happen is if the object is a directory +then the set of child names it has might be different. If, on the other hand, +you act upon the object using its pathname, then a different object might be in +that place, which can result in more kinds of surprises. + +For example, suppose you are writing code which recursively downloads the +contents of a directory. The first thing your code does is fetch the listing +of the contents of the directory. For each child that it fetched, if that +child is a file then it downloads the file, and if that child is a directory +then it recurses into that directory. Now, if the download and the recurse +actions are performed using the child's name, then the results might be +wrong, because for example a child name that pointed to a sub-directory when +you listed the directory might have been changed to point to a file (in which +case your attempt to recurse into it would result in an error and the file +would be skipped), or a child name that pointed to a file when you listed the +directory might now point to a sub-directory (in which case your attempt to +download the child would result in a file containing HTML text describing the +sub-directory!). + +If your recursive algorithm uses the uri of the child instead of the name of +the child, then those kinds of mistakes just can't happen. Note that both the +child's name and the child's URI are included in the results of listing the +parent directory, so it isn't any harder to use the URI for this purpose. + +The read and write caps in a given directory node are separate URIs, and +can't be assumed to point to the same object even if they were retrieved in +the same operation (although the webapi server attempts to ensure this +in most cases). If you need to rely on that property, you should explicitly +verify it. More generally, you should not make assumptions about the +internal consistency of the contents of mutable directories. As a result +of the signatures on mutable object versions, it is guaranteed that a given +version was written in a single update, but -- as in the case of a file -- +the contents may have been chosen by a malicious writer in a way that is +designed to confuse applications that rely on their consistency. + +In general, use names if you want "whatever object (whether file or +directory) is found by following this name (or sequence of names) when my +request reaches the server". Use URIs if you want "this particular object". + +Concurrency Issues +================== + +Tahoe uses both mutable and immutable files. Mutable files can be created +explicitly by doing an upload with ?mutable=true added, or implicitly by +creating a new directory (since a directory is just a special way to +interpret a given mutable file). + +Mutable files suffer from the same consistency-vs-availability tradeoff that +all distributed data storage systems face. It is not possible to +simultaneously achieve perfect consistency and perfect availability in the +face of network partitions (servers being unreachable or faulty). + +Tahoe tries to achieve a reasonable compromise, but there is a basic rule in +place, known as the Prime Coordination Directive: "Don't Do That". What this +means is that if write-access to a mutable file is available to several +parties, then those parties are responsible for coordinating their activities +to avoid multiple simultaneous updates. This could be achieved by having +these parties talk to each other and using some sort of locking mechanism, or +by serializing all changes through a single writer. + +The consequences of performing uncoordinated writes can vary. Some of the +writers may lose their changes, as somebody else wins the race condition. In +many cases the file will be left in an "unhealthy" state, meaning that there +are not as many redundant shares as we would like (reducing the reliability +of the file against server failures). In the worst case, the file can be left +in such an unhealthy state that no version is recoverable, even the old ones. +It is this small possibility of data loss that prompts us to issue the Prime +Coordination Directive. + +Tahoe nodes implement internal serialization to make sure that a single Tahoe +node cannot conflict with itself. For example, it is safe to issue two +directory modification requests to a single tahoe node's webapi server at the +same time, because the Tahoe node will internally delay one of them until +after the other has finished being applied. (This feature was introduced in +Tahoe-1.1; back with Tahoe-1.0 the web client was responsible for serializing +web requests themselves). + +For more details, please see the "Consistency vs Availability" and "The Prime +Coordination Directive" sections of mutable.txt, in the same directory as +this file. + + +.. [1] URLs and HTTP and UTF-8, Oh My + + HTTP does not provide a mechanism to specify the character set used to + encode non-ascii names in URLs (rfc2396#2.1). We prefer the convention that + the filename= argument shall be a URL-encoded UTF-8 encoded unicode object. + For example, suppose we want to provoke the server into using a filename of + "f i a n c e-acute e" (i.e. F I A N C U+00E9 E). The UTF-8 encoding of this + is 0x66 0x69 0x61 0x6e 0x63 0xc3 0xa9 0x65 (or "fianc\xC3\xA9e", as python's + repr() function would show). To encode this into a URL, the non-printable + characters must be escaped with the urlencode '%XX' mechansim, giving us + "fianc%C3%A9e". Thus, the first line of the HTTP request will be "GET + /uri/CAP...?save=true&filename=fianc%C3%A9e HTTP/1.1". Not all browsers + provide this: IE7 uses the Latin-1 encoding, which is fianc%E9e. + + The response header will need to indicate a non-ASCII filename. The actual + mechanism to do this is not clear. For ASCII filenames, the response header + would look like:: + + Content-Disposition: attachment; filename="english.txt" + + If Tahoe were to enforce the utf-8 convention, it would need to decode the + URL argument into a unicode string, and then encode it back into a sequence + of bytes when creating the response header. One possibility would be to use + unencoded utf-8. Developers suggest that IE7 might accept this:: + + #1: Content-Disposition: attachment; filename="fianc\xC3\xA9e" + (note, the last four bytes of that line, not including the newline, are + 0xC3 0xA9 0x65 0x22) + + RFC2231#4 (dated 1997): suggests that the following might work, and some + developers (http://markmail.org/message/dsjyokgl7hv64ig3) have reported that + it is supported by firefox (but not IE7):: + + #2: Content-Disposition: attachment; filename*=utf-8''fianc%C3%A9e + + My reading of RFC2616#19.5.1 (which defines Content-Disposition) says that + the filename= parameter is defined to be wrapped in quotes (presumeably to + allow spaces without breaking the parsing of subsequent parameters), which + would give us:: + + #3: Content-Disposition: attachment; filename*=utf-8''"fianc%C3%A9e" + + However this is contrary to the examples in the email thread listed above. + + Developers report that IE7 (when it is configured for UTF-8 URL encoding, + which is not the default in asian countries), will accept:: + + #4: Content-Disposition: attachment; filename=fianc%C3%A9e + + However, for maximum compatibility, Tahoe simply copies bytes from the URL + into the response header, rather than enforcing the utf-8 convention. This + means it does not try to decode the filename from the URL argument, nor does + it encode the filename into the response header. diff --git a/docs/frontends/webapi.txt b/docs/frontends/webapi.txt deleted file mode 100644 index 31924bcc..00000000 --- a/docs/frontends/webapi.txt +++ /dev/null @@ -1,1963 +0,0 @@ -========================== -The Tahoe REST-ful Web API -========================== - -1. `Enabling the web-API port`_ -2. `Basic Concepts: GET, PUT, DELETE, POST`_ -3. `URLs`_ - - 1. `Child Lookup`_ - -4. `Slow Operations, Progress, and Cancelling`_ -5. `Programmatic Operations`_ - - 1. `Reading a file`_ - 2. `Writing/Uploading a File`_ - 3. `Creating a New Directory`_ - 4. `Get Information About A File Or Directory (as JSON)`_ - 5. `Attaching an existing File or Directory by its read- or write-cap`_ - 6. `Adding multiple files or directories to a parent directory at once`_ - 7. `Deleting a File or Directory`_ - -6. `Browser Operations: Human-Oriented Interfaces`_ - - 1. `Viewing A Directory (as HTML)`_ - 2. `Viewing/Downloading a File`_ - 3. `Get Information About A File Or Directory (as HTML)`_ - 4. `Creating a Directory`_ - 5. `Uploading a File`_ - 6. `Attaching An Existing File Or Directory (by URI)`_ - 7. `Deleting A Child`_ - 8. `Renaming A Child`_ - 9. `Other Utilities`_ - 10. `Debugging and Testing Features`_ - -7. `Other Useful Pages`_ -8. `Static Files in /public_html`_ -9. `Safety and security issues -- names vs. URIs`_ -10. `Concurrency Issues`_ - -Enabling the web-API port -========================= - -Every Tahoe node is capable of running a built-in HTTP server. To enable -this, just write a port number into the "[node]web.port" line of your node's -tahoe.cfg file. For example, writing "web.port = 3456" into the "[node]" -section of $NODEDIR/tahoe.cfg will cause the node to run a webserver on port -3456. - -This string is actually a Twisted "strports" specification, meaning you can -get more control over the interface to which the server binds by supplying -additional arguments. For more details, see the documentation on -`twisted.application.strports -`_. - -Writing "tcp:3456:interface=127.0.0.1" into the web.port line does the same -but binds to the loopback interface, ensuring that only the programs on the -local host can connect. Using "ssl:3456:privateKey=mykey.pem:certKey=cert.pem" -runs an SSL server. - -This webport can be set when the node is created by passing a --webport -option to the 'tahoe create-node' command. By default, the node listens on -port 3456, on the loopback (127.0.0.1) interface. - -Basic Concepts: GET, PUT, DELETE, POST -====================================== - -As described in `architecture.rst`_, each file and directory in a Tahoe virtual -filesystem is referenced by an identifier that combines the designation of -the object with the authority to do something with it (such as read or modify -the contents). This identifier is called a "read-cap" or "write-cap", -depending upon whether it enables read-only or read-write access. These -"caps" are also referred to as URIs. - -.. _architecture.rst: http://tahoe-lafs.org/source/tahoe-lafs/trunk/docs/architecture.rst - -The Tahoe web-based API is "REST-ful", meaning it implements the concepts of -"REpresentational State Transfer": the original scheme by which the World -Wide Web was intended to work. Each object (file or directory) is referenced -by a URL that includes the read- or write- cap. HTTP methods (GET, PUT, and -DELETE) are used to manipulate these objects. You can think of the URL as a -noun, and the method as a verb. - -In REST, the GET method is used to retrieve information about an object, or -to retrieve some representation of the object itself. When the object is a -file, the basic GET method will simply return the contents of that file. -Other variations (generally implemented by adding query parameters to the -URL) will return information about the object, such as metadata. GET -operations are required to have no side-effects. - -PUT is used to upload new objects into the filesystem, or to replace an -existing object. DELETE it used to delete objects from the filesystem. Both -PUT and DELETE are required to be idempotent: performing the same operation -multiple times must have the same side-effects as only performing it once. - -POST is used for more complicated actions that cannot be expressed as a GET, -PUT, or DELETE. POST operations can be thought of as a method call: sending -some message to the object referenced by the URL. In Tahoe, POST is also used -for operations that must be triggered by an HTML form (including upload and -delete), because otherwise a regular web browser has no way to accomplish -these tasks. In general, everything that can be done with a PUT or DELETE can -also be done with a POST. - -Tahoe's web API is designed for two different kinds of consumer. The first is -a program that needs to manipulate the virtual file system. Such programs are -expected to use the RESTful interface described above. The second is a human -using a standard web browser to work with the filesystem. This user is given -a series of HTML pages with links to download files, and forms that use POST -actions to upload, rename, and delete files. - -When an error occurs, the HTTP response code will be set to an appropriate -400-series code (like 404 Not Found for an unknown childname, or 400 Bad Request -when the parameters to a webapi operation are invalid), and the HTTP response -body will usually contain a few lines of explanation as to the cause of the -error and possible responses. Unusual exceptions may result in a 500 Internal -Server Error as a catch-all, with a default response body containing -a Nevow-generated HTML-ized representation of the Python exception stack trace -that caused the problem. CLI programs which want to copy the response body to -stderr should provide an "Accept: text/plain" header to their requests to get -a plain text stack trace instead. If the Accept header contains ``*/*``, or -``text/*``, or text/html (or if there is no Accept header), HTML tracebacks will -be generated. - -URLs -==== - -Tahoe uses a variety of read- and write- caps to identify files and -directories. The most common of these is the "immutable file read-cap", which -is used for most uploaded files. These read-caps look like the following:: - - URI:CHK:ime6pvkaxuetdfah2p2f35pe54:4btz54xk3tew6nd4y2ojpxj4m6wxjqqlwnztgre6gnjgtucd5r4a:3:10:202 - -The next most common is a "directory write-cap", which provides both read and -write access to a directory, and look like this:: - - URI:DIR2:djrdkfawoqihigoett4g6auz6a:jx5mplfpwexnoqff7y5e4zjus4lidm76dcuarpct7cckorh2dpgq - -There are also "directory read-caps", which start with "URI:DIR2-RO:", and -give read-only access to a directory. Finally there are also mutable file -read- and write- caps, which start with "URI:SSK", and give access to mutable -files. - -(Later versions of Tahoe will make these strings shorter, and will remove the -unfortunate colons, which must be escaped when these caps are embedded in -URLs.) - -To refer to any Tahoe object through the web API, you simply need to combine -a prefix (which indicates the HTTP server to use) with the cap (which -indicates which object inside that server to access). Since the default Tahoe -webport is 3456, the most common prefix is one that will use a local node -listening on this port:: - - http://127.0.0.1:3456/uri/ + $CAP - -So, to access the directory named above (which happens to be the -publically-writeable sample directory on the Tahoe test grid, described at -http://allmydata.org/trac/tahoe/wiki/TestGrid), the URL would be:: - - http://127.0.0.1:3456/uri/URI%3ADIR2%3Adjrdkfawoqihigoett4g6auz6a%3Ajx5mplfpwexnoqff7y5e4zjus4lidm76dcuarpct7cckorh2dpgq/ - -(note that the colons in the directory-cap are url-encoded into "%3A" -sequences). - -Likewise, to access the file named above, use:: - - http://127.0.0.1:3456/uri/URI%3ACHK%3Aime6pvkaxuetdfah2p2f35pe54%3A4btz54xk3tew6nd4y2ojpxj4m6wxjqqlwnztgre6gnjgtucd5r4a%3A3%3A10%3A202 - -In the rest of this document, we'll use "$DIRCAP" as shorthand for a read-cap -or write-cap that refers to a directory, and "$FILECAP" to abbreviate a cap -that refers to a file (whether mutable or immutable). So those URLs above can -be abbreviated as:: - - http://127.0.0.1:3456/uri/$DIRCAP/ - http://127.0.0.1:3456/uri/$FILECAP - -The operation summaries below will abbreviate these further, by eliding the -server prefix. They will be displayed like this:: - - /uri/$DIRCAP/ - /uri/$FILECAP - - -Child Lookup ------------- - -Tahoe directories contain named child entries, just like directories in a regular -local filesystem. These child entries, called "dirnodes", consist of a name, -metadata, a write slot, and a read slot. The write and read slots normally contain -a write-cap and read-cap referring to the same object, which can be either a file -or a subdirectory. The write slot may be empty (actually, both may be empty, -but that is unusual). - -If you have a Tahoe URL that refers to a directory, and want to reference a -named child inside it, just append the child name to the URL. For example, if -our sample directory contains a file named "welcome.txt", we can refer to -that file with:: - - http://127.0.0.1:3456/uri/$DIRCAP/welcome.txt - -(or http://127.0.0.1:3456/uri/URI%3ADIR2%3Adjrdkfawoqihigoett4g6auz6a%3Ajx5mplfpwexnoqff7y5e4zjus4lidm76dcuarpct7cckorh2dpgq/welcome.txt) - -Multiple levels of subdirectories can be handled this way:: - - http://127.0.0.1:3456/uri/$DIRCAP/tahoe-source/docs/webapi.txt - -In this document, when we need to refer to a URL that references a file using -this child-of-some-directory format, we'll use the following string:: - - /uri/$DIRCAP/[SUBDIRS../]FILENAME - -The "[SUBDIRS../]" part means that there are zero or more (optional) -subdirectory names in the middle of the URL. The "FILENAME" at the end means -that this whole URL refers to a file of some sort, rather than to a -directory. - -When we need to refer specifically to a directory in this way, we'll write:: - - /uri/$DIRCAP/[SUBDIRS../]SUBDIR - - -Note that all components of pathnames in URLs are required to be UTF-8 -encoded, so "resume.doc" (with an acute accent on both E's) would be accessed -with:: - - http://127.0.0.1:3456/uri/$DIRCAP/r%C3%A9sum%C3%A9.doc - -Also note that the filenames inside upload POST forms are interpreted using -whatever character set was provided in the conventional '_charset' field, and -defaults to UTF-8 if not otherwise specified. The JSON representation of each -directory contains native unicode strings. Tahoe directories are specified to -contain unicode filenames, and cannot contain binary strings that are not -representable as such. - -All Tahoe operations that refer to existing files or directories must include -a suitable read- or write- cap in the URL: the webapi server won't add one -for you. If you don't know the cap, you can't access the file. This allows -the security properties of Tahoe caps to be extended across the webapi -interface. - -Slow Operations, Progress, and Cancelling -========================================= - -Certain operations can be expected to take a long time. The "t=deep-check", -described below, will recursively visit every file and directory reachable -from a given starting point, which can take minutes or even hours for -extremely large directory structures. A single long-running HTTP request is a -fragile thing: proxies, NAT boxes, browsers, and users may all grow impatient -with waiting and give up on the connection. - -For this reason, long-running operations have an "operation handle", which -can be used to poll for status/progress messages while the operation -proceeds. This handle can also be used to cancel the operation. These handles -are created by the client, and passed in as a an "ophandle=" query argument -to the POST or PUT request which starts the operation. The following -operations can then be used to retrieve status: - -``GET /operations/$HANDLE?output=HTML (with or without t=status)`` - -``GET /operations/$HANDLE?output=JSON (same)`` - - These two retrieve the current status of the given operation. Each operation - presents a different sort of information, but in general the page retrieved - will indicate: - - * whether the operation is complete, or if it is still running - * how much of the operation is complete, and how much is left, if possible - - Note that the final status output can be quite large: a deep-manifest of a - directory structure with 300k directories and 200k unique files is about - 275MB of JSON, and might take two minutes to generate. For this reason, the - full status is not provided until the operation has completed. - - The HTML form will include a meta-refresh tag, which will cause a regular - web browser to reload the status page about 60 seconds later. This tag will - be removed once the operation has completed. - - There may be more status information available under - /operations/$HANDLE/$ETC : i.e., the handle forms the root of a URL space. - -``POST /operations/$HANDLE?t=cancel`` - - This terminates the operation, and returns an HTML page explaining what was - cancelled. If the operation handle has already expired (see below), this - POST will return a 404, which indicates that the operation is no longer - running (either it was completed or terminated). The response body will be - the same as a GET /operations/$HANDLE on this operation handle, and the - handle will be expired immediately afterwards. - -The operation handle will eventually expire, to avoid consuming an unbounded -amount of memory. The handle's time-to-live can be reset at any time, by -passing a retain-for= argument (with a count of seconds) to either the -initial POST that starts the operation, or the subsequent GET request which -asks about the operation. For example, if a 'GET -/operations/$HANDLE?output=JSON&retain-for=600' query is performed, the -handle will remain active for 600 seconds (10 minutes) after the GET was -received. - -In addition, if the GET includes a release-after-complete=True argument, and -the operation has completed, the operation handle will be released -immediately. - -If a retain-for= argument is not used, the default handle lifetimes are: - - * handles will remain valid at least until their operation finishes - * uncollected handles for finished operations (i.e. handles for - operations that have finished but for which the GET page has not been - accessed since completion) will remain valid for four days, or for - the total time consumed by the operation, whichever is greater. - * collected handles (i.e. the GET page has been retrieved at least once - since the operation completed) will remain valid for one day. - -Many "slow" operations can begin to use unacceptable amounts of memory when -operating on large directory structures. The memory usage increases when the -ophandle is polled, as the results must be copied into a JSON string, sent -over the wire, then parsed by a client. So, as an alternative, many "slow" -operations have streaming equivalents. These equivalents do not use operation -handles. Instead, they emit line-oriented status results immediately. Client -code can cancel the operation by simply closing the HTTP connection. - -Programmatic Operations -======================= - -Now that we know how to build URLs that refer to files and directories in a -Tahoe virtual filesystem, what sorts of operations can we do with those URLs? -This section contains a catalog of GET, PUT, DELETE, and POST operations that -can be performed on these URLs. This set of operations are aimed at programs -that use HTTP to communicate with a Tahoe node. A later section describes -operations that are intended for web browsers. - -Reading A File --------------- - -``GET /uri/$FILECAP`` - -``GET /uri/$DIRCAP/[SUBDIRS../]FILENAME`` - - This will retrieve the contents of the given file. The HTTP response body - will contain the sequence of bytes that make up the file. - - To view files in a web browser, you may want more control over the - Content-Type and Content-Disposition headers. Please see the next section - "Browser Operations", for details on how to modify these URLs for that - purpose. - -Writing/Uploading A File ------------------------- - -``PUT /uri/$FILECAP`` - -``PUT /uri/$DIRCAP/[SUBDIRS../]FILENAME`` - - Upload a file, using the data from the HTTP request body, and add whatever - child links and subdirectories are necessary to make the file available at - the given location. Once this operation succeeds, a GET on the same URL will - retrieve the same contents that were just uploaded. This will create any - necessary intermediate subdirectories. - - To use the /uri/$FILECAP form, $FILECAP must be a write-cap for a mutable file. - - In the /uri/$DIRCAP/[SUBDIRS../]FILENAME form, if the target file is a - writeable mutable file, that file's contents will be overwritten in-place. If - it is a read-cap for a mutable file, an error will occur. If it is an - immutable file, the old file will be discarded, and a new one will be put in - its place. - - When creating a new file, if "mutable=true" is in the query arguments, the - operation will create a mutable file instead of an immutable one. - - This returns the file-cap of the resulting file. If a new file was created - by this method, the HTTP response code (as dictated by rfc2616) will be set - to 201 CREATED. If an existing file was replaced or modified, the response - code will be 200 OK. - - Note that the 'curl -T localfile http://127.0.0.1:3456/uri/$DIRCAP/foo.txt' - command can be used to invoke this operation. - -``PUT /uri`` - - This uploads a file, and produces a file-cap for the contents, but does not - attach the file into the filesystem. No directories will be modified by - this operation. The file-cap is returned as the body of the HTTP response. - - If "mutable=true" is in the query arguments, the operation will create a - mutable file, and return its write-cap in the HTTP respose. The default is - to create an immutable file, returning the read-cap as a response. - -Creating A New Directory ------------------------- - -``POST /uri?t=mkdir`` - -``PUT /uri?t=mkdir`` - - Create a new empty directory and return its write-cap as the HTTP response - body. This does not make the newly created directory visible from the - filesystem. The "PUT" operation is provided for backwards compatibility: - new code should use POST. - -``POST /uri?t=mkdir-with-children`` - - Create a new directory, populated with a set of child nodes, and return its - write-cap as the HTTP response body. The new directory is not attached to - any other directory: the returned write-cap is the only reference to it. - - Initial children are provided as the body of the POST form (this is more - efficient than doing separate mkdir and set_children operations). If the - body is empty, the new directory will be empty. If not empty, the body will - be interpreted as a UTF-8 JSON-encoded dictionary of children with which the - new directory should be populated, using the same format as would be - returned in the 'children' value of the t=json GET request, described below. - Each dictionary key should be a child name, and each value should be a list - of [TYPE, PROPDICT], where PROPDICT contains "rw_uri", "ro_uri", and - "metadata" keys (all others are ignored). For example, the PUT request body - could be:: - - { - "Fran\u00e7ais": [ "filenode", { - "ro_uri": "URI:CHK:...", - "size": bytes, - "metadata": { - "ctime": 1202777696.7564139, - "mtime": 1202777696.7564139, - "tahoe": { - "linkcrtime": 1202777696.7564139, - "linkmotime": 1202777696.7564139 - } } } ], - "subdir": [ "dirnode", { - "rw_uri": "URI:DIR2:...", - "ro_uri": "URI:DIR2-RO:...", - "metadata": { - "ctime": 1202778102.7589991, - "mtime": 1202778111.2160511, - "tahoe": { - "linkcrtime": 1202777696.7564139, - "linkmotime": 1202777696.7564139 - } } } ] - } - - For forward-compatibility, a mutable directory can also contain caps in - a format that is unknown to the webapi server. When such caps are retrieved - from a mutable directory in a "ro_uri" field, they will be prefixed with - the string "ro.", indicating that they must not be decoded without - checking that they are read-only. The "ro." prefix must not be stripped - off without performing this check. (Future versions of the webapi server - will perform it where necessary.) - - If both the "rw_uri" and "ro_uri" fields are present in a given PROPDICT, - and the webapi server recognizes the rw_uri as a write cap, then it will - reset the ro_uri to the corresponding read cap and discard the original - contents of ro_uri (in order to ensure that the two caps correspond to the - same object and that the ro_uri is in fact read-only). However this may not - happen for caps in a format unknown to the webapi server. Therefore, when - writing a directory the webapi client should ensure that the contents - of "rw_uri" and "ro_uri" for a given PROPDICT are a consistent - (write cap, read cap) pair if possible. If the webapi client only has - one cap and does not know whether it is a write cap or read cap, then - it is acceptable to set "rw_uri" to that cap and omit "ro_uri". The - client must not put a write cap into a "ro_uri" field. - - The metadata may have a "no-write" field. If this is set to true in the - metadata of a link, it will not be possible to open that link for writing - via the SFTP frontend; see `FTP-and-SFTP.rst`_ for details. - Also, if the "no-write" field is set to true in the metadata of a link to - a mutable child, it will cause the link to be diminished to read-only. - - .. _FTP-and-SFTP.rst: http://tahoe-lafs.org/source/tahoe-lafs/trunk/docs/frontents/FTP-and-SFTP.rst - - Note that the webapi-using client application must not provide the - "Content-Type: multipart/form-data" header that usually accompanies HTML - form submissions, since the body is not formatted this way. Doing so will - cause a server error as the lower-level code misparses the request body. - - Child file names should each be expressed as a unicode string, then used as - keys of the dictionary. The dictionary should then be converted into JSON, - and the resulting string encoded into UTF-8. This UTF-8 bytestring should - then be used as the POST body. - -``POST /uri?t=mkdir-immutable`` - - Like t=mkdir-with-children above, but the new directory will be - deep-immutable. This means that the directory itself is immutable, and that - it can only contain objects that are treated as being deep-immutable, like - immutable files, literal files, and deep-immutable directories. - - For forward-compatibility, a deep-immutable directory can also contain caps - in a format that is unknown to the webapi server. When such caps are retrieved - from a deep-immutable directory in a "ro_uri" field, they will be prefixed - with the string "imm.", indicating that they must not be decoded without - checking that they are immutable. The "imm." prefix must not be stripped - off without performing this check. (Future versions of the webapi server - will perform it where necessary.) - - The cap for each child may be given either in the "rw_uri" or "ro_uri" - field of the PROPDICT (not both). If a cap is given in the "rw_uri" field, - then the webapi server will check that it is an immutable read-cap of a - *known* format, and give an error if it is not. If a cap is given in the - "ro_uri" field, then the webapi server will still check whether known - caps are immutable, but for unknown caps it will simply assume that the - cap can be stored, as described above. Note that an attacker would be - able to store any cap in an immutable directory, so this check when - creating the directory is only to help non-malicious clients to avoid - accidentally giving away more authority than intended. - - A non-empty request body is mandatory, since after the directory is created, - it will not be possible to add more children to it. - -``POST /uri/$DIRCAP/[SUBDIRS../]SUBDIR?t=mkdir`` - -``PUT /uri/$DIRCAP/[SUBDIRS../]SUBDIR?t=mkdir`` - - Create new directories as necessary to make sure that the named target - ($DIRCAP/SUBDIRS../SUBDIR) is a directory. This will create additional - intermediate mutable directories as necessary. If the named target directory - already exists, this will make no changes to it. - - If the final directory is created, it will be empty. - - This operation will return an error if a blocking file is present at any of - the parent names, preventing the server from creating the necessary parent - directory; or if it would require changing an immutable directory. - - The write-cap of the new directory will be returned as the HTTP response - body. - -``POST /uri/$DIRCAP/[SUBDIRS../]SUBDIR?t=mkdir-with-children`` - - Like /uri?t=mkdir-with-children, but the final directory is created as a - child of an existing mutable directory. This will create additional - intermediate mutable directories as necessary. If the final directory is - created, it will be populated with initial children from the POST request - body, as described above. - - This operation will return an error if a blocking file is present at any of - the parent names, preventing the server from creating the necessary parent - directory; or if it would require changing an immutable directory; or if - the immediate parent directory already has a a child named SUBDIR. - -``POST /uri/$DIRCAP/[SUBDIRS../]SUBDIR?t=mkdir-immutable`` - - Like /uri?t=mkdir-immutable, but the final directory is created as a child - of an existing mutable directory. The final directory will be deep-immutable, - and will be populated with the children specified as a JSON dictionary in - the POST request body. - - In Tahoe 1.6 this operation creates intermediate mutable directories if - necessary, but that behaviour should not be relied on; see ticket #920. - - This operation will return an error if the parent directory is immutable, - or already has a child named SUBDIR. - -``POST /uri/$DIRCAP/[SUBDIRS../]?t=mkdir&name=NAME`` - - Create a new empty mutable directory and attach it to the given existing - directory. This will create additional intermediate directories as necessary. - - This operation will return an error if a blocking file is present at any of - the parent names, preventing the server from creating the necessary parent - directory, or if it would require changing any immutable directory. - - The URL of this operation points to the parent of the bottommost new directory, - whereas the /uri/$DIRCAP/[SUBDIRS../]SUBDIR?t=mkdir operation above has a URL - that points directly to the bottommost new directory. - -``POST /uri/$DIRCAP/[SUBDIRS../]?t=mkdir-with-children&name=NAME`` - - Like /uri/$DIRCAP/[SUBDIRS../]?t=mkdir&name=NAME, but the new directory will - be populated with initial children via the POST request body. This command - will create additional intermediate mutable directories as necessary. - - This operation will return an error if a blocking file is present at any of - the parent names, preventing the server from creating the necessary parent - directory; or if it would require changing an immutable directory; or if - the immediate parent directory already has a a child named NAME. - - Note that the name= argument must be passed as a queryarg, because the POST - request body is used for the initial children JSON. - -``POST /uri/$DIRCAP/[SUBDIRS../]?t=mkdir-immutable&name=NAME`` - - Like /uri/$DIRCAP/[SUBDIRS../]?t=mkdir-with-children&name=NAME, but the - final directory will be deep-immutable. The children are specified as a - JSON dictionary in the POST request body. Again, the name= argument must be - passed as a queryarg. - - In Tahoe 1.6 this operation creates intermediate mutable directories if - necessary, but that behaviour should not be relied on; see ticket #920. - - This operation will return an error if the parent directory is immutable, - or already has a child named NAME. - -Get Information About A File Or Directory (as JSON) ---------------------------------------------------- - -``GET /uri/$FILECAP?t=json`` - -``GET /uri/$DIRCAP?t=json`` - -``GET /uri/$DIRCAP/[SUBDIRS../]SUBDIR?t=json`` - -``GET /uri/$DIRCAP/[SUBDIRS../]FILENAME?t=json`` - - This returns a machine-parseable JSON-encoded description of the given - object. The JSON always contains a list, and the first element of the list is - always a flag that indicates whether the referenced object is a file or a - directory. If it is a capability to a file, then the information includes - file size and URI, like this:: - - GET /uri/$FILECAP?t=json : - - [ "filenode", { - "ro_uri": file_uri, - "verify_uri": verify_uri, - "size": bytes, - "mutable": false - } ] - - If it is a capability to a directory followed by a path from that directory - to a file, then the information also includes metadata from the link to the - file in the parent directory, like this:: - - GET /uri/$DIRCAP/[SUBDIRS../]FILENAME?t=json - - [ "filenode", { - "ro_uri": file_uri, - "verify_uri": verify_uri, - "size": bytes, - "mutable": false, - "metadata": { - "ctime": 1202777696.7564139, - "mtime": 1202777696.7564139, - "tahoe": { - "linkcrtime": 1202777696.7564139, - "linkmotime": 1202777696.7564139 - } } } ] - - If it is a directory, then it includes information about the children of - this directory, as a mapping from child name to a set of data about the - child (the same data that would appear in a corresponding GET?t=json of the - child itself). The child entries also include metadata about each child, - including link-creation- and link-change- timestamps. The output looks like - this:: - - GET /uri/$DIRCAP?t=json : - GET /uri/$DIRCAP/[SUBDIRS../]SUBDIR?t=json : - - [ "dirnode", { - "rw_uri": read_write_uri, - "ro_uri": read_only_uri, - "verify_uri": verify_uri, - "mutable": true, - "children": { - "foo.txt": [ "filenode", { - "ro_uri": uri, - "size": bytes, - "metadata": { - "ctime": 1202777696.7564139, - "mtime": 1202777696.7564139, - "tahoe": { - "linkcrtime": 1202777696.7564139, - "linkmotime": 1202777696.7564139 - } } } ], - "subdir": [ "dirnode", { - "rw_uri": rwuri, - "ro_uri": rouri, - "metadata": { - "ctime": 1202778102.7589991, - "mtime": 1202778111.2160511, - "tahoe": { - "linkcrtime": 1202777696.7564139, - "linkmotime": 1202777696.7564139 - } } } ] - } } ] - - In the above example, note how 'children' is a dictionary in which the keys - are child names and the values depend upon whether the child is a file or a - directory. The value is mostly the same as the JSON representation of the - child object (except that directories do not recurse -- the "children" - entry of the child is omitted, and the directory view includes the metadata - that is stored on the directory edge). - - The rw_uri field will be present in the information about a directory - if and only if you have read-write access to that directory. The verify_uri - field will be present if and only if the object has a verify-cap - (non-distributed LIT files do not have verify-caps). - - If the cap is of an unknown format, then the file size and verify_uri will - not be available:: - - GET /uri/$UNKNOWNCAP?t=json : - - [ "unknown", { - "ro_uri": unknown_read_uri - } ] - - GET /uri/$DIRCAP/[SUBDIRS../]UNKNOWNCHILDNAME?t=json : - - [ "unknown", { - "rw_uri": unknown_write_uri, - "ro_uri": unknown_read_uri, - "mutable": true, - "metadata": { - "ctime": 1202777696.7564139, - "mtime": 1202777696.7564139, - "tahoe": { - "linkcrtime": 1202777696.7564139, - "linkmotime": 1202777696.7564139 - } } } ] - - As in the case of file nodes, the metadata will only be present when the - capability is to a directory followed by a path. The "mutable" field is also - not always present; when it is absent, the mutability of the object is not - known. - -About the metadata -`````````````````` - -The value of the 'tahoe':'linkmotime' key is updated whenever a link to a -child is set. The value of the 'tahoe':'linkcrtime' key is updated whenever -a link to a child is created -- i.e. when there was not previously a link -under that name. - -Note however, that if the edge in the Tahoe filesystem points to a mutable -file and the contents of that mutable file is changed, then the -'tahoe':'linkmotime' value on that edge will *not* be updated, since the -edge itself wasn't updated -- only the mutable file was. - -The timestamps are represented as a number of seconds since the UNIX epoch -(1970-01-01 00:00:00 UTC), with leap seconds not being counted in the long -term. - -In Tahoe earlier than v1.4.0, 'mtime' and 'ctime' keys were populated -instead of the 'tahoe':'linkmotime' and 'tahoe':'linkcrtime' keys. Starting -in Tahoe v1.4.0, the 'linkmotime'/'linkcrtime' keys in the 'tahoe' sub-dict -are populated. However, prior to Tahoe v1.7beta, a bug caused the 'tahoe' -sub-dict to be deleted by webapi requests in which new metadata is -specified, and not to be added to existing child links that lack it. - -From Tahoe v1.7.0 onward, the 'mtime' and 'ctime' fields are no longer -populated or updated (see ticket #924), except by "tahoe backup" as -explained below. For backward compatibility, when an existing link is -updated and 'tahoe':'linkcrtime' is not present in the previous metadata -but 'ctime' is, the old value of 'ctime' is used as the new value of -'tahoe':'linkcrtime'. - -The reason we added the new fields in Tahoe v1.4.0 is that there is a -"set_children" API (described below) which you can use to overwrite the -values of the 'mtime'/'ctime' pair, and this API is used by the -"tahoe backup" command (in Tahoe v1.3.0 and later) to set the 'mtime' and -'ctime' values when backing up files from a local filesystem into the -Tahoe filesystem. As of Tahoe v1.4.0, the set_children API cannot be used -to set anything under the 'tahoe' key of the metadata dict -- if you -include 'tahoe' keys in your 'metadata' arguments then it will silently -ignore those keys. - -Therefore, if the 'tahoe' sub-dict is present, you can rely on the -'linkcrtime' and 'linkmotime' values therein to have the semantics described -above. (This is assuming that only official Tahoe clients have been used to -write those links, and that their system clocks were set to what you expected --- there is nothing preventing someone from editing their Tahoe client or -writing their own Tahoe client which would overwrite those values however -they like, and there is nothing to constrain their system clock from taking -any value.) - -When an edge is created or updated by "tahoe backup", the 'mtime' and -'ctime' keys on that edge are set as follows: - -* 'mtime' is set to the timestamp read from the local filesystem for the - "mtime" of the local file in question, which means the last time the - contents of that file were changed. - -* On Windows, 'ctime' is set to the creation timestamp for the file - read from the local filesystem. On other platforms, 'ctime' is set to - the UNIX "ctime" of the local file, which means the last time that - either the contents or the metadata of the local file was changed. - -There are several ways that the 'ctime' field could be confusing: - -1. You might be confused about whether it reflects the time of the creation - of a link in the Tahoe filesystem (by a version of Tahoe < v1.7.0) or a - timestamp copied in by "tahoe backup" from a local filesystem. - -2. You might be confused about whether it is a copy of the file creation - time (if "tahoe backup" was run on a Windows system) or of the last - contents-or-metadata change (if "tahoe backup" was run on a different - operating system). - -3. You might be confused by the fact that changing the contents of a - mutable file in Tahoe doesn't have any effect on any links pointing at - that file in any directories, although "tahoe backup" sets the link - 'ctime'/'mtime' to reflect timestamps about the local file corresponding - to the Tahoe file to which the link points. - -4. Also, quite apart from Tahoe, you might be confused about the meaning - of the "ctime" in UNIX local filesystems, which people sometimes think - means file creation time, but which actually means, in UNIX local - filesystems, the most recent time that the file contents or the file - metadata (such as owner, permission bits, extended attributes, etc.) - has changed. Note that although "ctime" does not mean file creation time - in UNIX, links created by a version of Tahoe prior to v1.7.0, and never - written by "tahoe backup", will have 'ctime' set to the link creation - time. - - -Attaching an existing File or Directory by its read- or write-cap ------------------------------------------------------------------ - -``PUT /uri/$DIRCAP/[SUBDIRS../]CHILDNAME?t=uri`` - - This attaches a child object (either a file or directory) to a specified - location in the virtual filesystem. The child object is referenced by its - read- or write- cap, as provided in the HTTP request body. This will create - intermediate directories as necessary. - - This is similar to a UNIX hardlink: by referencing a previously-uploaded file - (or previously-created directory) instead of uploading/creating a new one, - you can create two references to the same object. - - The read- or write- cap of the child is provided in the body of the HTTP - request, and this same cap is returned in the response body. - - The default behavior is to overwrite any existing object at the same - location. To prevent this (and make the operation return an error instead - of overwriting), add a "replace=false" argument, as "?t=uri&replace=false". - With replace=false, this operation will return an HTTP 409 "Conflict" error - if there is already an object at the given location, rather than - overwriting the existing object. To allow the operation to overwrite a - file, but return an error when trying to overwrite a directory, use - "replace=only-files" (this behavior is closer to the traditional UNIX "mv" - command). Note that "true", "t", and "1" are all synonyms for "True", and - "false", "f", and "0" are synonyms for "False", and the parameter is - case-insensitive. - - Note that this operation does not take its child cap in the form of - separate "rw_uri" and "ro_uri" fields. Therefore, it cannot accept a - child cap in a format unknown to the webapi server, unless its URI - starts with "ro." or "imm.". This restriction is necessary because the - server is not able to attenuate an unknown write cap to a read cap. - Unknown URIs starting with "ro." or "imm.", on the other hand, are - assumed to represent read caps. The client should not prefix a write - cap with "ro." or "imm." and pass it to this operation, since that - would result in granting the cap's write authority to holders of the - directory read cap. - -Adding multiple files or directories to a parent directory at once ------------------------------------------------------------------- - -``POST /uri/$DIRCAP/[SUBDIRS..]?t=set_children`` - -``POST /uri/$DIRCAP/[SUBDIRS..]?t=set-children`` (Tahoe >= v1.6) - - This command adds multiple children to a directory in a single operation. - It reads the request body and interprets it as a JSON-encoded description - of the child names and read/write-caps that should be added. - - The body should be a JSON-encoded dictionary, in the same format as the - "children" value returned by the "GET /uri/$DIRCAP?t=json" operation - described above. In this format, each key is a child names, and the - corresponding value is a tuple of (type, childinfo). "type" is ignored, and - "childinfo" is a dictionary that contains "rw_uri", "ro_uri", and - "metadata" keys. You can take the output of "GET /uri/$DIRCAP1?t=json" and - use it as the input to "POST /uri/$DIRCAP2?t=set_children" to make DIR2 - look very much like DIR1 (except for any existing children of DIR2 that - were not overwritten, and any existing "tahoe" metadata keys as described - below). - - When the set_children request contains a child name that already exists in - the target directory, this command defaults to overwriting that child with - the new value (both child cap and metadata, but if the JSON data does not - contain a "metadata" key, the old child's metadata is preserved). The - command takes a boolean "overwrite=" query argument to control this - behavior. If you use "?t=set_children&overwrite=false", then an attempt to - replace an existing child will instead cause an error. - - Any "tahoe" key in the new child's "metadata" value is ignored. Any - existing "tahoe" metadata is preserved. The metadata["tahoe"] value is - reserved for metadata generated by the tahoe node itself. The only two keys - currently placed here are "linkcrtime" and "linkmotime". For details, see - the section above entitled "Get Information About A File Or Directory (as - JSON)", in the "About the metadata" subsection. - - Note that this command was introduced with the name "set_children", which - uses an underscore rather than a hyphen as other multi-word command names - do. The variant with a hyphen is now accepted, but clients that desire - backward compatibility should continue to use "set_children". - - -Deleting a File or Directory ----------------------------- - -``DELETE /uri/$DIRCAP/[SUBDIRS../]CHILDNAME`` - - This removes the given name from its parent directory. CHILDNAME is the - name to be removed, and $DIRCAP/SUBDIRS.. indicates the directory that will - be modified. - - Note that this does not actually delete the file or directory that the name - points to from the tahoe grid -- it only removes the named reference from - this directory. If there are other names in this directory or in other - directories that point to the resource, then it will remain accessible - through those paths. Even if all names pointing to this object are removed - from their parent directories, then someone with possession of its read-cap - can continue to access the object through that cap. - - The object will only become completely unreachable once 1: there are no - reachable directories that reference it, and 2: nobody is holding a read- - or write- cap to the object. (This behavior is very similar to the way - hardlinks and anonymous files work in traditional UNIX filesystems). - - This operation will not modify more than a single directory. Intermediate - directories which were implicitly created by PUT or POST methods will *not* - be automatically removed by DELETE. - - This method returns the file- or directory- cap of the object that was just - removed. - -Browser Operations: Human-oriented interfaces -============================================= - -This section describes the HTTP operations that provide support for humans -running a web browser. Most of these operations use HTML forms that use POST -to drive the Tahoe node. This section is intended for HTML authors who want -to write web pages that contain forms and buttons which manipulate the Tahoe -filesystem. - -Note that for all POST operations, the arguments listed can be provided -either as URL query arguments or as form body fields. URL query arguments are -separated from the main URL by "?", and from each other by "&". For example, -"POST /uri/$DIRCAP?t=upload&mutable=true". Form body fields are usually -specified by using elements. For clarity, the -descriptions below display the most significant arguments as URL query args. - -Viewing A Directory (as HTML) ------------------------------ - -``GET /uri/$DIRCAP/[SUBDIRS../]`` - - This returns an HTML page, intended to be displayed to a human by a web - browser, which contains HREF links to all files and directories reachable - from this directory. These HREF links do not have a t= argument, meaning - that a human who follows them will get pages also meant for a human. It also - contains forms to upload new files, and to delete files and directories. - Those forms use POST methods to do their job. - -Viewing/Downloading a File --------------------------- - -``GET /uri/$FILECAP`` - -``GET /uri/$DIRCAP/[SUBDIRS../]FILENAME`` - - This will retrieve the contents of the given file. The HTTP response body - will contain the sequence of bytes that make up the file. - - If you want the HTTP response to include a useful Content-Type header, - either use the second form (which starts with a $DIRCAP), or add a - "filename=foo" query argument, like "GET /uri/$FILECAP?filename=foo.jpg". - The bare "GET /uri/$FILECAP" does not give the Tahoe node enough information - to determine a Content-Type (since Tahoe immutable files are merely - sequences of bytes, not typed+named file objects). - - If the URL has both filename= and "save=true" in the query arguments, then - the server to add a "Content-Disposition: attachment" header, along with a - filename= parameter. When a user clicks on such a link, most browsers will - offer to let the user save the file instead of displaying it inline (indeed, - most browsers will refuse to display it inline). "true", "t", "1", and other - case-insensitive equivalents are all treated the same. - - Character-set handling in URLs and HTTP headers is a dubious art [1]_. For - maximum compatibility, Tahoe simply copies the bytes from the filename= - argument into the Content-Disposition header's filename= parameter, without - trying to interpret them in any particular way. - - -``GET /named/$FILECAP/FILENAME`` - - This is an alternate download form which makes it easier to get the correct - filename. The Tahoe server will provide the contents of the given file, with - a Content-Type header derived from the given filename. This form is used to - get browsers to use the "Save Link As" feature correctly, and also helps - command-line tools like "wget" and "curl" use the right filename. Note that - this form can *only* be used with file caps; it is an error to use a - directory cap after the /named/ prefix. - -Get Information About A File Or Directory (as HTML) ---------------------------------------------------- - -``GET /uri/$FILECAP?t=info`` - -``GET /uri/$DIRCAP/?t=info`` - -``GET /uri/$DIRCAP/[SUBDIRS../]SUBDIR/?t=info`` - -``GET /uri/$DIRCAP/[SUBDIRS../]FILENAME?t=info`` - - This returns a human-oriented HTML page with more detail about the selected - file or directory object. This page contains the following items: - - * object size - * storage index - * JSON representation - * raw contents (text/plain) - * access caps (URIs): verify-cap, read-cap, write-cap (for mutable objects) - * check/verify/repair form - * deep-check/deep-size/deep-stats/manifest (for directories) - * replace-conents form (for mutable files) - -Creating a Directory --------------------- - -``POST /uri?t=mkdir`` - - This creates a new empty directory, but does not attach it to the virtual - filesystem. - - If a "redirect_to_result=true" argument is provided, then the HTTP response - will cause the web browser to be redirected to a /uri/$DIRCAP page that - gives access to the newly-created directory. If you bookmark this page, - you'll be able to get back to the directory again in the future. This is the - recommended way to start working with a Tahoe server: create a new unlinked - directory (using redirect_to_result=true), then bookmark the resulting - /uri/$DIRCAP page. There is a "create directory" button on the Welcome page - to invoke this action. - - If "redirect_to_result=true" is not provided (or is given a value of - "false"), then the HTTP response body will simply be the write-cap of the - new directory. - -``POST /uri/$DIRCAP/[SUBDIRS../]?t=mkdir&name=CHILDNAME`` - - This creates a new empty directory as a child of the designated SUBDIR. This - will create additional intermediate directories as necessary. - - If a "when_done=URL" argument is provided, the HTTP response will cause the - web browser to redirect to the given URL. This provides a convenient way to - return the browser to the directory that was just modified. Without a - when_done= argument, the HTTP response will simply contain the write-cap of - the directory that was just created. - - -Uploading a File ----------------- - -``POST /uri?t=upload`` - - This uploads a file, and produces a file-cap for the contents, but does not - attach the file into the filesystem. No directories will be modified by - this operation. - - The file must be provided as the "file" field of an HTML encoded form body, - produced in response to an HTML form like this:: - -
- - - -
- - If a "when_done=URL" argument is provided, the response body will cause the - browser to redirect to the given URL. If the when_done= URL has the string - "%(uri)s" in it, that string will be replaced by a URL-escaped form of the - newly created file-cap. (Note that without this substitution, there is no - way to access the file that was just uploaded). - - The default (in the absence of when_done=) is to return an HTML page that - describes the results of the upload. This page will contain information - about which storage servers were used for the upload, how long each - operation took, etc. - - If a "mutable=true" argument is provided, the operation will create a - mutable file, and the response body will contain the write-cap instead of - the upload results page. The default is to create an immutable file, - returning the upload results page as a response. - - -``POST /uri/$DIRCAP/[SUBDIRS../]?t=upload`` - - This uploads a file, and attaches it as a new child of the given directory, - which must be mutable. The file must be provided as the "file" field of an - HTML-encoded form body, produced in response to an HTML form like this:: - -
- - - -
- - A "name=" argument can be provided to specify the new child's name, - otherwise it will be taken from the "filename" field of the upload form - (most web browsers will copy the last component of the original file's - pathname into this field). To avoid confusion, name= is not allowed to - contain a slash. - - If there is already a child with that name, and it is a mutable file, then - its contents are replaced with the data being uploaded. If it is not a - mutable file, the default behavior is to remove the existing child before - creating a new one. To prevent this (and make the operation return an error - instead of overwriting the old child), add a "replace=false" argument, as - "?t=upload&replace=false". With replace=false, this operation will return an - HTTP 409 "Conflict" error if there is already an object at the given - location, rather than overwriting the existing object. Note that "true", - "t", and "1" are all synonyms for "True", and "false", "f", and "0" are - synonyms for "False". the parameter is case-insensitive. - - This will create additional intermediate directories as necessary, although - since it is expected to be triggered by a form that was retrieved by "GET - /uri/$DIRCAP/[SUBDIRS../]", it is likely that the parent directory will - already exist. - - If a "mutable=true" argument is provided, any new file that is created will - be a mutable file instead of an immutable one. will give the user a way to set this option. - - If a "when_done=URL" argument is provided, the HTTP response will cause the - web browser to redirect to the given URL. This provides a convenient way to - return the browser to the directory that was just modified. Without a - when_done= argument, the HTTP response will simply contain the file-cap of - the file that was just uploaded (a write-cap for mutable files, or a - read-cap for immutable files). - -``POST /uri/$DIRCAP/[SUBDIRS../]FILENAME?t=upload`` - - This also uploads a file and attaches it as a new child of the given - directory, which must be mutable. It is a slight variant of the previous - operation, as the URL refers to the target file rather than the parent - directory. It is otherwise identical: this accepts mutable= and when_done= - arguments too. - -``POST /uri/$FILECAP?t=upload`` - - This modifies the contents of an existing mutable file in-place. An error is - signalled if $FILECAP does not refer to a mutable file. It behaves just like - the "PUT /uri/$FILECAP" form, but uses a POST for the benefit of HTML forms - in a web browser. - -Attaching An Existing File Or Directory (by URI) ------------------------------------------------- - -``POST /uri/$DIRCAP/[SUBDIRS../]?t=uri&name=CHILDNAME&uri=CHILDCAP`` - - This attaches a given read- or write- cap "CHILDCAP" to the designated - directory, with a specified child name. This behaves much like the PUT t=uri - operation, and is a lot like a UNIX hardlink. It is subject to the same - restrictions as that operation on the use of cap formats unknown to the - webapi server. - - This will create additional intermediate directories as necessary, although - since it is expected to be triggered by a form that was retrieved by "GET - /uri/$DIRCAP/[SUBDIRS../]", it is likely that the parent directory will - already exist. - - This accepts the same replace= argument as POST t=upload. - -Deleting A Child ----------------- - -``POST /uri/$DIRCAP/[SUBDIRS../]?t=delete&name=CHILDNAME`` - - This instructs the node to remove a child object (file or subdirectory) from - the given directory, which must be mutable. Note that the entire subtree is - unlinked from the parent. Unlike deleting a subdirectory in a UNIX local - filesystem, the subtree need not be empty; if it isn't, then other references - into the subtree will see that the child subdirectories are not modified by - this operation. Only the link from the given directory to its child is severed. - -Renaming A Child ----------------- - -``POST /uri/$DIRCAP/[SUBDIRS../]?t=rename&from_name=OLD&to_name=NEW`` - - This instructs the node to rename a child of the given directory, which must - be mutable. This has a similar effect to removing the child, then adding the - same child-cap under the new name, except that it preserves metadata. This - operation cannot move the child to a different directory. - - This operation will replace any existing child of the new name, making it - behave like the UNIX "``mv -f``" command. - -Other Utilities ---------------- - -``GET /uri?uri=$CAP`` - - This causes a redirect to /uri/$CAP, and retains any additional query - arguments (like filename= or save=). This is for the convenience of web - forms which allow the user to paste in a read- or write- cap (obtained - through some out-of-band channel, like IM or email). - - Note that this form merely redirects to the specific file or directory - indicated by the $CAP: unlike the GET /uri/$DIRCAP form, you cannot - traverse to children by appending additional path segments to the URL. - -``GET /uri/$DIRCAP/[SUBDIRS../]?t=rename-form&name=$CHILDNAME`` - - This provides a useful facility to browser-based user interfaces. It - returns a page containing a form targetting the "POST $DIRCAP t=rename" - functionality described above, with the provided $CHILDNAME present in the - 'from_name' field of that form. I.e. this presents a form offering to - rename $CHILDNAME, requesting the new name, and submitting POST rename. - -``GET /uri/$DIRCAP/[SUBDIRS../]CHILDNAME?t=uri`` - - This returns the file- or directory- cap for the specified object. - -``GET /uri/$DIRCAP/[SUBDIRS../]CHILDNAME?t=readonly-uri`` - - This returns a read-only file- or directory- cap for the specified object. - If the object is an immutable file, this will return the same value as - t=uri. - -Debugging and Testing Features ------------------------------- - -These URLs are less-likely to be helpful to the casual Tahoe user, and are -mainly intended for developers. - -``POST $URL?t=check`` - - This triggers the FileChecker to determine the current "health" of the - given file or directory, by counting how many shares are available. The - page that is returned will display the results. This can be used as a "show - me detailed information about this file" page. - - If a verify=true argument is provided, the node will perform a more - intensive check, downloading and verifying every single bit of every share. - - If an add-lease=true argument is provided, the node will also add (or - renew) a lease to every share it encounters. Each lease will keep the share - alive for a certain period of time (one month by default). Once the last - lease expires or is explicitly cancelled, the storage server is allowed to - delete the share. - - If an output=JSON argument is provided, the response will be - machine-readable JSON instead of human-oriented HTML. The data is a - dictionary with the following keys:: - - storage-index: a base32-encoded string with the objects's storage index, - or an empty string for LIT files - summary: a string, with a one-line summary of the stats of the file - results: a dictionary that describes the state of the file. For LIT files, - this dictionary has only the 'healthy' key, which will always be - True. For distributed files, this dictionary has the following - keys: - count-shares-good: the number of good shares that were found - count-shares-needed: 'k', the number of shares required for recovery - count-shares-expected: 'N', the number of total shares generated - count-good-share-hosts: this was intended to be the number of distinct - storage servers with good shares. It is currently - (as of Tahoe-LAFS v1.8.0) computed incorrectly; - see ticket #1115. - count-wrong-shares: for mutable files, the number of shares for - versions other than the 'best' one (highest - sequence number, highest roothash). These are - either old ... - count-recoverable-versions: for mutable files, the number of - recoverable versions of the file. For - a healthy file, this will equal 1. - count-unrecoverable-versions: for mutable files, the number of - unrecoverable versions of the file. - For a healthy file, this will be 0. - count-corrupt-shares: the number of shares with integrity failures - list-corrupt-shares: a list of "share locators", one for each share - that was found to be corrupt. Each share locator - is a list of (serverid, storage_index, sharenum). - needs-rebalancing: (bool) True if there are multiple shares on a single - storage server, indicating a reduction in reliability - that could be resolved by moving shares to new - servers. - servers-responding: list of base32-encoded storage server identifiers, - one for each server which responded to the share - query. - healthy: (bool) True if the file is completely healthy, False otherwise. - Healthy files have at least N good shares. Overlapping shares - do not currently cause a file to be marked unhealthy. If there - are at least N good shares, then corrupt shares do not cause the - file to be marked unhealthy, although the corrupt shares will be - listed in the results (list-corrupt-shares) and should be manually - removed to wasting time in subsequent downloads (as the - downloader rediscovers the corruption and uses alternate shares). - Future compatibility: the meaning of this field may change to - reflect whether the servers-of-happiness criterion is met - (see ticket #614). - sharemap: dict mapping share identifier to list of serverids - (base32-encoded strings). This indicates which servers are - holding which shares. For immutable files, the shareid is - an integer (the share number, from 0 to N-1). For - immutable files, it is a string of the form - 'seq%d-%s-sh%d', containing the sequence number, the - roothash, and the share number. - -``POST $URL?t=start-deep-check`` (must add &ophandle=XYZ) - - This initiates a recursive walk of all files and directories reachable from - the target, performing a check on each one just like t=check. The result - page will contain a summary of the results, including details on any - file/directory that was not fully healthy. - - t=start-deep-check can only be invoked on a directory. An error (400 - BAD_REQUEST) will be signalled if it is invoked on a file. The recursive - walker will deal with loops safely. - - This accepts the same verify= and add-lease= arguments as t=check. - - Since this operation can take a long time (perhaps a second per object), - the ophandle= argument is required (see "Slow Operations, Progress, and - Cancelling" above). The response to this POST will be a redirect to the - corresponding /operations/$HANDLE page (with output=HTML or output=JSON to - match the output= argument given to the POST). The deep-check operation - will continue to run in the background, and the /operations page should be - used to find out when the operation is done. - - Detailed check results for non-healthy files and directories will be - available under /operations/$HANDLE/$STORAGEINDEX, and the HTML status will - contain links to these detailed results. - - The HTML /operations/$HANDLE page for incomplete operations will contain a - meta-refresh tag, set to 60 seconds, so that a browser which uses - deep-check will automatically poll until the operation has completed. - - The JSON page (/options/$HANDLE?output=JSON) will contain a - machine-readable JSON dictionary with the following keys:: - - finished: a boolean, True if the operation is complete, else False. Some - of the remaining keys may not be present until the operation - is complete. - root-storage-index: a base32-encoded string with the storage index of the - starting point of the deep-check operation - count-objects-checked: count of how many objects were checked. Note that - non-distributed objects (i.e. small immutable LIT - files) are not checked, since for these objects, - the data is contained entirely in the URI. - count-objects-healthy: how many of those objects were completely healthy - count-objects-unhealthy: how many were damaged in some way - count-corrupt-shares: how many shares were found to have corruption, - summed over all objects examined - list-corrupt-shares: a list of "share identifiers", one for each share - that was found to be corrupt. Each share identifier - is a list of (serverid, storage_index, sharenum). - list-unhealthy-files: a list of (pathname, check-results) tuples, for - each file that was not fully healthy. 'pathname' is - a list of strings (which can be joined by "/" - characters to turn it into a single string), - relative to the directory on which deep-check was - invoked. The 'check-results' field is the same as - that returned by t=check&output=JSON, described - above. - stats: a dictionary with the same keys as the t=start-deep-stats command - (described below) - -``POST $URL?t=stream-deep-check`` - - This initiates a recursive walk of all files and directories reachable from - the target, performing a check on each one just like t=check. For each - unique object (duplicates are skipped), a single line of JSON is emitted to - the HTTP response channel (or an error indication, see below). When the walk - is complete, a final line of JSON is emitted which contains the accumulated - file-size/count "deep-stats" data. - - This command takes the same arguments as t=start-deep-check. - - A CLI tool can split the response stream on newlines into "response units", - and parse each response unit as JSON. Each such parsed unit will be a - dictionary, and will contain at least the "type" key: a string, one of - "file", "directory", or "stats". - - For all units that have a type of "file" or "directory", the dictionary will - contain the following keys:: - - "path": a list of strings, with the path that is traversed to reach the - object - "cap": a write-cap URI for the file or directory, if available, else a - read-cap URI - "verifycap": a verify-cap URI for the file or directory - "repaircap": an URI for the weakest cap that can still be used to repair - the object - "storage-index": a base32 storage index for the object - "check-results": a copy of the dictionary which would be returned by - t=check&output=json, with three top-level keys: - "storage-index", "summary", and "results", and a variety - of counts and sharemaps in the "results" value. - - Note that non-distributed files (i.e. LIT files) will have values of None - for verifycap, repaircap, and storage-index, since these files can neither - be verified nor repaired, and are not stored on the storage servers. - Likewise the check-results dictionary will be limited: an empty string for - storage-index, and a results dictionary with only the "healthy" key. - - The last unit in the stream will have a type of "stats", and will contain - the keys described in the "start-deep-stats" operation, below. - - If any errors occur during the traversal (specifically if a directory is - unrecoverable, such that further traversal is not possible), an error - indication is written to the response body, instead of the usual line of - JSON. This error indication line will begin with the string "ERROR:" (in all - caps), and contain a summary of the error on the rest of the line. The - remaining lines of the response body will be a python exception. The client - application should look for the ERROR: and stop processing JSON as soon as - it is seen. Note that neither a file being unrecoverable nor a directory - merely being unhealthy will cause traversal to stop. The line just before - the ERROR: will describe the directory that was untraversable, since the - unit is emitted to the HTTP response body before the child is traversed. - - -``POST $URL?t=check&repair=true`` - - This performs a health check of the given file or directory, and if the - checker determines that the object is not healthy (some shares are missing - or corrupted), it will perform a "repair". During repair, any missing - shares will be regenerated and uploaded to new servers. - - This accepts the same verify=true and add-lease= arguments as t=check. When - an output=JSON argument is provided, the machine-readable JSON response - will contain the following keys:: - - storage-index: a base32-encoded string with the objects's storage index, - or an empty string for LIT files - repair-attempted: (bool) True if repair was attempted - repair-successful: (bool) True if repair was attempted and the file was - fully healthy afterwards. False if no repair was - attempted, or if a repair attempt failed. - pre-repair-results: a dictionary that describes the state of the file - before any repair was performed. This contains exactly - the same keys as the 'results' value of the t=check - response, described above. - post-repair-results: a dictionary that describes the state of the file - after any repair was performed. If no repair was - performed, post-repair-results and pre-repair-results - will be the same. This contains exactly the same keys - as the 'results' value of the t=check response, - described above. - -``POST $URL?t=start-deep-check&repair=true`` (must add &ophandle=XYZ) - - This triggers a recursive walk of all files and directories, performing a - t=check&repair=true on each one. - - Like t=start-deep-check without the repair= argument, this can only be - invoked on a directory. An error (400 BAD_REQUEST) will be signalled if it - is invoked on a file. The recursive walker will deal with loops safely. - - This accepts the same verify= and add-lease= arguments as - t=start-deep-check. It uses the same ophandle= mechanism as - start-deep-check. When an output=JSON argument is provided, the response - will contain the following keys:: - - finished: (bool) True if the operation has completed, else False - root-storage-index: a base32-encoded string with the storage index of the - starting point of the deep-check operation - count-objects-checked: count of how many objects were checked - - count-objects-healthy-pre-repair: how many of those objects were completely - healthy, before any repair - count-objects-unhealthy-pre-repair: how many were damaged in some way - count-objects-healthy-post-repair: how many of those objects were completely - healthy, after any repair - count-objects-unhealthy-post-repair: how many were damaged in some way - - count-repairs-attempted: repairs were attempted on this many objects. - count-repairs-successful: how many repairs resulted in healthy objects - count-repairs-unsuccessful: how many repairs resulted did not results in - completely healthy objects - count-corrupt-shares-pre-repair: how many shares were found to have - corruption, summed over all objects - examined, before any repair - count-corrupt-shares-post-repair: how many shares were found to have - corruption, summed over all objects - examined, after any repair - list-corrupt-shares: a list of "share identifiers", one for each share - that was found to be corrupt (before any repair). - Each share identifier is a list of (serverid, - storage_index, sharenum). - list-remaining-corrupt-shares: like list-corrupt-shares, but mutable shares - that were successfully repaired are not - included. These are shares that need - manual processing. Since immutable shares - cannot be modified by clients, all corruption - in immutable shares will be listed here. - list-unhealthy-files: a list of (pathname, check-results) tuples, for - each file that was not fully healthy. 'pathname' is - relative to the directory on which deep-check was - invoked. The 'check-results' field is the same as - that returned by t=check&repair=true&output=JSON, - described above. - stats: a dictionary with the same keys as the t=start-deep-stats command - (described below) - -``POST $URL?t=stream-deep-check&repair=true`` - - This triggers a recursive walk of all files and directories, performing a - t=check&repair=true on each one. For each unique object (duplicates are - skipped), a single line of JSON is emitted to the HTTP response channel (or - an error indication). When the walk is complete, a final line of JSON is - emitted which contains the accumulated file-size/count "deep-stats" data. - - This emits the same data as t=stream-deep-check (without the repair=true), - except that the "check-results" field is replaced with a - "check-and-repair-results" field, which contains the keys returned by - t=check&repair=true&output=json (i.e. repair-attempted, repair-successful, - pre-repair-results, and post-repair-results). The output does not contain - the summary dictionary that is provied by t=start-deep-check&repair=true - (the one with count-objects-checked and list-unhealthy-files), since the - receiving client is expected to calculate those values itself from the - stream of per-object check-and-repair-results. - - Note that the "ERROR:" indication will only be emitted if traversal stops, - which will only occur if an unrecoverable directory is encountered. If a - file or directory repair fails, the traversal will continue, and the repair - failure will be indicated in the JSON data (in the "repair-successful" key). - -``POST $DIRURL?t=start-manifest`` (must add &ophandle=XYZ) - - This operation generates a "manfest" of the given directory tree, mostly - for debugging. This is a table of (path, filecap/dircap), for every object - reachable from the starting directory. The path will be slash-joined, and - the filecap/dircap will contain a link to the object in question. This page - gives immediate access to every object in the virtual filesystem subtree. - - This operation uses the same ophandle= mechanism as deep-check. The - corresponding /operations/$HANDLE page has three different forms. The - default is output=HTML. - - If output=text is added to the query args, the results will be a text/plain - list. The first line is special: it is either "finished: yes" or "finished: - no"; if the operation is not finished, you must periodically reload the - page until it completes. The rest of the results are a plaintext list, with - one file/dir per line, slash-separated, with the filecap/dircap separated - by a space. - - If output=JSON is added to the queryargs, then the results will be a - JSON-formatted dictionary with six keys. Note that because large directory - structures can result in very large JSON results, the full results will not - be available until the operation is complete (i.e. until output["finished"] - is True):: - - finished (bool): if False then you must reload the page until True - origin_si (base32 str): the storage index of the starting point - manifest: list of (path, cap) tuples, where path is a list of strings. - verifycaps: list of (printable) verify cap strings - storage-index: list of (base32) storage index strings - stats: a dictionary with the same keys as the t=start-deep-stats command - (described below) - -``POST $DIRURL?t=start-deep-size`` (must add &ophandle=XYZ) - - This operation generates a number (in bytes) containing the sum of the - filesize of all directories and immutable files reachable from the given - directory. This is a rough lower bound of the total space consumed by this - subtree. It does not include space consumed by mutable files, nor does it - take expansion or encoding overhead into account. Later versions of the - code may improve this estimate upwards. - - The /operations/$HANDLE status output consists of two lines of text:: - - finished: yes - size: 1234 - -``POST $DIRURL?t=start-deep-stats`` (must add &ophandle=XYZ) - - This operation performs a recursive walk of all files and directories - reachable from the given directory, and generates a collection of - statistics about those objects. - - The result (obtained from the /operations/$OPHANDLE page) is a - JSON-serialized dictionary with the following keys (note that some of these - keys may be missing until 'finished' is True):: - - finished: (bool) True if the operation has finished, else False - count-immutable-files: count of how many CHK files are in the set - count-mutable-files: same, for mutable files (does not include directories) - count-literal-files: same, for LIT files (data contained inside the URI) - count-files: sum of the above three - count-directories: count of directories - count-unknown: count of unrecognized objects (perhaps from the future) - size-immutable-files: total bytes for all CHK files in the set, =deep-size - size-mutable-files (TODO): same, for current version of all mutable files - size-literal-files: same, for LIT files - size-directories: size of directories (includes size-literal-files) - size-files-histogram: list of (minsize, maxsize, count) buckets, - with a histogram of filesizes, 5dB/bucket, - for both literal and immutable files - largest-directory: number of children in the largest directory - largest-immutable-file: number of bytes in the largest CHK file - - size-mutable-files is not implemented, because it would require extra - queries to each mutable file to get their size. This may be implemented in - the future. - - Assuming no sharing, the basic space consumed by a single root directory is - the sum of size-immutable-files, size-mutable-files, and size-directories. - The actual disk space used by the shares is larger, because of the - following sources of overhead:: - - integrity data - expansion due to erasure coding - share management data (leases) - backend (ext3) minimum block size - -``POST $URL?t=stream-manifest`` - - This operation performs a recursive walk of all files and directories - reachable from the given starting point. For each such unique object - (duplicates are skipped), a single line of JSON is emitted to the HTTP - response channel (or an error indication, see below). When the walk is - complete, a final line of JSON is emitted which contains the accumulated - file-size/count "deep-stats" data. - - A CLI tool can split the response stream on newlines into "response units", - and parse each response unit as JSON. Each such parsed unit will be a - dictionary, and will contain at least the "type" key: a string, one of - "file", "directory", or "stats". - - For all units that have a type of "file" or "directory", the dictionary will - contain the following keys:: - - "path": a list of strings, with the path that is traversed to reach the - object - "cap": a write-cap URI for the file or directory, if available, else a - read-cap URI - "verifycap": a verify-cap URI for the file or directory - "repaircap": an URI for the weakest cap that can still be used to repair - the object - "storage-index": a base32 storage index for the object - - Note that non-distributed files (i.e. LIT files) will have values of None - for verifycap, repaircap, and storage-index, since these files can neither - be verified nor repaired, and are not stored on the storage servers. - - The last unit in the stream will have a type of "stats", and will contain - the keys described in the "start-deep-stats" operation, below. - - If any errors occur during the traversal (specifically if a directory is - unrecoverable, such that further traversal is not possible), an error - indication is written to the response body, instead of the usual line of - JSON. This error indication line will begin with the string "ERROR:" (in all - caps), and contain a summary of the error on the rest of the line. The - remaining lines of the response body will be a python exception. The client - application should look for the ERROR: and stop processing JSON as soon as - it is seen. The line just before the ERROR: will describe the directory that - was untraversable, since the manifest entry is emitted to the HTTP response - body before the child is traversed. - -Other Useful Pages -================== - -The portion of the web namespace that begins with "/uri" (and "/named") is -dedicated to giving users (both humans and programs) access to the Tahoe -virtual filesystem. The rest of the namespace provides status information -about the state of the Tahoe node. - -``GET /`` (the root page) - -This is the "Welcome Page", and contains a few distinct sections:: - - Node information: library versions, local nodeid, services being provided. - - Filesystem Access Forms: create a new directory, view a file/directory by - URI, upload a file (unlinked), download a file by - URI. - - Grid Status: introducer information, helper information, connected storage - servers. - -``GET /status/`` - - This page lists all active uploads and downloads, and contains a short list - of recent upload/download operations. Each operation has a link to a page - that describes file sizes, servers that were involved, and the time consumed - in each phase of the operation. - - A GET of /status/?t=json will contain a machine-readable subset of the same - data. It returns a JSON-encoded dictionary. The only key defined at this - time is "active", with a value that is a list of operation dictionaries, one - for each active operation. Once an operation is completed, it will no longer - appear in data["active"] . - - Each op-dict contains a "type" key, one of "upload", "download", - "mapupdate", "publish", or "retrieve" (the first two are for immutable - files, while the latter three are for mutable files and directories). - - The "upload" op-dict will contain the following keys:: - - type (string): "upload" - storage-index-string (string): a base32-encoded storage index - total-size (int): total size of the file - status (string): current status of the operation - progress-hash (float): 1.0 when the file has been hashed - progress-ciphertext (float): 1.0 when the file has been encrypted. - progress-encode-push (float): 1.0 when the file has been encoded and - pushed to the storage servers. For helper - uploads, the ciphertext value climbs to 1.0 - first, then encoding starts. For unassisted - uploads, ciphertext and encode-push progress - will climb at the same pace. - - The "download" op-dict will contain the following keys:: - - type (string): "download" - storage-index-string (string): a base32-encoded storage index - total-size (int): total size of the file - status (string): current status of the operation - progress (float): 1.0 when the file has been fully downloaded - - Front-ends which want to report progress information are advised to simply - average together all the progress-* indicators. A slightly more accurate - value can be found by ignoring the progress-hash value (since the current - implementation hashes synchronously, so clients will probably never see - progress-hash!=1.0). - -``GET /provisioning/`` - - This page provides a basic tool to predict the likely storage and bandwidth - requirements of a large Tahoe grid. It provides forms to input things like - total number of users, number of files per user, average file size, number - of servers, expansion ratio, hard drive failure rate, etc. It then provides - numbers like how many disks per server will be needed, how many read - operations per second should be expected, and the likely MTBF for files in - the grid. This information is very preliminary, and the model upon which it - is based still needs a lot of work. - -``GET /helper_status/`` - - If the node is running a helper (i.e. if [helper]enabled is set to True in - tahoe.cfg), then this page will provide a list of all the helper operations - currently in progress. If "?t=json" is added to the URL, it will return a - JSON-formatted list of helper statistics, which can then be used to produce - graphs to indicate how busy the helper is. - -``GET /statistics/`` - - This page provides "node statistics", which are collected from a variety of - sources:: - - load_monitor: every second, the node schedules a timer for one second in - the future, then measures how late the subsequent callback - is. The "load_average" is this tardiness, measured in - seconds, averaged over the last minute. It is an indication - of a busy node, one which is doing more work than can be - completed in a timely fashion. The "max_load" value is the - highest value that has been seen in the last 60 seconds. - - cpu_monitor: every minute, the node uses time.clock() to measure how much - CPU time it has used, and it uses this value to produce - 1min/5min/15min moving averages. These values range from 0% - (0.0) to 100% (1.0), and indicate what fraction of the CPU - has been used by the Tahoe node. Not all operating systems - provide meaningful data to time.clock(): they may report 100% - CPU usage at all times. - - uploader: this counts how many immutable files (and bytes) have been - uploaded since the node was started - - downloader: this counts how many immutable files have been downloaded - since the node was started - - publishes: this counts how many mutable files (including directories) have - been modified since the node was started - - retrieves: this counts how many mutable files (including directories) have - been read since the node was started - - There are other statistics that are tracked by the node. The "raw stats" - section shows a formatted dump of all of them. - - By adding "?t=json" to the URL, the node will return a JSON-formatted - dictionary of stats values, which can be used by other tools to produce - graphs of node behavior. The misc/munin/ directory in the source - distribution provides some tools to produce these graphs. - -``GET /`` (introducer status) - - For Introducer nodes, the welcome page displays information about both - clients and servers which are connected to the introducer. Servers make - "service announcements", and these are listed in a table. Clients will - subscribe to hear about service announcements, and these subscriptions are - listed in a separate table. Both tables contain information about what - version of Tahoe is being run by the remote node, their advertised and - outbound IP addresses, their nodeid and nickname, and how long they have - been available. - - By adding "?t=json" to the URL, the node will return a JSON-formatted - dictionary of stats values, which can be used to produce graphs of connected - clients over time. This dictionary has the following keys:: - - ["subscription_summary"] : a dictionary mapping service name (like - "storage") to an integer with the number of - clients that have subscribed to hear about that - service - ["announcement_summary"] : a dictionary mapping service name to an integer - with the number of servers which are announcing - that service - ["announcement_distinct_hosts"] : a dictionary mapping service name to an - integer which represents the number of - distinct hosts that are providing that - service. If two servers have announced - FURLs which use the same hostnames (but - different ports and tubids), they are - considered to be on the same host. - - -Static Files in /public_html -============================ - -The webapi server will take any request for a URL that starts with /static -and serve it from a configurable directory which defaults to -$BASEDIR/public_html . This is configured by setting the "[node]web.static" -value in $BASEDIR/tahoe.cfg . If this is left at the default value of -"public_html", then http://localhost:3456/static/subdir/foo.html will be -served with the contents of the file $BASEDIR/public_html/subdir/foo.html . - -This can be useful to serve a javascript application which provides a -prettier front-end to the rest of the Tahoe webapi. - - -Safety and security issues -- names vs. URIs -============================================ - -Summary: use explicit file- and dir- caps whenever possible, to reduce the -potential for surprises when the filesystem structure is changed. - -Tahoe provides a mutable filesystem, but the ways that the filesystem can -change are limited. The only thing that can change is that the mapping from -child names to child objects that each directory contains can be changed by -adding a new child name pointing to an object, removing an existing child name, -or changing an existing child name to point to a different object. - -Obviously if you query Tahoe for information about the filesystem and then act -to change the filesystem (such as by getting a listing of the contents of a -directory and then adding a file to the directory), then the filesystem might -have been changed after you queried it and before you acted upon it. However, -if you use the URI instead of the pathname of an object when you act upon the -object, then the only change that can happen is if the object is a directory -then the set of child names it has might be different. If, on the other hand, -you act upon the object using its pathname, then a different object might be in -that place, which can result in more kinds of surprises. - -For example, suppose you are writing code which recursively downloads the -contents of a directory. The first thing your code does is fetch the listing -of the contents of the directory. For each child that it fetched, if that -child is a file then it downloads the file, and if that child is a directory -then it recurses into that directory. Now, if the download and the recurse -actions are performed using the child's name, then the results might be -wrong, because for example a child name that pointed to a sub-directory when -you listed the directory might have been changed to point to a file (in which -case your attempt to recurse into it would result in an error and the file -would be skipped), or a child name that pointed to a file when you listed the -directory might now point to a sub-directory (in which case your attempt to -download the child would result in a file containing HTML text describing the -sub-directory!). - -If your recursive algorithm uses the uri of the child instead of the name of -the child, then those kinds of mistakes just can't happen. Note that both the -child's name and the child's URI are included in the results of listing the -parent directory, so it isn't any harder to use the URI for this purpose. - -The read and write caps in a given directory node are separate URIs, and -can't be assumed to point to the same object even if they were retrieved in -the same operation (although the webapi server attempts to ensure this -in most cases). If you need to rely on that property, you should explicitly -verify it. More generally, you should not make assumptions about the -internal consistency of the contents of mutable directories. As a result -of the signatures on mutable object versions, it is guaranteed that a given -version was written in a single update, but -- as in the case of a file -- -the contents may have been chosen by a malicious writer in a way that is -designed to confuse applications that rely on their consistency. - -In general, use names if you want "whatever object (whether file or -directory) is found by following this name (or sequence of names) when my -request reaches the server". Use URIs if you want "this particular object". - -Concurrency Issues -================== - -Tahoe uses both mutable and immutable files. Mutable files can be created -explicitly by doing an upload with ?mutable=true added, or implicitly by -creating a new directory (since a directory is just a special way to -interpret a given mutable file). - -Mutable files suffer from the same consistency-vs-availability tradeoff that -all distributed data storage systems face. It is not possible to -simultaneously achieve perfect consistency and perfect availability in the -face of network partitions (servers being unreachable or faulty). - -Tahoe tries to achieve a reasonable compromise, but there is a basic rule in -place, known as the Prime Coordination Directive: "Don't Do That". What this -means is that if write-access to a mutable file is available to several -parties, then those parties are responsible for coordinating their activities -to avoid multiple simultaneous updates. This could be achieved by having -these parties talk to each other and using some sort of locking mechanism, or -by serializing all changes through a single writer. - -The consequences of performing uncoordinated writes can vary. Some of the -writers may lose their changes, as somebody else wins the race condition. In -many cases the file will be left in an "unhealthy" state, meaning that there -are not as many redundant shares as we would like (reducing the reliability -of the file against server failures). In the worst case, the file can be left -in such an unhealthy state that no version is recoverable, even the old ones. -It is this small possibility of data loss that prompts us to issue the Prime -Coordination Directive. - -Tahoe nodes implement internal serialization to make sure that a single Tahoe -node cannot conflict with itself. For example, it is safe to issue two -directory modification requests to a single tahoe node's webapi server at the -same time, because the Tahoe node will internally delay one of them until -after the other has finished being applied. (This feature was introduced in -Tahoe-1.1; back with Tahoe-1.0 the web client was responsible for serializing -web requests themselves). - -For more details, please see the "Consistency vs Availability" and "The Prime -Coordination Directive" sections of mutable.txt, in the same directory as -this file. - - -.. [1] URLs and HTTP and UTF-8, Oh My - - HTTP does not provide a mechanism to specify the character set used to - encode non-ascii names in URLs (rfc2396#2.1). We prefer the convention that - the filename= argument shall be a URL-encoded UTF-8 encoded unicode object. - For example, suppose we want to provoke the server into using a filename of - "f i a n c e-acute e" (i.e. F I A N C U+00E9 E). The UTF-8 encoding of this - is 0x66 0x69 0x61 0x6e 0x63 0xc3 0xa9 0x65 (or "fianc\xC3\xA9e", as python's - repr() function would show). To encode this into a URL, the non-printable - characters must be escaped with the urlencode '%XX' mechansim, giving us - "fianc%C3%A9e". Thus, the first line of the HTTP request will be "GET - /uri/CAP...?save=true&filename=fianc%C3%A9e HTTP/1.1". Not all browsers - provide this: IE7 uses the Latin-1 encoding, which is fianc%E9e. - - The response header will need to indicate a non-ASCII filename. The actual - mechanism to do this is not clear. For ASCII filenames, the response header - would look like:: - - Content-Disposition: attachment; filename="english.txt" - - If Tahoe were to enforce the utf-8 convention, it would need to decode the - URL argument into a unicode string, and then encode it back into a sequence - of bytes when creating the response header. One possibility would be to use - unencoded utf-8. Developers suggest that IE7 might accept this:: - - #1: Content-Disposition: attachment; filename="fianc\xC3\xA9e" - (note, the last four bytes of that line, not including the newline, are - 0xC3 0xA9 0x65 0x22) - - RFC2231#4 (dated 1997): suggests that the following might work, and some - developers (http://markmail.org/message/dsjyokgl7hv64ig3) have reported that - it is supported by firefox (but not IE7):: - - #2: Content-Disposition: attachment; filename*=utf-8''fianc%C3%A9e - - My reading of RFC2616#19.5.1 (which defines Content-Disposition) says that - the filename= parameter is defined to be wrapped in quotes (presumeably to - allow spaces without breaking the parsing of subsequent parameters), which - would give us:: - - #3: Content-Disposition: attachment; filename*=utf-8''"fianc%C3%A9e" - - However this is contrary to the examples in the email thread listed above. - - Developers report that IE7 (when it is configured for UTF-8 URL encoding, - which is not the default in asian countries), will accept:: - - #4: Content-Disposition: attachment; filename=fianc%C3%A9e - - However, for maximum compatibility, Tahoe simply copies bytes from the URL - into the response header, rather than enforcing the utf-8 convention. This - means it does not try to decode the filename from the URL argument, nor does - it encode the filename into the response header. diff --git a/docs/specifications/URI-extension.rst b/docs/specifications/URI-extension.rst new file mode 100644 index 00000000..6d40652e --- /dev/null +++ b/docs/specifications/URI-extension.rst @@ -0,0 +1,62 @@ +=================== +URI Extension Block +=================== + +This block is a serialized dictionary with string keys and string values +(some of which represent numbers, some of which are SHA-256 hashes). All +buckets hold an identical copy. The hash of the serialized data is kept in +the URI. + +The download process must obtain a valid copy of this data before any +decoding can take place. The download process must also obtain other data +before incremental validation can be performed. Full-file validation (for +clients who do not wish to do incremental validation) can be performed solely +with the data from this block. + +At the moment, this data block contains the following keys (and an estimate +on their sizes):: + + size 5 + segment_size 7 + num_segments 2 + needed_shares 2 + total_shares 3 + + codec_name 3 + codec_params 5+1+2+1+3=12 + tail_codec_params 12 + + share_root_hash 32 (binary) or 52 (base32-encoded) each + plaintext_hash + plaintext_root_hash + crypttext_hash + crypttext_root_hash + +Some pieces are needed elsewhere (size should be visible without pulling the +block, the Tahoe3 algorithm needs total_shares to find the right peers, all +peer selection algorithms need needed_shares to ask a minimal set of peers). +Some pieces are arguably redundant but are convenient to have present +(test_encode.py makes use of num_segments). + +The rule for this data block is that it should be a constant size for all +files, regardless of file size. Therefore hash trees (which have a size that +depends linearly upon the number of segments) are stored elsewhere in the +bucket, with only the hash tree root stored in this data block. + +This block will be serialized as follows:: + + assert that all keys match ^[a-zA-z_\-]+$ + sort all the keys lexicographically + for k in keys: + write("%s:" % k) + write(netstring(data[k])) + + +Serialized size:: + + dense binary (but decimal) packing: 160+46=206 + including 'key:' (185) and netstring (6*3+7*4=46) on values: 231 + including 'key:%d\n' (185+13=198) and printable values (46+5*52=306)=504 + +We'll go with the 231-sized block, and provide a tool to dump it as text if +we really want one. diff --git a/docs/specifications/URI-extension.txt b/docs/specifications/URI-extension.txt deleted file mode 100644 index 6d40652e..00000000 --- a/docs/specifications/URI-extension.txt +++ /dev/null @@ -1,62 +0,0 @@ -=================== -URI Extension Block -=================== - -This block is a serialized dictionary with string keys and string values -(some of which represent numbers, some of which are SHA-256 hashes). All -buckets hold an identical copy. The hash of the serialized data is kept in -the URI. - -The download process must obtain a valid copy of this data before any -decoding can take place. The download process must also obtain other data -before incremental validation can be performed. Full-file validation (for -clients who do not wish to do incremental validation) can be performed solely -with the data from this block. - -At the moment, this data block contains the following keys (and an estimate -on their sizes):: - - size 5 - segment_size 7 - num_segments 2 - needed_shares 2 - total_shares 3 - - codec_name 3 - codec_params 5+1+2+1+3=12 - tail_codec_params 12 - - share_root_hash 32 (binary) or 52 (base32-encoded) each - plaintext_hash - plaintext_root_hash - crypttext_hash - crypttext_root_hash - -Some pieces are needed elsewhere (size should be visible without pulling the -block, the Tahoe3 algorithm needs total_shares to find the right peers, all -peer selection algorithms need needed_shares to ask a minimal set of peers). -Some pieces are arguably redundant but are convenient to have present -(test_encode.py makes use of num_segments). - -The rule for this data block is that it should be a constant size for all -files, regardless of file size. Therefore hash trees (which have a size that -depends linearly upon the number of segments) are stored elsewhere in the -bucket, with only the hash tree root stored in this data block. - -This block will be serialized as follows:: - - assert that all keys match ^[a-zA-z_\-]+$ - sort all the keys lexicographically - for k in keys: - write("%s:" % k) - write(netstring(data[k])) - - -Serialized size:: - - dense binary (but decimal) packing: 160+46=206 - including 'key:' (185) and netstring (6*3+7*4=46) on values: 231 - including 'key:%d\n' (185+13=198) and printable values (46+5*52=306)=504 - -We'll go with the 231-sized block, and provide a tool to dump it as text if -we really want one. diff --git a/docs/specifications/dirnodes.rst b/docs/specifications/dirnodes.rst new file mode 100644 index 00000000..129e4997 --- /dev/null +++ b/docs/specifications/dirnodes.rst @@ -0,0 +1,469 @@ +========================== +Tahoe-LAFS Directory Nodes +========================== + +As explained in the architecture docs, Tahoe-LAFS can be roughly viewed as +a collection of three layers. The lowest layer is the key-value store: it +provides operations that accept files and upload them to the grid, creating +a URI in the process which securely references the file's contents. +The middle layer is the filesystem, creating a structure of directories and +filenames resembling the traditional unix/windows filesystems. The top layer +is the application layer, which uses the lower layers to provide useful +services to users, like a backup application, or a way to share files with +friends. + +This document examines the middle layer, the "filesystem". + +1. `Key-value Store Primitives`_ +2. `Filesystem goals`_ +3. `Dirnode goals`_ +4. `Dirnode secret values`_ +5. `Dirnode storage format`_ +6. `Dirnode sizes, mutable-file initial read sizes`_ +7. `Design Goals, redux`_ + + 1. `Confidentiality leaks in the storage servers`_ + 2. `Integrity failures in the storage servers`_ + 3. `Improving the efficiency of dirnodes`_ + 4. `Dirnode expiration and leases`_ + +8. `Starting Points: root dirnodes`_ +9. `Mounting and Sharing Directories`_ +10. `Revocation`_ + +Key-value Store Primitives +========================== + +In the lowest layer (key-value store), there are two operations that reference +immutable data (which we refer to as "CHK URIs" or "CHK read-capabilities" or +"CHK read-caps"). One puts data into the grid (but only if it doesn't exist +already), the other retrieves it:: + + chk_uri = put(data) + data = get(chk_uri) + +We also have three operations which reference mutable data (which we refer to +as "mutable slots", or "mutable write-caps and read-caps", or sometimes "SSK +slots"). One creates a slot with some initial contents, a second replaces the +contents of a pre-existing slot, and the third retrieves the contents:: + + mutable_uri = create(initial_data) + replace(mutable_uri, new_data) + data = get(mutable_uri) + +Filesystem Goals +================ + +The main goal for the middle (filesystem) layer is to give users a way to +organize the data that they have uploaded into the grid. The traditional way +to do this in computer filesystems is to put this data into files, give those +files names, and collect these names into directories. + +Each directory is a set of name-entry pairs, each of which maps a "child name" +to a directory entry pointing to an object of some kind. Those child objects +might be files, or they might be other directories. Each directory entry also +contains metadata. + +The directory structure is therefore a directed graph of nodes, in which each +node might be a directory node or a file node. All file nodes are terminal +nodes. + +Dirnode Goals +============= + +What properties might be desirable for these directory nodes? In no +particular order: + +1. functional. Code which does not work doesn't count. +2. easy to document, explain, and understand +3. confidential: it should not be possible for others to see the contents of + a directory +4. integrity: it should not be possible for others to modify the contents + of a directory +5. available: directories should survive host failure, just like files do +6. efficient: in storage, communication bandwidth, number of round-trips +7. easy to delegate individual directories in a flexible way +8. updateness: everybody looking at a directory should see the same contents +9. monotonicity: everybody looking at a directory should see the same + sequence of updates + +Some of these goals are mutually exclusive. For example, availability and +consistency are opposing, so it is not possible to achieve #5 and #8 at the +same time. Moreover, it takes a more complex architecture to get close to the +available-and-consistent ideal, so #2/#6 is in opposition to #5/#8. + +Tahoe-LAFS v0.7.0 introduced distributed mutable files, which use public-key +cryptography for integrity, and erasure coding for availability. These +achieve roughly the same properties as immutable CHK files, but their +contents can be replaced without changing their identity. Dirnodes are then +just a special way of interpreting the contents of a specific mutable file. +Earlier releases used a "vdrive server": this server was abolished in the +v0.7.0 release. + +For details of how mutable files work, please see "mutable.txt" in this +directory. + +For releases since v0.7.0, we achieve most of our desired properties. The +integrity and availability of dirnodes is equivalent to that of regular +(immutable) files, with the exception that there are more simultaneous-update +failure modes for mutable slots. Delegation is quite strong: you can give +read-write or read-only access to any subtree, and the data format used for +dirnodes is such that read-only access is transitive: i.e. if you grant Bob +read-only access to a parent directory, then Bob will get read-only access +(and *not* read-write access) to its children. + +Relative to the previous "vdrive-server" based scheme, the current +distributed dirnode approach gives better availability, but cannot guarantee +updateness quite as well, and requires far more network traffic for each +retrieval and update. Mutable files are somewhat less available than +immutable files, simply because of the increased number of combinations +(shares of an immutable file are either present or not, whereas there are +multiple versions of each mutable file, and you might have some shares of +version 1 and other shares of version 2). In extreme cases of simultaneous +update, mutable files might suffer from non-monotonicity. + + +Dirnode secret values +===================== + +As mentioned before, dirnodes are simply a special way to interpret the +contents of a mutable file, so the secret keys and capability strings +described in "mutable.txt" are all the same. Each dirnode contains an RSA +public/private keypair, and the holder of the "write capability" will be able +to retrieve the private key (as well as the AES encryption key used for the +data itself). The holder of the "read capability" will be able to obtain the +public key and the AES data key, but not the RSA private key needed to modify +the data. + +The "write capability" for a dirnode grants read-write access to its +contents. This is expressed on concrete form as the "dirnode write cap": a +printable string which contains the necessary secrets to grant this access. +Likewise, the "read capability" grants read-only access to a dirnode, and can +be represented by a "dirnode read cap" string. + +For example, +URI:DIR2:swdi8ge1s7qko45d3ckkyw1aac%3Aar8r5j99a4mezdojejmsfp4fj1zeky9gjigyrid4urxdimego68o +is a write-capability URI, while +URI:DIR2-RO:buxjqykt637u61nnmjg7s8zkny:ar8r5j99a4mezdojejmsfp4fj1zeky9gjigyrid4urxdimego68o +is a read-capability URI, both for the same dirnode. + + +Dirnode storage format +====================== + +Each dirnode is stored in a single mutable file, distributed in the Tahoe-LAFS +grid. The contents of this file are a serialized list of netstrings, one per +child. Each child is a list of four netstrings: (name, rocap, rwcap, +metadata). (Remember that the contents of the mutable file are encrypted by +the read-cap, so this section describes the plaintext contents of the mutable +file, *after* it has been decrypted by the read-cap.) + +The name is simple a UTF-8 -encoded child name. The 'rocap' is a read-only +capability URI to that child, either an immutable (CHK) file, a mutable file, +or a directory. It is also possible to store 'unknown' URIs that are not +recognized by the current version of Tahoe-LAFS. The 'rwcap' is a read-write +capability URI for that child, encrypted with the dirnode's write-cap: this +enables the "transitive readonlyness" property, described further below. The +'metadata' is a JSON-encoded dictionary of type,value metadata pairs. Some +metadata keys are pre-defined, the rest are left up to the application. + +Each rwcap is stored as IV + ciphertext + MAC. The IV is a 16-byte random +value. The ciphertext is obtained by using AES in CTR mode on the rwcap URI +string, using a key that is formed from a tagged hash of the IV and the +dirnode's writekey. The MAC is written only for compatibility with older +Tahoe-LAFS versions and is no longer verified. + +If Bob has read-only access to the 'bar' directory, and he adds it as a child +to the 'foo' directory, then he will put the read-only cap for 'bar' in both +the rwcap and rocap slots (encrypting the rwcap contents as described above). +If he has full read-write access to 'bar', then he will put the read-write +cap in the 'rwcap' slot, and the read-only cap in the 'rocap' slot. Since +other users who have read-only access to 'foo' will be unable to decrypt its +rwcap slot, this limits those users to read-only access to 'bar' as well, +thus providing the transitive readonlyness that we desire. + +Dirnode sizes, mutable-file initial read sizes +============================================== + +How big are dirnodes? When reading dirnode data out of mutable files, how +large should our initial read be? If we guess exactly, we can read a dirnode +in a single round-trip, and update one in two RTT. If we guess too high, +we'll waste some amount of bandwidth. If we guess low, we need to make a +second pass to get the data (or the encrypted privkey, for writes), which +will cost us at least another RTT. + +Assuming child names are between 10 and 99 characters long, how long are the +various pieces of a dirnode? + +:: + + netstring(name) ~= 4+len(name) + chk-cap = 97 (for 4-char filesizes) + dir-rw-cap = 88 + dir-ro-cap = 91 + netstring(cap) = 4+len(cap) + encrypted(cap) = 16+cap+32 + JSON({}) = 2 + JSON({ctime=float,mtime=float,'tahoe':{linkcrtime=float,linkmotime=float}}): 137 + netstring(metadata) = 4+137 = 141 + +so a CHK entry is:: + + 5+ 4+len(name) + 4+97 + 5+16+97+32 + 4+137 + +And a 15-byte filename gives a 416-byte entry. When the entry points at a +subdirectory instead of a file, the entry is a little bit smaller. So an +empty directory uses 0 bytes, a directory with one child uses about 416 +bytes, a directory with two children uses about 832, etc. + +When the dirnode data is encoding using our default 3-of-10, that means we +get 139ish bytes of data in each share per child. + +The pubkey, signature, and hashes form the first 935ish bytes of the +container, then comes our data, then about 1216 bytes of encprivkey. So if we +read the first:: + + 1kB: we get 65bytes of dirnode data : only empty directories + 2kB: 1065bytes: about 8 + 3kB: 2065bytes: about 15 entries, or 6 entries plus the encprivkey + 4kB: 3065bytes: about 22 entries, or about 13 plus the encprivkey + +So we've written the code to do an initial read of 4kB from each share when +we read the mutable file, which should give good performance (one RTT) for +small directories. + + +Design Goals, redux +=================== + +How well does this design meet the goals? + +1. functional: YES: the code works and has extensive unit tests +2. documentable: YES: this document is the existence proof +3. confidential: YES: see below +4. integrity: MOSTLY: a coalition of storage servers can rollback individual + mutable files, but not a single one. No server can + substitute fake data as genuine. +5. availability: YES: as long as 'k' storage servers are present and have + the same version of the mutable file, the dirnode will + be available. +6. efficient: MOSTLY: + network: single dirnode lookup is very efficient, since clients can + fetch specific keys rather than being required to get or set + the entire dirnode each time. Traversing many directories + takes a lot of roundtrips, and these can't be collapsed with + promise-pipelining because the intermediate values must only + be visible to the client. Modifying many dirnodes at once + (e.g. importing a large pre-existing directory tree) is pretty + slow, since each graph edge must be created independently. + storage: each child has a separate IV, which makes them larger than + if all children were aggregated into a single encrypted string +7. delegation: VERY: each dirnode is a completely independent object, + to which clients can be granted separate read-write or + read-only access +8. updateness: VERY: with only a single point of access, and no caching, + each client operation starts by fetching the current + value, so there are no opportunities for staleness +9. monotonicity: VERY: the single point of access also protects against + retrograde motion + + + +Confidentiality leaks in the storage servers +-------------------------------------------- + +Dirnode (and the mutable files upon which they are based) are very private +against other clients: traffic between the client and the storage servers is +protected by the Foolscap SSL connection, so they can observe very little. +Storage index values are hashes of secrets and thus unguessable, and they are +not made public, so other clients cannot snoop through encrypted dirnodes +that they have not been told about. + +Storage servers can observe access patterns and see ciphertext, but they +cannot see the plaintext (of child names, metadata, or URIs). If an attacker +operates a significant number of storage servers, they can infer the shape of +the directory structure by assuming that directories are usually accessed +from root to leaf in rapid succession. Since filenames are usually much +shorter than read-caps and write-caps, the attacker can use the length of the +ciphertext to guess the number of children of each node, and might be able to +guess the length of the child names (or at least their sum). From this, the +attacker may be able to build up a graph with the same shape as the plaintext +filesystem, but with unlabeled edges and unknown file contents. + + +Integrity failures in the storage servers +----------------------------------------- + +The mutable file's integrity mechanism (RSA signature on the hash of the file +contents) prevents the storage server from modifying the dirnode's contents +without detection. Therefore the storage servers can make the dirnode +unavailable, but not corrupt it. + +A sufficient number of colluding storage servers can perform a rollback +attack: replace all shares of the whole mutable file with an earlier version. +To prevent this, when retrieving the contents of a mutable file, the +client queries more servers than necessary and uses the highest available +version number. This insures that one or two misbehaving storage servers +cannot cause this rollback on their own. + + +Improving the efficiency of dirnodes +------------------------------------ + +The current mutable-file -based dirnode scheme suffers from certain +inefficiencies. A very large directory (with thousands or millions of +children) will take a significant time to extract any single entry, because +the whole file must be downloaded first, then parsed and searched to find the +desired child entry. Likewise, modifying a single child will require the +whole file to be re-uploaded. + +The current design assumes (and in some cases, requires) that dirnodes remain +small. The mutable files on which dirnodes are based are currently using +"SDMF" ("Small Distributed Mutable File") design rules, which state that the +size of the data shall remain below one megabyte. More advanced forms of +mutable files (MDMF and LDMF) are in the design phase to allow efficient +manipulation of larger mutable files. This would reduce the work needed to +modify a single entry in a large directory. + +Judicious caching may help improve the reading-large-directory case. Some +form of mutable index at the beginning of the dirnode might help as well. The +MDMF design rules allow for efficient random-access reads from the middle of +the file, which would give the index something useful to point at. + +The current SDMF design generates a new RSA public/private keypair for each +directory. This takes considerable time and CPU effort, generally one or two +seconds per directory. We have designed (but not yet built) a DSA-based +mutable file scheme which will use shared parameters to reduce the +directory-creation effort to a bare minimum (picking a random number instead +of generating two random primes). + +When a backup program is run for the first time, it needs to copy a large +amount of data from a pre-existing filesystem into reliable storage. This +means that a large and complex directory structure needs to be duplicated in +the dirnode layer. With the one-object-per-dirnode approach described here, +this requires as many operations as there are edges in the imported +filesystem graph. + +Another approach would be to aggregate multiple directories into a single +storage object. This object would contain a serialized graph rather than a +single name-to-child dictionary. Most directory operations would fetch the +whole block of data (and presumeably cache it for a while to avoid lots of +re-fetches), and modification operations would need to replace the whole +thing at once. This "realm" approach would have the added benefit of +combining more data into a single encrypted bundle (perhaps hiding the shape +of the graph from a determined attacker), and would reduce round-trips when +performing deep directory traversals (assuming the realm was already cached). +It would also prevent fine-grained rollback attacks from working: a coalition +of storage servers could change the entire realm to look like an earlier +state, but it could not independently roll back individual directories. + +The drawbacks of this aggregation would be that small accesses (adding a +single child, looking up a single child) would require pulling or pushing a +lot of unrelated data, increasing network overhead (and necessitating +test-and-set semantics for the modification side, which increases the chances +that a user operation will fail, making it more challenging to provide +promises of atomicity to the user). + +It would also make it much more difficult to enable the delegation +("sharing") of specific directories. Since each aggregate "realm" provides +all-or-nothing access control, the act of delegating any directory from the +middle of the realm would require the realm first be split into the upper +piece that isn't being shared and the lower piece that is. This splitting +would have to be done in response to what is essentially a read operation, +which is not traditionally supposed to be a high-effort action. On the other +hand, it may be possible to aggregate the ciphertext, but use distinct +encryption keys for each component directory, to get the benefits of both +schemes at once. + + +Dirnode expiration and leases +----------------------------- + +Dirnodes are created any time a client wishes to add a new directory. How +long do they live? What's to keep them from sticking around forever, taking +up space that nobody can reach any longer? + +Mutable files are created with limited-time "leases", which keep the shares +alive until the last lease has expired or been cancelled. Clients which know +and care about specific dirnodes can ask to keep them alive for a while, by +renewing a lease on them (with a typical period of one month). Clients are +expected to assist in the deletion of dirnodes by canceling their leases as +soon as they are done with them. This means that when a client deletes a +directory, it should also cancel its lease on that directory. When the lease +count on a given share goes to zero, the storage server can delete the +related storage. Multiple clients may all have leases on the same dirnode: +the server may delete the shares only after all of the leases have gone away. + +We expect that clients will periodically create a "manifest": a list of +so-called "refresh capabilities" for all of the dirnodes and files that they +can reach. They will give this manifest to the "repairer", which is a service +that keeps files (and dirnodes) alive on behalf of clients who cannot take on +this responsibility for themselves. These refresh capabilities include the +storage index, but do *not* include the readkeys or writekeys, so the +repairer does not get to read the files or directories that it is helping to +keep alive. + +After each change to the user's vdrive, the client creates a manifest and +looks for differences from their previous version. Anything which was removed +prompts the client to send out lease-cancellation messages, allowing the data +to be deleted. + + +Starting Points: root dirnodes +============================== + +Any client can record the URI of a directory node in some external form (say, +in a local file) and use it as the starting point of later traversal. Each +Tahoe-LAFS user is expected to create a new (unattached) dirnode when they first +start using the grid, and record its URI for later use. + +Mounting and Sharing Directories +================================ + +The biggest benefit of this dirnode approach is that sharing individual +directories is almost trivial. Alice creates a subdirectory that she wants to +use to share files with Bob. This subdirectory is attached to Alice's +filesystem at "~alice/share-with-bob". She asks her filesystem for the +read-write directory URI for that new directory, and emails it to Bob. When +Bob receives the URI, he asks his own local vdrive to attach the given URI, +perhaps at a place named "~bob/shared-with-alice". Every time either party +writes a file into this directory, the other will be able to read it. If +Alice prefers, she can give a read-only URI to Bob instead, and then Bob will +be able to read files but not change the contents of the directory. Neither +Alice nor Bob will get access to any files above the mounted directory: there +are no 'parent directory' pointers. If Alice creates a nested set of +directories, "~alice/share-with-bob/subdir2", and gives a read-only URI to +share-with-bob to Bob, then Bob will be unable to write to either +share-with-bob/ or subdir2/. + +A suitable UI needs to be created to allow users to easily perform this +sharing action: dragging a folder their vdrive to an IM or email user icon, +for example. The UI will need to give the sending user an opportunity to +indicate whether they want to grant read-write or read-only access to the +recipient. The recipient then needs an interface to drag the new folder into +their vdrive and give it a home. + +Revocation +========== + +When Alice decides that she no longer wants Bob to be able to access the +shared directory, what should she do? Suppose she's shared this folder with +both Bob and Carol, and now she wants Carol to retain access to it but Bob to +be shut out. Ideally Carol should not have to do anything: her access should +continue unabated. + +The current plan is to have her client create a deep copy of the folder in +question, delegate access to the new folder to the remaining members of the +group (Carol), asking the lucky survivors to replace their old reference with +the new one. Bob may still have access to the old folder, but he is now the +only one who cares: everyone else has moved on, and he will no longer be able +to see their new changes. In a strict sense, this is the strongest form of +revocation that can be accomplished: there is no point trying to force Bob to +forget about the files that he read a moment before being kicked out. In +addition it must be noted that anyone who can access the directory can proxy +for Bob, reading files to him and accepting changes whenever he wants. +Preventing delegation between communication parties is just as pointless as +asking Bob to forget previously accessed files. However, there may be value +to configuring the UI to ask Carol to not share files with Bob, or to +removing all files from Bob's view at the same time his access is revoked. + diff --git a/docs/specifications/dirnodes.txt b/docs/specifications/dirnodes.txt deleted file mode 100644 index 129e4997..00000000 --- a/docs/specifications/dirnodes.txt +++ /dev/null @@ -1,469 +0,0 @@ -========================== -Tahoe-LAFS Directory Nodes -========================== - -As explained in the architecture docs, Tahoe-LAFS can be roughly viewed as -a collection of three layers. The lowest layer is the key-value store: it -provides operations that accept files and upload them to the grid, creating -a URI in the process which securely references the file's contents. -The middle layer is the filesystem, creating a structure of directories and -filenames resembling the traditional unix/windows filesystems. The top layer -is the application layer, which uses the lower layers to provide useful -services to users, like a backup application, or a way to share files with -friends. - -This document examines the middle layer, the "filesystem". - -1. `Key-value Store Primitives`_ -2. `Filesystem goals`_ -3. `Dirnode goals`_ -4. `Dirnode secret values`_ -5. `Dirnode storage format`_ -6. `Dirnode sizes, mutable-file initial read sizes`_ -7. `Design Goals, redux`_ - - 1. `Confidentiality leaks in the storage servers`_ - 2. `Integrity failures in the storage servers`_ - 3. `Improving the efficiency of dirnodes`_ - 4. `Dirnode expiration and leases`_ - -8. `Starting Points: root dirnodes`_ -9. `Mounting and Sharing Directories`_ -10. `Revocation`_ - -Key-value Store Primitives -========================== - -In the lowest layer (key-value store), there are two operations that reference -immutable data (which we refer to as "CHK URIs" or "CHK read-capabilities" or -"CHK read-caps"). One puts data into the grid (but only if it doesn't exist -already), the other retrieves it:: - - chk_uri = put(data) - data = get(chk_uri) - -We also have three operations which reference mutable data (which we refer to -as "mutable slots", or "mutable write-caps and read-caps", or sometimes "SSK -slots"). One creates a slot with some initial contents, a second replaces the -contents of a pre-existing slot, and the third retrieves the contents:: - - mutable_uri = create(initial_data) - replace(mutable_uri, new_data) - data = get(mutable_uri) - -Filesystem Goals -================ - -The main goal for the middle (filesystem) layer is to give users a way to -organize the data that they have uploaded into the grid. The traditional way -to do this in computer filesystems is to put this data into files, give those -files names, and collect these names into directories. - -Each directory is a set of name-entry pairs, each of which maps a "child name" -to a directory entry pointing to an object of some kind. Those child objects -might be files, or they might be other directories. Each directory entry also -contains metadata. - -The directory structure is therefore a directed graph of nodes, in which each -node might be a directory node or a file node. All file nodes are terminal -nodes. - -Dirnode Goals -============= - -What properties might be desirable for these directory nodes? In no -particular order: - -1. functional. Code which does not work doesn't count. -2. easy to document, explain, and understand -3. confidential: it should not be possible for others to see the contents of - a directory -4. integrity: it should not be possible for others to modify the contents - of a directory -5. available: directories should survive host failure, just like files do -6. efficient: in storage, communication bandwidth, number of round-trips -7. easy to delegate individual directories in a flexible way -8. updateness: everybody looking at a directory should see the same contents -9. monotonicity: everybody looking at a directory should see the same - sequence of updates - -Some of these goals are mutually exclusive. For example, availability and -consistency are opposing, so it is not possible to achieve #5 and #8 at the -same time. Moreover, it takes a more complex architecture to get close to the -available-and-consistent ideal, so #2/#6 is in opposition to #5/#8. - -Tahoe-LAFS v0.7.0 introduced distributed mutable files, which use public-key -cryptography for integrity, and erasure coding for availability. These -achieve roughly the same properties as immutable CHK files, but their -contents can be replaced without changing their identity. Dirnodes are then -just a special way of interpreting the contents of a specific mutable file. -Earlier releases used a "vdrive server": this server was abolished in the -v0.7.0 release. - -For details of how mutable files work, please see "mutable.txt" in this -directory. - -For releases since v0.7.0, we achieve most of our desired properties. The -integrity and availability of dirnodes is equivalent to that of regular -(immutable) files, with the exception that there are more simultaneous-update -failure modes for mutable slots. Delegation is quite strong: you can give -read-write or read-only access to any subtree, and the data format used for -dirnodes is such that read-only access is transitive: i.e. if you grant Bob -read-only access to a parent directory, then Bob will get read-only access -(and *not* read-write access) to its children. - -Relative to the previous "vdrive-server" based scheme, the current -distributed dirnode approach gives better availability, but cannot guarantee -updateness quite as well, and requires far more network traffic for each -retrieval and update. Mutable files are somewhat less available than -immutable files, simply because of the increased number of combinations -(shares of an immutable file are either present or not, whereas there are -multiple versions of each mutable file, and you might have some shares of -version 1 and other shares of version 2). In extreme cases of simultaneous -update, mutable files might suffer from non-monotonicity. - - -Dirnode secret values -===================== - -As mentioned before, dirnodes are simply a special way to interpret the -contents of a mutable file, so the secret keys and capability strings -described in "mutable.txt" are all the same. Each dirnode contains an RSA -public/private keypair, and the holder of the "write capability" will be able -to retrieve the private key (as well as the AES encryption key used for the -data itself). The holder of the "read capability" will be able to obtain the -public key and the AES data key, but not the RSA private key needed to modify -the data. - -The "write capability" for a dirnode grants read-write access to its -contents. This is expressed on concrete form as the "dirnode write cap": a -printable string which contains the necessary secrets to grant this access. -Likewise, the "read capability" grants read-only access to a dirnode, and can -be represented by a "dirnode read cap" string. - -For example, -URI:DIR2:swdi8ge1s7qko45d3ckkyw1aac%3Aar8r5j99a4mezdojejmsfp4fj1zeky9gjigyrid4urxdimego68o -is a write-capability URI, while -URI:DIR2-RO:buxjqykt637u61nnmjg7s8zkny:ar8r5j99a4mezdojejmsfp4fj1zeky9gjigyrid4urxdimego68o -is a read-capability URI, both for the same dirnode. - - -Dirnode storage format -====================== - -Each dirnode is stored in a single mutable file, distributed in the Tahoe-LAFS -grid. The contents of this file are a serialized list of netstrings, one per -child. Each child is a list of four netstrings: (name, rocap, rwcap, -metadata). (Remember that the contents of the mutable file are encrypted by -the read-cap, so this section describes the plaintext contents of the mutable -file, *after* it has been decrypted by the read-cap.) - -The name is simple a UTF-8 -encoded child name. The 'rocap' is a read-only -capability URI to that child, either an immutable (CHK) file, a mutable file, -or a directory. It is also possible to store 'unknown' URIs that are not -recognized by the current version of Tahoe-LAFS. The 'rwcap' is a read-write -capability URI for that child, encrypted with the dirnode's write-cap: this -enables the "transitive readonlyness" property, described further below. The -'metadata' is a JSON-encoded dictionary of type,value metadata pairs. Some -metadata keys are pre-defined, the rest are left up to the application. - -Each rwcap is stored as IV + ciphertext + MAC. The IV is a 16-byte random -value. The ciphertext is obtained by using AES in CTR mode on the rwcap URI -string, using a key that is formed from a tagged hash of the IV and the -dirnode's writekey. The MAC is written only for compatibility with older -Tahoe-LAFS versions and is no longer verified. - -If Bob has read-only access to the 'bar' directory, and he adds it as a child -to the 'foo' directory, then he will put the read-only cap for 'bar' in both -the rwcap and rocap slots (encrypting the rwcap contents as described above). -If he has full read-write access to 'bar', then he will put the read-write -cap in the 'rwcap' slot, and the read-only cap in the 'rocap' slot. Since -other users who have read-only access to 'foo' will be unable to decrypt its -rwcap slot, this limits those users to read-only access to 'bar' as well, -thus providing the transitive readonlyness that we desire. - -Dirnode sizes, mutable-file initial read sizes -============================================== - -How big are dirnodes? When reading dirnode data out of mutable files, how -large should our initial read be? If we guess exactly, we can read a dirnode -in a single round-trip, and update one in two RTT. If we guess too high, -we'll waste some amount of bandwidth. If we guess low, we need to make a -second pass to get the data (or the encrypted privkey, for writes), which -will cost us at least another RTT. - -Assuming child names are between 10 and 99 characters long, how long are the -various pieces of a dirnode? - -:: - - netstring(name) ~= 4+len(name) - chk-cap = 97 (for 4-char filesizes) - dir-rw-cap = 88 - dir-ro-cap = 91 - netstring(cap) = 4+len(cap) - encrypted(cap) = 16+cap+32 - JSON({}) = 2 - JSON({ctime=float,mtime=float,'tahoe':{linkcrtime=float,linkmotime=float}}): 137 - netstring(metadata) = 4+137 = 141 - -so a CHK entry is:: - - 5+ 4+len(name) + 4+97 + 5+16+97+32 + 4+137 - -And a 15-byte filename gives a 416-byte entry. When the entry points at a -subdirectory instead of a file, the entry is a little bit smaller. So an -empty directory uses 0 bytes, a directory with one child uses about 416 -bytes, a directory with two children uses about 832, etc. - -When the dirnode data is encoding using our default 3-of-10, that means we -get 139ish bytes of data in each share per child. - -The pubkey, signature, and hashes form the first 935ish bytes of the -container, then comes our data, then about 1216 bytes of encprivkey. So if we -read the first:: - - 1kB: we get 65bytes of dirnode data : only empty directories - 2kB: 1065bytes: about 8 - 3kB: 2065bytes: about 15 entries, or 6 entries plus the encprivkey - 4kB: 3065bytes: about 22 entries, or about 13 plus the encprivkey - -So we've written the code to do an initial read of 4kB from each share when -we read the mutable file, which should give good performance (one RTT) for -small directories. - - -Design Goals, redux -=================== - -How well does this design meet the goals? - -1. functional: YES: the code works and has extensive unit tests -2. documentable: YES: this document is the existence proof -3. confidential: YES: see below -4. integrity: MOSTLY: a coalition of storage servers can rollback individual - mutable files, but not a single one. No server can - substitute fake data as genuine. -5. availability: YES: as long as 'k' storage servers are present and have - the same version of the mutable file, the dirnode will - be available. -6. efficient: MOSTLY: - network: single dirnode lookup is very efficient, since clients can - fetch specific keys rather than being required to get or set - the entire dirnode each time. Traversing many directories - takes a lot of roundtrips, and these can't be collapsed with - promise-pipelining because the intermediate values must only - be visible to the client. Modifying many dirnodes at once - (e.g. importing a large pre-existing directory tree) is pretty - slow, since each graph edge must be created independently. - storage: each child has a separate IV, which makes them larger than - if all children were aggregated into a single encrypted string -7. delegation: VERY: each dirnode is a completely independent object, - to which clients can be granted separate read-write or - read-only access -8. updateness: VERY: with only a single point of access, and no caching, - each client operation starts by fetching the current - value, so there are no opportunities for staleness -9. monotonicity: VERY: the single point of access also protects against - retrograde motion - - - -Confidentiality leaks in the storage servers --------------------------------------------- - -Dirnode (and the mutable files upon which they are based) are very private -against other clients: traffic between the client and the storage servers is -protected by the Foolscap SSL connection, so they can observe very little. -Storage index values are hashes of secrets and thus unguessable, and they are -not made public, so other clients cannot snoop through encrypted dirnodes -that they have not been told about. - -Storage servers can observe access patterns and see ciphertext, but they -cannot see the plaintext (of child names, metadata, or URIs). If an attacker -operates a significant number of storage servers, they can infer the shape of -the directory structure by assuming that directories are usually accessed -from root to leaf in rapid succession. Since filenames are usually much -shorter than read-caps and write-caps, the attacker can use the length of the -ciphertext to guess the number of children of each node, and might be able to -guess the length of the child names (or at least their sum). From this, the -attacker may be able to build up a graph with the same shape as the plaintext -filesystem, but with unlabeled edges and unknown file contents. - - -Integrity failures in the storage servers ------------------------------------------ - -The mutable file's integrity mechanism (RSA signature on the hash of the file -contents) prevents the storage server from modifying the dirnode's contents -without detection. Therefore the storage servers can make the dirnode -unavailable, but not corrupt it. - -A sufficient number of colluding storage servers can perform a rollback -attack: replace all shares of the whole mutable file with an earlier version. -To prevent this, when retrieving the contents of a mutable file, the -client queries more servers than necessary and uses the highest available -version number. This insures that one or two misbehaving storage servers -cannot cause this rollback on their own. - - -Improving the efficiency of dirnodes ------------------------------------- - -The current mutable-file -based dirnode scheme suffers from certain -inefficiencies. A very large directory (with thousands or millions of -children) will take a significant time to extract any single entry, because -the whole file must be downloaded first, then parsed and searched to find the -desired child entry. Likewise, modifying a single child will require the -whole file to be re-uploaded. - -The current design assumes (and in some cases, requires) that dirnodes remain -small. The mutable files on which dirnodes are based are currently using -"SDMF" ("Small Distributed Mutable File") design rules, which state that the -size of the data shall remain below one megabyte. More advanced forms of -mutable files (MDMF and LDMF) are in the design phase to allow efficient -manipulation of larger mutable files. This would reduce the work needed to -modify a single entry in a large directory. - -Judicious caching may help improve the reading-large-directory case. Some -form of mutable index at the beginning of the dirnode might help as well. The -MDMF design rules allow for efficient random-access reads from the middle of -the file, which would give the index something useful to point at. - -The current SDMF design generates a new RSA public/private keypair for each -directory. This takes considerable time and CPU effort, generally one or two -seconds per directory. We have designed (but not yet built) a DSA-based -mutable file scheme which will use shared parameters to reduce the -directory-creation effort to a bare minimum (picking a random number instead -of generating two random primes). - -When a backup program is run for the first time, it needs to copy a large -amount of data from a pre-existing filesystem into reliable storage. This -means that a large and complex directory structure needs to be duplicated in -the dirnode layer. With the one-object-per-dirnode approach described here, -this requires as many operations as there are edges in the imported -filesystem graph. - -Another approach would be to aggregate multiple directories into a single -storage object. This object would contain a serialized graph rather than a -single name-to-child dictionary. Most directory operations would fetch the -whole block of data (and presumeably cache it for a while to avoid lots of -re-fetches), and modification operations would need to replace the whole -thing at once. This "realm" approach would have the added benefit of -combining more data into a single encrypted bundle (perhaps hiding the shape -of the graph from a determined attacker), and would reduce round-trips when -performing deep directory traversals (assuming the realm was already cached). -It would also prevent fine-grained rollback attacks from working: a coalition -of storage servers could change the entire realm to look like an earlier -state, but it could not independently roll back individual directories. - -The drawbacks of this aggregation would be that small accesses (adding a -single child, looking up a single child) would require pulling or pushing a -lot of unrelated data, increasing network overhead (and necessitating -test-and-set semantics for the modification side, which increases the chances -that a user operation will fail, making it more challenging to provide -promises of atomicity to the user). - -It would also make it much more difficult to enable the delegation -("sharing") of specific directories. Since each aggregate "realm" provides -all-or-nothing access control, the act of delegating any directory from the -middle of the realm would require the realm first be split into the upper -piece that isn't being shared and the lower piece that is. This splitting -would have to be done in response to what is essentially a read operation, -which is not traditionally supposed to be a high-effort action. On the other -hand, it may be possible to aggregate the ciphertext, but use distinct -encryption keys for each component directory, to get the benefits of both -schemes at once. - - -Dirnode expiration and leases ------------------------------ - -Dirnodes are created any time a client wishes to add a new directory. How -long do they live? What's to keep them from sticking around forever, taking -up space that nobody can reach any longer? - -Mutable files are created with limited-time "leases", which keep the shares -alive until the last lease has expired or been cancelled. Clients which know -and care about specific dirnodes can ask to keep them alive for a while, by -renewing a lease on them (with a typical period of one month). Clients are -expected to assist in the deletion of dirnodes by canceling their leases as -soon as they are done with them. This means that when a client deletes a -directory, it should also cancel its lease on that directory. When the lease -count on a given share goes to zero, the storage server can delete the -related storage. Multiple clients may all have leases on the same dirnode: -the server may delete the shares only after all of the leases have gone away. - -We expect that clients will periodically create a "manifest": a list of -so-called "refresh capabilities" for all of the dirnodes and files that they -can reach. They will give this manifest to the "repairer", which is a service -that keeps files (and dirnodes) alive on behalf of clients who cannot take on -this responsibility for themselves. These refresh capabilities include the -storage index, but do *not* include the readkeys or writekeys, so the -repairer does not get to read the files or directories that it is helping to -keep alive. - -After each change to the user's vdrive, the client creates a manifest and -looks for differences from their previous version. Anything which was removed -prompts the client to send out lease-cancellation messages, allowing the data -to be deleted. - - -Starting Points: root dirnodes -============================== - -Any client can record the URI of a directory node in some external form (say, -in a local file) and use it as the starting point of later traversal. Each -Tahoe-LAFS user is expected to create a new (unattached) dirnode when they first -start using the grid, and record its URI for later use. - -Mounting and Sharing Directories -================================ - -The biggest benefit of this dirnode approach is that sharing individual -directories is almost trivial. Alice creates a subdirectory that she wants to -use to share files with Bob. This subdirectory is attached to Alice's -filesystem at "~alice/share-with-bob". She asks her filesystem for the -read-write directory URI for that new directory, and emails it to Bob. When -Bob receives the URI, he asks his own local vdrive to attach the given URI, -perhaps at a place named "~bob/shared-with-alice". Every time either party -writes a file into this directory, the other will be able to read it. If -Alice prefers, she can give a read-only URI to Bob instead, and then Bob will -be able to read files but not change the contents of the directory. Neither -Alice nor Bob will get access to any files above the mounted directory: there -are no 'parent directory' pointers. If Alice creates a nested set of -directories, "~alice/share-with-bob/subdir2", and gives a read-only URI to -share-with-bob to Bob, then Bob will be unable to write to either -share-with-bob/ or subdir2/. - -A suitable UI needs to be created to allow users to easily perform this -sharing action: dragging a folder their vdrive to an IM or email user icon, -for example. The UI will need to give the sending user an opportunity to -indicate whether they want to grant read-write or read-only access to the -recipient. The recipient then needs an interface to drag the new folder into -their vdrive and give it a home. - -Revocation -========== - -When Alice decides that she no longer wants Bob to be able to access the -shared directory, what should she do? Suppose she's shared this folder with -both Bob and Carol, and now she wants Carol to retain access to it but Bob to -be shut out. Ideally Carol should not have to do anything: her access should -continue unabated. - -The current plan is to have her client create a deep copy of the folder in -question, delegate access to the new folder to the remaining members of the -group (Carol), asking the lucky survivors to replace their old reference with -the new one. Bob may still have access to the old folder, but he is now the -only one who cares: everyone else has moved on, and he will no longer be able -to see their new changes. In a strict sense, this is the strongest form of -revocation that can be accomplished: there is no point trying to force Bob to -forget about the files that he read a moment before being kicked out. In -addition it must be noted that anyone who can access the directory can proxy -for Bob, reading files to him and accepting changes whenever he wants. -Preventing delegation between communication parties is just as pointless as -asking Bob to forget previously accessed files. However, there may be value -to configuring the UI to ask Carol to not share files with Bob, or to -removing all files from Bob's view at the same time his access is revoked. - diff --git a/docs/specifications/file-encoding.rst b/docs/specifications/file-encoding.rst new file mode 100644 index 00000000..1f2ee748 --- /dev/null +++ b/docs/specifications/file-encoding.rst @@ -0,0 +1,150 @@ +============= +File Encoding +============= + +When the client wishes to upload an immutable file, the first step is to +decide upon an encryption key. There are two methods: convergent or random. +The goal of the convergent-key method is to make sure that multiple uploads +of the same file will result in only one copy on the grid, whereas the +random-key method does not provide this "convergence" feature. + +The convergent-key method computes the SHA-256d hash of a single-purpose tag, +the encoding parameters, a "convergence secret", and the contents of the +file. It uses a portion of the resulting hash as the AES encryption key. +There are security concerns with using convergence this approach (the +"partial-information guessing attack", please see ticket #365 for some +references), so Tahoe uses a separate (randomly-generated) "convergence +secret" for each node, stored in NODEDIR/private/convergence . The encoding +parameters (k, N, and the segment size) are included in the hash to make sure +that two different encodings of the same file will get different keys. This +method requires an extra IO pass over the file, to compute this key, and +encryption cannot be started until the pass is complete. This means that the +convergent-key method will require at least two total passes over the file. + +The random-key method simply chooses a random encryption key. Convergence is +disabled, however this method does not require a separate IO pass, so upload +can be done with a single pass. This mode makes it easier to perform +streaming upload. + +Regardless of which method is used to generate the key, the plaintext file is +encrypted (using AES in CTR mode) to produce a ciphertext. This ciphertext is +then erasure-coded and uploaded to the servers. Two hashes of the ciphertext +are generated as the encryption proceeds: a flat hash of the whole +ciphertext, and a Merkle tree. These are used to verify the correctness of +the erasure decoding step, and can be used by a "verifier" process to make +sure the file is intact without requiring the decryption key. + +The encryption key is hashed (with SHA-256d and a single-purpose tag) to +produce the "Storage Index". This Storage Index (or SI) is used to identify +the shares produced by the method described below. The grid can be thought of +as a large table that maps Storage Index to a ciphertext. Since the +ciphertext is stored as erasure-coded shares, it can also be thought of as a +table that maps SI to shares. + +Anybody who knows a Storage Index can retrieve the associated ciphertext: +ciphertexts are not secret. + +.. image:: file-encoding1.svg + +The ciphertext file is then broken up into segments. The last segment is +likely to be shorter than the rest. Each segment is erasure-coded into a +number of "blocks". This takes place one segment at a time. (In fact, +encryption and erasure-coding take place at the same time, once per plaintext +segment). Larger segment sizes result in less overhead overall, but increase +both the memory footprint and the "alacrity" (the number of bytes we have to +receive before we can deliver validated plaintext to the user). The current +default segment size is 128KiB. + +One block from each segment is sent to each shareholder (aka leaseholder, +aka landlord, aka storage node, aka peer). The "share" held by each remote +shareholder is nominally just a collection of these blocks. The file will +be recoverable when a certain number of shares have been retrieved. + +.. image:: file-encoding2.svg + +The blocks are hashed as they are generated and transmitted. These +block hashes are put into a Merkle hash tree. When the last share has been +created, the merkle tree is completed and delivered to the peer. Later, when +we retrieve these blocks, the peer will send many of the merkle hash tree +nodes ahead of time, so we can validate each block independently. + +The root of this block hash tree is called the "block root hash" and +used in the next step. + +.. image:: file-encoding3.svg + +There is a higher-level Merkle tree called the "share hash tree". Its leaves +are the block root hashes from each share. The root of this tree is called +the "share root hash" and is included in the "URI Extension Block", aka UEB. +The ciphertext hash and Merkle tree are also put here, along with the +original file size, and the encoding parameters. The UEB contains all the +non-secret values that could be put in the URI, but would have made the URI +too big. So instead, the UEB is stored with the share, and the hash of the +UEB is put in the URI. + +The URI then contains the secret encryption key and the UEB hash. It also +contains the basic encoding parameters (k and N) and the file size, to make +download more efficient (by knowing the number of required shares ahead of +time, sufficient download queries can be generated in parallel). + +The URI (also known as the immutable-file read-cap, since possessing it +grants the holder the capability to read the file's plaintext) is then +represented as a (relatively) short printable string like so:: + + URI:CHK:auxet66ynq55naiy2ay7cgrshm:6rudoctmbxsmbg7gwtjlimd6umtwrrsxkjzthuldsmo4nnfoc6fa:3:10:1000000 + +.. image:: file-encoding4.svg + +During download, when a peer begins to transmit a share, it first transmits +all of the parts of the share hash tree that are necessary to validate its +block root hash. Then it transmits the portions of the block hash tree +that are necessary to validate the first block. Then it transmits the +first block. It then continues this loop: transmitting any portions of the +block hash tree to validate block#N, then sending block#N. + +.. image:: file-encoding5.svg + +So the "share" that is sent to the remote peer actually consists of three +pieces, sent in a specific order as they become available, and retrieved +during download in a different order according to when they are needed. + +The first piece is the blocks themselves, one per segment. The last +block will likely be shorter than the rest, because the last segment is +probably shorter than the rest. The second piece is the block hash tree, +consisting of a total of two SHA-1 hashes per block. The third piece is a +hash chain from the share hash tree, consisting of log2(numshares) hashes. + +During upload, all blocks are sent first, followed by the block hash +tree, followed by the share hash chain. During download, the share hash chain +is delivered first, followed by the block root hash. The client then uses +the hash chain to validate the block root hash. Then the peer delivers +enough of the block hash tree to validate the first block, followed by +the first block itself. The block hash chain is used to validate the +block, then it is passed (along with the first block from several other +peers) into decoding, to produce the first segment of crypttext, which is +then decrypted to produce the first segment of plaintext, which is finally +delivered to the user. + +.. image:: file-encoding6.svg + +Hashes +====== + +All hashes use SHA-256d, as defined in Practical Cryptography (by Ferguson +and Schneier). All hashes use a single-purpose tag, e.g. the hash that +converts an encryption key into a storage index is defined as follows:: + + SI = SHA256d(netstring("allmydata_immutable_key_to_storage_index_v1") + key) + +When two separate values need to be combined together in a hash, we wrap each +in a netstring. + +Using SHA-256d (instead of plain SHA-256) guards against length-extension +attacks. Using the tag protects our Merkle trees against attacks in which the +hash of a leaf is confused with a hash of two children (allowing an attacker +to generate corrupted data that nevertheless appears to be valid), and is +simply good "cryptograhic hygiene". The `"Chosen Protocol Attack" by Kelsey, +Schneier, and Wagner `_ is +relevant. Putting the tag in a netstring guards against attacks that seek to +confuse the end of the tag with the beginning of the subsequent value. + diff --git a/docs/specifications/file-encoding.txt b/docs/specifications/file-encoding.txt deleted file mode 100644 index 1f2ee748..00000000 --- a/docs/specifications/file-encoding.txt +++ /dev/null @@ -1,150 +0,0 @@ -============= -File Encoding -============= - -When the client wishes to upload an immutable file, the first step is to -decide upon an encryption key. There are two methods: convergent or random. -The goal of the convergent-key method is to make sure that multiple uploads -of the same file will result in only one copy on the grid, whereas the -random-key method does not provide this "convergence" feature. - -The convergent-key method computes the SHA-256d hash of a single-purpose tag, -the encoding parameters, a "convergence secret", and the contents of the -file. It uses a portion of the resulting hash as the AES encryption key. -There are security concerns with using convergence this approach (the -"partial-information guessing attack", please see ticket #365 for some -references), so Tahoe uses a separate (randomly-generated) "convergence -secret" for each node, stored in NODEDIR/private/convergence . The encoding -parameters (k, N, and the segment size) are included in the hash to make sure -that two different encodings of the same file will get different keys. This -method requires an extra IO pass over the file, to compute this key, and -encryption cannot be started until the pass is complete. This means that the -convergent-key method will require at least two total passes over the file. - -The random-key method simply chooses a random encryption key. Convergence is -disabled, however this method does not require a separate IO pass, so upload -can be done with a single pass. This mode makes it easier to perform -streaming upload. - -Regardless of which method is used to generate the key, the plaintext file is -encrypted (using AES in CTR mode) to produce a ciphertext. This ciphertext is -then erasure-coded and uploaded to the servers. Two hashes of the ciphertext -are generated as the encryption proceeds: a flat hash of the whole -ciphertext, and a Merkle tree. These are used to verify the correctness of -the erasure decoding step, and can be used by a "verifier" process to make -sure the file is intact without requiring the decryption key. - -The encryption key is hashed (with SHA-256d and a single-purpose tag) to -produce the "Storage Index". This Storage Index (or SI) is used to identify -the shares produced by the method described below. The grid can be thought of -as a large table that maps Storage Index to a ciphertext. Since the -ciphertext is stored as erasure-coded shares, it can also be thought of as a -table that maps SI to shares. - -Anybody who knows a Storage Index can retrieve the associated ciphertext: -ciphertexts are not secret. - -.. image:: file-encoding1.svg - -The ciphertext file is then broken up into segments. The last segment is -likely to be shorter than the rest. Each segment is erasure-coded into a -number of "blocks". This takes place one segment at a time. (In fact, -encryption and erasure-coding take place at the same time, once per plaintext -segment). Larger segment sizes result in less overhead overall, but increase -both the memory footprint and the "alacrity" (the number of bytes we have to -receive before we can deliver validated plaintext to the user). The current -default segment size is 128KiB. - -One block from each segment is sent to each shareholder (aka leaseholder, -aka landlord, aka storage node, aka peer). The "share" held by each remote -shareholder is nominally just a collection of these blocks. The file will -be recoverable when a certain number of shares have been retrieved. - -.. image:: file-encoding2.svg - -The blocks are hashed as they are generated and transmitted. These -block hashes are put into a Merkle hash tree. When the last share has been -created, the merkle tree is completed and delivered to the peer. Later, when -we retrieve these blocks, the peer will send many of the merkle hash tree -nodes ahead of time, so we can validate each block independently. - -The root of this block hash tree is called the "block root hash" and -used in the next step. - -.. image:: file-encoding3.svg - -There is a higher-level Merkle tree called the "share hash tree". Its leaves -are the block root hashes from each share. The root of this tree is called -the "share root hash" and is included in the "URI Extension Block", aka UEB. -The ciphertext hash and Merkle tree are also put here, along with the -original file size, and the encoding parameters. The UEB contains all the -non-secret values that could be put in the URI, but would have made the URI -too big. So instead, the UEB is stored with the share, and the hash of the -UEB is put in the URI. - -The URI then contains the secret encryption key and the UEB hash. It also -contains the basic encoding parameters (k and N) and the file size, to make -download more efficient (by knowing the number of required shares ahead of -time, sufficient download queries can be generated in parallel). - -The URI (also known as the immutable-file read-cap, since possessing it -grants the holder the capability to read the file's plaintext) is then -represented as a (relatively) short printable string like so:: - - URI:CHK:auxet66ynq55naiy2ay7cgrshm:6rudoctmbxsmbg7gwtjlimd6umtwrrsxkjzthuldsmo4nnfoc6fa:3:10:1000000 - -.. image:: file-encoding4.svg - -During download, when a peer begins to transmit a share, it first transmits -all of the parts of the share hash tree that are necessary to validate its -block root hash. Then it transmits the portions of the block hash tree -that are necessary to validate the first block. Then it transmits the -first block. It then continues this loop: transmitting any portions of the -block hash tree to validate block#N, then sending block#N. - -.. image:: file-encoding5.svg - -So the "share" that is sent to the remote peer actually consists of three -pieces, sent in a specific order as they become available, and retrieved -during download in a different order according to when they are needed. - -The first piece is the blocks themselves, one per segment. The last -block will likely be shorter than the rest, because the last segment is -probably shorter than the rest. The second piece is the block hash tree, -consisting of a total of two SHA-1 hashes per block. The third piece is a -hash chain from the share hash tree, consisting of log2(numshares) hashes. - -During upload, all blocks are sent first, followed by the block hash -tree, followed by the share hash chain. During download, the share hash chain -is delivered first, followed by the block root hash. The client then uses -the hash chain to validate the block root hash. Then the peer delivers -enough of the block hash tree to validate the first block, followed by -the first block itself. The block hash chain is used to validate the -block, then it is passed (along with the first block from several other -peers) into decoding, to produce the first segment of crypttext, which is -then decrypted to produce the first segment of plaintext, which is finally -delivered to the user. - -.. image:: file-encoding6.svg - -Hashes -====== - -All hashes use SHA-256d, as defined in Practical Cryptography (by Ferguson -and Schneier). All hashes use a single-purpose tag, e.g. the hash that -converts an encryption key into a storage index is defined as follows:: - - SI = SHA256d(netstring("allmydata_immutable_key_to_storage_index_v1") + key) - -When two separate values need to be combined together in a hash, we wrap each -in a netstring. - -Using SHA-256d (instead of plain SHA-256) guards against length-extension -attacks. Using the tag protects our Merkle trees against attacks in which the -hash of a leaf is confused with a hash of two children (allowing an attacker -to generate corrupted data that nevertheless appears to be valid), and is -simply good "cryptograhic hygiene". The `"Chosen Protocol Attack" by Kelsey, -Schneier, and Wagner `_ is -relevant. Putting the tag in a netstring guards against attacks that seek to -confuse the end of the tag with the beginning of the subsequent value. - diff --git a/docs/specifications/mutable.rst b/docs/specifications/mutable.rst new file mode 100644 index 00000000..0d7e71e5 --- /dev/null +++ b/docs/specifications/mutable.rst @@ -0,0 +1,704 @@ +============= +Mutable Files +============= + +This describes the "RSA-based mutable files" which were shipped in Tahoe v0.8.0. + +1. `Consistency vs. Availability`_ +2. `The Prime Coordination Directive: "Don't Do That"`_ +3. `Small Distributed Mutable Files`_ + + 1. `SDMF slots overview`_ + 2. `Server Storage Protocol`_ + 3. `Code Details`_ + 4. `SMDF Slot Format`_ + 5. `Recovery`_ + +4. `Medium Distributed Mutable Files`_ +5. `Large Distributed Mutable Files`_ +6. `TODO`_ + +Mutable File Slots are places with a stable identifier that can hold data +that changes over time. In contrast to CHK slots, for which the +URI/identifier is derived from the contents themselves, the Mutable File Slot +URI remains fixed for the life of the slot, regardless of what data is placed +inside it. + +Each mutable slot is referenced by two different URIs. The "read-write" URI +grants read-write access to its holder, allowing them to put whatever +contents they like into the slot. The "read-only" URI is less powerful, only +granting read access, and not enabling modification of the data. The +read-write URI can be turned into the read-only URI, but not the other way +around. + +The data in these slots is distributed over a number of servers, using the +same erasure coding that CHK files use, with 3-of-10 being a typical choice +of encoding parameters. The data is encrypted and signed in such a way that +only the holders of the read-write URI will be able to set the contents of +the slot, and only the holders of the read-only URI will be able to read +those contents. Holders of either URI will be able to validate the contents +as being written by someone with the read-write URI. The servers who hold the +shares cannot read or modify them: the worst they can do is deny service (by +deleting or corrupting the shares), or attempt a rollback attack (which can +only succeed with the cooperation of at least k servers). + +Consistency vs. Availability +============================ + +There is an age-old battle between consistency and availability. Epic papers +have been written, elaborate proofs have been established, and generations of +theorists have learned that you cannot simultaneously achieve guaranteed +consistency with guaranteed reliability. In addition, the closer to 0 you get +on either axis, the cost and complexity of the design goes up. + +Tahoe's design goals are to largely favor design simplicity, then slightly +favor read availability, over the other criteria. + +As we develop more sophisticated mutable slots, the API may expose multiple +read versions to the application layer. The tahoe philosophy is to defer most +consistency recovery logic to the higher layers. Some applications have +effective ways to merge multiple versions, so inconsistency is not +necessarily a problem (i.e. directory nodes can usually merge multiple "add +child" operations). + +The Prime Coordination Directive: "Don't Do That" +================================================= + +The current rule for applications which run on top of Tahoe is "do not +perform simultaneous uncoordinated writes". That means you need non-tahoe +means to make sure that two parties are not trying to modify the same mutable +slot at the same time. For example: + +* don't give the read-write URI to anyone else. Dirnodes in a private + directory generally satisfy this case, as long as you don't use two + clients on the same account at the same time +* if you give a read-write URI to someone else, stop using it yourself. An + inbox would be a good example of this. +* if you give a read-write URI to someone else, call them on the phone + before you write into it +* build an automated mechanism to have your agents coordinate writes. + For example, we expect a future release to include a FURL for a + "coordination server" in the dirnodes. The rule can be that you must + contact the coordination server and obtain a lock/lease on the file + before you're allowed to modify it. + +If you do not follow this rule, Bad Things will happen. The worst-case Bad +Thing is that the entire file will be lost. A less-bad Bad Thing is that one +or more of the simultaneous writers will lose their changes. An observer of +the file may not see monotonically-increasing changes to the file, i.e. they +may see version 1, then version 2, then 3, then 2 again. + +Tahoe takes some amount of care to reduce the badness of these Bad Things. +One way you can help nudge it from the "lose your file" case into the "lose +some changes" case is to reduce the number of competing versions: multiple +versions of the file that different parties are trying to establish as the +one true current contents. Each simultaneous writer counts as a "competing +version", as does the previous version of the file. If the count "S" of these +competing versions is larger than N/k, then the file runs the risk of being +lost completely. [TODO] If at least one of the writers remains running after +the collision is detected, it will attempt to recover, but if S>(N/k) and all +writers crash after writing a few shares, the file will be lost. + +Note that Tahoe uses serialization internally to make sure that a single +Tahoe node will not perform simultaneous modifications to a mutable file. It +accomplishes this by using a weakref cache of the MutableFileNode (so that +there will never be two distinct MutableFileNodes for the same file), and by +forcing all mutable file operations to obtain a per-node lock before they +run. The Prime Coordination Directive therefore applies to inter-node +conflicts, not intra-node ones. + + +Small Distributed Mutable Files +=============================== + +SDMF slots are suitable for small (<1MB) files that are editing by rewriting +the entire file. The three operations are: + + * allocate (with initial contents) + * set (with new contents) + * get (old contents) + +The first use of SDMF slots will be to hold directories (dirnodes), which map +encrypted child names to rw-URI/ro-URI pairs. + +SDMF slots overview +------------------- + +Each SDMF slot is created with a public/private key pair. The public key is +known as the "verification key", while the private key is called the +"signature key". The private key is hashed and truncated to 16 bytes to form +the "write key" (an AES symmetric key). The write key is then hashed and +truncated to form the "read key". The read key is hashed and truncated to +form the 16-byte "storage index" (a unique string used as an index to locate +stored data). + +The public key is hashed by itself to form the "verification key hash". + +The write key is hashed a different way to form the "write enabler master". +For each storage server on which a share is kept, the write enabler master is +concatenated with the server's nodeid and hashed, and the result is called +the "write enabler" for that particular server. Note that multiple shares of +the same slot stored on the same server will all get the same write enabler, +i.e. the write enabler is associated with the "bucket", rather than the +individual shares. + +The private key is encrypted (using AES in counter mode) by the write key, +and the resulting crypttext is stored on the servers. so it will be +retrievable by anyone who knows the write key. The write key is not used to +encrypt anything else, and the private key never changes, so we do not need +an IV for this purpose. + +The actual data is encrypted (using AES in counter mode) with a key derived +by concatenating the readkey with the IV, the hashing the results and +truncating to 16 bytes. The IV is randomly generated each time the slot is +updated, and stored next to the encrypted data. + +The read-write URI consists of the write key and the verification key hash. +The read-only URI contains the read key and the verification key hash. The +verify-only URI contains the storage index and the verification key hash. + +:: + + URI:SSK-RW:b2a(writekey):b2a(verification_key_hash) + URI:SSK-RO:b2a(readkey):b2a(verification_key_hash) + URI:SSK-Verify:b2a(storage_index):b2a(verification_key_hash) + +Note that this allows the read-only and verify-only URIs to be derived from +the read-write URI without actually retrieving the public keys. Also note +that it means the read-write agent must validate both the private key and the +public key when they are first fetched. All users validate the public key in +exactly the same way. + +The SDMF slot is allocated by sending a request to the storage server with a +desired size, the storage index, and the write enabler for that server's +nodeid. If granted, the write enabler is stashed inside the slot's backing +store file. All further write requests must be accompanied by the write +enabler or they will not be honored. The storage server does not share the +write enabler with anyone else. + +The SDMF slot structure will be described in more detail below. The important +pieces are: + +* a sequence number +* a root hash "R" +* the encoding parameters (including k, N, file size, segment size) +* a signed copy of [seqnum,R,encoding_params], using the signature key +* the verification key (not encrypted) +* the share hash chain (part of a Merkle tree over the share hashes) +* the block hash tree (Merkle tree over blocks of share data) +* the share data itself (erasure-coding of read-key-encrypted file data) +* the signature key, encrypted with the write key + +The access pattern for read is: + +* hash read-key to get storage index +* use storage index to locate 'k' shares with identical 'R' values + + * either get one share, read 'k' from it, then read k-1 shares + * or read, say, 5 shares, discover k, either get more or be finished + * or copy k into the URIs + +* read verification key +* hash verification key, compare against verification key hash +* read seqnum, R, encoding parameters, signature +* verify signature against verification key +* read share data, compute block-hash Merkle tree and root "r" +* read share hash chain (leading from "r" to "R") +* validate share hash chain up to the root "R" +* submit share data to erasure decoding +* decrypt decoded data with read-key +* submit plaintext to application + +The access pattern for write is: + +* hash write-key to get read-key, hash read-key to get storage index +* use the storage index to locate at least one share +* read verification key and encrypted signature key +* decrypt signature key using write-key +* hash signature key, compare against write-key +* hash verification key, compare against verification key hash +* encrypt plaintext from application with read-key + + * application can encrypt some data with the write-key to make it only + available to writers (use this for transitive read-onlyness of dirnodes) + +* erasure-code crypttext to form shares +* split shares into blocks +* compute Merkle tree of blocks, giving root "r" for each share +* compute Merkle tree of shares, find root "R" for the file as a whole +* create share data structures, one per server: + + * use seqnum which is one higher than the old version + * share hash chain has log(N) hashes, different for each server + * signed data is the same for each server + +* now we have N shares and need homes for them +* walk through peers + + * if share is not already present, allocate-and-set + * otherwise, try to modify existing share: + * send testv_and_writev operation to each one + * testv says to accept share if their(seqnum+R) <= our(seqnum+R) + * count how many servers wind up with which versions (histogram over R) + * keep going until N servers have the same version, or we run out of servers + + * if any servers wound up with a different version, report error to + application + * if we ran out of servers, initiate recovery process (described below) + +Server Storage Protocol +----------------------- + +The storage servers will provide a mutable slot container which is oblivious +to the details of the data being contained inside it. Each storage index +refers to a "bucket", and each bucket has one or more shares inside it. (In a +well-provisioned network, each bucket will have only one share). The bucket +is stored as a directory, using the base32-encoded storage index as the +directory name. Each share is stored in a single file, using the share number +as the filename. + +The container holds space for a container magic number (for versioning), the +write enabler, the nodeid which accepted the write enabler (used for share +migration, described below), a small number of lease structures, the embedded +data itself, and expansion space for additional lease structures:: + + # offset size name + 1 0 32 magic verstr "tahoe mutable container v1" plus binary + 2 32 20 write enabler's nodeid + 3 52 32 write enabler + 4 84 8 data size (actual share data present) (a) + 5 92 8 offset of (8) count of extra leases (after data) + 6 100 368 four leases, 92 bytes each + 0 4 ownerid (0 means "no lease here") + 4 4 expiration timestamp + 8 32 renewal token + 40 32 cancel token + 72 20 nodeid which accepted the tokens + 7 468 (a) data + 8 ?? 4 count of extra leases + 9 ?? n*92 extra leases + +The "extra leases" field must be copied and rewritten each time the size of +the enclosed data changes. The hope is that most buckets will have four or +fewer leases and this extra copying will not usually be necessary. + +The (4) "data size" field contains the actual number of bytes of data present +in field (7), such that a client request to read beyond 504+(a) will result +in an error. This allows the client to (one day) read relative to the end of +the file. The container size (that is, (8)-(7)) might be larger, especially +if extra size was pre-allocated in anticipation of filling the container with +a lot of data. + +The offset in (5) points at the *count* of extra leases, at (8). The actual +leases (at (9)) begin 4 bytes later. If the container size changes, both (8) +and (9) must be relocated by copying. + +The server will honor any write commands that provide the write token and do +not exceed the server-wide storage size limitations. Read and write commands +MUST be restricted to the 'data' portion of the container: the implementation +of those commands MUST perform correct bounds-checking to make sure other +portions of the container are inaccessible to the clients. + +The two methods provided by the storage server on these "MutableSlot" share +objects are: + +* readv(ListOf(offset=int, length=int)) + + * returns a list of bytestrings, of the various requested lengths + * offset < 0 is interpreted relative to the end of the data + * spans which hit the end of the data will return truncated data + +* testv_and_writev(write_enabler, test_vector, write_vector) + + * this is a test-and-set operation which performs the given tests and only + applies the desired writes if all tests succeed. This is used to detect + simultaneous writers, and to reduce the chance that an update will lose + data recently written by some other party (written after the last time + this slot was read). + * test_vector=ListOf(TupleOf(offset, length, opcode, specimen)) + * the opcode is a string, from the set [gt, ge, eq, le, lt, ne] + * each element of the test vector is read from the slot's data and + compared against the specimen using the desired (in)equality. If all + tests evaluate True, the write is performed + * write_vector=ListOf(TupleOf(offset, newdata)) + + * offset < 0 is not yet defined, it probably means relative to the + end of the data, which probably means append, but we haven't nailed + it down quite yet + * write vectors are executed in order, which specifies the results of + overlapping writes + + * return value: + + * error: OutOfSpace + * error: something else (io error, out of memory, whatever) + * (True, old_test_data): the write was accepted (test_vector passed) + * (False, old_test_data): the write was rejected (test_vector failed) + + * both 'accepted' and 'rejected' return the old data that was used + for the test_vector comparison. This can be used by the client + to detect write collisions, including collisions for which the + desired behavior was to overwrite the old version. + +In addition, the storage server provides several methods to access these +share objects: + +* allocate_mutable_slot(storage_index, sharenums=SetOf(int)) + + * returns DictOf(int, MutableSlot) + +* get_mutable_slot(storage_index) + + * returns DictOf(int, MutableSlot) + * or raises KeyError + +We intend to add an interface which allows small slots to allocate-and-write +in a single call, as well as do update or read in a single call. The goal is +to allow a reasonably-sized dirnode to be created (or updated, or read) in +just one round trip (to all N shareholders in parallel). + +migrating shares +```````````````` + +If a share must be migrated from one server to another, two values become +invalid: the write enabler (since it was computed for the old server), and +the lease renew/cancel tokens. + +Suppose that a slot was first created on nodeA, and was thus initialized with +WE(nodeA) (= H(WEM+nodeA)). Later, for provisioning reasons, the share is +moved from nodeA to nodeB. + +Readers may still be able to find the share in its new home, depending upon +how many servers are present in the grid, where the new nodeid lands in the +permuted index for this particular storage index, and how many servers the +reading client is willing to contact. + +When a client attempts to write to this migrated share, it will get a "bad +write enabler" error, since the WE it computes for nodeB will not match the +WE(nodeA) that was embedded in the share. When this occurs, the "bad write +enabler" message must include the old nodeid (e.g. nodeA) that was in the +share. + +The client then computes H(nodeB+H(WEM+nodeA)), which is the same as +H(nodeB+WE(nodeA)). The client sends this along with the new WE(nodeB), which +is H(WEM+nodeB). Note that the client only sends WE(nodeB) to nodeB, never to +anyone else. Also note that the client does not send a value to nodeB that +would allow the node to impersonate the client to a third node: everything +sent to nodeB will include something specific to nodeB in it. + +The server locally computes H(nodeB+WE(nodeA)), using its own node id and the +old write enabler from the share. It compares this against the value supplied +by the client. If they match, this serves as proof that the client was able +to compute the old write enabler. The server then accepts the client's new +WE(nodeB) and writes it into the container. + +This WE-fixup process requires an extra round trip, and requires the error +message to include the old nodeid, but does not require any public key +operations on either client or server. + +Migrating the leases will require a similar protocol. This protocol will be +defined concretely at a later date. + +Code Details +------------ + +The MutableFileNode class is used to manipulate mutable files (as opposed to +ImmutableFileNodes). These are initially generated with +client.create_mutable_file(), and later recreated from URIs with +client.create_node_from_uri(). Instances of this class will contain a URI and +a reference to the client (for peer selection and connection). + +NOTE: this section is out of date. Please see src/allmydata/interfaces.py +(the section on IMutableFilesystemNode) for more accurate information. + +The methods of MutableFileNode are: + +* download_to_data() -> [deferred] newdata, NotEnoughSharesError + + * if there are multiple retrieveable versions in the grid, get() returns + the first version it can reconstruct, and silently ignores the others. + In the future, a more advanced API will signal and provide access to + the multiple heads. + +* update(newdata) -> OK, UncoordinatedWriteError, NotEnoughSharesError +* overwrite(newdata) -> OK, UncoordinatedWriteError, NotEnoughSharesError + +download_to_data() causes a new retrieval to occur, pulling the current +contents from the grid and returning them to the caller. At the same time, +this call caches information about the current version of the file. This +information will be used in a subsequent call to update(), and if another +change has occured between the two, this information will be out of date, +triggering the UncoordinatedWriteError. + +update() is therefore intended to be used just after a download_to_data(), in +the following pattern:: + + d = mfn.download_to_data() + d.addCallback(apply_delta) + d.addCallback(mfn.update) + +If the update() call raises UCW, then the application can simply return an +error to the user ("you violated the Prime Coordination Directive"), and they +can try again later. Alternatively, the application can attempt to retry on +its own. To accomplish this, the app needs to pause, download the new +(post-collision and post-recovery) form of the file, reapply their delta, +then submit the update request again. A randomized pause is necessary to +reduce the chances of colliding a second time with another client that is +doing exactly the same thing:: + + d = mfn.download_to_data() + d.addCallback(apply_delta) + d.addCallback(mfn.update) + def _retry(f): + f.trap(UncoordinatedWriteError) + d1 = pause(random.uniform(5, 20)) + d1.addCallback(lambda res: mfn.download_to_data()) + d1.addCallback(apply_delta) + d1.addCallback(mfn.update) + return d1 + d.addErrback(_retry) + +Enthusiastic applications can retry multiple times, using a randomized +exponential backoff between each. A particularly enthusiastic application can +retry forever, but such apps are encouraged to provide a means to the user of +giving up after a while. + +UCW does not mean that the update was not applied, so it is also a good idea +to skip the retry-update step if the delta was already applied:: + + d = mfn.download_to_data() + d.addCallback(apply_delta) + d.addCallback(mfn.update) + def _retry(f): + f.trap(UncoordinatedWriteError) + d1 = pause(random.uniform(5, 20)) + d1.addCallback(lambda res: mfn.download_to_data()) + def _maybe_apply_delta(contents): + new_contents = apply_delta(contents) + if new_contents != contents: + return mfn.update(new_contents) + d1.addCallback(_maybe_apply_delta) + return d1 + d.addErrback(_retry) + +update() is the right interface to use for delta-application situations, like +directory nodes (in which apply_delta might be adding or removing child +entries from a serialized table). + +Note that any uncoordinated write has the potential to lose data. We must do +more analysis to be sure, but it appears that two clients who write to the +same mutable file at the same time (even if both eventually retry) will, with +high probability, result in one client observing UCW and the other silently +losing their changes. It is also possible for both clients to observe UCW. +The moral of the story is that the Prime Coordination Directive is there for +a reason, and that recovery/UCW/retry is not a subsitute for write +coordination. + +overwrite() tells the client to ignore this cached version information, and +to unconditionally replace the mutable file's contents with the new data. +This should not be used in delta application, but rather in situations where +you want to replace the file's contents with completely unrelated ones. When +raw files are uploaded into a mutable slot through the tahoe webapi (using +POST and the ?mutable=true argument), they are put in place with overwrite(). + +The peer-selection and data-structure manipulation (and signing/verification) +steps will be implemented in a separate class in allmydata/mutable.py . + +SMDF Slot Format +---------------- + +This SMDF data lives inside a server-side MutableSlot container. The server +is oblivious to this format. + +This data is tightly packed. In particular, the share data is defined to run +all the way to the beginning of the encrypted private key (the encprivkey +offset is used both to terminate the share data and to begin the encprivkey). + +:: + + # offset size name + 1 0 1 version byte, \x00 for this format + 2 1 8 sequence number. 2^64-1 must be handled specially, TBD + 3 9 32 "R" (root of share hash Merkle tree) + 4 41 16 IV (share data is AES(H(readkey+IV)) ) + 5 57 18 encoding parameters: + 57 1 k + 58 1 N + 59 8 segment size + 67 8 data length (of original plaintext) + 6 75 32 offset table: + 75 4 (8) signature + 79 4 (9) share hash chain + 83 4 (10) block hash tree + 87 4 (11) share data + 91 8 (12) encrypted private key + 99 8 (13) EOF + 7 107 436ish verification key (2048 RSA key) + 8 543ish 256ish signature=RSAenc(sigkey, H(version+seqnum+r+IV+encparm)) + 9 799ish (a) share hash chain, encoded as: + "".join([pack(">H32s", shnum, hash) + for (shnum,hash) in needed_hashes]) + 10 (927ish) (b) block hash tree, encoded as: + "".join([pack(">32s",hash) for hash in block_hash_tree]) + 11 (935ish) LEN share data (no gap between this and encprivkey) + 12 ?? 1216ish encrypted private key= AESenc(write-key, RSA-key) + 13 ?? -- EOF + + (a) The share hash chain contains ceil(log(N)) hashes, each 32 bytes long. + This is the set of hashes necessary to validate this share's leaf in the + share Merkle tree. For N=10, this is 4 hashes, i.e. 128 bytes. + (b) The block hash tree contains ceil(length/segsize) hashes, each 32 bytes + long. This is the set of hashes necessary to validate any given block of + share data up to the per-share root "r". Each "r" is a leaf of the share + has tree (with root "R"), from which a minimal subset of hashes is put in + the share hash chain in (8). + +Recovery +-------- + +The first line of defense against damage caused by colliding writes is the +Prime Coordination Directive: "Don't Do That". + +The second line of defense is to keep "S" (the number of competing versions) +lower than N/k. If this holds true, at least one competing version will have +k shares and thus be recoverable. Note that server unavailability counts +against us here: the old version stored on the unavailable server must be +included in the value of S. + +The third line of defense is our use of testv_and_writev() (described below), +which increases the convergence of simultaneous writes: one of the writers +will be favored (the one with the highest "R"), and that version is more +likely to be accepted than the others. This defense is least effective in the +pathological situation where S simultaneous writers are active, the one with +the lowest "R" writes to N-k+1 of the shares and then dies, then the one with +the next-lowest "R" writes to N-2k+1 of the shares and dies, etc, until the +one with the highest "R" writes to k-1 shares and dies. Any other sequencing +will allow the highest "R" to write to at least k shares and establish a new +revision. + +The fourth line of defense is the fact that each client keeps writing until +at least one version has N shares. This uses additional servers, if +necessary, to make sure that either the client's version or some +newer/overriding version is highly available. + +The fifth line of defense is the recovery algorithm, which seeks to make sure +that at least *one* version is highly available, even if that version is +somebody else's. + +The write-shares-to-peers algorithm is as follows: + +* permute peers according to storage index +* walk through peers, trying to assign one share per peer +* for each peer: + + * send testv_and_writev, using "old(seqnum+R) <= our(seqnum+R)" as the test + + * this means that we will overwrite any old versions, and we will + overwrite simultaenous writers of the same version if our R is higher. + We will not overwrite writers using a higher seqnum. + + * record the version that each share winds up with. If the write was + accepted, this is our own version. If it was rejected, read the + old_test_data to find out what version was retained. + * if old_test_data indicates the seqnum was equal or greater than our + own, mark the "Simultanous Writes Detected" flag, which will eventually + result in an error being reported to the writer (in their close() call). + * build a histogram of "R" values + * repeat until the histogram indicate that some version (possibly ours) + has N shares. Use new servers if necessary. + * If we run out of servers: + + * if there are at least shares-of-happiness of any one version, we're + happy, so return. (the close() might still get an error) + * not happy, need to reinforce something, goto RECOVERY + +Recovery: + +* read all shares, count the versions, identify the recoverable ones, + discard the unrecoverable ones. +* sort versions: locate max(seqnums), put all versions with that seqnum + in the list, sort by number of outstanding shares. Then put our own + version. (TODO: put versions with seqnum us ahead of us?). +* for each version: + + * attempt to recover that version + * if not possible, remove it from the list, go to next one + * if recovered, start at beginning of peer list, push that version, + continue until N shares are placed + * if pushing our own version, bump up the seqnum to one higher than + the max seqnum we saw + * if we run out of servers: + + * schedule retry and exponential backoff to repeat RECOVERY + + * admit defeat after some period? presumeably the client will be shut down + eventually, maybe keep trying (once per hour?) until then. + + +Medium Distributed Mutable Files +================================ + +These are just like the SDMF case, but: + +* we actually take advantage of the Merkle hash tree over the blocks, by + reading a single segment of data at a time (and its necessary hashes), to + reduce the read-time alacrity +* we allow arbitrary writes to the file (i.e. seek() is provided, and + O_TRUNC is no longer required) +* we write more code on the client side (in the MutableFileNode class), to + first read each segment that a write must modify. This looks exactly like + the way a normal filesystem uses a block device, or how a CPU must perform + a cache-line fill before modifying a single word. +* we might implement some sort of copy-based atomic update server call, + to allow multiple writev() calls to appear atomic to any readers. + +MDMF slots provide fairly efficient in-place edits of very large files (a few +GB). Appending data is also fairly efficient, although each time a power of 2 +boundary is crossed, the entire file must effectively be re-uploaded (because +the size of the block hash tree changes), so if the filesize is known in +advance, that space ought to be pre-allocated (by leaving extra space between +the block hash tree and the actual data). + +MDMF1 uses the Merkle tree to enable low-alacrity random-access reads. MDMF2 +adds cache-line reads to allow random-access writes. + +Large Distributed Mutable Files +=============================== + +LDMF slots use a fundamentally different way to store the file, inspired by +Mercurial's "revlog" format. They enable very efficient insert/remove/replace +editing of arbitrary spans. Multiple versions of the file can be retained, in +a revision graph that can have multiple heads. Each revision can be +referenced by a cryptographic identifier. There are two forms of the URI, one +that means "most recent version", and a longer one that points to a specific +revision. + +Metadata can be attached to the revisions, like timestamps, to enable rolling +back an entire tree to a specific point in history. + +LDMF1 provides deltas but tries to avoid dealing with multiple heads. LDMF2 +provides explicit support for revision identifiers and branching. + +TODO +==== + +improve allocate-and-write or get-writer-buckets API to allow one-call (or +maybe two-call) updates. The challenge is in figuring out which shares are on +which machines. First cut will have lots of round trips. + +(eventually) define behavior when seqnum wraps. At the very least make sure +it can't cause a security problem. "the slot is worn out" is acceptable. + +(eventually) define share-migration lease update protocol. Including the +nodeid who accepted the lease is useful, we can use the same protocol as we +do for updating the write enabler. However we need to know which lease to +update.. maybe send back a list of all old nodeids that we find, then try all +of them when we accept the update? + +We now do this in a specially-formatted IndexError exception: + "UNABLE to renew non-existent lease. I have leases accepted by " + + "nodeids: '12345','abcde','44221' ." + +confirm that a repairer can regenerate shares without the private key. Hmm, +without the write-enabler they won't be able to write those shares to the +servers.. although they could add immutable new shares to new servers. diff --git a/docs/specifications/mutable.txt b/docs/specifications/mutable.txt deleted file mode 100644 index 0d7e71e5..00000000 --- a/docs/specifications/mutable.txt +++ /dev/null @@ -1,704 +0,0 @@ -============= -Mutable Files -============= - -This describes the "RSA-based mutable files" which were shipped in Tahoe v0.8.0. - -1. `Consistency vs. Availability`_ -2. `The Prime Coordination Directive: "Don't Do That"`_ -3. `Small Distributed Mutable Files`_ - - 1. `SDMF slots overview`_ - 2. `Server Storage Protocol`_ - 3. `Code Details`_ - 4. `SMDF Slot Format`_ - 5. `Recovery`_ - -4. `Medium Distributed Mutable Files`_ -5. `Large Distributed Mutable Files`_ -6. `TODO`_ - -Mutable File Slots are places with a stable identifier that can hold data -that changes over time. In contrast to CHK slots, for which the -URI/identifier is derived from the contents themselves, the Mutable File Slot -URI remains fixed for the life of the slot, regardless of what data is placed -inside it. - -Each mutable slot is referenced by two different URIs. The "read-write" URI -grants read-write access to its holder, allowing them to put whatever -contents they like into the slot. The "read-only" URI is less powerful, only -granting read access, and not enabling modification of the data. The -read-write URI can be turned into the read-only URI, but not the other way -around. - -The data in these slots is distributed over a number of servers, using the -same erasure coding that CHK files use, with 3-of-10 being a typical choice -of encoding parameters. The data is encrypted and signed in such a way that -only the holders of the read-write URI will be able to set the contents of -the slot, and only the holders of the read-only URI will be able to read -those contents. Holders of either URI will be able to validate the contents -as being written by someone with the read-write URI. The servers who hold the -shares cannot read or modify them: the worst they can do is deny service (by -deleting or corrupting the shares), or attempt a rollback attack (which can -only succeed with the cooperation of at least k servers). - -Consistency vs. Availability -============================ - -There is an age-old battle between consistency and availability. Epic papers -have been written, elaborate proofs have been established, and generations of -theorists have learned that you cannot simultaneously achieve guaranteed -consistency with guaranteed reliability. In addition, the closer to 0 you get -on either axis, the cost and complexity of the design goes up. - -Tahoe's design goals are to largely favor design simplicity, then slightly -favor read availability, over the other criteria. - -As we develop more sophisticated mutable slots, the API may expose multiple -read versions to the application layer. The tahoe philosophy is to defer most -consistency recovery logic to the higher layers. Some applications have -effective ways to merge multiple versions, so inconsistency is not -necessarily a problem (i.e. directory nodes can usually merge multiple "add -child" operations). - -The Prime Coordination Directive: "Don't Do That" -================================================= - -The current rule for applications which run on top of Tahoe is "do not -perform simultaneous uncoordinated writes". That means you need non-tahoe -means to make sure that two parties are not trying to modify the same mutable -slot at the same time. For example: - -* don't give the read-write URI to anyone else. Dirnodes in a private - directory generally satisfy this case, as long as you don't use two - clients on the same account at the same time -* if you give a read-write URI to someone else, stop using it yourself. An - inbox would be a good example of this. -* if you give a read-write URI to someone else, call them on the phone - before you write into it -* build an automated mechanism to have your agents coordinate writes. - For example, we expect a future release to include a FURL for a - "coordination server" in the dirnodes. The rule can be that you must - contact the coordination server and obtain a lock/lease on the file - before you're allowed to modify it. - -If you do not follow this rule, Bad Things will happen. The worst-case Bad -Thing is that the entire file will be lost. A less-bad Bad Thing is that one -or more of the simultaneous writers will lose their changes. An observer of -the file may not see monotonically-increasing changes to the file, i.e. they -may see version 1, then version 2, then 3, then 2 again. - -Tahoe takes some amount of care to reduce the badness of these Bad Things. -One way you can help nudge it from the "lose your file" case into the "lose -some changes" case is to reduce the number of competing versions: multiple -versions of the file that different parties are trying to establish as the -one true current contents. Each simultaneous writer counts as a "competing -version", as does the previous version of the file. If the count "S" of these -competing versions is larger than N/k, then the file runs the risk of being -lost completely. [TODO] If at least one of the writers remains running after -the collision is detected, it will attempt to recover, but if S>(N/k) and all -writers crash after writing a few shares, the file will be lost. - -Note that Tahoe uses serialization internally to make sure that a single -Tahoe node will not perform simultaneous modifications to a mutable file. It -accomplishes this by using a weakref cache of the MutableFileNode (so that -there will never be two distinct MutableFileNodes for the same file), and by -forcing all mutable file operations to obtain a per-node lock before they -run. The Prime Coordination Directive therefore applies to inter-node -conflicts, not intra-node ones. - - -Small Distributed Mutable Files -=============================== - -SDMF slots are suitable for small (<1MB) files that are editing by rewriting -the entire file. The three operations are: - - * allocate (with initial contents) - * set (with new contents) - * get (old contents) - -The first use of SDMF slots will be to hold directories (dirnodes), which map -encrypted child names to rw-URI/ro-URI pairs. - -SDMF slots overview -------------------- - -Each SDMF slot is created with a public/private key pair. The public key is -known as the "verification key", while the private key is called the -"signature key". The private key is hashed and truncated to 16 bytes to form -the "write key" (an AES symmetric key). The write key is then hashed and -truncated to form the "read key". The read key is hashed and truncated to -form the 16-byte "storage index" (a unique string used as an index to locate -stored data). - -The public key is hashed by itself to form the "verification key hash". - -The write key is hashed a different way to form the "write enabler master". -For each storage server on which a share is kept, the write enabler master is -concatenated with the server's nodeid and hashed, and the result is called -the "write enabler" for that particular server. Note that multiple shares of -the same slot stored on the same server will all get the same write enabler, -i.e. the write enabler is associated with the "bucket", rather than the -individual shares. - -The private key is encrypted (using AES in counter mode) by the write key, -and the resulting crypttext is stored on the servers. so it will be -retrievable by anyone who knows the write key. The write key is not used to -encrypt anything else, and the private key never changes, so we do not need -an IV for this purpose. - -The actual data is encrypted (using AES in counter mode) with a key derived -by concatenating the readkey with the IV, the hashing the results and -truncating to 16 bytes. The IV is randomly generated each time the slot is -updated, and stored next to the encrypted data. - -The read-write URI consists of the write key and the verification key hash. -The read-only URI contains the read key and the verification key hash. The -verify-only URI contains the storage index and the verification key hash. - -:: - - URI:SSK-RW:b2a(writekey):b2a(verification_key_hash) - URI:SSK-RO:b2a(readkey):b2a(verification_key_hash) - URI:SSK-Verify:b2a(storage_index):b2a(verification_key_hash) - -Note that this allows the read-only and verify-only URIs to be derived from -the read-write URI without actually retrieving the public keys. Also note -that it means the read-write agent must validate both the private key and the -public key when they are first fetched. All users validate the public key in -exactly the same way. - -The SDMF slot is allocated by sending a request to the storage server with a -desired size, the storage index, and the write enabler for that server's -nodeid. If granted, the write enabler is stashed inside the slot's backing -store file. All further write requests must be accompanied by the write -enabler or they will not be honored. The storage server does not share the -write enabler with anyone else. - -The SDMF slot structure will be described in more detail below. The important -pieces are: - -* a sequence number -* a root hash "R" -* the encoding parameters (including k, N, file size, segment size) -* a signed copy of [seqnum,R,encoding_params], using the signature key -* the verification key (not encrypted) -* the share hash chain (part of a Merkle tree over the share hashes) -* the block hash tree (Merkle tree over blocks of share data) -* the share data itself (erasure-coding of read-key-encrypted file data) -* the signature key, encrypted with the write key - -The access pattern for read is: - -* hash read-key to get storage index -* use storage index to locate 'k' shares with identical 'R' values - - * either get one share, read 'k' from it, then read k-1 shares - * or read, say, 5 shares, discover k, either get more or be finished - * or copy k into the URIs - -* read verification key -* hash verification key, compare against verification key hash -* read seqnum, R, encoding parameters, signature -* verify signature against verification key -* read share data, compute block-hash Merkle tree and root "r" -* read share hash chain (leading from "r" to "R") -* validate share hash chain up to the root "R" -* submit share data to erasure decoding -* decrypt decoded data with read-key -* submit plaintext to application - -The access pattern for write is: - -* hash write-key to get read-key, hash read-key to get storage index -* use the storage index to locate at least one share -* read verification key and encrypted signature key -* decrypt signature key using write-key -* hash signature key, compare against write-key -* hash verification key, compare against verification key hash -* encrypt plaintext from application with read-key - - * application can encrypt some data with the write-key to make it only - available to writers (use this for transitive read-onlyness of dirnodes) - -* erasure-code crypttext to form shares -* split shares into blocks -* compute Merkle tree of blocks, giving root "r" for each share -* compute Merkle tree of shares, find root "R" for the file as a whole -* create share data structures, one per server: - - * use seqnum which is one higher than the old version - * share hash chain has log(N) hashes, different for each server - * signed data is the same for each server - -* now we have N shares and need homes for them -* walk through peers - - * if share is not already present, allocate-and-set - * otherwise, try to modify existing share: - * send testv_and_writev operation to each one - * testv says to accept share if their(seqnum+R) <= our(seqnum+R) - * count how many servers wind up with which versions (histogram over R) - * keep going until N servers have the same version, or we run out of servers - - * if any servers wound up with a different version, report error to - application - * if we ran out of servers, initiate recovery process (described below) - -Server Storage Protocol ------------------------ - -The storage servers will provide a mutable slot container which is oblivious -to the details of the data being contained inside it. Each storage index -refers to a "bucket", and each bucket has one or more shares inside it. (In a -well-provisioned network, each bucket will have only one share). The bucket -is stored as a directory, using the base32-encoded storage index as the -directory name. Each share is stored in a single file, using the share number -as the filename. - -The container holds space for a container magic number (for versioning), the -write enabler, the nodeid which accepted the write enabler (used for share -migration, described below), a small number of lease structures, the embedded -data itself, and expansion space for additional lease structures:: - - # offset size name - 1 0 32 magic verstr "tahoe mutable container v1" plus binary - 2 32 20 write enabler's nodeid - 3 52 32 write enabler - 4 84 8 data size (actual share data present) (a) - 5 92 8 offset of (8) count of extra leases (after data) - 6 100 368 four leases, 92 bytes each - 0 4 ownerid (0 means "no lease here") - 4 4 expiration timestamp - 8 32 renewal token - 40 32 cancel token - 72 20 nodeid which accepted the tokens - 7 468 (a) data - 8 ?? 4 count of extra leases - 9 ?? n*92 extra leases - -The "extra leases" field must be copied and rewritten each time the size of -the enclosed data changes. The hope is that most buckets will have four or -fewer leases and this extra copying will not usually be necessary. - -The (4) "data size" field contains the actual number of bytes of data present -in field (7), such that a client request to read beyond 504+(a) will result -in an error. This allows the client to (one day) read relative to the end of -the file. The container size (that is, (8)-(7)) might be larger, especially -if extra size was pre-allocated in anticipation of filling the container with -a lot of data. - -The offset in (5) points at the *count* of extra leases, at (8). The actual -leases (at (9)) begin 4 bytes later. If the container size changes, both (8) -and (9) must be relocated by copying. - -The server will honor any write commands that provide the write token and do -not exceed the server-wide storage size limitations. Read and write commands -MUST be restricted to the 'data' portion of the container: the implementation -of those commands MUST perform correct bounds-checking to make sure other -portions of the container are inaccessible to the clients. - -The two methods provided by the storage server on these "MutableSlot" share -objects are: - -* readv(ListOf(offset=int, length=int)) - - * returns a list of bytestrings, of the various requested lengths - * offset < 0 is interpreted relative to the end of the data - * spans which hit the end of the data will return truncated data - -* testv_and_writev(write_enabler, test_vector, write_vector) - - * this is a test-and-set operation which performs the given tests and only - applies the desired writes if all tests succeed. This is used to detect - simultaneous writers, and to reduce the chance that an update will lose - data recently written by some other party (written after the last time - this slot was read). - * test_vector=ListOf(TupleOf(offset, length, opcode, specimen)) - * the opcode is a string, from the set [gt, ge, eq, le, lt, ne] - * each element of the test vector is read from the slot's data and - compared against the specimen using the desired (in)equality. If all - tests evaluate True, the write is performed - * write_vector=ListOf(TupleOf(offset, newdata)) - - * offset < 0 is not yet defined, it probably means relative to the - end of the data, which probably means append, but we haven't nailed - it down quite yet - * write vectors are executed in order, which specifies the results of - overlapping writes - - * return value: - - * error: OutOfSpace - * error: something else (io error, out of memory, whatever) - * (True, old_test_data): the write was accepted (test_vector passed) - * (False, old_test_data): the write was rejected (test_vector failed) - - * both 'accepted' and 'rejected' return the old data that was used - for the test_vector comparison. This can be used by the client - to detect write collisions, including collisions for which the - desired behavior was to overwrite the old version. - -In addition, the storage server provides several methods to access these -share objects: - -* allocate_mutable_slot(storage_index, sharenums=SetOf(int)) - - * returns DictOf(int, MutableSlot) - -* get_mutable_slot(storage_index) - - * returns DictOf(int, MutableSlot) - * or raises KeyError - -We intend to add an interface which allows small slots to allocate-and-write -in a single call, as well as do update or read in a single call. The goal is -to allow a reasonably-sized dirnode to be created (or updated, or read) in -just one round trip (to all N shareholders in parallel). - -migrating shares -```````````````` - -If a share must be migrated from one server to another, two values become -invalid: the write enabler (since it was computed for the old server), and -the lease renew/cancel tokens. - -Suppose that a slot was first created on nodeA, and was thus initialized with -WE(nodeA) (= H(WEM+nodeA)). Later, for provisioning reasons, the share is -moved from nodeA to nodeB. - -Readers may still be able to find the share in its new home, depending upon -how many servers are present in the grid, where the new nodeid lands in the -permuted index for this particular storage index, and how many servers the -reading client is willing to contact. - -When a client attempts to write to this migrated share, it will get a "bad -write enabler" error, since the WE it computes for nodeB will not match the -WE(nodeA) that was embedded in the share. When this occurs, the "bad write -enabler" message must include the old nodeid (e.g. nodeA) that was in the -share. - -The client then computes H(nodeB+H(WEM+nodeA)), which is the same as -H(nodeB+WE(nodeA)). The client sends this along with the new WE(nodeB), which -is H(WEM+nodeB). Note that the client only sends WE(nodeB) to nodeB, never to -anyone else. Also note that the client does not send a value to nodeB that -would allow the node to impersonate the client to a third node: everything -sent to nodeB will include something specific to nodeB in it. - -The server locally computes H(nodeB+WE(nodeA)), using its own node id and the -old write enabler from the share. It compares this against the value supplied -by the client. If they match, this serves as proof that the client was able -to compute the old write enabler. The server then accepts the client's new -WE(nodeB) and writes it into the container. - -This WE-fixup process requires an extra round trip, and requires the error -message to include the old nodeid, but does not require any public key -operations on either client or server. - -Migrating the leases will require a similar protocol. This protocol will be -defined concretely at a later date. - -Code Details ------------- - -The MutableFileNode class is used to manipulate mutable files (as opposed to -ImmutableFileNodes). These are initially generated with -client.create_mutable_file(), and later recreated from URIs with -client.create_node_from_uri(). Instances of this class will contain a URI and -a reference to the client (for peer selection and connection). - -NOTE: this section is out of date. Please see src/allmydata/interfaces.py -(the section on IMutableFilesystemNode) for more accurate information. - -The methods of MutableFileNode are: - -* download_to_data() -> [deferred] newdata, NotEnoughSharesError - - * if there are multiple retrieveable versions in the grid, get() returns - the first version it can reconstruct, and silently ignores the others. - In the future, a more advanced API will signal and provide access to - the multiple heads. - -* update(newdata) -> OK, UncoordinatedWriteError, NotEnoughSharesError -* overwrite(newdata) -> OK, UncoordinatedWriteError, NotEnoughSharesError - -download_to_data() causes a new retrieval to occur, pulling the current -contents from the grid and returning them to the caller. At the same time, -this call caches information about the current version of the file. This -information will be used in a subsequent call to update(), and if another -change has occured between the two, this information will be out of date, -triggering the UncoordinatedWriteError. - -update() is therefore intended to be used just after a download_to_data(), in -the following pattern:: - - d = mfn.download_to_data() - d.addCallback(apply_delta) - d.addCallback(mfn.update) - -If the update() call raises UCW, then the application can simply return an -error to the user ("you violated the Prime Coordination Directive"), and they -can try again later. Alternatively, the application can attempt to retry on -its own. To accomplish this, the app needs to pause, download the new -(post-collision and post-recovery) form of the file, reapply their delta, -then submit the update request again. A randomized pause is necessary to -reduce the chances of colliding a second time with another client that is -doing exactly the same thing:: - - d = mfn.download_to_data() - d.addCallback(apply_delta) - d.addCallback(mfn.update) - def _retry(f): - f.trap(UncoordinatedWriteError) - d1 = pause(random.uniform(5, 20)) - d1.addCallback(lambda res: mfn.download_to_data()) - d1.addCallback(apply_delta) - d1.addCallback(mfn.update) - return d1 - d.addErrback(_retry) - -Enthusiastic applications can retry multiple times, using a randomized -exponential backoff between each. A particularly enthusiastic application can -retry forever, but such apps are encouraged to provide a means to the user of -giving up after a while. - -UCW does not mean that the update was not applied, so it is also a good idea -to skip the retry-update step if the delta was already applied:: - - d = mfn.download_to_data() - d.addCallback(apply_delta) - d.addCallback(mfn.update) - def _retry(f): - f.trap(UncoordinatedWriteError) - d1 = pause(random.uniform(5, 20)) - d1.addCallback(lambda res: mfn.download_to_data()) - def _maybe_apply_delta(contents): - new_contents = apply_delta(contents) - if new_contents != contents: - return mfn.update(new_contents) - d1.addCallback(_maybe_apply_delta) - return d1 - d.addErrback(_retry) - -update() is the right interface to use for delta-application situations, like -directory nodes (in which apply_delta might be adding or removing child -entries from a serialized table). - -Note that any uncoordinated write has the potential to lose data. We must do -more analysis to be sure, but it appears that two clients who write to the -same mutable file at the same time (even if both eventually retry) will, with -high probability, result in one client observing UCW and the other silently -losing their changes. It is also possible for both clients to observe UCW. -The moral of the story is that the Prime Coordination Directive is there for -a reason, and that recovery/UCW/retry is not a subsitute for write -coordination. - -overwrite() tells the client to ignore this cached version information, and -to unconditionally replace the mutable file's contents with the new data. -This should not be used in delta application, but rather in situations where -you want to replace the file's contents with completely unrelated ones. When -raw files are uploaded into a mutable slot through the tahoe webapi (using -POST and the ?mutable=true argument), they are put in place with overwrite(). - -The peer-selection and data-structure manipulation (and signing/verification) -steps will be implemented in a separate class in allmydata/mutable.py . - -SMDF Slot Format ----------------- - -This SMDF data lives inside a server-side MutableSlot container. The server -is oblivious to this format. - -This data is tightly packed. In particular, the share data is defined to run -all the way to the beginning of the encrypted private key (the encprivkey -offset is used both to terminate the share data and to begin the encprivkey). - -:: - - # offset size name - 1 0 1 version byte, \x00 for this format - 2 1 8 sequence number. 2^64-1 must be handled specially, TBD - 3 9 32 "R" (root of share hash Merkle tree) - 4 41 16 IV (share data is AES(H(readkey+IV)) ) - 5 57 18 encoding parameters: - 57 1 k - 58 1 N - 59 8 segment size - 67 8 data length (of original plaintext) - 6 75 32 offset table: - 75 4 (8) signature - 79 4 (9) share hash chain - 83 4 (10) block hash tree - 87 4 (11) share data - 91 8 (12) encrypted private key - 99 8 (13) EOF - 7 107 436ish verification key (2048 RSA key) - 8 543ish 256ish signature=RSAenc(sigkey, H(version+seqnum+r+IV+encparm)) - 9 799ish (a) share hash chain, encoded as: - "".join([pack(">H32s", shnum, hash) - for (shnum,hash) in needed_hashes]) - 10 (927ish) (b) block hash tree, encoded as: - "".join([pack(">32s",hash) for hash in block_hash_tree]) - 11 (935ish) LEN share data (no gap between this and encprivkey) - 12 ?? 1216ish encrypted private key= AESenc(write-key, RSA-key) - 13 ?? -- EOF - - (a) The share hash chain contains ceil(log(N)) hashes, each 32 bytes long. - This is the set of hashes necessary to validate this share's leaf in the - share Merkle tree. For N=10, this is 4 hashes, i.e. 128 bytes. - (b) The block hash tree contains ceil(length/segsize) hashes, each 32 bytes - long. This is the set of hashes necessary to validate any given block of - share data up to the per-share root "r". Each "r" is a leaf of the share - has tree (with root "R"), from which a minimal subset of hashes is put in - the share hash chain in (8). - -Recovery --------- - -The first line of defense against damage caused by colliding writes is the -Prime Coordination Directive: "Don't Do That". - -The second line of defense is to keep "S" (the number of competing versions) -lower than N/k. If this holds true, at least one competing version will have -k shares and thus be recoverable. Note that server unavailability counts -against us here: the old version stored on the unavailable server must be -included in the value of S. - -The third line of defense is our use of testv_and_writev() (described below), -which increases the convergence of simultaneous writes: one of the writers -will be favored (the one with the highest "R"), and that version is more -likely to be accepted than the others. This defense is least effective in the -pathological situation where S simultaneous writers are active, the one with -the lowest "R" writes to N-k+1 of the shares and then dies, then the one with -the next-lowest "R" writes to N-2k+1 of the shares and dies, etc, until the -one with the highest "R" writes to k-1 shares and dies. Any other sequencing -will allow the highest "R" to write to at least k shares and establish a new -revision. - -The fourth line of defense is the fact that each client keeps writing until -at least one version has N shares. This uses additional servers, if -necessary, to make sure that either the client's version or some -newer/overriding version is highly available. - -The fifth line of defense is the recovery algorithm, which seeks to make sure -that at least *one* version is highly available, even if that version is -somebody else's. - -The write-shares-to-peers algorithm is as follows: - -* permute peers according to storage index -* walk through peers, trying to assign one share per peer -* for each peer: - - * send testv_and_writev, using "old(seqnum+R) <= our(seqnum+R)" as the test - - * this means that we will overwrite any old versions, and we will - overwrite simultaenous writers of the same version if our R is higher. - We will not overwrite writers using a higher seqnum. - - * record the version that each share winds up with. If the write was - accepted, this is our own version. If it was rejected, read the - old_test_data to find out what version was retained. - * if old_test_data indicates the seqnum was equal or greater than our - own, mark the "Simultanous Writes Detected" flag, which will eventually - result in an error being reported to the writer (in their close() call). - * build a histogram of "R" values - * repeat until the histogram indicate that some version (possibly ours) - has N shares. Use new servers if necessary. - * If we run out of servers: - - * if there are at least shares-of-happiness of any one version, we're - happy, so return. (the close() might still get an error) - * not happy, need to reinforce something, goto RECOVERY - -Recovery: - -* read all shares, count the versions, identify the recoverable ones, - discard the unrecoverable ones. -* sort versions: locate max(seqnums), put all versions with that seqnum - in the list, sort by number of outstanding shares. Then put our own - version. (TODO: put versions with seqnum us ahead of us?). -* for each version: - - * attempt to recover that version - * if not possible, remove it from the list, go to next one - * if recovered, start at beginning of peer list, push that version, - continue until N shares are placed - * if pushing our own version, bump up the seqnum to one higher than - the max seqnum we saw - * if we run out of servers: - - * schedule retry and exponential backoff to repeat RECOVERY - - * admit defeat after some period? presumeably the client will be shut down - eventually, maybe keep trying (once per hour?) until then. - - -Medium Distributed Mutable Files -================================ - -These are just like the SDMF case, but: - -* we actually take advantage of the Merkle hash tree over the blocks, by - reading a single segment of data at a time (and its necessary hashes), to - reduce the read-time alacrity -* we allow arbitrary writes to the file (i.e. seek() is provided, and - O_TRUNC is no longer required) -* we write more code on the client side (in the MutableFileNode class), to - first read each segment that a write must modify. This looks exactly like - the way a normal filesystem uses a block device, or how a CPU must perform - a cache-line fill before modifying a single word. -* we might implement some sort of copy-based atomic update server call, - to allow multiple writev() calls to appear atomic to any readers. - -MDMF slots provide fairly efficient in-place edits of very large files (a few -GB). Appending data is also fairly efficient, although each time a power of 2 -boundary is crossed, the entire file must effectively be re-uploaded (because -the size of the block hash tree changes), so if the filesize is known in -advance, that space ought to be pre-allocated (by leaving extra space between -the block hash tree and the actual data). - -MDMF1 uses the Merkle tree to enable low-alacrity random-access reads. MDMF2 -adds cache-line reads to allow random-access writes. - -Large Distributed Mutable Files -=============================== - -LDMF slots use a fundamentally different way to store the file, inspired by -Mercurial's "revlog" format. They enable very efficient insert/remove/replace -editing of arbitrary spans. Multiple versions of the file can be retained, in -a revision graph that can have multiple heads. Each revision can be -referenced by a cryptographic identifier. There are two forms of the URI, one -that means "most recent version", and a longer one that points to a specific -revision. - -Metadata can be attached to the revisions, like timestamps, to enable rolling -back an entire tree to a specific point in history. - -LDMF1 provides deltas but tries to avoid dealing with multiple heads. LDMF2 -provides explicit support for revision identifiers and branching. - -TODO -==== - -improve allocate-and-write or get-writer-buckets API to allow one-call (or -maybe two-call) updates. The challenge is in figuring out which shares are on -which machines. First cut will have lots of round trips. - -(eventually) define behavior when seqnum wraps. At the very least make sure -it can't cause a security problem. "the slot is worn out" is acceptable. - -(eventually) define share-migration lease update protocol. Including the -nodeid who accepted the lease is useful, we can use the same protocol as we -do for updating the write enabler. However we need to know which lease to -update.. maybe send back a list of all old nodeids that we find, then try all -of them when we accept the update? - -We now do this in a specially-formatted IndexError exception: - "UNABLE to renew non-existent lease. I have leases accepted by " + - "nodeids: '12345','abcde','44221' ." - -confirm that a repairer can regenerate shares without the private key. Hmm, -without the write-enabler they won't be able to write those shares to the -servers.. although they could add immutable new shares to new servers. diff --git a/docs/specifications/outline.rst b/docs/specifications/outline.rst new file mode 100644 index 00000000..9ec69bff --- /dev/null +++ b/docs/specifications/outline.rst @@ -0,0 +1,221 @@ +============================== +Specification Document Outline +============================== + +While we do not yet have a clear set of specification documents for Tahoe +(explaining the file formats, so that others can write interoperable +implementations), this document is intended to lay out an outline for what +these specs ought to contain. Think of this as the ISO 7-Layer Model for +Tahoe. + +We currently imagine 4 documents. + +1. `#1: Share Format, Encoding Algorithm`_ +2. `#2: Share Exchange Protocol`_ +3. `#3: Server Selection Algorithm, filecap format`_ +4. `#4: Directory Format`_ + +#1: Share Format, Encoding Algorithm +==================================== + +This document will describe the way that files are encrypted and encoded into +shares. It will include a specification of the share format, and explain both +the encoding and decoding algorithms. It will cover both mutable and +immutable files. + +The immutable encoding algorithm, as described by this document, will start +with a plaintext series of bytes, encoding parameters "k" and "N", and either +an encryption key or a mechanism for deterministically deriving the key from +the plaintext (the CHK specification). The algorithm will end with a set of N +shares, and a set of values that must be included in the filecap to provide +confidentiality (the encryption key) and integrity (the UEB hash). + +The immutable decoding algorithm will start with the filecap values (key and +UEB hash) and "k" shares. It will explain how to validate the shares against +the integrity information, how to reverse the erasure-coding, and how to +decrypt the resulting ciphertext. It will result in the original plaintext +bytes (or some subrange thereof). + +The sections on mutable files will contain similar information. + +This document is *not* responsible for explaining the filecap format, since +full filecaps may need to contain additional information as described in +document #3. Likewise it it not responsible for explaining where to put the +generated shares or where to find them again later. + +It is also not responsible for explaining the access control mechanisms +surrounding share upload, download, or modification ("Accounting" is the +business of controlling share upload to conserve space, and mutable file +shares require some sort of access control to prevent non-writecap holders +from destroying shares). We don't yet have a document dedicated to explaining +these, but let's call it "Access Control" for now. + + +#2: Share Exchange Protocol +=========================== + +This document explains the wire-protocol used to upload, download, and modify +shares on the various storage servers. + +Given the N shares created by the algorithm described in document #1, and a +set of servers who are willing to accept those shares, the protocols in this +document will be sufficient to get the shares onto the servers. Likewise, +given a set of servers who hold at least k shares, these protocols will be +enough to retrieve the shares necessary to begin the decoding process +described in document #1. The notion of a "storage index" is used to +reference a particular share: the storage index is generated by the encoding +process described in document #1. + +This document does *not* describe how to identify or choose those servers, +rather it explains what to do once they have been selected (by the mechanisms +in document #3). + +This document also explains the protocols that a client uses to ask a server +whether or not it is willing to accept an uploaded share, and whether it has +a share available for download. These protocols will be used by the +mechanisms in document #3 to help decide where the shares should be placed. + +Where cryptographic mechanisms are necessary to implement access-control +policy, this document will explain those mechanisms. + +In the future, Tahoe will be able to use multiple protocols to speak to +storage servers. There will be alternative forms of this document, one for +each protocol. The first one to be written will describe the Foolscap-based +protocol that tahoe currently uses, but we anticipate a subsequent one to +describe a more HTTP-based protocol. + +#3: Server Selection Algorithm, filecap format +============================================== + +This document has two interrelated purposes. With a deeper understanding of +the issues, we may be able to separate these more cleanly in the future. + +The first purpose is to explain the server selection algorithm. Given a set +of N shares, where should those shares be uploaded? Given some information +stored about a previously-uploaded file, how should a downloader locate and +recover at least k shares? Given a previously-uploaded mutable file, how +should a modifier locate all (or most of) the shares with a reasonable amount +of work? + +This question implies many things, all of which should be explained in this +document: + +* the notion of a "grid", nominally a set of servers who could potentially + hold shares, which might change over time +* a way to configure which grid should be used +* a way to discover which servers are a part of that grid +* a way to decide which servers are reliable enough to be worth sending + shares +* an algorithm to handle servers which refuse shares +* a way for a downloader to locate which servers have shares +* a way to choose which shares should be used for download + +The server-selection algorithm has several obviously competing goals: + +* minimize the amount of work that must be done during upload +* minimize the total storage resources used +* avoid "hot spots", balance load among multiple servers +* maximize the chance that enough shares will be downloadable later, by + uploading lots of shares, and by placing them on reliable servers +* minimize the work that the future downloader must do +* tolerate temporary server failures, permanent server departure, and new + server insertions +* minimize the amount of information that must be added to the filecap + +The server-selection algorithm is defined in some context: some set of +expectations about the servers or grid with which it is expected to operate. +Different algorithms are appropriate for different situtations, so there will +be multiple alternatives of this document. + +The first version of this document will describe the algorithm that the +current (1.3.0) release uses, which is heavily weighted towards the two main +use case scenarios for which Tahoe has been designed: the small, stable +friendnet, and the allmydata.com managed grid. In both cases, we assume that +the storage servers are online most of the time, they are uniformly highly +reliable, and that the set of servers does not change very rapidly. The +server-selection algorithm for this environment uses a permuted server list +to achieve load-balancing, uses all servers identically, and derives the +permutation key from the storage index to avoid adding a new field to the +filecap. + +An alternative algorithm could give clients more precise control over share +placement, for example by a user who wished to make sure that k+1 shares are +located in each datacenter (to allow downloads to take place using only local +bandwidth). This algorithm could skip the permuted list and use other +mechanisms to accomplish load-balancing (or ignore the issue altogether). It +could add additional information to the filecap (like a list of which servers +received the shares) in lieu of performing a search at download time, perhaps +at the expense of allowing a repairer to move shares to a new server after +the initial upload. It might make up for this by storing "location hints" +next to each share, to indicate where other shares are likely to be found, +and obligating the repairer to update these hints. + +The second purpose of this document is to explain the format of the file +capability string (or "filecap" for short). There are multiple kinds of +capabilties (read-write, read-only, verify-only, repaircap, lease-renewal +cap, traverse-only, etc). There are multiple ways to represent the filecap +(compressed binary, human-readable, clickable-HTTP-URL, "tahoe:" URL, etc), +but they must all contain enough information to reliably retrieve a file +(given some context, of course). It must at least contain the confidentiality +and integrity information from document #1 (i.e. the encryption key and the +UEB hash). It must also contain whatever additional information the +upload-time server-selection algorithm generated that will be required by the +downloader. + +For some server-selection algorithms, the additional information will be +minimal. For example, the 1.3.0 release uses the hash of the encryption key +as a storage index, and uses the storage index to permute the server list, +and uses an Introducer to learn the current list of servers. This allows a +"close-enough" list of servers to be compressed into a filecap field that is +already required anyways (the encryption key). It also adds k and N to the +filecap, to speed up the downloader's search (the downloader knows how many +shares it needs, so it can send out multiple queries in parallel). + +But other server-selection algorithms might require more information. Each +variant of this document will explain how to encode that additional +information into the filecap, and how to extract and use that information at +download time. + +These two purposes are interrelated. A filecap that is interpreted in the +context of the allmydata.com commercial grid, which uses tahoe-1.3.0, implies +a specific peer-selection algorithm, a specific Introducer, and therefore a +fairly-specific set of servers to query for shares. A filecap which is meant +to be interpreted on a different sort of grid would need different +information. + +Some filecap formats can be designed to contain more information (and depend +less upon context), such as the way an HTTP URL implies the existence of a +single global DNS system. Ideally a tahoe filecap should be able to specify +which "grid" it lives in, with enough information to allow a compatible +implementation of Tahoe to locate that grid and retrieve the file (regardless +of which server-selection algorithm was used for upload). + +This more-universal format might come at the expense of reliability, however. +Tahoe-1.3.0 filecaps do not contain hostnames, because the failure of DNS or +an individual host might then impact file availability (however the +Introducer contains DNS names or IP addresses). + +#4: Directory Format +==================== + +Tahoe directories are a special way of interpreting and managing the contents +of a file (either mutable or immutable). These "dirnode" files are basically +serialized tables that map child name to filecap/dircap. This document +describes the format of these files. + +Tahoe-1.3.0 directories are "transitively readonly", which is accomplished by +applying an additional layer of encryption to the list of child writecaps. +The key for this encryption is derived from the containing file's writecap. +This document must explain how to derive this key and apply it to the +appropriate portion of the table. + +Future versions of the directory format are expected to contain +"deep-traversal caps", which allow verification/repair of files without +exposing their plaintext to the repair agent. This document wil be +responsible for explaining traversal caps too. + +Future versions of the directory format will probably contain an index and +more advanced data structures (for efficiency and fast lookups), instead of a +simple flat list of (childname, childcap). This document will also need to +describe metadata formats, including what access-control policies are defined +for the metadata. diff --git a/docs/specifications/outline.txt b/docs/specifications/outline.txt deleted file mode 100644 index 9ec69bff..00000000 --- a/docs/specifications/outline.txt +++ /dev/null @@ -1,221 +0,0 @@ -============================== -Specification Document Outline -============================== - -While we do not yet have a clear set of specification documents for Tahoe -(explaining the file formats, so that others can write interoperable -implementations), this document is intended to lay out an outline for what -these specs ought to contain. Think of this as the ISO 7-Layer Model for -Tahoe. - -We currently imagine 4 documents. - -1. `#1: Share Format, Encoding Algorithm`_ -2. `#2: Share Exchange Protocol`_ -3. `#3: Server Selection Algorithm, filecap format`_ -4. `#4: Directory Format`_ - -#1: Share Format, Encoding Algorithm -==================================== - -This document will describe the way that files are encrypted and encoded into -shares. It will include a specification of the share format, and explain both -the encoding and decoding algorithms. It will cover both mutable and -immutable files. - -The immutable encoding algorithm, as described by this document, will start -with a plaintext series of bytes, encoding parameters "k" and "N", and either -an encryption key or a mechanism for deterministically deriving the key from -the plaintext (the CHK specification). The algorithm will end with a set of N -shares, and a set of values that must be included in the filecap to provide -confidentiality (the encryption key) and integrity (the UEB hash). - -The immutable decoding algorithm will start with the filecap values (key and -UEB hash) and "k" shares. It will explain how to validate the shares against -the integrity information, how to reverse the erasure-coding, and how to -decrypt the resulting ciphertext. It will result in the original plaintext -bytes (or some subrange thereof). - -The sections on mutable files will contain similar information. - -This document is *not* responsible for explaining the filecap format, since -full filecaps may need to contain additional information as described in -document #3. Likewise it it not responsible for explaining where to put the -generated shares or where to find them again later. - -It is also not responsible for explaining the access control mechanisms -surrounding share upload, download, or modification ("Accounting" is the -business of controlling share upload to conserve space, and mutable file -shares require some sort of access control to prevent non-writecap holders -from destroying shares). We don't yet have a document dedicated to explaining -these, but let's call it "Access Control" for now. - - -#2: Share Exchange Protocol -=========================== - -This document explains the wire-protocol used to upload, download, and modify -shares on the various storage servers. - -Given the N shares created by the algorithm described in document #1, and a -set of servers who are willing to accept those shares, the protocols in this -document will be sufficient to get the shares onto the servers. Likewise, -given a set of servers who hold at least k shares, these protocols will be -enough to retrieve the shares necessary to begin the decoding process -described in document #1. The notion of a "storage index" is used to -reference a particular share: the storage index is generated by the encoding -process described in document #1. - -This document does *not* describe how to identify or choose those servers, -rather it explains what to do once they have been selected (by the mechanisms -in document #3). - -This document also explains the protocols that a client uses to ask a server -whether or not it is willing to accept an uploaded share, and whether it has -a share available for download. These protocols will be used by the -mechanisms in document #3 to help decide where the shares should be placed. - -Where cryptographic mechanisms are necessary to implement access-control -policy, this document will explain those mechanisms. - -In the future, Tahoe will be able to use multiple protocols to speak to -storage servers. There will be alternative forms of this document, one for -each protocol. The first one to be written will describe the Foolscap-based -protocol that tahoe currently uses, but we anticipate a subsequent one to -describe a more HTTP-based protocol. - -#3: Server Selection Algorithm, filecap format -============================================== - -This document has two interrelated purposes. With a deeper understanding of -the issues, we may be able to separate these more cleanly in the future. - -The first purpose is to explain the server selection algorithm. Given a set -of N shares, where should those shares be uploaded? Given some information -stored about a previously-uploaded file, how should a downloader locate and -recover at least k shares? Given a previously-uploaded mutable file, how -should a modifier locate all (or most of) the shares with a reasonable amount -of work? - -This question implies many things, all of which should be explained in this -document: - -* the notion of a "grid", nominally a set of servers who could potentially - hold shares, which might change over time -* a way to configure which grid should be used -* a way to discover which servers are a part of that grid -* a way to decide which servers are reliable enough to be worth sending - shares -* an algorithm to handle servers which refuse shares -* a way for a downloader to locate which servers have shares -* a way to choose which shares should be used for download - -The server-selection algorithm has several obviously competing goals: - -* minimize the amount of work that must be done during upload -* minimize the total storage resources used -* avoid "hot spots", balance load among multiple servers -* maximize the chance that enough shares will be downloadable later, by - uploading lots of shares, and by placing them on reliable servers -* minimize the work that the future downloader must do -* tolerate temporary server failures, permanent server departure, and new - server insertions -* minimize the amount of information that must be added to the filecap - -The server-selection algorithm is defined in some context: some set of -expectations about the servers or grid with which it is expected to operate. -Different algorithms are appropriate for different situtations, so there will -be multiple alternatives of this document. - -The first version of this document will describe the algorithm that the -current (1.3.0) release uses, which is heavily weighted towards the two main -use case scenarios for which Tahoe has been designed: the small, stable -friendnet, and the allmydata.com managed grid. In both cases, we assume that -the storage servers are online most of the time, they are uniformly highly -reliable, and that the set of servers does not change very rapidly. The -server-selection algorithm for this environment uses a permuted server list -to achieve load-balancing, uses all servers identically, and derives the -permutation key from the storage index to avoid adding a new field to the -filecap. - -An alternative algorithm could give clients more precise control over share -placement, for example by a user who wished to make sure that k+1 shares are -located in each datacenter (to allow downloads to take place using only local -bandwidth). This algorithm could skip the permuted list and use other -mechanisms to accomplish load-balancing (or ignore the issue altogether). It -could add additional information to the filecap (like a list of which servers -received the shares) in lieu of performing a search at download time, perhaps -at the expense of allowing a repairer to move shares to a new server after -the initial upload. It might make up for this by storing "location hints" -next to each share, to indicate where other shares are likely to be found, -and obligating the repairer to update these hints. - -The second purpose of this document is to explain the format of the file -capability string (or "filecap" for short). There are multiple kinds of -capabilties (read-write, read-only, verify-only, repaircap, lease-renewal -cap, traverse-only, etc). There are multiple ways to represent the filecap -(compressed binary, human-readable, clickable-HTTP-URL, "tahoe:" URL, etc), -but they must all contain enough information to reliably retrieve a file -(given some context, of course). It must at least contain the confidentiality -and integrity information from document #1 (i.e. the encryption key and the -UEB hash). It must also contain whatever additional information the -upload-time server-selection algorithm generated that will be required by the -downloader. - -For some server-selection algorithms, the additional information will be -minimal. For example, the 1.3.0 release uses the hash of the encryption key -as a storage index, and uses the storage index to permute the server list, -and uses an Introducer to learn the current list of servers. This allows a -"close-enough" list of servers to be compressed into a filecap field that is -already required anyways (the encryption key). It also adds k and N to the -filecap, to speed up the downloader's search (the downloader knows how many -shares it needs, so it can send out multiple queries in parallel). - -But other server-selection algorithms might require more information. Each -variant of this document will explain how to encode that additional -information into the filecap, and how to extract and use that information at -download time. - -These two purposes are interrelated. A filecap that is interpreted in the -context of the allmydata.com commercial grid, which uses tahoe-1.3.0, implies -a specific peer-selection algorithm, a specific Introducer, and therefore a -fairly-specific set of servers to query for shares. A filecap which is meant -to be interpreted on a different sort of grid would need different -information. - -Some filecap formats can be designed to contain more information (and depend -less upon context), such as the way an HTTP URL implies the existence of a -single global DNS system. Ideally a tahoe filecap should be able to specify -which "grid" it lives in, with enough information to allow a compatible -implementation of Tahoe to locate that grid and retrieve the file (regardless -of which server-selection algorithm was used for upload). - -This more-universal format might come at the expense of reliability, however. -Tahoe-1.3.0 filecaps do not contain hostnames, because the failure of DNS or -an individual host might then impact file availability (however the -Introducer contains DNS names or IP addresses). - -#4: Directory Format -==================== - -Tahoe directories are a special way of interpreting and managing the contents -of a file (either mutable or immutable). These "dirnode" files are basically -serialized tables that map child name to filecap/dircap. This document -describes the format of these files. - -Tahoe-1.3.0 directories are "transitively readonly", which is accomplished by -applying an additional layer of encryption to the list of child writecaps. -The key for this encryption is derived from the containing file's writecap. -This document must explain how to derive this key and apply it to the -appropriate portion of the table. - -Future versions of the directory format are expected to contain -"deep-traversal caps", which allow verification/repair of files without -exposing their plaintext to the repair agent. This document wil be -responsible for explaining traversal caps too. - -Future versions of the directory format will probably contain an index and -more advanced data structures (for efficiency and fast lookups), instead of a -simple flat list of (childname, childcap). This document will also need to -describe metadata formats, including what access-control policies are defined -for the metadata. diff --git a/docs/specifications/servers-of-happiness.rst b/docs/specifications/servers-of-happiness.rst new file mode 100644 index 00000000..7f0029be --- /dev/null +++ b/docs/specifications/servers-of-happiness.rst @@ -0,0 +1,90 @@ +==================== +Servers of Happiness +==================== + +When you upload a file to a Tahoe-LAFS grid, you expect that it will +stay there for a while, and that it will do so even if a few of the +peers on the grid stop working, or if something else goes wrong. An +upload health metric helps to make sure that this actually happens. +An upload health metric is a test that looks at a file on a Tahoe-LAFS +grid and says whether or not that file is healthy; that is, whether it +is distributed on the grid in such a way as to ensure that it will +probably survive in good enough shape to be recoverable, even if a few +things go wrong between the time of the test and the time that it is +recovered. Our current upload health metric for immutable files is called +'servers-of-happiness'; its predecessor was called 'shares-of-happiness'. + +shares-of-happiness used the number of encoded shares generated by a +file upload to say whether or not it was healthy. If there were more +shares than a user-configurable threshold, the file was reported to be +healthy; otherwise, it was reported to be unhealthy. In normal +situations, the upload process would distribute shares fairly evenly +over the peers in the grid, and in that case shares-of-happiness +worked fine. However, because it only considered the number of shares, +and not where they were on the grid, it could not detect situations +where a file was unhealthy because most or all of the shares generated +from the file were stored on one or two peers. + +servers-of-happiness addresses this by extending the share-focused +upload health metric to also consider the location of the shares on +grid. servers-of-happiness looks at the mapping of peers to the shares +that they hold, and compares the cardinality of the largest happy subset +of those to a user-configurable threshold. A happy subset of peers has +the property that any k (where k is as in k-of-n encoding) peers within +the subset can reconstruct the source file. This definition of file +health provides a stronger assurance of file availability over time; +with 3-of-10 encoding, and happy=7, a healthy file is still guaranteed +to be available even if 4 peers fail. + +Measuring Servers of Happiness +============================== + +We calculate servers-of-happiness by computing a matching on a +bipartite graph that is related to the layout of shares on the grid. +One set of vertices is the peers on the grid, and one set of vertices is +the shares. An edge connects a peer and a share if the peer will (or +does, for existing shares) hold the share. The size of the maximum +matching on this graph is the size of the largest happy peer set that +exists for the upload. + +First, note that a bipartite matching of size n corresponds to a happy +subset of size n. This is because a bipartite matching of size n implies +that there are n peers such that each peer holds a share that no other +peer holds. Then any k of those peers collectively hold k distinct +shares, and can restore the file. + +A bipartite matching of size n is not necessary for a happy subset of +size n, however (so it is not correct to say that the size of the +maximum matching on this graph is the size of the largest happy subset +of peers that exists for the upload). For example, consider a file with +k = 3, and suppose that each peer has all three of those pieces. Then, +since any peer from the original upload can restore the file, if there +are 10 peers holding shares, and the happiness threshold is 7, the +upload should be declared happy, because there is a happy subset of size +10, and 10 > 7. However, since a maximum matching on the bipartite graph +related to this layout has only 3 edges, Tahoe-LAFS declares the upload +unhealthy. Though it is not unhealthy, a share layout like this example +is inefficient; for k = 3, and if there are n peers, it corresponds to +an expansion factor of 10x. Layouts that are declared healthy by the +bipartite graph matching approach have the property that they correspond +to uploads that are either already relatively efficient in their +utilization of space, or can be made to be so by deleting shares; and +that place all of the shares that they generate, enabling redistribution +of shares later without having to re-encode the file. Also, it is +computationally reasonable to compute a maximum matching in a bipartite +graph, and there are well-studied algorithms to do that. + +Issues +====== + +The uploader is good at detecting unhealthy upload layouts, but it +doesn't always know how to make an unhealthy upload into a healthy +upload if it is possible to do so; it attempts to redistribute shares to +achieve happiness, but only in certain circumstances. The redistribution +algorithm isn't optimal, either, so even in these cases it will not +always find a happy layout if one can be arrived at through +redistribution. We are investigating improvements to address these +issues. + +We don't use servers-of-happiness for mutable files yet; this fix will +likely come in Tahoe-LAFS version 1.8. diff --git a/docs/specifications/servers-of-happiness.txt b/docs/specifications/servers-of-happiness.txt deleted file mode 100644 index 7f0029be..00000000 --- a/docs/specifications/servers-of-happiness.txt +++ /dev/null @@ -1,90 +0,0 @@ -==================== -Servers of Happiness -==================== - -When you upload a file to a Tahoe-LAFS grid, you expect that it will -stay there for a while, and that it will do so even if a few of the -peers on the grid stop working, or if something else goes wrong. An -upload health metric helps to make sure that this actually happens. -An upload health metric is a test that looks at a file on a Tahoe-LAFS -grid and says whether or not that file is healthy; that is, whether it -is distributed on the grid in such a way as to ensure that it will -probably survive in good enough shape to be recoverable, even if a few -things go wrong between the time of the test and the time that it is -recovered. Our current upload health metric for immutable files is called -'servers-of-happiness'; its predecessor was called 'shares-of-happiness'. - -shares-of-happiness used the number of encoded shares generated by a -file upload to say whether or not it was healthy. If there were more -shares than a user-configurable threshold, the file was reported to be -healthy; otherwise, it was reported to be unhealthy. In normal -situations, the upload process would distribute shares fairly evenly -over the peers in the grid, and in that case shares-of-happiness -worked fine. However, because it only considered the number of shares, -and not where they were on the grid, it could not detect situations -where a file was unhealthy because most or all of the shares generated -from the file were stored on one or two peers. - -servers-of-happiness addresses this by extending the share-focused -upload health metric to also consider the location of the shares on -grid. servers-of-happiness looks at the mapping of peers to the shares -that they hold, and compares the cardinality of the largest happy subset -of those to a user-configurable threshold. A happy subset of peers has -the property that any k (where k is as in k-of-n encoding) peers within -the subset can reconstruct the source file. This definition of file -health provides a stronger assurance of file availability over time; -with 3-of-10 encoding, and happy=7, a healthy file is still guaranteed -to be available even if 4 peers fail. - -Measuring Servers of Happiness -============================== - -We calculate servers-of-happiness by computing a matching on a -bipartite graph that is related to the layout of shares on the grid. -One set of vertices is the peers on the grid, and one set of vertices is -the shares. An edge connects a peer and a share if the peer will (or -does, for existing shares) hold the share. The size of the maximum -matching on this graph is the size of the largest happy peer set that -exists for the upload. - -First, note that a bipartite matching of size n corresponds to a happy -subset of size n. This is because a bipartite matching of size n implies -that there are n peers such that each peer holds a share that no other -peer holds. Then any k of those peers collectively hold k distinct -shares, and can restore the file. - -A bipartite matching of size n is not necessary for a happy subset of -size n, however (so it is not correct to say that the size of the -maximum matching on this graph is the size of the largest happy subset -of peers that exists for the upload). For example, consider a file with -k = 3, and suppose that each peer has all three of those pieces. Then, -since any peer from the original upload can restore the file, if there -are 10 peers holding shares, and the happiness threshold is 7, the -upload should be declared happy, because there is a happy subset of size -10, and 10 > 7. However, since a maximum matching on the bipartite graph -related to this layout has only 3 edges, Tahoe-LAFS declares the upload -unhealthy. Though it is not unhealthy, a share layout like this example -is inefficient; for k = 3, and if there are n peers, it corresponds to -an expansion factor of 10x. Layouts that are declared healthy by the -bipartite graph matching approach have the property that they correspond -to uploads that are either already relatively efficient in their -utilization of space, or can be made to be so by deleting shares; and -that place all of the shares that they generate, enabling redistribution -of shares later without having to re-encode the file. Also, it is -computationally reasonable to compute a maximum matching in a bipartite -graph, and there are well-studied algorithms to do that. - -Issues -====== - -The uploader is good at detecting unhealthy upload layouts, but it -doesn't always know how to make an unhealthy upload into a healthy -upload if it is possible to do so; it attempts to redistribute shares to -achieve happiness, but only in certain circumstances. The redistribution -algorithm isn't optimal, either, so even in these cases it will not -always find a happy layout if one can be arrived at through -redistribution. We are investigating improvements to address these -issues. - -We don't use servers-of-happiness for mutable files yet; this fix will -likely come in Tahoe-LAFS version 1.8. diff --git a/docs/specifications/uri.rst b/docs/specifications/uri.rst new file mode 100644 index 00000000..91f8cc28 --- /dev/null +++ b/docs/specifications/uri.rst @@ -0,0 +1,201 @@ +========== +Tahoe URIs +========== + +1. `File URIs`_ + + 1. `CHK URIs`_ + 2. `LIT URIs`_ + 3. `Mutable File URIs`_ + +2. `Directory URIs`_ +3. `Internal Usage of URIs`_ + +Each file and directory in a Tahoe filesystem is described by a "URI". There +are different kinds of URIs for different kinds of objects, and there are +different kinds of URIs to provide different kinds of access to those +objects. Each URI is a string representation of a "capability" or "cap", and +there are read-caps, write-caps, verify-caps, and others. + +Each URI provides both ``location`` and ``identification`` properties. +``location`` means that holding the URI is sufficient to locate the data it +represents (this means it contains a storage index or a lookup key, whatever +is necessary to find the place or places where the data is being kept). +``identification`` means that the URI also serves to validate the data: an +attacker who wants to trick you into into using the wrong data will be +limited in their abilities by the identification properties of the URI. + +Some URIs are subsets of others. In particular, if you know a URI which +allows you to modify some object, you can produce a weaker read-only URI and +give it to someone else, and they will be able to read that object but not +modify it. Directories, for example, have a read-cap which is derived from +the write-cap: anyone with read/write access to the directory can produce a +limited URI that grants read-only access, but not the other way around. + +src/allmydata/uri.py is the main place where URIs are processed. It is +the authoritative definition point for all the the URI types described +herein. + +File URIs +========= + +The lowest layer of the Tahoe architecture (the "grid") is reponsible for +mapping URIs to data. This is basically a distributed hash table, in which +the URI is the key, and some sequence of bytes is the value. + +There are two kinds of entries in this table: immutable and mutable. For +immutable entries, the URI represents a fixed chunk of data. The URI itself +is derived from the data when it is uploaded into the grid, and can be used +to locate and download that data from the grid at some time in the future. + +For mutable entries, the URI identifies a "slot" or "container", which can be +filled with different pieces of data at different times. + +It is important to note that the "files" described by these URIs are just a +bunch of bytes, and that **no** filenames or other metadata is retained at +this layer. The vdrive layer (which sits above the grid layer) is entirely +responsible for directories and filenames and the like. + +CHK URIs +-------- + +CHK (Content Hash Keyed) files are immutable sequences of bytes. They are +uploaded in a distributed fashion using a "storage index" (for the "location" +property), and encrypted using a "read key". A secure hash of the data is +computed to help validate the data afterwards (providing the "identification" +property). All of these pieces, plus information about the file's size and +the number of shares into which it has been distributed, are put into the +"CHK" uri. The storage index is derived by hashing the read key (using a +tagged SHA-256d hash, then truncated to 128 bits), so it does not need to be +physically present in the URI. + +The current format for CHK URIs is the concatenation of the following +strings:: + + URI:CHK:(key):(hash):(needed-shares):(total-shares):(size) + +Where (key) is the base32 encoding of the 16-byte AES read key, (hash) is the +base32 encoding of the SHA-256 hash of the URI Extension Block, +(needed-shares) is an ascii decimal representation of the number of shares +required to reconstruct this file, (total-shares) is the same representation +of the total number of shares created, and (size) is an ascii decimal +representation of the size of the data represented by this URI. All base32 +encodings are expressed in lower-case, with the trailing '=' signs removed. + +For example, the following is a CHK URI, generated from the contents of the +architecture.txt document that lives next to this one in the source tree:: + + URI:CHK:ihrbeov7lbvoduupd4qblysj7a:bg5agsdt62jb34hxvxmdsbza6do64f4fg5anxxod2buttbo6udzq:3:10:28733 + +Historical note: The name "CHK" is somewhat inaccurate and continues to be +used for historical reasons. "Content Hash Key" means that the encryption key +is derived by hashing the contents, which gives the useful property that +encoding the same file twice will result in the same URI. However, this is an +optional step: by passing a different flag to the appropriate API call, Tahoe +will generate a random encryption key instead of hashing the file: this gives +the useful property that the URI or storage index does not reveal anything +about the file's contents (except filesize), which improves privacy. The +URI:CHK: prefix really indicates that an immutable file is in use, without +saying anything about how the key was derived. + +LIT URIs +-------- + +LITeral files are also an immutable sequence of bytes, but they are so short +that the data is stored inside the URI itself. These are used for files of 55 +bytes or shorter, which is the point at which the LIT URI is the same length +as a CHK URI would be. + +LIT URIs do not require an upload or download phase, as their data is stored +directly in the URI. + +The format of a LIT URI is simply a fixed prefix concatenated with the base32 +encoding of the file's data:: + + URI:LIT:bjuw4y3movsgkidbnrwg26lemf2gcl3xmvrc6kropbuhi3lmbi + +The LIT URI for an empty file is "URI:LIT:", and the LIT URI for a 5-byte +file that contains the string "hello" is "URI:LIT:nbswy3dp". + +Mutable File URIs +----------------- + +The other kind of DHT entry is the "mutable slot", in which the URI names a +container to which data can be placed and retrieved without changing the +identity of the container. + +These slots have write-caps (which allow read/write access), read-caps (which +only allow read-access), and verify-caps (which allow a file checker/repairer +to confirm that the contents exist, but does not let it decrypt the +contents). + +Mutable slots use public key technology to provide data integrity, and put a +hash of the public key in the URI. As a result, the data validation is +limited to confirming that the data retrieved matches *some* data that was +uploaded in the past, but not _which_ version of that data. + +The format of the write-cap for mutable files is:: + + URI:SSK:(writekey):(fingerprint) + +Where (writekey) is the base32 encoding of the 16-byte AES encryption key +that is used to encrypt the RSA private key, and (fingerprint) is the base32 +encoded 32-byte SHA-256 hash of the RSA public key. For more details about +the way these keys are used, please see docs/mutable.txt . + +The format for mutable read-caps is:: + + URI:SSK-RO:(readkey):(fingerprint) + +The read-cap is just like the write-cap except it contains the other AES +encryption key: the one used for encrypting the mutable file's contents. This +second key is derived by hashing the writekey, which allows the holder of a +write-cap to produce a read-cap, but not the other way around. The +fingerprint is the same in both caps. + +Historical note: the "SSK" prefix is a perhaps-inaccurate reference to +"Sub-Space Keys" from the Freenet project, which uses a vaguely similar +structure to provide mutable file access. + +Directory URIs +============== + +The grid layer provides a mapping from URI to data. To turn this into a graph +of directories and files, the "vdrive" layer (which sits on top of the grid +layer) needs to keep track of "directory nodes", or "dirnodes" for short. +docs/dirnodes.txt describes how these work. + +Dirnodes are contained inside mutable files, and are thus simply a particular +way to interpret the contents of these files. As a result, a directory +write-cap looks a lot like a mutable-file write-cap:: + + URI:DIR2:(writekey):(fingerprint) + +Likewise directory read-caps (which provide read-only access to the +directory) look much like mutable-file read-caps:: + + URI:DIR2-RO:(readkey):(fingerprint) + +Historical note: the "DIR2" prefix is used because the non-distributed +dirnodes in earlier Tahoe releases had already claimed the "DIR" prefix. + +Internal Usage of URIs +====================== + +The classes in source:src/allmydata/uri.py are used to pack and unpack these +various kinds of URIs. Three Interfaces are defined (IURI, IFileURI, and +IDirnodeURI) which are implemented by these classes, and string-to-URI-class +conversion routines have been registered as adapters, so that code which +wants to extract e.g. the size of a CHK or LIT uri can do:: + + print IFileURI(uri).get_size() + +If the URI does not represent a CHK or LIT uri (for example, if it was for a +directory instead), the adaptation will fail, raising a TypeError inside the +IFileURI() call. + +Several utility methods are provided on these objects. The most important is +``to_string()``, which returns the string form of the URI. Therefore +``IURI(uri).to_string == uri`` is true for any valid URI. See the IURI class +in source:src/allmydata/interfaces.py for more details. + diff --git a/docs/specifications/uri.txt b/docs/specifications/uri.txt deleted file mode 100644 index 91f8cc28..00000000 --- a/docs/specifications/uri.txt +++ /dev/null @@ -1,201 +0,0 @@ -========== -Tahoe URIs -========== - -1. `File URIs`_ - - 1. `CHK URIs`_ - 2. `LIT URIs`_ - 3. `Mutable File URIs`_ - -2. `Directory URIs`_ -3. `Internal Usage of URIs`_ - -Each file and directory in a Tahoe filesystem is described by a "URI". There -are different kinds of URIs for different kinds of objects, and there are -different kinds of URIs to provide different kinds of access to those -objects. Each URI is a string representation of a "capability" or "cap", and -there are read-caps, write-caps, verify-caps, and others. - -Each URI provides both ``location`` and ``identification`` properties. -``location`` means that holding the URI is sufficient to locate the data it -represents (this means it contains a storage index or a lookup key, whatever -is necessary to find the place or places where the data is being kept). -``identification`` means that the URI also serves to validate the data: an -attacker who wants to trick you into into using the wrong data will be -limited in their abilities by the identification properties of the URI. - -Some URIs are subsets of others. In particular, if you know a URI which -allows you to modify some object, you can produce a weaker read-only URI and -give it to someone else, and they will be able to read that object but not -modify it. Directories, for example, have a read-cap which is derived from -the write-cap: anyone with read/write access to the directory can produce a -limited URI that grants read-only access, but not the other way around. - -src/allmydata/uri.py is the main place where URIs are processed. It is -the authoritative definition point for all the the URI types described -herein. - -File URIs -========= - -The lowest layer of the Tahoe architecture (the "grid") is reponsible for -mapping URIs to data. This is basically a distributed hash table, in which -the URI is the key, and some sequence of bytes is the value. - -There are two kinds of entries in this table: immutable and mutable. For -immutable entries, the URI represents a fixed chunk of data. The URI itself -is derived from the data when it is uploaded into the grid, and can be used -to locate and download that data from the grid at some time in the future. - -For mutable entries, the URI identifies a "slot" or "container", which can be -filled with different pieces of data at different times. - -It is important to note that the "files" described by these URIs are just a -bunch of bytes, and that **no** filenames or other metadata is retained at -this layer. The vdrive layer (which sits above the grid layer) is entirely -responsible for directories and filenames and the like. - -CHK URIs --------- - -CHK (Content Hash Keyed) files are immutable sequences of bytes. They are -uploaded in a distributed fashion using a "storage index" (for the "location" -property), and encrypted using a "read key". A secure hash of the data is -computed to help validate the data afterwards (providing the "identification" -property). All of these pieces, plus information about the file's size and -the number of shares into which it has been distributed, are put into the -"CHK" uri. The storage index is derived by hashing the read key (using a -tagged SHA-256d hash, then truncated to 128 bits), so it does not need to be -physically present in the URI. - -The current format for CHK URIs is the concatenation of the following -strings:: - - URI:CHK:(key):(hash):(needed-shares):(total-shares):(size) - -Where (key) is the base32 encoding of the 16-byte AES read key, (hash) is the -base32 encoding of the SHA-256 hash of the URI Extension Block, -(needed-shares) is an ascii decimal representation of the number of shares -required to reconstruct this file, (total-shares) is the same representation -of the total number of shares created, and (size) is an ascii decimal -representation of the size of the data represented by this URI. All base32 -encodings are expressed in lower-case, with the trailing '=' signs removed. - -For example, the following is a CHK URI, generated from the contents of the -architecture.txt document that lives next to this one in the source tree:: - - URI:CHK:ihrbeov7lbvoduupd4qblysj7a:bg5agsdt62jb34hxvxmdsbza6do64f4fg5anxxod2buttbo6udzq:3:10:28733 - -Historical note: The name "CHK" is somewhat inaccurate and continues to be -used for historical reasons. "Content Hash Key" means that the encryption key -is derived by hashing the contents, which gives the useful property that -encoding the same file twice will result in the same URI. However, this is an -optional step: by passing a different flag to the appropriate API call, Tahoe -will generate a random encryption key instead of hashing the file: this gives -the useful property that the URI or storage index does not reveal anything -about the file's contents (except filesize), which improves privacy. The -URI:CHK: prefix really indicates that an immutable file is in use, without -saying anything about how the key was derived. - -LIT URIs --------- - -LITeral files are also an immutable sequence of bytes, but they are so short -that the data is stored inside the URI itself. These are used for files of 55 -bytes or shorter, which is the point at which the LIT URI is the same length -as a CHK URI would be. - -LIT URIs do not require an upload or download phase, as their data is stored -directly in the URI. - -The format of a LIT URI is simply a fixed prefix concatenated with the base32 -encoding of the file's data:: - - URI:LIT:bjuw4y3movsgkidbnrwg26lemf2gcl3xmvrc6kropbuhi3lmbi - -The LIT URI for an empty file is "URI:LIT:", and the LIT URI for a 5-byte -file that contains the string "hello" is "URI:LIT:nbswy3dp". - -Mutable File URIs ------------------ - -The other kind of DHT entry is the "mutable slot", in which the URI names a -container to which data can be placed and retrieved without changing the -identity of the container. - -These slots have write-caps (which allow read/write access), read-caps (which -only allow read-access), and verify-caps (which allow a file checker/repairer -to confirm that the contents exist, but does not let it decrypt the -contents). - -Mutable slots use public key technology to provide data integrity, and put a -hash of the public key in the URI. As a result, the data validation is -limited to confirming that the data retrieved matches *some* data that was -uploaded in the past, but not _which_ version of that data. - -The format of the write-cap for mutable files is:: - - URI:SSK:(writekey):(fingerprint) - -Where (writekey) is the base32 encoding of the 16-byte AES encryption key -that is used to encrypt the RSA private key, and (fingerprint) is the base32 -encoded 32-byte SHA-256 hash of the RSA public key. For more details about -the way these keys are used, please see docs/mutable.txt . - -The format for mutable read-caps is:: - - URI:SSK-RO:(readkey):(fingerprint) - -The read-cap is just like the write-cap except it contains the other AES -encryption key: the one used for encrypting the mutable file's contents. This -second key is derived by hashing the writekey, which allows the holder of a -write-cap to produce a read-cap, but not the other way around. The -fingerprint is the same in both caps. - -Historical note: the "SSK" prefix is a perhaps-inaccurate reference to -"Sub-Space Keys" from the Freenet project, which uses a vaguely similar -structure to provide mutable file access. - -Directory URIs -============== - -The grid layer provides a mapping from URI to data. To turn this into a graph -of directories and files, the "vdrive" layer (which sits on top of the grid -layer) needs to keep track of "directory nodes", or "dirnodes" for short. -docs/dirnodes.txt describes how these work. - -Dirnodes are contained inside mutable files, and are thus simply a particular -way to interpret the contents of these files. As a result, a directory -write-cap looks a lot like a mutable-file write-cap:: - - URI:DIR2:(writekey):(fingerprint) - -Likewise directory read-caps (which provide read-only access to the -directory) look much like mutable-file read-caps:: - - URI:DIR2-RO:(readkey):(fingerprint) - -Historical note: the "DIR2" prefix is used because the non-distributed -dirnodes in earlier Tahoe releases had already claimed the "DIR" prefix. - -Internal Usage of URIs -====================== - -The classes in source:src/allmydata/uri.py are used to pack and unpack these -various kinds of URIs. Three Interfaces are defined (IURI, IFileURI, and -IDirnodeURI) which are implemented by these classes, and string-to-URI-class -conversion routines have been registered as adapters, so that code which -wants to extract e.g. the size of a CHK or LIT uri can do:: - - print IFileURI(uri).get_size() - -If the URI does not represent a CHK or LIT uri (for example, if it was for a -directory instead), the adaptation will fail, raising a TypeError inside the -IFileURI() call. - -Several utility methods are provided on these objects. The most important is -``to_string()``, which returns the string form of the URI. Therefore -``IURI(uri).to_string == uri`` is true for any valid URI. See the IURI class -in source:src/allmydata/interfaces.py for more details. -