From: Brian Warner Date: Tue, 3 Jun 2008 06:07:02 +0000 (-0700) Subject: docs: move files that are about future plans into docs/proposed/, to clearly separate... X-Git-Tag: allmydata-tahoe-1.1.0~56 X-Git-Url: https://git.rkrishnan.org/listings/reliability?a=commitdiff_plain;h=91565f465d6fb307564af85ffc91c27a5e8b9e0c;p=tahoe-lafs%2Ftahoe-lafs.git docs: move files that are about future plans into docs/proposed/, to clearly separate them from descriptions of the present codebase --- diff --git a/docs/accounts-introducer.txt b/docs/accounts-introducer.txt deleted file mode 100644 index 36a5a56f..00000000 --- a/docs/accounts-introducer.txt +++ /dev/null @@ -1,134 +0,0 @@ -This is a proposal for handing accounts and quotas in Tahoe. Nothing is final -yet.. we are still evaluating the options. - - -= Account Management: Introducer-based = - -A Tahoe grid can be configured in several different modes. The simplest mode -(which is also the default) is completely permissive: all storage servers -will accept shares from all clients, and no attempt is made to keep track of -who is storing what. Access to the grid is mostly equivalent to having access -to the Introducer (or convincing one of the existing members to give you a -list of all their storage server FURLs). - -This mode, while a good starting point, does not accomodate any sort of -auditing or quota management. Even in a small friendnet, operators might like -to know how much of their storage space is being consumed by Alice, so they -might be able to ask her to cut back when overall disk usage is getting to -high. In a larger commercial deployment, a service provider needs to be able -to get accurate usage numbers so they can bill the user appropriately. In -addition, the operator may want the ability to delete all of Bob's shares -(i.e. cancel any outstanding leases) when he terminates his account. - -There are several lease-management/garbage-collection/deletion strategies -possible for a Tahoe grid, but the most efficient ones require knowledge of -lease ownership, so that renewals and expiration can take place on a -per-account basis rather than a (more numerous) per-share basis. - -== Accounts == - -To accomplish this, "Accounts" can be established in a Tahoe grid. There is -nominally one account per human user of the grid, but of course a user might -use multiple accounts, or an account might be shared between multiple users. -The Account is the smallest unit of quota and lease management. - -Accounts are created by an "Account Manager". In a commercial network there -will be just one (centralized) account manager, and all storage nodes will be -configured to require a valid account before providing storage services. In a -friendnet, each peer can run their own account manager, and servers will -accept accounts from any of the managers (this mode is permissive but allows -quota-tracking of non-malicious users). - -The account manager is free to manage the accounts as it pleases. Large -systems will probably use a database to correlate things like username, -storage consumed, billing status, etc. - -== Overview == - -The Account Manager ("AM") replaces the normal Introducer node: grids which -use an Account Manager will not run an Introducer, and the participating -nodes will not be configured with an "introducer.furl". - -Instead, each client will be configured with a different "account.furl", -which gives that client access to a specific account. These account FURLs -point to an object inside the Account Manager which exists solely for the -benefit of that one account. When the client needs access to storage servers, -it will use this account object to acquire personalized introductions to a -per-account "Personal Storage Server" facet, one per storage server node. For -example, Alice would wind up with PSS[1A] on server 1, and PSS[2A] on server -2. Bob would get PSS[1B] and PSS[2B]. - -These PSS facets provide the same remote methods as the old generic SS facet, -except that every time they create a lease object, the account information of -the holder is recorded in that lease. The client stores a list of these PSS -facet FURLs in persistent storage, and uses them in the "get_permuted_peers" -function that all uploads and downloads use to figure out who to talk to when -looking for shares or shareholders. - -Each Storage Server has a private facet that it gives to the Account Manager. -This facet allows the AM to create PSS facets for a specific account. In -particular, the AM tells the SS "please create account number 42, and tell me -the PSS FURL that I should give to the client". The SS creates an object -which remembers the account number, creates a FURL for it, and returns the -FURL. - -If there is a single central account manager, then account numbers can be -small integers. (if there are multiple ones, they need to be large random -strings to ensure uniqueness). To avoid requiring large (accounts*servers) -lookup tables, a given account should use the same identifer for all the -servers it talks to. When this can be done, the PSS and Account FURLs are -generated as MAC'ed copies of the account number. - -More specifically, the PSS FURL is a MAC'ed copy of the account number: each -SS has a private secret "S", and it creates a string "%d-%s" % (accountnum, -b2a(hash(S+accountnum))) to use as the swissnum part of the FURL. The SS uses -tub.registerNameLookupHandler to add a function that tries to validate -inbound FURLs against this scheme: if successful, it creates a new PSS object -with the account number stashed inside. This allows the server to minimize -their per-user storage requirements but still insure that PSS FURLs are -unguessable. - -Account FURLs are created by the Account Manager in a similar fashion, using -a MAC of the account number. The Account Manager can use the same account -number to index other information in a database, like account status, billing -status, etc. - -The mechanism by which Account FURLs are minted is left up to the account -manager, but the simple AM that the 'tahoe create-account-manager' command -makes has a "new-account" FURL which accepts a username and creates an -account for them. The 'tahoe create-account' command is a CLI frontend to -this facility. In a friendnet, you could publish this FURL to your friends, -allowing everyone to make their own account. In a commercial grid, this -facility would be reserved use by the same code which handles billing. - - -== Creating the Account Manager == - -The 'tahoe create-account-manager' command is used to create a simple account -manager node. When started, this node will write several FURLs to its -private/ directory, some of which should be provided to other services. - - * new-account.furl : this FURL allows the holder to create new accounts - * manage-accounts.furl : this FURL allows the holder to list and modify - all existing accounts - * serverdesk.furl : this FURL is used by storage servers to make themselves - available to all account holders - - -== Configuring the Storage Servers == - -To use an account manager, each storage server node should be given access to -the AM's serverdesk (by simply copying "serverdesk.furl" into the storage -server's base directory). In addition, it should *not* be given an -introducer.furl . The serverdesk FURL tells the SS that it should allow the -AM to create PSS facets for each account, and the lack of an introducer FURL -tells the SS to not make its generic SS facet available to anyone. The -combination means that clients must acquire PSS facets instead of using the -generic one. - -== Configuring Clients == - -Each client should be configured to use a specific account by copying their -account FURL into their basedir, in a file named "account.furl". In addition, -these client nodes should *not* have an "introducer.furl". This combination -tells the client to ask the AM for ... diff --git a/docs/accounts-pubkey.txt b/docs/accounts-pubkey.txt deleted file mode 100644 index 11d28043..00000000 --- a/docs/accounts-pubkey.txt +++ /dev/null @@ -1,636 +0,0 @@ -This is a proposal for handing accounts and quotas in Tahoe. Nothing is final -yet.. we are still evaluating the options. - - -= Accounts = - -The basic Tahoe account is defined by a DSA key pair. The holder of the -private key has the ability to consume storage in conjunction with a specific -account number. - -The Account Server has a long-term keypair. Valid accounts are marked as such -by the Account Server's signature on a "membership card", which binds a -specific pubkey to an account number and declares that this pair is a valid -account. - -Each Storage Server which participages in the AS's domain will have the AS's -pubkey in its list of valid AS keys, and will thus accept membership cards -that were signed by that AS. If the SS accepts multiple ASs, then it will -give each a distinct number, and leases will be labled with an (AS#,Account#) -pair. If there is only one AS, then leases will be labeled with just the -Account#. - -Each client node is given the FURL of their personal Account object. The -Account will accept a DSA public key and return a signed membership card that -authorizes the corresponding private key to consume storage on behalf of the -account. The client will create its own DSA keypair the first time it -connects to the Account, and will then use the resulting membership card for -all subsequent storage operations. - -== Storage Server Goals == - -The Storage Server cares about two things: - - 1: maintaining an accurate refcount on each bucket, so it can delete the - bucket when the refcount goes to zero - 2: being able to answer questions about aggregate usage per account - -The SS conceptually maintains a big matrix of lease information: one column -per account, one row per storage index. The cells contain a boolean -(has-lease or no-lease). If the grid uses per-lease timers, then each -has-lease cell also contains a lease timer. - -This matrix may be stored in a variety of ways: entries in each share file, -or items in a SQL database, according to the desired tradeoff between -complexity, robustness, read speed, and write speed. - -Each client (by virtue of their knowledge of an authorized private key) gets -to manipulate their column of this matrix in any way they like: add lease, -renew lease, delete lease. (TODO: for reconcilliation purposes, the should -also be able to enumerate leases). - -== Storage Operations == - -Side-effect-causing storage operations come in three forms: - - 1: allocate bucket / add lease to existing bucket - arguments: storage_index=, storage_server=, ueb_hash=, account= - 2: renew lease - arguments: storage_index=, storage_server=, account= - 3: cancel lease - arguments: storage_index=, storage_server=, account= - -(where lease renewal is only relevant for grids which use per-lease timers). -Clients do add-lease when they upload a file, and cancel-lease when they -remove their last reference to it. - -Storage Servers publish a "public storage port" through the introducer, which -does not actually enable storage operations, but is instead used in a -rights-amplification pattern to grant authorized parties access to a -"personal storage server facet". This personal facet is the one that -implements allocate_bucket. All clients get access to the same public storage -port, which means that we can improve the introduction mechanism later (to -use a gossip-based protocol) without affecting the authority-granting -protocols. - -The public storage port accepts signed messages asking for storage authority. -It responds by creating a personal facet and making it available to the -requester. The account number is curried into the facet, so that all -lease-creating operations will record this account number into the lease. By -restricting the nature of the personal facets that a client can access, we -restrict them to using their designated account number. - - -======================================== - -There are two kinds of signed messages: use (other names: connection, -FURLification, activation, reification, grounding, specific-making, ?), and -delegation. The FURLification message results in a FURL that points to an -object which can actually accept RIStorageServer methods. The delegation -message results in a new signed message. - -The furlification message looks like: - - (pubkey, signed(serialized({limitations}, beneficiary_furl))) - -The delegation message looks like: - - (pubkey, signed(serialized({limitations}, delegate_pubkey))) - -The limitations dict indicates what the resulting connection or delegation -can be used for. All limitations for the cert chain are applied, and the -result must be restricted to their overall minimum. - -The following limitation keys are defined: - - 'account': a number. All resulting leases must be tagged with this account - number. A chain with multiple distinct 'account' limitations is - an error (the result will not permit leases) - 'SI': a storage index (binary string). Leases may only be created for this - specific storage index, no other. - 'serverid': a peerid (binary string). Leases may only be created on the - storage server identified by this serverid. - 'UEB_hash': (binary string): Leases may only be created for shares which - contain a matching UEB_hash. Note: this limitation is a nuisance - to implement correctly: it requires that the storage server - parse the share and verify all hashes. - 'before': a timestamp (seconds since epoch). All leases must be made before - this time. In addition, all liverefs and FURLs must expire and - cease working at this time. - 'server_size': a number, measuring share size (in bytes). A storage server - which sees this message should keep track of how much storage - space has been consumed using this liveref/FURL, and throw - an exception when receiving a lease request that would bring - this total above 'server_size'. Note: this limitation is - a nuisance to implement (it works best if 'before' is used - and provides a short lifetime). - -Actually, let's merge the two, and put the type in the limitations dict. -'furl_to' and 'delegate_key' are mutually exclusive. - - 'furl_to': (string): Used only on furlification messages. This requests the - recipient to create an object which implements the given access, - then send a FURL which references this object to an - RIFURLReceiver.furl() call at the given 'furl_to' FURL: - facet = create_storage_facet(limitations) - facet_furl = tub.registerReference(facet) - d = tub.getReference(limitations['furl_to']) - d.addCallback(lambda rref: rref.furl(facet_furl)) - The facet_furl should be persistent, so to reduce storage space, - facet_furl should contain an HMAC'ed list of all limitations, and - create_storage_facet() should be deferred until the client - actually tries to use the furl. This leads to 150-200 byte base32 - swissnums. - 'delegate_key': (binary string, a DSA pubkey). Used only on delegation - messages. This requests all observers to accept messages - signed by the given public key and to apply the associated - limitations. - -I also want to keep the message size small, so I'm going to define a custom -netstring-based encoding format for it (JSON expands binary data by about -3.5x). Each dict entry will be encoded as netstring(key)+netstring(value). -The container is responsible for providing the size of this serialized -structure. - -The actual message will then look like: - -def make_message(privkey, limitations): - message_to_sign = "".join([ netstring(k) + netstring(v) - for k,v in limitations ]) - signature = privkey.sign(message_to_sign) - pubkey = privkey.get_public_key() - msg = netstring(message_to_sign) + netstring(signature) + netstring(pubkey) - return msg - -The deserialization code MUST throw an exception if the same limitations key -appears twice, to ensure that everybody interprets the dict the same way. - -These messages are passed over foolscap connections as a single string. They -are also saved to disk in this format. Code should only store them in a -deserialized form if the signature has been verified, the cert chain -verified, and the limitations accumulated. - - -The membership card is just the following: - - membership_card = make_message(account_server_privkey, - {'account': account_number, - 'before': time.time() + 1*MONTH, - 'delegate_key': client_pubkey}) - -This card is provided on demand by the given user's Account facet, for -whatever pubkey they submit. - -When a client learns about a new storage server, they create a new receiver -object (and stash the peerid in it), and submit the following message to the -RIStorageServerWelcome.get_personal_facet() method: - - mymsg = make_message(client_privkey, {'furl_to': receiver_furl}) - send(membership_card, mymsg) - -(note that the receiver_furl will probably not have a routeable address, but -this won't matter because the client is already attached, so foolscap can use -the existing connection.) - -The server will validate the cert chain (see below) and wind up with a -complete list of limitations that are to be applied to the facet it will -provide to the caller. This list must combine limitations from the entire -chain: in particular it must enforce the account= limitation from the -membership card. - -The server will then serialize this limitation dict into a string, compute a -fixed-size HMAC code using a server-private secret, then base32 encode the -(hmac+limitstring) value (and prepend a "0-" version indicator). The -resulting string is used as the swissnum portion of the FURL that is sent to -the furl_to target. - -Later, when the client tries to dereference this FURL, a -Tub.registerNameLookupHandler hook will notice the attempt, claim the "0-" -namespace, base32decode the string, check the HMAC, decode the limitation -dict, then create and return an RIStorageServer facet with these limitations. - -The client should cache the (peerid, FURL) mapping in persistent storage. -Later, when it learns about this storage server again, it will use the cached -FURL instead of signing another message. If the getReference or the storage -operation fails with StorageAuthorityExpiredError, the cache entry should be -removed and the client should sign a new message to obtain a new one. - - (security note: an evil storage server can take 'mymsg' and present it to - someone else, but other servers will only send the resulting authority to - the client's receiver_furl, so the evil server cannot benefit from this. The - receiver object has the serverid curried into it, so the evil server can - only affect the client's mapping for this one serverid, not anything else, - so the server cannot hurt the client in any way other than denying service - to itself. It might be a good idea to include serverid= in the message, but - it isn't clear that it really helps anything). - -When the client wants to use a Helper, it needs to delegate some amount of -storage authority to the helper. The first phase has the client send the -storage index to the helper, so it can query servers and decide whether the -file needs to be uploaded or not. If it decides yes, the Helper creates a new -Uploader object and a receiver object, and sends the Uploader liveref and the -receiver FURL to the client. - -The client then creates a message for the helper to use: - - helper_msg = make_message(client_privkey, {'furl_to': helper_rx_furl, - 'SI': storage_index, - 'before': time.time() + 1*DAY, #? - 'server_size': filesize/k+overhead, - }) - -The client then sends (membership_card, helper_msg) to the helper. The Helper -sends (membership_card, helper_msg) to each storage server that it needs to -use for the upload. This gives the Helper access to a limited facet on each -storage server. This facet gives the helper the authority to upload data for -a specific storage index, for a limited time, using leases that are tagged by -the user's account number. The helper cannot use the client's storage -authority for any other file. The size limit prevents the helper from storing -some other (larger) file of its own using this authority. The time -restriction allows the storage servers to expire their 'server_size' table -entry quickly, and prevents the helper from hanging on to the storage -authority indefinitely. - -The Helper only gets one furl_to target, which must be used for multiple SS -peerids. The helper's receiver must parse the FURL that gets returned to -determine which server is which. [problems: an evil server could deliver a -bogus FURL which points to a different server. The Helper might reject the -real server's good FURL as a duplicate. This allows an evil server to block -access to a good server. Queries could be sent sequentially, which would -partially mitigate this problem (an evil server could send multiple -requests). Better: if the cert-chain send message could include a nonce, -which is supposed to be returned with the FURL, then the helper could use -this to correlate sends and receives.] - -=== repair caps === - -There are three basic approaches to provide a Repairer with the storage -authority that it needs. The first is to give the Repairer complete -authority: allow it to place leases for whatever account number it wishes. -This is simple and requires the least overhead, but of course it give the -Repairer the ability to abuse everyone's quota. The second is to give the -Repairer no user authority: instead, give the repairer its own account, and -build it to keep track of which leases it is holding on behalf of one of its -customers. This repairer will slowly accumulate quota space over time, as it -creates new shares to replace ones that have decayed. Eventually, when the -client comes back online, the client should establish its own leases on these -new shares and allow the repairer to cancel its temporary ones. - -The third approach is in between the other two: give the repairer some -limited authority over the customer's account, but not enough to let it -consume the user's whole quota. - -To create the storage-authority portion of a (one-month) repair-cap, the -client creates a new DSA keypair (repair_privkey, repair_pubkey), and then -creates a signed message and bundles it into the repaircap: - - repair_msg = make_message(client_privkey, {'delegate_key': repair_pubkey, - 'SI': storage_index, - 'UEB_hash': file_ueb_hash}) - repair_cap = (verify_cap, repair_privkey, (membership_card, repair_msg)) - -This gives the holder of the repair cap a time-limited authority to upload -shares for the given storage index which contain the given data. This -prohibits the repair-cap from being used to upload or repair any other file. - -When the repairer needs to upload a new share, it will use the delegated key -to create its own signed message: - - upload_msg = make_message(repair_privkey, {'furl_to': repairer_rx_furl}) - send(membership_card, repair_msg, upload_msg) - -The biggest problem with the low-authority approaches is the expiration time -of the membership card, which limits the duration for which the repair-cap -authority is valid. It would be nice if repair-caps could last a long time, -years perhaps, so that clients can be offline for a similar period of time. -However to retain a reasonable revocation interval for users, the membership -card's before= timeout needs to be closer to a month. [it might be reasonable -to use some sort of rights-amplification: the repairer has a special cert -which allows it to remove the before= value from a chain]. - - -=== chain verification === - -The server will create a chain that starts with the AS's certificate: an -unsigned message which derives its authority from being manually placed in -the SS's configdir. The only limitation in the AS certificate will be on some -kind of meta-account, in case we want to use multiple account servers and -allow their account numbers to live in distinct number spaces (think -sub-accounts or business partners to buy storage in bulk and resell it to -users). The rest of the chain comes directly from what the client sent. - -The server walks the chain, keeping an accumulated limitations dictionary -along the way. At each step it knows the pubkey that was delegated by the -previous step. - -== client config == - -Clients are configured with an Account FURL that points to a private facet on -the Account Server. The client generates a private key at startup. It sends -the pubkey to the AS facet, which will return a signed delegate_key message -(the "membership card") that grants the client's privkey any storage -authority it wishes (as long as the account number is set to a specific -value). - -The client stores this membership card in private/membership.cert . - - -RIStorageServer messages will accept an optional account= argument. If left -unspecified, the value is taken from the limitations that were curried into -the SS facet. In all cases, the value used must meet those limitations. The -value must not be None: Helpers/Repairers or other super-powered storage -clients are obligated to specify an account number. - -== server config == - -Storage servers are configured with an unsigned root authority message. This -is like the output of make_message(account_server_privkey, {}) but has empty -'signature' and 'pubkey' strings. This root goes into -NODEDIR/storage_authority_root.cert . It is prepended to all chains that -arrive. - - [if/when we accept multiple authorities, storage_authority_root.cert will - turn into a storage_authority_root/ directory with *.cert files, and each - arriving chain will cause a search through these root certs for a matching - pubkey. The empty limitations will be replaced by {domain=X}, which is used - as a sort of meta-account.. the details depend upon whether we express - account numbers as an int (with various ranges) or as a tuple] - -The root authority message is published by the Account Server through its web -interface, and also into a local file: NODEDIR/storage_authority_root.cert . -The admin of the storage server is responsible for copying this file into -place, thus enabling clients to use storage services. - - ----------------------------------------- - --- Text beyond this point is out-of-date, and exists purely for background -- - -Each storage server offers a "public storage port", which only accepts signed -messages. The Introducer mechanism exists to give clients a reference to a -set of these public storage ports. All clients get access to the same ports. -If clients did all their work themselves, these public storage ports would be -enough, and no further code would be necessary (all storage requests would we -signed the same way). - -Fundamentally, each storage request must be signed by the account's private -key, giving the SS an authenticated Account Number to go with the request. -This is used to index the correct cell in the lease matrix. The holder of the -account privkey is allowed to manipulate their column of the matrix in any -way they like: add leases, renew leases, delete leases. (TODO: for -reconcilliation purposes, they should also be able to enumerate leases). The -storage request is sent in the form of a signed request message, accompanied -by the membership card. For example: - - req = SIGN("allocate SI=123 SSID=abc", accountprivkey) , membership_card - -> RemoteBucketWriter reference - -Upon receipt of this request, the storage server will return a reference to a -RemoteBucketWriter object, which the client can use to fill and close the -bucket. The SS must perform two DSA signature verifications before accepting -this request. The first is to validate the membership card: the Account -Server's pubkey is used to verify the membership card's signature, from which -an account pubkey and account# is extracted. The second is to validate the -request: the account pubkey is used to verify the request signature. If both -are valid, the full request (with account# and storage index) is delivered to -the internal StorageServer object. - -Note that the signed request message includes the Storage Server's node ID, -to prevent this storage server from taking the signed message and echoing to -other storage servers. Each SS will ignore any request that is not addressed -to the right SSID. Also note that the SI= and SSID= fields may contain -wildcards, if the signing client so chooses. - -== Caching Signature Verification == - -We add some complexity to this simple model to achieve two goals: to enable -fine-grained delegation of storage capabilities (specifically for renewers -and repairers), and to reduce the number of public-key crypto operations that -must be performed. - -The first enhancement is to allow the SS to cache the results of the -verification step. To do this, the client creates a signed message which asks -the SS to return a FURL of an object which can be used to execute further -operations *without* a DSA signature. The FURL is expected to contain a -MAC'ed string that contains the account# and the argument restrictions, -effectively currying a subset of arguments into the RemoteReference. Clients -which do all their operations themselves would use this to obtain a private -storage port for each public storage port, stashing the FURLs in a local -table, and then later storage operations would be done to those FURLs instead -of creating signed requests. For example: - - req = SIGN("FURL(allocate SI=* SSID=abc)", accountprivkey), membership_card - -> FURL - Tub.getReference(FURL).allocate(SI=123) -> RemoteBucketWriter reference - -== Renewers and Repairers - -A brief digression is in order, to motivate the other enhancement. The -"manifest" is a list of caps, one for each node that is reachable from the -user's root directory/directories. The client is expected to generate the -manifest on a periodic basis (perhaps once a day), and to keep track of which -files/dirnodes have been added and removed. Items which have been removed -must be explicitly dereferenced to reclaim their storage space. For grids -which use per-file lease timers, the manifest is used to drive the Renewer: a -process which renews the lease timers on a periodic basis (perhaps once a -week). The manifest can also be used to drive a Checker, which in turn feeds -work into the Repairer. - -The manifest should contain the minimum necessary authority to do its job, -which generally means it contains the "verify cap" for each node. For -immutable files, the verify cap contains the storage index and the UEB hash: -enough information to retrieve and validate the ciphertext but not enough to -decrypt it. For mutable files, the verify cap contains the storage index and -the pubkey hash, which also serves to retrieve and validate ciphertext but -not decrypt it. - -If the client does its own Renewing and Repairing, then a verifycap-based -manifest is sufficient. However, if the user wants to be able to turn their -computer off for a few months and still keep their files around, they need to -delegate this job off to some other willing node. In a commercial network, -there will be centralized (and perhaps trusted) Renewer/Repairer nodes, but -in a friendnet these may not be available, and the user will depend upon one -of their friends being willing to run this service for them while they are -away. In either of these cases, the verifycaps are not enough: the Renewer -will need additional authority to renew the client's leases, and the Repairer -will need the authority to create new shares (in the client's name) when -necessary. - -A trusted central service could be given all-account superpowers, allowing it -to exercise storage authority on behalf of all users as it pleases. If this -is the case, the verifycaps are sufficient. But if we desire to grant less -authority to the Renewer/Repairer, then we need a mechanism to attenuate this -authority. - -The usual objcap approach is to create a proxy: an intermediate object which -itself is given full authority, but which is unwilling to exercise more than -a portion of that authority in response to incoming requests. The -not-fully-trusted service is then only given access to the proxy, not the -final authority. For example: - - class Proxy(RemoteReference): - def __init__(self, original, storage_index): - self.original = original - self.storage_index = storage_index - def remote_renew_leases(self): - return self.original.renew_leases(self.storage_index) - renewer.grant(Proxy(target, "abcd")) - -But this approach interposes the proxy in the calling chain, requiring the -machine which hosts the proxy to be available and on-line at all times, which -runs opposite to our use case (turning the client off for a month). - -== Creating Attenuated Authorities == - -The other enhancement is to use more public-key operations to allow the -delegation of reduced authority to external helper services. Specifically, we -want to give then Renewer the ability to renew leases for a specific file, -rather than giving it lease-renewal power for all files. Likewise, the -Repairer should have the ability to create new shares, but only for the file -that is being repaired, not for unrelated files. - -If we do not mind giving the storage servers the ability to replay their -inbound message to other storage servers, then the client can simply generate -a signed message with a wildcard SSID= argument and leave it in the care of -the Renewer or Repairer. For example, the Renewer would get: - - SIGN("renew-lease SI=123 SSID=*", accountprivkey), membership_card - -Then, when the Renewer needed to renew a lease, it would deliver this signed -request message to the storage server. The SS would verify the signatures -just as if the message came from the original client, find them good, and -perform the desired operation. With this approach, the manifest that is -delivered to the remote Renewer process needs to include a signed -lease-renewal request for each file: we use the term "renew-cap" for this -combined (verifycap + signed lease-renewal request) message. Likewise the -"repair-cap" would be the verifycap plus a signed allocate-bucket message. A -renew-cap manifest would be enough for a remote Renewer to do its job, a -repair-cap manifest would provide a remote Repairer with enough authority, -and a cancel-cap manifest would be used for a remote Canceller (used, e.g., -to make sure that file has been dereferenced even if the client does not -stick around long enough to track down and inform all of the storage servers -involved). - -The only concern is that the SS could also take this exact same renew-lease -message and deliver it to other storage servers. This wouldn't cause a -concern for mere lease renewal, but the allocate-share message might be a bit -less comfortable (you might not want to grant the first storage server the -ability to claim space in your name on all other storage servers). - -Ideally we'd like to send a different message to each storage server, each -narrowed in scope to a single SSID, since then none of these messages would -be useful on any other SS. If the client knew the identities of all the -storage servers in the system ahead of time, it might create a whole slew of -signed messages, but a) this is a lot of signatures, only a fraction of which -will ever actually be used, and b) new servers might be introduced after the -manifest is created, particularly if we're talking about repair-caps instead -of renewal-caps. The Renewer can't generate these one-per-SSID messages from -the SSID=* message, because it doesn't have a privkey to make the correct -signatures. So without some other mechanism, we're stuck with these -relatively coarse authorities. - -If we want to limit this sort of authority, then we need to introduce a new -method. The client begins by generating a new DSA keypair. Then it signs a -message that declares the new pubkey to be valid for a specific subset of -storage operations (such as "renew-lease SI=123 SSID=*"). Then it delivers -the new privkey, the declaration message, and the membership card to the -Renewer. The renewer uses the new privkey to sign its own one-per-SSID -request message for each server, then sends the (signed request, declaration, -membership card) triple to the server. The server needs to perform three -verification checks per message: first the membership card, then the -declaration message, then the actual request message. - -== Other Enhancements == - -If a given authority is likely to be used multiple times, the same -give-me-a-FURL trick can be used to cut down on the number of public key -operations that must be performed. This is trickier with the per-SI messages. - -When storing the manifest, things like the membership card should be -amortized across a set of common entries. An isolated renew-cap needs to -contain the verifycap, the signed renewal request, and the membership card. -But a manifest with a thousand entries should only include one copy of the -membership card. - -It might be sensible to define a signed renewal request that grants authority -for a set of storage indicies, so that the signature can be shared among -several entries (to save space and perhaps processing time). The request -could include a Bloom filter of authorized SI values: when the request is -actually sent to the server, the renewer would add a list of actual SI values -to renew, and the server would accept all that are contained in the filter. - -== Revocation == - -The lifetime of the storage authority included in the manifest's renew-caps -or repair-caps will determine the lifetime of those caps. In particular, if -we implement account revocation by using time-limited membership cards -(requiring the client to get a new card once a month), then the repair-caps -won't work for more than a month, which kind of defeats the purpose. - -A related issue is the FURL-shortcut: the MAC'ed message needs to include a -validity period of some sort, and if the client tries to use a old FURL they -should get an error message that will prompt them to try and acquire a newer -one. - ------------------------------- - -The client can produce a repair-cap manifest for a specific Repairer's -pubkey, so it can produce a signed message that includes the pubkey (instead -of needing to generate a new privkey just for this purpose). The result is -not a capability, since it can only be used by the holder of the -corresponding privkey. - -So the generic form of the storage operation message is the request (which -has all the argument values filled in), followed by a chain of -authorizations. The first authorization must be signed by the Account -Server's key. Each authorization must be signed by the key mentioned in the -previous one. Each one adds a new limitation on the power of the following -ones. The actual request is bounded by all the limitations of the chain. - -The membership card is an authorization that simply limits the account number -that can be used: "op=* SI=* SSID=* account=4 signed-by=CLIENT-PUBKEY". - -So a repair manifest created for a Repairer with pubkey ABCD could consist of -a list of verifycaps plus a single authorization (using a Bloom filter to -identify the SIs that were allowed): - - SIGN("allocate SI=[bloom] SSID=* signed-by=ABCD") - -If/when the Repairer needed to allocate a share, it would use its own privkey -to sign an additional message and send the whole list to the SS: - - request=allocate SI=1234 SSID=EEFS account=4 shnum=2 - SIGN("allocate SI=1234 SSID=EEFS", ABCD) - SIGN("allocate SI=[bloom] SSID=* signed-by=ABCD", clientkey) - membership: SIGN("op=* SI=* SSID=* account=4 signed-by=clientkey", ASkey) - [implicit]: ASkey - ----------------------------------------- - -Things would be a lot simpler if the Repairer (actually the Re-Leaser) had -everybody's account authority. - -One simplifying approach: the Repairer/Re-Leaser has its own account, and the -shares it creates are leased under that account number. The R/R keeps track -of which leases it has created for whom. When the client eventually comes -back online, it is told to perform a re-leasing run, and after that occurs -the R/R can cancel its own temporary leases. - -This would effectively transfer storage quota from the original client to the -R/R over time (as shares are regenerated by the R/R while the client remains -offline). If the R/R is centrally managed, the quota mechanism can sum the -R/R's numbers with the SS's numbers when determining how much storage is -consumed by any given account. Not quite as clean as storing the exact -information in the SS's lease tables directly, but: - - * the R/R no longer needs any special account authority (it merely needs an - accurate account number, which can be supplied by giving the client a - specific facet that is bound to that account number) - * the verify-cap manifest is sufficient to perform repair - * no extra DSA keys are necessary - * account authority could be implemented with either DSA keys or personal SS - facets: i.e. we don't need the delegability aspects of DSA keys for use by - the repair mechanism (we might still want them to simplify introduction). - -I *think* this would eliminate all that complexity of chained authorization -messages. diff --git a/docs/backupdb.txt b/docs/backupdb.txt deleted file mode 100644 index c9618e6d..00000000 --- a/docs/backupdb.txt +++ /dev/null @@ -1,188 +0,0 @@ -= PRELIMINARY = - -This document is a description of a feature which is not yet implemented, -added here to solicit feedback and to describe future plans. This document is -subject to revision or withdrawal at any moment. Until this notice is -removed, consider this entire document to be a figment of your imagination. - -= The Tahoe BackupDB = - -To speed up backup operations, Tahoe maintains a small database known as the -"backupdb". This is used to avoid re-uploading files which have already been -uploaded recently. - -This database lives in ~/.tahoe/private/backupdb.sqlite, and is a SQLite -single-file database. It is used by the "tahoe backup" command, and by the -"tahoe cp" command when the --use-backupdb option is included. - -The purpose of this database is specifically to manage the file-to-cap -translation (the "upload" step). It does not address directory updates. - -The overall goal of optimizing backup is to reduce the work required when the -source disk has not changed since the last backup. In the ideal case, running -"tahoe backup" twice in a row, with no intervening changes to the disk, will -not require any network traffic. - -This database is optional. If it is deleted, the worst effect is that a -subsequent backup operation may use more effort (network bandwidth, CPU -cycles, and disk IO) than it would have without the backupdb. - -== Schema == - -The database contains the following tables: - -CREATE TABLE version -( - version integer # contains one row, set to 0 -); - -CREATE TABLE last_upload -( - path varchar(1024), # index, this is os.path.abspath(fn) - size integer, # os.stat(fn)[stat.ST_SIZE] - mtime number, # os.stat(fn)[stat.ST_MTIME] - fileid integer -); - -CREATE TABLE caps -( - fileid integer PRIMARY KEY AUTOINCREMENT, - filecap varchar(256), # URI:CHK:... - last_uploaded timestamp, - last_checked timestamp -); - -CREATE TABLE keys_to_files -( - readkey varchar(256) PRIMARY KEY, # index, AES key portion of filecap - fileid integer -); - -Notes: if we extend the backupdb to assist with directory maintenance (see -below), we may need paths in multiple places, so it would make sense to -create a table for them, and change the last_upload table to refer to a -pathid instead of an absolute path: - -CREATE TABLE paths -( - path varchar(1024), # index - pathid integer PRIMARY KEY AUTOINCREMENT -); - -== Operation == - -The upload process starts with a pathname (like ~/.emacs) and wants to end up -with a file-cap (like URI:CHK:...). - -The first step is to convert the path to an absolute form -(/home/warner/emacs) and do a lookup in the last_upload table. If the path is -not present in this table, the file must be uploaded. The upload process is: - - 1. record the file's size and modification time - 2. upload the file into the grid, obtaining an immutable file read-cap - 3. add an entry to the 'caps' table, with the read-cap, and the current time - 4. extract the read-key from the read-cap, add an entry to 'keys_to_files' - 5. add an entry to 'last_upload' - -If the path *is* present in 'last_upload', the easy-to-compute identifying -information is compared: file size and modification time. If these differ, -the file must be uploaded. The row is removed from the last_upload table, and -the upload process above is followed. - -If the path is present but the mtime differs, the file may have changed. If -the size differs, then the file has certainly changed. The client will -compute the CHK read-key for the file by hashing its contents, using exactly -the same algorithm as the node does when it uploads a file (including -~/.tahoe/private/convergence). It then checks the 'keys_to_files' table to -see if this file has been uploaded before: perhaps the file was moved from -elsewhere on the disk. If no match is found, the file must be uploaded, so -the upload process above is follwed. - -If the read-key *is* found in the 'keys_to_files' table, then the file has -been uploaded before, but we should consider performing a file check / verify -operation to make sure we can skip a new upload. The fileid is used to -retrieve the entry from the 'caps' table, and the last_checked timestamp is -examined. If this timestamp is too old, a filecheck operation should be -performed, and the file repaired if the results are not satisfactory. A -"random early check" algorithm should be used, in which a check is performed -with a probability that increases with the age of the previous results. E.g. -files that were last checked within a month are not checked, files that were -checked 5 weeks ago are re-checked with 25% probability, 6 weeks with 50%, -more than 8 weeks are always checked. This reduces the "thundering herd" of -filechecks-on-everything that would otherwise result when a backup operation -is run one month after the original backup. The readkey can be submitted to -the upload operation, to remove a duplicate hashing pass through the file and -reduce the disk IO. In a future version of the storage server protocol, this -could also improve the "streamingness" of the upload process. - -If the file's size and mtime match, the file is considered to be unmodified, -and the last_checked timestamp from the 'caps' table is examined as above -(possibly resulting in a filecheck or repair). The --no-timestamps option -disables this check: this removes the danger of false-positives (i.e. not -uploading a new file, because it appeared to be the same as a previously -uploaded one), but increases the amount of disk IO that must be performed -(every byte of every file must be hashed to compute the readkey). - -This algorithm is summarized in the following pseudocode: - -{{{ - def backup(path): - abspath = os.path.abspath(path) - result = check_for_upload(abspath) - now = time.time() - if result == MUST_UPLOAD: - filecap = upload(abspath, key=result.readkey) - fileid = db("INSERT INTO caps (filecap, last_uploaded, last_checked)", - (filecap, now, now)) - db("INSERT INTO keys_to_files", (result.readkey, filecap)) - db("INSERT INTO last_upload", (abspath,current_size,current_mtime,fileid)) - if result in (MOVED, ALREADY_UPLOADED): - age = now - result.last_checked - probability = (age - 1*MONTH) / 1*MONTH - probability = min(max(probability, 0.0), 1.0) - if random.random() < probability: - do_filecheck(result.filecap) - if result == MOVED: - db("INSERT INTO last_upload", - (abspath, current_size, current_mtime, result.fileid)) - - - def check_for_upload(abspath): - row = db("SELECT (size,mtime,fileid) FROM last_upload WHERE path == %s" - % abspath) - if not row: - return check_moved(abspath) - current_size = os.stat(abspath)[stat.ST_SIZE] - current_mtime = os.stat(abspath)[stat.ST_MTIME] - (last_size,last_mtime,last_fileid) = row - if file_changed(current_size, last_size, current_mtime, last_mtime): - db("DELETE FROM last_upload WHERE fileid=%s" % fileid) - return check_moved(abspath) - (filecap, last_checked) = db("SELECT (filecap, last_checked) FROM caps" + - " WHERE fileid == %s" % last_fileid) - return ALREADY_UPLOADED(filecap=filecap, last_checked=last_checked) - - def file_changed(current_size, last_size, current_mtime, last_mtime): - if last_size != current_size: - return True - if NO_TIMESTAMPS: - return True - if last_mtime != current_mtime: - return True - return False - - def check_moved(abspath): - readkey = hash_with_convergence(abspath) - fileid = db("SELECT (fileid) FROM keys_to_files WHERE readkey == %s"%readkey) - if not fileid: - return MUST_UPLOAD(readkey=readkey) - (filecap, last_checked) = db("SELECT (filecap, last_checked) FROM caps" + - " WHERE fileid == %s" % fileid) - return MOVED(fileid=fileid, filecap=filecap, last_checked=last_checked) - - def do_filecheck(filecap): - health = check(filecap) - if health < DESIRED: - repair(filecap) - -}}} diff --git a/docs/denver.txt b/docs/denver.txt deleted file mode 100644 index 5aa9893b..00000000 --- a/docs/denver.txt +++ /dev/null @@ -1,182 +0,0 @@ -The "Denver Airport" Protocol - - (discussed whilst returning robk to DEN, 12/1/06) - -This is a scaling improvement on the "Select Peers" phase of Tahoe2. The -problem it tries to address is the storage and maintenance of the 1M-long -peer list, and the relative difficulty of gathering long-term reliability -information on a useful numbers of those peers. - -In DEN, each node maintains a Chord-style set of connections to other nodes: -log2(N) "finger" connections to distant peers (the first of which is halfway -across the ring, the second is 1/4 across, then 1/8th, etc). These -connections need to be kept alive with relatively short timeouts (5s?), so -any breaks can be rejoined quickly. In addition to the finger connections, -each node must also remain aware of K "successor" nodes (those which are -immediately clockwise of the starting point). The node is not required to -maintain connections to these, but it should remain informed about their -contact information, so that it can create connections when necessary. We -probably need a connection open to the immediate successor at all times. - -Since inbound connections exist too, each node has something like 2*log2(N) -plus up to 2*K connections. - -Each node keeps history of uptime/availability of the nodes that it remains -connected to. Each message that is sent to these peers includes an estimate -of that peer's availability from the point of view of the outside world. The -receiving node will average these reports together to determine what kind of -reliability they should announce to anyone they accept leases for. This -reliability is expressed as a percentage uptime: P=1.0 means the peer is -available 24/7, P=0.0 means it is almost never reachable. - - -When a node wishes to publish a file, it creates a list of (verifierid, -sharenum) tuples, and computes a hash of each tuple. These hashes then -represent starting points for the landlord search: - - starting_points = [(sharenum,sha(verifierid + str(sharenum))) - for sharenum in range(256)] - -The node then constructs a reservation message that contains enough -information for the potential landlord to evaluate the lease, *and* to make a -connection back to the starting node: - - message = [verifierid, sharesize, requestor_furl, starting_points] - -The node looks through its list of finger connections and splits this message -into up to log2(N) smaller messages, each of which contains only the starting -points that should be sent to that finger connection. Specifically we sent a -starting_point to a finger A if the nodeid of that finger is <= the -starting_point and if the next finger B is > starting_point. Each message -sent out can contain multiple starting_points, each for a different share. - -When a finger node receives this message, it performs the same splitting -algorithm, sending each starting_point to other fingers. Eventually a -starting_point is received by a node that knows that the starting_point lies -between itself and its immediate successor. At this point the message -switches from the "hop" mode (following fingers) to the "search" mode -(following successors). - -While in "search" mode, each node interprets the message as a lease request. -It checks its storage pool to see if it can accomodate the reservation. If -so, it uses requestor_furl to contact the originator and announces its -willingness to host the given sharenum. This message will include the -reliability measurement derived from the host's counterclockwise neighbors. - -If the recipient cannot host the share, it forwards the request on to the -next successor, which repeats the cycle. Each message has a maximum hop count -which limits the number of peers which may be searched before giving up. If a -node sees itself to be the last such hop, it must establish a connection to -the originator and let them know that this sharenum could not be hosted. - -The originator sends out something like 100 or 200 starting points, and -expects to get back responses (positive or negative) in a reasonable amount -of time. (perhaps if we receive half of the responses in time T, wait for a -total of 2T for the remaining ones). If no response is received with the -timeout, either re-send the requests for those shares (to different fingers) -or send requests for completely different shares. - -Each share represents some fraction of a point "S", such that the points for -enough shares to reconstruct the whole file total to 1.0 points. I.e., if we -construct 100 shares such that we need 25 of them to reconstruct the file, -then each share represents .04 points. - -As the positive responses come in, we accumulate two counters: the capacity -counter (which gets a full S points for each positive response), and the -reliability counter (which gets S*(reliability-of-host) points). The capacity -counter is not allowed to go above some limit (like 4x), as determined by -provisioning. The node keeps adding leases until the reliability counter has -gone above some other threshold (larger but close to 1.0). - -[ at download time, each host will be able to provide the share back with - probability P times an exponential decay factor related to peer death. Sum - these probabilities to get the average number of shares that will be - available. The interesting thing is actually the distribution of these - probabilities, and what threshold you have to pick to get a sufficiently - high chance of recovering the file. If there are N identical peers with - probability P, the number of recovered shares will have a gaussian - distribution with an average of N*P and a stddev of (??). The PMF of this - function is an S-curve, with a sharper slope when N is large. The - probability of recovering the file is the value of this S curve at the - threshold value (the number of necessary shares). - - P is not actually constant across all peers, rather we assume that it has - its own distribution: maybe gaussian, more likely exponential (power law). - This changes the shape of the S-curve. Assuming that we can characterize - the distribution of P with perhaps two parameters (say meanP and stddevP), - the S-curve is a function of meanP, stddevP, N, and threshold... - - To get 99.99% or 99.999% recoverability, we must choose a threshold value - high enough to accomodate the random variations and uncertainty about the - real values of P for each of the hosts we've selected. By counting - reliability points, we are trying to estimate meanP/stddevP, so we know - which S-curve to look at. The threshold is fixed at 1.0, since that's what - erasure coding tells us we need to recover the file. The job is then to add - hosts (increasing N and possibly changing meanP/stddevP) until our - recoverability probability is as high as we want. -] - -The originator takes all acceptance messages and adds them in order to the -list of landlords that will be used to host the file. It stops when it gets -enough reliability points. Note that it does *not* discriminate against -unreliable hosts: they are less likely to have been found in the first place, -so we don't need to discriminate against them a second time. We do, however, -use the reliability points to acknowledge that sending data to an unreliable -peer is not as useful as sending it to a reliable one (there is still value -in doing so, though). The remaining reservation-acceptance messages are -cancelled and then put aside: if we need to make a second pass, we ask those -peers first. - -Shares are then created and published as in Tahoe2. If we lose a connection -during the encoding, that share is lost. If we lose enough shares, we might -want to generate more to make up for them: this is done by using the leftover -acceptance messages first, then triggering a new Chord search for the -as-yet-unaccepted sharenums. These new peers will get shares from all -segments that have not yet been finished, then a second pass will be made to -catch them up on the earlier segments. - -Properties of this approach: - the total number of peers that each node must know anything about is bounded - to something like 2*log2(N) + K, probably on the order of 50 to 100 total. - This is the biggest advantage, since in tahoe2 each node must know at least - the nodeid of all 1M peers. The maintenance traffic should be much less as a - result. - - each node must maintain open (keep-alived) connections to something like - 2*log2(N) peers. In tahoe2, this number is 0 (well, probably 1 for the - introducer). - - during upload, each node must actively use 100 connections to a random set - of peers to push data (just like tahoe2). - - The probability that any given share-request gets a response is equal to the - number of hops it travels through times the chance that a peer dies while - holding on to the message. This should be pretty small, as the message - should only be held by a peer for a few seconds (more if their network is - busy). In tahoe2, each share-request always gets a response, since they are - made directly to the target. - -I visualize the peer-lookup process as the originator creating a -message-in-a-bottle for each share. Each message says "Dear Sir/Madam, I -would like to store X bytes of data for file Y (share #Z) on a system close -to (but not below) nodeid STARTING_POINT. If you find this amenable, please -contact me at FURL so we can make arrangements.". These messages are then -bundled together according to their rough destination (STARTING_POINT) and -sent somewhere in the right direction. - -Download happens the same way: lookup messages are disseminated towards the -STARTING_POINT and then search one successor at a time from there. There are -two ways that the share might go missing: if the node is now offline (or has -for some reason lost its shares), or if new nodes have joined since the -original upload and the search depth (maximum hop count) is too small to -accomodate the churn. Both result in the same amount of localized traffic. In -the latter case, a storage node might want to migrate the share closer to the -starting point, or perhaps just send them a note to remember a pointer for -the share. - -Checking: anyone who wishes to do a filecheck needs to send out a lookup -message for every potential share. These lookup messages could have a higher -search depth than usual. It would be useful to know how many peers each -message went through before being returned: this might be useful to perform -repair by instructing the old host (which is further from the starting point -than you'd like) to push their share closer towards the starting point. diff --git a/docs/mutable-DSA.svg b/docs/mutable-DSA.svg deleted file mode 100644 index 6870d834..00000000 --- a/docs/mutable-DSA.svg +++ /dev/null @@ -1,1144 +0,0 @@ - - - - - - - - - - - - - - - - - - - image/svg+xml - - - - - - - - - DSA private key - (256 bit string) - - DSA public key - (2048+ bit string) - - - salt - - - read-cap - - - - - - + - - - - - - - - 192 - - - - 64 - - (math) - - - AES - - - - H - H - - encryptedsalt - - - data - - crypttext - - - AES - readkey - - - - - shares - - - otherstuff - - - DSA - - - - - - signature - private key - - 256 - - - pubkey hash - 256 - - FEC - Hmerkletrees - H - write-cap - - - - storage index - - - - 64 - SI:A - - - - 64 - SI:B - - - - verify cap - H - H - H - H - H - H - - - pubkey hash - 256 - - - - 64 - SI:A - - - : stored in share - H - - - deep-verify cap - - 192 - 64 - - - H - H - - - - AES - writekey - - - - - AES - deepverifykey - - H - - diff --git a/docs/mutable-DSA.txt b/docs/mutable-DSA.txt deleted file mode 100644 index 73f3eb78..00000000 --- a/docs/mutable-DSA.txt +++ /dev/null @@ -1,346 +0,0 @@ - -(protocol proposal, work-in-progress, not authoritative) - -(this document describes DSA-based mutable files, as opposed to the RSA-based -mutable files that were introduced in tahoe-0.7.0 . This proposal has not yet -been implemented. Please see mutable-DSA.svg for a quick picture of the -crypto scheme described herein) - -This file shows only the differences from RSA-based mutable files to -(EC)DSA-based mutable files. You have to read and understand mutable.txt before -reading this file (mutable-DSA.txt). - -=== SDMF slots overview === - -Each SDMF slot is created with a DSA public/private key pair, using a -system-wide common modulus and generator, in which the private key is a -random 256 bit number, and the public key is a larger value (about 2048 bits) -that can be derived with a bit of math from the private key. The public key -is known as the "verification key", while the private key is called the -"signature key". - -The 256-bit signature key is used verbatim as the "write capability". This -can be converted into the 2048ish-bit verification key through a fairly cheap -set of modular exponentiation operations; this is done any time the holder of -the write-cap wants to read the data. (Note that the signature key can either -be a newly-generated random value, or the hash of something else, if we found -a need for a capability that's stronger than the write-cap). - -This results in a write-cap which is 256 bits long and can thus be expressed -in an ASCII/transport-safe encoded form (base62 encoding, fits in 72 -characters, including a local-node http: convenience prefix). - -The private key is hashed to form a 256-bit "salt". The public key is also -hashed to form a 256-bit "pubkey hash". These two values are concatenated, -hashed, and truncated to 192 bits to form the first 192 bits of the read-cap. -The pubkey hash is hashed by itself and truncated to 64 bits to form the last -64 bits of the read-cap. The full read-cap is 256 bits long, just like the -write-cap. - -The first 192 bits of the read-cap are hashed and truncated to form the first -192 bits of the "traversal cap". The last 64 bits of the read-cap are hashed -to form the last 64 bits of the traversal cap. This gives us a 256-bit -traversal cap. - -The first 192 bits of the traversal-cap are hashed and truncated to form the -first 64 bits of the storage index. The last 64 bits of the traversal-cap are -hashed to form the last 64 bits of the storage index. This gives us a 128-bit -storage index. - -The verification-cap is the first 64 bits of the storage index plus the -pubkey hash, 320 bits total. The verification-cap doesn't need to be -expressed in a printable transport-safe form, so it's ok that it's longer. - -The read-cap is hashed one way to form an AES encryption key that is used to -encrypt the salt; this key is called the "salt key". The encrypted salt is -stored in the share. The private key never changes, therefore the salt never -changes, and the salt key is only used for a single purpose, so there is no -need for an IV. - -The read-cap is hashed a different way to form the master data encryption -key. A random "data salt" is generated each time the share's contents are -replaced, and the master data encryption key is concatenated with the data -salt, then hashed, to form the AES CTR-mode "read key" that will be used to -encrypt the actual file data. This is to avoid key-reuse. An outstanding -issue is how to avoid key reuse when files are modified in place instead of -being replaced completely; this is not done in SDMF but might occur in MDMF. - -The master data encryption key is used to encrypt data that should be visible -to holders of a write-cap or a read-cap, but not to holders of a -traversal-cap. - -The private key is hashed one way to form the salt, and a different way to -form the "write enabler master". For each storage server on which a share is -kept, the write enabler master is concatenated with the server's nodeid and -hashed, and the result is called the "write enabler" for that particular -server. Note that multiple shares of the same slot stored on the same server -will all get the same write enabler, i.e. the write enabler is associated -with the "bucket", rather than the individual shares. - -The private key is hashed a third way to form the "data write key", which can -be used by applications which wish to store some data in a form that is only -available to those with a write-cap, and not to those with merely a read-cap. -This is used to implement transitive read-onlyness of dirnodes. - -The traversal cap is hashed to work the "traversal key", which can be used by -applications that wish to store data in a form that is available to holders -of a write-cap, read-cap, or traversal-cap. - -The idea is that dirnodes will store child write-caps under the writekey, -child names and read-caps under the read-key, and verify-caps (for files) or -deep-verify-caps (for directories) under the traversal key. This would give -the holder of a root deep-verify-cap the ability to create a verify manifest -for everything reachable from the root, but not the ability to see any -plaintext or filenames. This would make it easier to delegate filechecking -and repair to a not-fully-trusted agent. - -The public key is stored on the servers, as is the encrypted salt, the -(non-encrypted) data salt, the encrypted data, and a signature. The container -records the write-enabler, but of course this is not visible to readers. To -make sure that every byte of the share can be verified by a holder of the -verify-cap (and also by the storage server itself), the signature covers the -version number, the sequence number, the root hash "R" of the share merkle -tree, the encoding parameters, and the encrypted salt. "R" itself covers the -hash trees and the share data. - -The read-write URI is just the private key. The read-only URI is the read-cap -key. The deep-verify URI is the traversal-cap. The verify-only URI contains -the the pubkey hash and the first 64 bits of the storage index. - - FMW:b2a(privatekey) - FMR:b2a(readcap) - FMT:b2a(traversalcap) - FMV:b2a(storageindex[:64])b2a(pubkey-hash) - -Note that this allows the read-only, deep-verify, and verify-only URIs to be -derived from the read-write URI without actually retrieving any data from the -share, but instead by regenerating the public key from the private one. Users -of the read-only, deep-verify, or verify-only caps must validate the public -key against their pubkey hash (or its derivative) the first time they -retrieve the pubkey, before trusting any signatures they see. - -The SDMF slot is allocated by sending a request to the storage server with a -desired size, the storage index, and the write enabler for that server's -nodeid. If granted, the write enabler is stashed inside the slot's backing -store file. All further write requests must be accompanied by the write -enabler or they will not be honored. The storage server does not share the -write enabler with anyone else. - -The SDMF slot structure will be described in more detail below. The important -pieces are: - - * a sequence number - * a root hash "R" - * the data salt - * the encoding parameters (including k, N, file size, segment size) - * a signed copy of [seqnum,R,data_salt,encoding_params] (using signature key) - * the verification key (not encrypted) - * the share hash chain (part of a Merkle tree over the share hashes) - * the block hash tree (Merkle tree over blocks of share data) - * the share data itself (erasure-coding of read-key-encrypted file data) - * the salt, encrypted with the salt key - -The access pattern for read (assuming we hold the write-cap) is: - * generate public key from the private one - * hash private key to get the salt, hash public key, form read-cap - * form storage-index - * use storage-index to locate 'k' shares with identical 'R' values - * either get one share, read 'k' from it, then read k-1 shares - * or read, say, 5 shares, discover k, either get more or be finished - * or copy k into the URIs - * .. jump to "COMMON READ", below - -To read (assuming we only hold the read-cap), do: - * hash read-cap pieces to generate storage index and salt key - * use storage-index to locate 'k' shares with identical 'R' values - * retrieve verification key and encrypted salt - * decrypt salt - * hash decrypted salt and pubkey to generate another copy of the read-cap, - make sure they match (this validates the pubkey) - * .. jump to "COMMON READ" - - * COMMON READ: - * read seqnum, R, data salt, encoding parameters, signature - * verify signature against verification key - * hash data salt and read-cap to generate read-key - * read share data, compute block-hash Merkle tree and root "r" - * read share hash chain (leading from "r" to "R") - * validate share hash chain up to the root "R" - * submit share data to erasure decoding - * decrypt decoded data with read-key - * submit plaintext to application - -The access pattern for write is: - * generate pubkey, salt, read-cap, storage-index as in read case - * generate data salt for this update, generate read-key - * encrypt plaintext from application with read-key - * application can encrypt some data with the data-write-key to make it - only available to writers (used for transitively-readonly dirnodes) - * erasure-code crypttext to form shares - * split shares into blocks - * compute Merkle tree of blocks, giving root "r" for each share - * compute Merkle tree of shares, find root "R" for the file as a whole - * create share data structures, one per server: - * use seqnum which is one higher than the old version - * share hash chain has log(N) hashes, different for each server - * signed data is the same for each server - * include pubkey, encrypted salt, data salt - * now we have N shares and need homes for them - * walk through peers - * if share is not already present, allocate-and-set - * otherwise, try to modify existing share: - * send testv_and_writev operation to each one - * testv says to accept share if their(seqnum+R) <= our(seqnum+R) - * count how many servers wind up with which versions (histogram over R) - * keep going until N servers have the same version, or we run out of servers - * if any servers wound up with a different version, report error to - application - * if we ran out of servers, initiate recovery process (described below) - -==== Cryptographic Properties ==== - -This scheme protects the data's confidentiality with 192 bits of key -material, since the read-cap contains 192 secret bits (derived from an -encrypted salt, which is encrypted using those same 192 bits plus some -additional public material). - -The integrity of the data (assuming that the signature is valid) is protected -by the 256-bit hash which gets included in the signature. The privilege of -modifying the data (equivalent to the ability to form a valid signature) is -protected by a 256 bit random DSA private key, and the difficulty of -computing a discrete logarithm in a 2048-bit field. - -There are a few weaker denial-of-service attacks possible. If N-k+1 of the -shares are damaged or unavailable, the client will be unable to recover the -file. Any coalition of more than N-k shareholders will be able to effect this -attack by merely refusing to provide the desired share. The "write enabler" -shared secret protects existing shares from being displaced by new ones, -except by the holder of the write-cap. One server cannot affect the other -shares of the same file, once those other shares are in place. - -The worst DoS attack is the "roadblock attack", which must be made before -those shares get placed. Storage indexes are effectively random (being -derived from the hash of a random value), so they are not guessable before -the writer begins their upload, but there is a window of vulnerability during -the beginning of the upload, when some servers have heard about the storage -index but not all of them. - -The roadblock attack we want to prevent is when the first server that the -uploader contacts quickly runs to all the other selected servers and places a -bogus share under the same storage index, before the uploader can contact -them. These shares will normally be accepted, since storage servers create -new shares on demand. The bogus shares would have randomly-generated -write-enablers, which will of course be different than the real uploader's -write-enabler, since the malicious server does not know the write-cap. - -If this attack were successful, the uploader would be unable to place any of -their shares, because the slots have already been filled by the bogus shares. -The uploader would probably try for peers further and further away from the -desired location, but eventually they will hit a preconfigured distance limit -and give up. In addition, the further the writer searches, the less likely it -is that a reader will search as far. So a successful attack will either cause -the file to be uploaded but not be reachable, or it will cause the upload to -fail. - -If the uploader tries again (creating a new privkey), they may get lucky and -the malicious servers will appear later in the query list, giving sufficient -honest servers a chance to see their share before the malicious one manages -to place bogus ones. - -The first line of defense against this attack is the timing challenges: the -attacking server must be ready to act the moment a storage request arrives -(which will only occur for a certain percentage of all new-file uploads), and -only has a few seconds to act before the other servers will have allocated -the shares (and recorded the write-enabler, terminating the window of -vulnerability). - -The second line of defense is post-verification, and is possible because the -storage index is partially derived from the public key hash. A storage server -can, at any time, verify every public bit of the container as being signed by -the verification key (this operation is recommended as a continual background -process, when disk usage is minimal, to detect disk errors). The server can -also hash the verification key to derive 64 bits of the storage index. If it -detects that these 64 bits do not match (but the rest of the share validates -correctly), then the implication is that this share was stored to the wrong -storage index, either due to a bug or a roadblock attack. - -If an uploader finds that they are unable to place their shares because of -"bad write enabler errors" (as reported by the prospective storage servers), -it can "cry foul", and ask the storage server to perform this verification on -the share in question. If the pubkey and storage index do not match, the -storage server can delete the bogus share, thus allowing the real uploader to -place their share. Of course the origin of the offending bogus share should -be logged and reported to a central authority, so corrective measures can be -taken. It may be necessary to have this "cry foul" protocol include the new -write-enabler, to close the window during which the malicious server can -re-submit the bogus share during the adjudication process. - -If the problem persists, the servers can be placed into pre-verification -mode, in which this verification is performed on all potential shares before -being committed to disk. This mode is more CPU-intensive (since normally the -storage server ignores the contents of the container altogether), but would -solve the problem completely. - -The mere existence of these potential defenses should be sufficient to deter -any actual attacks. Note that the storage index only has 64 bits of -pubkey-derived data in it, which is below the usual crypto guidelines for -security factors. In this case it's a pre-image attack which would be needed, -rather than a collision, and the actual attack would be to find a keypair for -which the public key can be hashed three times to produce the desired portion -of the storage index. We believe that 64 bits of material is sufficiently -resistant to this form of pre-image attack to serve as a suitable deterrent. - -=== SMDF Slot Format === - -This SMDF data lives inside a server-side MutableSlot container. The server -is generally oblivious to this format, but it may look inside the container -when verification is desired. - -This data is tightly packed. There are no gaps left between the different -fields, and the offset table is mainly present to allow future flexibility of -key sizes. - - # offset size name - 1 0 1 version byte, \x01 for this format - 2 1 8 sequence number. 2^64-1 must be handled specially, TBD - 3 9 32 "R" (root of share hash Merkle tree) - 4 41 32 data salt (readkey is H(readcap+data_salt)) - 5 73 32 encrypted salt (AESenc(key=H(readcap), salt) - 6 105 18 encoding parameters: - 105 1 k - 106 1 N - 107 8 segment size - 115 8 data length (of original plaintext) - 7 123 36 offset table: - 127 4 (9) signature - 131 4 (10) share hash chain - 135 4 (11) block hash tree - 139 4 (12) share data - 143 8 (13) EOF - 8 151 256 verification key (2048bit DSA key) - 9 407 40 signature=DSAsig(H([1,2,3,4,5,6])) -10 447 (a) share hash chain, encoded as: - "".join([pack(">H32s", shnum, hash) - for (shnum,hash) in needed_hashes]) -11 ?? (b) block hash tree, encoded as: - "".join([pack(">32s",hash) for hash in block_hash_tree]) -12 ?? LEN share data -13 ?? -- EOF - -(a) The share hash chain contains ceil(log(N)) hashes, each 32 bytes long. - This is the set of hashes necessary to validate this share's leaf in the - share Merkle tree. For N=10, this is 4 hashes, i.e. 128 bytes. -(b) The block hash tree contains ceil(length/segsize) hashes, each 32 bytes - long. This is the set of hashes necessary to validate any given block of - share data up to the per-share root "r". Each "r" is a leaf of the share - has tree (with root "R"), from which a minimal subset of hashes is put in - the share hash chain in (8). - -== TODO == - -Every node in a given tahoe grid must have the same common DSA moduli and -exponent, but different grids could use different parameters. We haven't -figured out how to define a "grid id" yet, but I think the DSA parameters -should be part of that identifier. In practical terms, this might mean that -the Introducer tells each node what parameters to use, or perhaps the node -could have a config file which specifies them instead. diff --git a/docs/proposed/accounts-introducer.txt b/docs/proposed/accounts-introducer.txt new file mode 100644 index 00000000..36a5a56f --- /dev/null +++ b/docs/proposed/accounts-introducer.txt @@ -0,0 +1,134 @@ +This is a proposal for handing accounts and quotas in Tahoe. Nothing is final +yet.. we are still evaluating the options. + + += Account Management: Introducer-based = + +A Tahoe grid can be configured in several different modes. The simplest mode +(which is also the default) is completely permissive: all storage servers +will accept shares from all clients, and no attempt is made to keep track of +who is storing what. Access to the grid is mostly equivalent to having access +to the Introducer (or convincing one of the existing members to give you a +list of all their storage server FURLs). + +This mode, while a good starting point, does not accomodate any sort of +auditing or quota management. Even in a small friendnet, operators might like +to know how much of their storage space is being consumed by Alice, so they +might be able to ask her to cut back when overall disk usage is getting to +high. In a larger commercial deployment, a service provider needs to be able +to get accurate usage numbers so they can bill the user appropriately. In +addition, the operator may want the ability to delete all of Bob's shares +(i.e. cancel any outstanding leases) when he terminates his account. + +There are several lease-management/garbage-collection/deletion strategies +possible for a Tahoe grid, but the most efficient ones require knowledge of +lease ownership, so that renewals and expiration can take place on a +per-account basis rather than a (more numerous) per-share basis. + +== Accounts == + +To accomplish this, "Accounts" can be established in a Tahoe grid. There is +nominally one account per human user of the grid, but of course a user might +use multiple accounts, or an account might be shared between multiple users. +The Account is the smallest unit of quota and lease management. + +Accounts are created by an "Account Manager". In a commercial network there +will be just one (centralized) account manager, and all storage nodes will be +configured to require a valid account before providing storage services. In a +friendnet, each peer can run their own account manager, and servers will +accept accounts from any of the managers (this mode is permissive but allows +quota-tracking of non-malicious users). + +The account manager is free to manage the accounts as it pleases. Large +systems will probably use a database to correlate things like username, +storage consumed, billing status, etc. + +== Overview == + +The Account Manager ("AM") replaces the normal Introducer node: grids which +use an Account Manager will not run an Introducer, and the participating +nodes will not be configured with an "introducer.furl". + +Instead, each client will be configured with a different "account.furl", +which gives that client access to a specific account. These account FURLs +point to an object inside the Account Manager which exists solely for the +benefit of that one account. When the client needs access to storage servers, +it will use this account object to acquire personalized introductions to a +per-account "Personal Storage Server" facet, one per storage server node. For +example, Alice would wind up with PSS[1A] on server 1, and PSS[2A] on server +2. Bob would get PSS[1B] and PSS[2B]. + +These PSS facets provide the same remote methods as the old generic SS facet, +except that every time they create a lease object, the account information of +the holder is recorded in that lease. The client stores a list of these PSS +facet FURLs in persistent storage, and uses them in the "get_permuted_peers" +function that all uploads and downloads use to figure out who to talk to when +looking for shares or shareholders. + +Each Storage Server has a private facet that it gives to the Account Manager. +This facet allows the AM to create PSS facets for a specific account. In +particular, the AM tells the SS "please create account number 42, and tell me +the PSS FURL that I should give to the client". The SS creates an object +which remembers the account number, creates a FURL for it, and returns the +FURL. + +If there is a single central account manager, then account numbers can be +small integers. (if there are multiple ones, they need to be large random +strings to ensure uniqueness). To avoid requiring large (accounts*servers) +lookup tables, a given account should use the same identifer for all the +servers it talks to. When this can be done, the PSS and Account FURLs are +generated as MAC'ed copies of the account number. + +More specifically, the PSS FURL is a MAC'ed copy of the account number: each +SS has a private secret "S", and it creates a string "%d-%s" % (accountnum, +b2a(hash(S+accountnum))) to use as the swissnum part of the FURL. The SS uses +tub.registerNameLookupHandler to add a function that tries to validate +inbound FURLs against this scheme: if successful, it creates a new PSS object +with the account number stashed inside. This allows the server to minimize +their per-user storage requirements but still insure that PSS FURLs are +unguessable. + +Account FURLs are created by the Account Manager in a similar fashion, using +a MAC of the account number. The Account Manager can use the same account +number to index other information in a database, like account status, billing +status, etc. + +The mechanism by which Account FURLs are minted is left up to the account +manager, but the simple AM that the 'tahoe create-account-manager' command +makes has a "new-account" FURL which accepts a username and creates an +account for them. The 'tahoe create-account' command is a CLI frontend to +this facility. In a friendnet, you could publish this FURL to your friends, +allowing everyone to make their own account. In a commercial grid, this +facility would be reserved use by the same code which handles billing. + + +== Creating the Account Manager == + +The 'tahoe create-account-manager' command is used to create a simple account +manager node. When started, this node will write several FURLs to its +private/ directory, some of which should be provided to other services. + + * new-account.furl : this FURL allows the holder to create new accounts + * manage-accounts.furl : this FURL allows the holder to list and modify + all existing accounts + * serverdesk.furl : this FURL is used by storage servers to make themselves + available to all account holders + + +== Configuring the Storage Servers == + +To use an account manager, each storage server node should be given access to +the AM's serverdesk (by simply copying "serverdesk.furl" into the storage +server's base directory). In addition, it should *not* be given an +introducer.furl . The serverdesk FURL tells the SS that it should allow the +AM to create PSS facets for each account, and the lack of an introducer FURL +tells the SS to not make its generic SS facet available to anyone. The +combination means that clients must acquire PSS facets instead of using the +generic one. + +== Configuring Clients == + +Each client should be configured to use a specific account by copying their +account FURL into their basedir, in a file named "account.furl". In addition, +these client nodes should *not* have an "introducer.furl". This combination +tells the client to ask the AM for ... diff --git a/docs/proposed/accounts-pubkey.txt b/docs/proposed/accounts-pubkey.txt new file mode 100644 index 00000000..11d28043 --- /dev/null +++ b/docs/proposed/accounts-pubkey.txt @@ -0,0 +1,636 @@ +This is a proposal for handing accounts and quotas in Tahoe. Nothing is final +yet.. we are still evaluating the options. + + += Accounts = + +The basic Tahoe account is defined by a DSA key pair. The holder of the +private key has the ability to consume storage in conjunction with a specific +account number. + +The Account Server has a long-term keypair. Valid accounts are marked as such +by the Account Server's signature on a "membership card", which binds a +specific pubkey to an account number and declares that this pair is a valid +account. + +Each Storage Server which participages in the AS's domain will have the AS's +pubkey in its list of valid AS keys, and will thus accept membership cards +that were signed by that AS. If the SS accepts multiple ASs, then it will +give each a distinct number, and leases will be labled with an (AS#,Account#) +pair. If there is only one AS, then leases will be labeled with just the +Account#. + +Each client node is given the FURL of their personal Account object. The +Account will accept a DSA public key and return a signed membership card that +authorizes the corresponding private key to consume storage on behalf of the +account. The client will create its own DSA keypair the first time it +connects to the Account, and will then use the resulting membership card for +all subsequent storage operations. + +== Storage Server Goals == + +The Storage Server cares about two things: + + 1: maintaining an accurate refcount on each bucket, so it can delete the + bucket when the refcount goes to zero + 2: being able to answer questions about aggregate usage per account + +The SS conceptually maintains a big matrix of lease information: one column +per account, one row per storage index. The cells contain a boolean +(has-lease or no-lease). If the grid uses per-lease timers, then each +has-lease cell also contains a lease timer. + +This matrix may be stored in a variety of ways: entries in each share file, +or items in a SQL database, according to the desired tradeoff between +complexity, robustness, read speed, and write speed. + +Each client (by virtue of their knowledge of an authorized private key) gets +to manipulate their column of this matrix in any way they like: add lease, +renew lease, delete lease. (TODO: for reconcilliation purposes, the should +also be able to enumerate leases). + +== Storage Operations == + +Side-effect-causing storage operations come in three forms: + + 1: allocate bucket / add lease to existing bucket + arguments: storage_index=, storage_server=, ueb_hash=, account= + 2: renew lease + arguments: storage_index=, storage_server=, account= + 3: cancel lease + arguments: storage_index=, storage_server=, account= + +(where lease renewal is only relevant for grids which use per-lease timers). +Clients do add-lease when they upload a file, and cancel-lease when they +remove their last reference to it. + +Storage Servers publish a "public storage port" through the introducer, which +does not actually enable storage operations, but is instead used in a +rights-amplification pattern to grant authorized parties access to a +"personal storage server facet". This personal facet is the one that +implements allocate_bucket. All clients get access to the same public storage +port, which means that we can improve the introduction mechanism later (to +use a gossip-based protocol) without affecting the authority-granting +protocols. + +The public storage port accepts signed messages asking for storage authority. +It responds by creating a personal facet and making it available to the +requester. The account number is curried into the facet, so that all +lease-creating operations will record this account number into the lease. By +restricting the nature of the personal facets that a client can access, we +restrict them to using their designated account number. + + +======================================== + +There are two kinds of signed messages: use (other names: connection, +FURLification, activation, reification, grounding, specific-making, ?), and +delegation. The FURLification message results in a FURL that points to an +object which can actually accept RIStorageServer methods. The delegation +message results in a new signed message. + +The furlification message looks like: + + (pubkey, signed(serialized({limitations}, beneficiary_furl))) + +The delegation message looks like: + + (pubkey, signed(serialized({limitations}, delegate_pubkey))) + +The limitations dict indicates what the resulting connection or delegation +can be used for. All limitations for the cert chain are applied, and the +result must be restricted to their overall minimum. + +The following limitation keys are defined: + + 'account': a number. All resulting leases must be tagged with this account + number. A chain with multiple distinct 'account' limitations is + an error (the result will not permit leases) + 'SI': a storage index (binary string). Leases may only be created for this + specific storage index, no other. + 'serverid': a peerid (binary string). Leases may only be created on the + storage server identified by this serverid. + 'UEB_hash': (binary string): Leases may only be created for shares which + contain a matching UEB_hash. Note: this limitation is a nuisance + to implement correctly: it requires that the storage server + parse the share and verify all hashes. + 'before': a timestamp (seconds since epoch). All leases must be made before + this time. In addition, all liverefs and FURLs must expire and + cease working at this time. + 'server_size': a number, measuring share size (in bytes). A storage server + which sees this message should keep track of how much storage + space has been consumed using this liveref/FURL, and throw + an exception when receiving a lease request that would bring + this total above 'server_size'. Note: this limitation is + a nuisance to implement (it works best if 'before' is used + and provides a short lifetime). + +Actually, let's merge the two, and put the type in the limitations dict. +'furl_to' and 'delegate_key' are mutually exclusive. + + 'furl_to': (string): Used only on furlification messages. This requests the + recipient to create an object which implements the given access, + then send a FURL which references this object to an + RIFURLReceiver.furl() call at the given 'furl_to' FURL: + facet = create_storage_facet(limitations) + facet_furl = tub.registerReference(facet) + d = tub.getReference(limitations['furl_to']) + d.addCallback(lambda rref: rref.furl(facet_furl)) + The facet_furl should be persistent, so to reduce storage space, + facet_furl should contain an HMAC'ed list of all limitations, and + create_storage_facet() should be deferred until the client + actually tries to use the furl. This leads to 150-200 byte base32 + swissnums. + 'delegate_key': (binary string, a DSA pubkey). Used only on delegation + messages. This requests all observers to accept messages + signed by the given public key and to apply the associated + limitations. + +I also want to keep the message size small, so I'm going to define a custom +netstring-based encoding format for it (JSON expands binary data by about +3.5x). Each dict entry will be encoded as netstring(key)+netstring(value). +The container is responsible for providing the size of this serialized +structure. + +The actual message will then look like: + +def make_message(privkey, limitations): + message_to_sign = "".join([ netstring(k) + netstring(v) + for k,v in limitations ]) + signature = privkey.sign(message_to_sign) + pubkey = privkey.get_public_key() + msg = netstring(message_to_sign) + netstring(signature) + netstring(pubkey) + return msg + +The deserialization code MUST throw an exception if the same limitations key +appears twice, to ensure that everybody interprets the dict the same way. + +These messages are passed over foolscap connections as a single string. They +are also saved to disk in this format. Code should only store them in a +deserialized form if the signature has been verified, the cert chain +verified, and the limitations accumulated. + + +The membership card is just the following: + + membership_card = make_message(account_server_privkey, + {'account': account_number, + 'before': time.time() + 1*MONTH, + 'delegate_key': client_pubkey}) + +This card is provided on demand by the given user's Account facet, for +whatever pubkey they submit. + +When a client learns about a new storage server, they create a new receiver +object (and stash the peerid in it), and submit the following message to the +RIStorageServerWelcome.get_personal_facet() method: + + mymsg = make_message(client_privkey, {'furl_to': receiver_furl}) + send(membership_card, mymsg) + +(note that the receiver_furl will probably not have a routeable address, but +this won't matter because the client is already attached, so foolscap can use +the existing connection.) + +The server will validate the cert chain (see below) and wind up with a +complete list of limitations that are to be applied to the facet it will +provide to the caller. This list must combine limitations from the entire +chain: in particular it must enforce the account= limitation from the +membership card. + +The server will then serialize this limitation dict into a string, compute a +fixed-size HMAC code using a server-private secret, then base32 encode the +(hmac+limitstring) value (and prepend a "0-" version indicator). The +resulting string is used as the swissnum portion of the FURL that is sent to +the furl_to target. + +Later, when the client tries to dereference this FURL, a +Tub.registerNameLookupHandler hook will notice the attempt, claim the "0-" +namespace, base32decode the string, check the HMAC, decode the limitation +dict, then create and return an RIStorageServer facet with these limitations. + +The client should cache the (peerid, FURL) mapping in persistent storage. +Later, when it learns about this storage server again, it will use the cached +FURL instead of signing another message. If the getReference or the storage +operation fails with StorageAuthorityExpiredError, the cache entry should be +removed and the client should sign a new message to obtain a new one. + + (security note: an evil storage server can take 'mymsg' and present it to + someone else, but other servers will only send the resulting authority to + the client's receiver_furl, so the evil server cannot benefit from this. The + receiver object has the serverid curried into it, so the evil server can + only affect the client's mapping for this one serverid, not anything else, + so the server cannot hurt the client in any way other than denying service + to itself. It might be a good idea to include serverid= in the message, but + it isn't clear that it really helps anything). + +When the client wants to use a Helper, it needs to delegate some amount of +storage authority to the helper. The first phase has the client send the +storage index to the helper, so it can query servers and decide whether the +file needs to be uploaded or not. If it decides yes, the Helper creates a new +Uploader object and a receiver object, and sends the Uploader liveref and the +receiver FURL to the client. + +The client then creates a message for the helper to use: + + helper_msg = make_message(client_privkey, {'furl_to': helper_rx_furl, + 'SI': storage_index, + 'before': time.time() + 1*DAY, #? + 'server_size': filesize/k+overhead, + }) + +The client then sends (membership_card, helper_msg) to the helper. The Helper +sends (membership_card, helper_msg) to each storage server that it needs to +use for the upload. This gives the Helper access to a limited facet on each +storage server. This facet gives the helper the authority to upload data for +a specific storage index, for a limited time, using leases that are tagged by +the user's account number. The helper cannot use the client's storage +authority for any other file. The size limit prevents the helper from storing +some other (larger) file of its own using this authority. The time +restriction allows the storage servers to expire their 'server_size' table +entry quickly, and prevents the helper from hanging on to the storage +authority indefinitely. + +The Helper only gets one furl_to target, which must be used for multiple SS +peerids. The helper's receiver must parse the FURL that gets returned to +determine which server is which. [problems: an evil server could deliver a +bogus FURL which points to a different server. The Helper might reject the +real server's good FURL as a duplicate. This allows an evil server to block +access to a good server. Queries could be sent sequentially, which would +partially mitigate this problem (an evil server could send multiple +requests). Better: if the cert-chain send message could include a nonce, +which is supposed to be returned with the FURL, then the helper could use +this to correlate sends and receives.] + +=== repair caps === + +There are three basic approaches to provide a Repairer with the storage +authority that it needs. The first is to give the Repairer complete +authority: allow it to place leases for whatever account number it wishes. +This is simple and requires the least overhead, but of course it give the +Repairer the ability to abuse everyone's quota. The second is to give the +Repairer no user authority: instead, give the repairer its own account, and +build it to keep track of which leases it is holding on behalf of one of its +customers. This repairer will slowly accumulate quota space over time, as it +creates new shares to replace ones that have decayed. Eventually, when the +client comes back online, the client should establish its own leases on these +new shares and allow the repairer to cancel its temporary ones. + +The third approach is in between the other two: give the repairer some +limited authority over the customer's account, but not enough to let it +consume the user's whole quota. + +To create the storage-authority portion of a (one-month) repair-cap, the +client creates a new DSA keypair (repair_privkey, repair_pubkey), and then +creates a signed message and bundles it into the repaircap: + + repair_msg = make_message(client_privkey, {'delegate_key': repair_pubkey, + 'SI': storage_index, + 'UEB_hash': file_ueb_hash}) + repair_cap = (verify_cap, repair_privkey, (membership_card, repair_msg)) + +This gives the holder of the repair cap a time-limited authority to upload +shares for the given storage index which contain the given data. This +prohibits the repair-cap from being used to upload or repair any other file. + +When the repairer needs to upload a new share, it will use the delegated key +to create its own signed message: + + upload_msg = make_message(repair_privkey, {'furl_to': repairer_rx_furl}) + send(membership_card, repair_msg, upload_msg) + +The biggest problem with the low-authority approaches is the expiration time +of the membership card, which limits the duration for which the repair-cap +authority is valid. It would be nice if repair-caps could last a long time, +years perhaps, so that clients can be offline for a similar period of time. +However to retain a reasonable revocation interval for users, the membership +card's before= timeout needs to be closer to a month. [it might be reasonable +to use some sort of rights-amplification: the repairer has a special cert +which allows it to remove the before= value from a chain]. + + +=== chain verification === + +The server will create a chain that starts with the AS's certificate: an +unsigned message which derives its authority from being manually placed in +the SS's configdir. The only limitation in the AS certificate will be on some +kind of meta-account, in case we want to use multiple account servers and +allow their account numbers to live in distinct number spaces (think +sub-accounts or business partners to buy storage in bulk and resell it to +users). The rest of the chain comes directly from what the client sent. + +The server walks the chain, keeping an accumulated limitations dictionary +along the way. At each step it knows the pubkey that was delegated by the +previous step. + +== client config == + +Clients are configured with an Account FURL that points to a private facet on +the Account Server. The client generates a private key at startup. It sends +the pubkey to the AS facet, which will return a signed delegate_key message +(the "membership card") that grants the client's privkey any storage +authority it wishes (as long as the account number is set to a specific +value). + +The client stores this membership card in private/membership.cert . + + +RIStorageServer messages will accept an optional account= argument. If left +unspecified, the value is taken from the limitations that were curried into +the SS facet. In all cases, the value used must meet those limitations. The +value must not be None: Helpers/Repairers or other super-powered storage +clients are obligated to specify an account number. + +== server config == + +Storage servers are configured with an unsigned root authority message. This +is like the output of make_message(account_server_privkey, {}) but has empty +'signature' and 'pubkey' strings. This root goes into +NODEDIR/storage_authority_root.cert . It is prepended to all chains that +arrive. + + [if/when we accept multiple authorities, storage_authority_root.cert will + turn into a storage_authority_root/ directory with *.cert files, and each + arriving chain will cause a search through these root certs for a matching + pubkey. The empty limitations will be replaced by {domain=X}, which is used + as a sort of meta-account.. the details depend upon whether we express + account numbers as an int (with various ranges) or as a tuple] + +The root authority message is published by the Account Server through its web +interface, and also into a local file: NODEDIR/storage_authority_root.cert . +The admin of the storage server is responsible for copying this file into +place, thus enabling clients to use storage services. + + +---------------------------------------- + +-- Text beyond this point is out-of-date, and exists purely for background -- + +Each storage server offers a "public storage port", which only accepts signed +messages. The Introducer mechanism exists to give clients a reference to a +set of these public storage ports. All clients get access to the same ports. +If clients did all their work themselves, these public storage ports would be +enough, and no further code would be necessary (all storage requests would we +signed the same way). + +Fundamentally, each storage request must be signed by the account's private +key, giving the SS an authenticated Account Number to go with the request. +This is used to index the correct cell in the lease matrix. The holder of the +account privkey is allowed to manipulate their column of the matrix in any +way they like: add leases, renew leases, delete leases. (TODO: for +reconcilliation purposes, they should also be able to enumerate leases). The +storage request is sent in the form of a signed request message, accompanied +by the membership card. For example: + + req = SIGN("allocate SI=123 SSID=abc", accountprivkey) , membership_card + -> RemoteBucketWriter reference + +Upon receipt of this request, the storage server will return a reference to a +RemoteBucketWriter object, which the client can use to fill and close the +bucket. The SS must perform two DSA signature verifications before accepting +this request. The first is to validate the membership card: the Account +Server's pubkey is used to verify the membership card's signature, from which +an account pubkey and account# is extracted. The second is to validate the +request: the account pubkey is used to verify the request signature. If both +are valid, the full request (with account# and storage index) is delivered to +the internal StorageServer object. + +Note that the signed request message includes the Storage Server's node ID, +to prevent this storage server from taking the signed message and echoing to +other storage servers. Each SS will ignore any request that is not addressed +to the right SSID. Also note that the SI= and SSID= fields may contain +wildcards, if the signing client so chooses. + +== Caching Signature Verification == + +We add some complexity to this simple model to achieve two goals: to enable +fine-grained delegation of storage capabilities (specifically for renewers +and repairers), and to reduce the number of public-key crypto operations that +must be performed. + +The first enhancement is to allow the SS to cache the results of the +verification step. To do this, the client creates a signed message which asks +the SS to return a FURL of an object which can be used to execute further +operations *without* a DSA signature. The FURL is expected to contain a +MAC'ed string that contains the account# and the argument restrictions, +effectively currying a subset of arguments into the RemoteReference. Clients +which do all their operations themselves would use this to obtain a private +storage port for each public storage port, stashing the FURLs in a local +table, and then later storage operations would be done to those FURLs instead +of creating signed requests. For example: + + req = SIGN("FURL(allocate SI=* SSID=abc)", accountprivkey), membership_card + -> FURL + Tub.getReference(FURL).allocate(SI=123) -> RemoteBucketWriter reference + +== Renewers and Repairers + +A brief digression is in order, to motivate the other enhancement. The +"manifest" is a list of caps, one for each node that is reachable from the +user's root directory/directories. The client is expected to generate the +manifest on a periodic basis (perhaps once a day), and to keep track of which +files/dirnodes have been added and removed. Items which have been removed +must be explicitly dereferenced to reclaim their storage space. For grids +which use per-file lease timers, the manifest is used to drive the Renewer: a +process which renews the lease timers on a periodic basis (perhaps once a +week). The manifest can also be used to drive a Checker, which in turn feeds +work into the Repairer. + +The manifest should contain the minimum necessary authority to do its job, +which generally means it contains the "verify cap" for each node. For +immutable files, the verify cap contains the storage index and the UEB hash: +enough information to retrieve and validate the ciphertext but not enough to +decrypt it. For mutable files, the verify cap contains the storage index and +the pubkey hash, which also serves to retrieve and validate ciphertext but +not decrypt it. + +If the client does its own Renewing and Repairing, then a verifycap-based +manifest is sufficient. However, if the user wants to be able to turn their +computer off for a few months and still keep their files around, they need to +delegate this job off to some other willing node. In a commercial network, +there will be centralized (and perhaps trusted) Renewer/Repairer nodes, but +in a friendnet these may not be available, and the user will depend upon one +of their friends being willing to run this service for them while they are +away. In either of these cases, the verifycaps are not enough: the Renewer +will need additional authority to renew the client's leases, and the Repairer +will need the authority to create new shares (in the client's name) when +necessary. + +A trusted central service could be given all-account superpowers, allowing it +to exercise storage authority on behalf of all users as it pleases. If this +is the case, the verifycaps are sufficient. But if we desire to grant less +authority to the Renewer/Repairer, then we need a mechanism to attenuate this +authority. + +The usual objcap approach is to create a proxy: an intermediate object which +itself is given full authority, but which is unwilling to exercise more than +a portion of that authority in response to incoming requests. The +not-fully-trusted service is then only given access to the proxy, not the +final authority. For example: + + class Proxy(RemoteReference): + def __init__(self, original, storage_index): + self.original = original + self.storage_index = storage_index + def remote_renew_leases(self): + return self.original.renew_leases(self.storage_index) + renewer.grant(Proxy(target, "abcd")) + +But this approach interposes the proxy in the calling chain, requiring the +machine which hosts the proxy to be available and on-line at all times, which +runs opposite to our use case (turning the client off for a month). + +== Creating Attenuated Authorities == + +The other enhancement is to use more public-key operations to allow the +delegation of reduced authority to external helper services. Specifically, we +want to give then Renewer the ability to renew leases for a specific file, +rather than giving it lease-renewal power for all files. Likewise, the +Repairer should have the ability to create new shares, but only for the file +that is being repaired, not for unrelated files. + +If we do not mind giving the storage servers the ability to replay their +inbound message to other storage servers, then the client can simply generate +a signed message with a wildcard SSID= argument and leave it in the care of +the Renewer or Repairer. For example, the Renewer would get: + + SIGN("renew-lease SI=123 SSID=*", accountprivkey), membership_card + +Then, when the Renewer needed to renew a lease, it would deliver this signed +request message to the storage server. The SS would verify the signatures +just as if the message came from the original client, find them good, and +perform the desired operation. With this approach, the manifest that is +delivered to the remote Renewer process needs to include a signed +lease-renewal request for each file: we use the term "renew-cap" for this +combined (verifycap + signed lease-renewal request) message. Likewise the +"repair-cap" would be the verifycap plus a signed allocate-bucket message. A +renew-cap manifest would be enough for a remote Renewer to do its job, a +repair-cap manifest would provide a remote Repairer with enough authority, +and a cancel-cap manifest would be used for a remote Canceller (used, e.g., +to make sure that file has been dereferenced even if the client does not +stick around long enough to track down and inform all of the storage servers +involved). + +The only concern is that the SS could also take this exact same renew-lease +message and deliver it to other storage servers. This wouldn't cause a +concern for mere lease renewal, but the allocate-share message might be a bit +less comfortable (you might not want to grant the first storage server the +ability to claim space in your name on all other storage servers). + +Ideally we'd like to send a different message to each storage server, each +narrowed in scope to a single SSID, since then none of these messages would +be useful on any other SS. If the client knew the identities of all the +storage servers in the system ahead of time, it might create a whole slew of +signed messages, but a) this is a lot of signatures, only a fraction of which +will ever actually be used, and b) new servers might be introduced after the +manifest is created, particularly if we're talking about repair-caps instead +of renewal-caps. The Renewer can't generate these one-per-SSID messages from +the SSID=* message, because it doesn't have a privkey to make the correct +signatures. So without some other mechanism, we're stuck with these +relatively coarse authorities. + +If we want to limit this sort of authority, then we need to introduce a new +method. The client begins by generating a new DSA keypair. Then it signs a +message that declares the new pubkey to be valid for a specific subset of +storage operations (such as "renew-lease SI=123 SSID=*"). Then it delivers +the new privkey, the declaration message, and the membership card to the +Renewer. The renewer uses the new privkey to sign its own one-per-SSID +request message for each server, then sends the (signed request, declaration, +membership card) triple to the server. The server needs to perform three +verification checks per message: first the membership card, then the +declaration message, then the actual request message. + +== Other Enhancements == + +If a given authority is likely to be used multiple times, the same +give-me-a-FURL trick can be used to cut down on the number of public key +operations that must be performed. This is trickier with the per-SI messages. + +When storing the manifest, things like the membership card should be +amortized across a set of common entries. An isolated renew-cap needs to +contain the verifycap, the signed renewal request, and the membership card. +But a manifest with a thousand entries should only include one copy of the +membership card. + +It might be sensible to define a signed renewal request that grants authority +for a set of storage indicies, so that the signature can be shared among +several entries (to save space and perhaps processing time). The request +could include a Bloom filter of authorized SI values: when the request is +actually sent to the server, the renewer would add a list of actual SI values +to renew, and the server would accept all that are contained in the filter. + +== Revocation == + +The lifetime of the storage authority included in the manifest's renew-caps +or repair-caps will determine the lifetime of those caps. In particular, if +we implement account revocation by using time-limited membership cards +(requiring the client to get a new card once a month), then the repair-caps +won't work for more than a month, which kind of defeats the purpose. + +A related issue is the FURL-shortcut: the MAC'ed message needs to include a +validity period of some sort, and if the client tries to use a old FURL they +should get an error message that will prompt them to try and acquire a newer +one. + +------------------------------ + +The client can produce a repair-cap manifest for a specific Repairer's +pubkey, so it can produce a signed message that includes the pubkey (instead +of needing to generate a new privkey just for this purpose). The result is +not a capability, since it can only be used by the holder of the +corresponding privkey. + +So the generic form of the storage operation message is the request (which +has all the argument values filled in), followed by a chain of +authorizations. The first authorization must be signed by the Account +Server's key. Each authorization must be signed by the key mentioned in the +previous one. Each one adds a new limitation on the power of the following +ones. The actual request is bounded by all the limitations of the chain. + +The membership card is an authorization that simply limits the account number +that can be used: "op=* SI=* SSID=* account=4 signed-by=CLIENT-PUBKEY". + +So a repair manifest created for a Repairer with pubkey ABCD could consist of +a list of verifycaps plus a single authorization (using a Bloom filter to +identify the SIs that were allowed): + + SIGN("allocate SI=[bloom] SSID=* signed-by=ABCD") + +If/when the Repairer needed to allocate a share, it would use its own privkey +to sign an additional message and send the whole list to the SS: + + request=allocate SI=1234 SSID=EEFS account=4 shnum=2 + SIGN("allocate SI=1234 SSID=EEFS", ABCD) + SIGN("allocate SI=[bloom] SSID=* signed-by=ABCD", clientkey) + membership: SIGN("op=* SI=* SSID=* account=4 signed-by=clientkey", ASkey) + [implicit]: ASkey + +---------------------------------------- + +Things would be a lot simpler if the Repairer (actually the Re-Leaser) had +everybody's account authority. + +One simplifying approach: the Repairer/Re-Leaser has its own account, and the +shares it creates are leased under that account number. The R/R keeps track +of which leases it has created for whom. When the client eventually comes +back online, it is told to perform a re-leasing run, and after that occurs +the R/R can cancel its own temporary leases. + +This would effectively transfer storage quota from the original client to the +R/R over time (as shares are regenerated by the R/R while the client remains +offline). If the R/R is centrally managed, the quota mechanism can sum the +R/R's numbers with the SS's numbers when determining how much storage is +consumed by any given account. Not quite as clean as storing the exact +information in the SS's lease tables directly, but: + + * the R/R no longer needs any special account authority (it merely needs an + accurate account number, which can be supplied by giving the client a + specific facet that is bound to that account number) + * the verify-cap manifest is sufficient to perform repair + * no extra DSA keys are necessary + * account authority could be implemented with either DSA keys or personal SS + facets: i.e. we don't need the delegability aspects of DSA keys for use by + the repair mechanism (we might still want them to simplify introduction). + +I *think* this would eliminate all that complexity of chained authorization +messages. diff --git a/docs/proposed/backupdb.txt b/docs/proposed/backupdb.txt new file mode 100644 index 00000000..c9618e6d --- /dev/null +++ b/docs/proposed/backupdb.txt @@ -0,0 +1,188 @@ += PRELIMINARY = + +This document is a description of a feature which is not yet implemented, +added here to solicit feedback and to describe future plans. This document is +subject to revision or withdrawal at any moment. Until this notice is +removed, consider this entire document to be a figment of your imagination. + += The Tahoe BackupDB = + +To speed up backup operations, Tahoe maintains a small database known as the +"backupdb". This is used to avoid re-uploading files which have already been +uploaded recently. + +This database lives in ~/.tahoe/private/backupdb.sqlite, and is a SQLite +single-file database. It is used by the "tahoe backup" command, and by the +"tahoe cp" command when the --use-backupdb option is included. + +The purpose of this database is specifically to manage the file-to-cap +translation (the "upload" step). It does not address directory updates. + +The overall goal of optimizing backup is to reduce the work required when the +source disk has not changed since the last backup. In the ideal case, running +"tahoe backup" twice in a row, with no intervening changes to the disk, will +not require any network traffic. + +This database is optional. If it is deleted, the worst effect is that a +subsequent backup operation may use more effort (network bandwidth, CPU +cycles, and disk IO) than it would have without the backupdb. + +== Schema == + +The database contains the following tables: + +CREATE TABLE version +( + version integer # contains one row, set to 0 +); + +CREATE TABLE last_upload +( + path varchar(1024), # index, this is os.path.abspath(fn) + size integer, # os.stat(fn)[stat.ST_SIZE] + mtime number, # os.stat(fn)[stat.ST_MTIME] + fileid integer +); + +CREATE TABLE caps +( + fileid integer PRIMARY KEY AUTOINCREMENT, + filecap varchar(256), # URI:CHK:... + last_uploaded timestamp, + last_checked timestamp +); + +CREATE TABLE keys_to_files +( + readkey varchar(256) PRIMARY KEY, # index, AES key portion of filecap + fileid integer +); + +Notes: if we extend the backupdb to assist with directory maintenance (see +below), we may need paths in multiple places, so it would make sense to +create a table for them, and change the last_upload table to refer to a +pathid instead of an absolute path: + +CREATE TABLE paths +( + path varchar(1024), # index + pathid integer PRIMARY KEY AUTOINCREMENT +); + +== Operation == + +The upload process starts with a pathname (like ~/.emacs) and wants to end up +with a file-cap (like URI:CHK:...). + +The first step is to convert the path to an absolute form +(/home/warner/emacs) and do a lookup in the last_upload table. If the path is +not present in this table, the file must be uploaded. The upload process is: + + 1. record the file's size and modification time + 2. upload the file into the grid, obtaining an immutable file read-cap + 3. add an entry to the 'caps' table, with the read-cap, and the current time + 4. extract the read-key from the read-cap, add an entry to 'keys_to_files' + 5. add an entry to 'last_upload' + +If the path *is* present in 'last_upload', the easy-to-compute identifying +information is compared: file size and modification time. If these differ, +the file must be uploaded. The row is removed from the last_upload table, and +the upload process above is followed. + +If the path is present but the mtime differs, the file may have changed. If +the size differs, then the file has certainly changed. The client will +compute the CHK read-key for the file by hashing its contents, using exactly +the same algorithm as the node does when it uploads a file (including +~/.tahoe/private/convergence). It then checks the 'keys_to_files' table to +see if this file has been uploaded before: perhaps the file was moved from +elsewhere on the disk. If no match is found, the file must be uploaded, so +the upload process above is follwed. + +If the read-key *is* found in the 'keys_to_files' table, then the file has +been uploaded before, but we should consider performing a file check / verify +operation to make sure we can skip a new upload. The fileid is used to +retrieve the entry from the 'caps' table, and the last_checked timestamp is +examined. If this timestamp is too old, a filecheck operation should be +performed, and the file repaired if the results are not satisfactory. A +"random early check" algorithm should be used, in which a check is performed +with a probability that increases with the age of the previous results. E.g. +files that were last checked within a month are not checked, files that were +checked 5 weeks ago are re-checked with 25% probability, 6 weeks with 50%, +more than 8 weeks are always checked. This reduces the "thundering herd" of +filechecks-on-everything that would otherwise result when a backup operation +is run one month after the original backup. The readkey can be submitted to +the upload operation, to remove a duplicate hashing pass through the file and +reduce the disk IO. In a future version of the storage server protocol, this +could also improve the "streamingness" of the upload process. + +If the file's size and mtime match, the file is considered to be unmodified, +and the last_checked timestamp from the 'caps' table is examined as above +(possibly resulting in a filecheck or repair). The --no-timestamps option +disables this check: this removes the danger of false-positives (i.e. not +uploading a new file, because it appeared to be the same as a previously +uploaded one), but increases the amount of disk IO that must be performed +(every byte of every file must be hashed to compute the readkey). + +This algorithm is summarized in the following pseudocode: + +{{{ + def backup(path): + abspath = os.path.abspath(path) + result = check_for_upload(abspath) + now = time.time() + if result == MUST_UPLOAD: + filecap = upload(abspath, key=result.readkey) + fileid = db("INSERT INTO caps (filecap, last_uploaded, last_checked)", + (filecap, now, now)) + db("INSERT INTO keys_to_files", (result.readkey, filecap)) + db("INSERT INTO last_upload", (abspath,current_size,current_mtime,fileid)) + if result in (MOVED, ALREADY_UPLOADED): + age = now - result.last_checked + probability = (age - 1*MONTH) / 1*MONTH + probability = min(max(probability, 0.0), 1.0) + if random.random() < probability: + do_filecheck(result.filecap) + if result == MOVED: + db("INSERT INTO last_upload", + (abspath, current_size, current_mtime, result.fileid)) + + + def check_for_upload(abspath): + row = db("SELECT (size,mtime,fileid) FROM last_upload WHERE path == %s" + % abspath) + if not row: + return check_moved(abspath) + current_size = os.stat(abspath)[stat.ST_SIZE] + current_mtime = os.stat(abspath)[stat.ST_MTIME] + (last_size,last_mtime,last_fileid) = row + if file_changed(current_size, last_size, current_mtime, last_mtime): + db("DELETE FROM last_upload WHERE fileid=%s" % fileid) + return check_moved(abspath) + (filecap, last_checked) = db("SELECT (filecap, last_checked) FROM caps" + + " WHERE fileid == %s" % last_fileid) + return ALREADY_UPLOADED(filecap=filecap, last_checked=last_checked) + + def file_changed(current_size, last_size, current_mtime, last_mtime): + if last_size != current_size: + return True + if NO_TIMESTAMPS: + return True + if last_mtime != current_mtime: + return True + return False + + def check_moved(abspath): + readkey = hash_with_convergence(abspath) + fileid = db("SELECT (fileid) FROM keys_to_files WHERE readkey == %s"%readkey) + if not fileid: + return MUST_UPLOAD(readkey=readkey) + (filecap, last_checked) = db("SELECT (filecap, last_checked) FROM caps" + + " WHERE fileid == %s" % fileid) + return MOVED(fileid=fileid, filecap=filecap, last_checked=last_checked) + + def do_filecheck(filecap): + health = check(filecap) + if health < DESIRED: + repair(filecap) + +}}} diff --git a/docs/proposed/denver.txt b/docs/proposed/denver.txt new file mode 100644 index 00000000..5aa9893b --- /dev/null +++ b/docs/proposed/denver.txt @@ -0,0 +1,182 @@ +The "Denver Airport" Protocol + + (discussed whilst returning robk to DEN, 12/1/06) + +This is a scaling improvement on the "Select Peers" phase of Tahoe2. The +problem it tries to address is the storage and maintenance of the 1M-long +peer list, and the relative difficulty of gathering long-term reliability +information on a useful numbers of those peers. + +In DEN, each node maintains a Chord-style set of connections to other nodes: +log2(N) "finger" connections to distant peers (the first of which is halfway +across the ring, the second is 1/4 across, then 1/8th, etc). These +connections need to be kept alive with relatively short timeouts (5s?), so +any breaks can be rejoined quickly. In addition to the finger connections, +each node must also remain aware of K "successor" nodes (those which are +immediately clockwise of the starting point). The node is not required to +maintain connections to these, but it should remain informed about their +contact information, so that it can create connections when necessary. We +probably need a connection open to the immediate successor at all times. + +Since inbound connections exist too, each node has something like 2*log2(N) +plus up to 2*K connections. + +Each node keeps history of uptime/availability of the nodes that it remains +connected to. Each message that is sent to these peers includes an estimate +of that peer's availability from the point of view of the outside world. The +receiving node will average these reports together to determine what kind of +reliability they should announce to anyone they accept leases for. This +reliability is expressed as a percentage uptime: P=1.0 means the peer is +available 24/7, P=0.0 means it is almost never reachable. + + +When a node wishes to publish a file, it creates a list of (verifierid, +sharenum) tuples, and computes a hash of each tuple. These hashes then +represent starting points for the landlord search: + + starting_points = [(sharenum,sha(verifierid + str(sharenum))) + for sharenum in range(256)] + +The node then constructs a reservation message that contains enough +information for the potential landlord to evaluate the lease, *and* to make a +connection back to the starting node: + + message = [verifierid, sharesize, requestor_furl, starting_points] + +The node looks through its list of finger connections and splits this message +into up to log2(N) smaller messages, each of which contains only the starting +points that should be sent to that finger connection. Specifically we sent a +starting_point to a finger A if the nodeid of that finger is <= the +starting_point and if the next finger B is > starting_point. Each message +sent out can contain multiple starting_points, each for a different share. + +When a finger node receives this message, it performs the same splitting +algorithm, sending each starting_point to other fingers. Eventually a +starting_point is received by a node that knows that the starting_point lies +between itself and its immediate successor. At this point the message +switches from the "hop" mode (following fingers) to the "search" mode +(following successors). + +While in "search" mode, each node interprets the message as a lease request. +It checks its storage pool to see if it can accomodate the reservation. If +so, it uses requestor_furl to contact the originator and announces its +willingness to host the given sharenum. This message will include the +reliability measurement derived from the host's counterclockwise neighbors. + +If the recipient cannot host the share, it forwards the request on to the +next successor, which repeats the cycle. Each message has a maximum hop count +which limits the number of peers which may be searched before giving up. If a +node sees itself to be the last such hop, it must establish a connection to +the originator and let them know that this sharenum could not be hosted. + +The originator sends out something like 100 or 200 starting points, and +expects to get back responses (positive or negative) in a reasonable amount +of time. (perhaps if we receive half of the responses in time T, wait for a +total of 2T for the remaining ones). If no response is received with the +timeout, either re-send the requests for those shares (to different fingers) +or send requests for completely different shares. + +Each share represents some fraction of a point "S", such that the points for +enough shares to reconstruct the whole file total to 1.0 points. I.e., if we +construct 100 shares such that we need 25 of them to reconstruct the file, +then each share represents .04 points. + +As the positive responses come in, we accumulate two counters: the capacity +counter (which gets a full S points for each positive response), and the +reliability counter (which gets S*(reliability-of-host) points). The capacity +counter is not allowed to go above some limit (like 4x), as determined by +provisioning. The node keeps adding leases until the reliability counter has +gone above some other threshold (larger but close to 1.0). + +[ at download time, each host will be able to provide the share back with + probability P times an exponential decay factor related to peer death. Sum + these probabilities to get the average number of shares that will be + available. The interesting thing is actually the distribution of these + probabilities, and what threshold you have to pick to get a sufficiently + high chance of recovering the file. If there are N identical peers with + probability P, the number of recovered shares will have a gaussian + distribution with an average of N*P and a stddev of (??). The PMF of this + function is an S-curve, with a sharper slope when N is large. The + probability of recovering the file is the value of this S curve at the + threshold value (the number of necessary shares). + + P is not actually constant across all peers, rather we assume that it has + its own distribution: maybe gaussian, more likely exponential (power law). + This changes the shape of the S-curve. Assuming that we can characterize + the distribution of P with perhaps two parameters (say meanP and stddevP), + the S-curve is a function of meanP, stddevP, N, and threshold... + + To get 99.99% or 99.999% recoverability, we must choose a threshold value + high enough to accomodate the random variations and uncertainty about the + real values of P for each of the hosts we've selected. By counting + reliability points, we are trying to estimate meanP/stddevP, so we know + which S-curve to look at. The threshold is fixed at 1.0, since that's what + erasure coding tells us we need to recover the file. The job is then to add + hosts (increasing N and possibly changing meanP/stddevP) until our + recoverability probability is as high as we want. +] + +The originator takes all acceptance messages and adds them in order to the +list of landlords that will be used to host the file. It stops when it gets +enough reliability points. Note that it does *not* discriminate against +unreliable hosts: they are less likely to have been found in the first place, +so we don't need to discriminate against them a second time. We do, however, +use the reliability points to acknowledge that sending data to an unreliable +peer is not as useful as sending it to a reliable one (there is still value +in doing so, though). The remaining reservation-acceptance messages are +cancelled and then put aside: if we need to make a second pass, we ask those +peers first. + +Shares are then created and published as in Tahoe2. If we lose a connection +during the encoding, that share is lost. If we lose enough shares, we might +want to generate more to make up for them: this is done by using the leftover +acceptance messages first, then triggering a new Chord search for the +as-yet-unaccepted sharenums. These new peers will get shares from all +segments that have not yet been finished, then a second pass will be made to +catch them up on the earlier segments. + +Properties of this approach: + the total number of peers that each node must know anything about is bounded + to something like 2*log2(N) + K, probably on the order of 50 to 100 total. + This is the biggest advantage, since in tahoe2 each node must know at least + the nodeid of all 1M peers. The maintenance traffic should be much less as a + result. + + each node must maintain open (keep-alived) connections to something like + 2*log2(N) peers. In tahoe2, this number is 0 (well, probably 1 for the + introducer). + + during upload, each node must actively use 100 connections to a random set + of peers to push data (just like tahoe2). + + The probability that any given share-request gets a response is equal to the + number of hops it travels through times the chance that a peer dies while + holding on to the message. This should be pretty small, as the message + should only be held by a peer for a few seconds (more if their network is + busy). In tahoe2, each share-request always gets a response, since they are + made directly to the target. + +I visualize the peer-lookup process as the originator creating a +message-in-a-bottle for each share. Each message says "Dear Sir/Madam, I +would like to store X bytes of data for file Y (share #Z) on a system close +to (but not below) nodeid STARTING_POINT. If you find this amenable, please +contact me at FURL so we can make arrangements.". These messages are then +bundled together according to their rough destination (STARTING_POINT) and +sent somewhere in the right direction. + +Download happens the same way: lookup messages are disseminated towards the +STARTING_POINT and then search one successor at a time from there. There are +two ways that the share might go missing: if the node is now offline (or has +for some reason lost its shares), or if new nodes have joined since the +original upload and the search depth (maximum hop count) is too small to +accomodate the churn. Both result in the same amount of localized traffic. In +the latter case, a storage node might want to migrate the share closer to the +starting point, or perhaps just send them a note to remember a pointer for +the share. + +Checking: anyone who wishes to do a filecheck needs to send out a lookup +message for every potential share. These lookup messages could have a higher +search depth than usual. It would be useful to know how many peers each +message went through before being returned: this might be useful to perform +repair by instructing the old host (which is further from the starting point +than you'd like) to push their share closer towards the starting point. diff --git a/docs/proposed/mutable-DSA.svg b/docs/proposed/mutable-DSA.svg new file mode 100644 index 00000000..6870d834 --- /dev/null +++ b/docs/proposed/mutable-DSA.svg @@ -0,0 +1,1144 @@ + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + DSA private key + (256 bit string) + + DSA public key + (2048+ bit string) + + + salt + + + read-cap + + + + + + + + + + + + + + + 192 + + + + 64 + + (math) + + + AES + + + + H + H + + encryptedsalt + + + data + + crypttext + + + AES + readkey + + + + + shares + + + otherstuff + + + DSA + + + + + + signature + private key + + 256 + + + pubkey hash + 256 + + FEC + Hmerkletrees + H + write-cap + + + + storage index + + + + 64 + SI:A + + + + 64 + SI:B + + + + verify cap + H + H + H + H + H + H + + + pubkey hash + 256 + + + + 64 + SI:A + + + : stored in share + H + + + deep-verify cap + + 192 + 64 + + + H + H + + + + AES + writekey + + + + + AES + deepverifykey + + H + + diff --git a/docs/proposed/mutable-DSA.txt b/docs/proposed/mutable-DSA.txt new file mode 100644 index 00000000..73f3eb78 --- /dev/null +++ b/docs/proposed/mutable-DSA.txt @@ -0,0 +1,346 @@ + +(protocol proposal, work-in-progress, not authoritative) + +(this document describes DSA-based mutable files, as opposed to the RSA-based +mutable files that were introduced in tahoe-0.7.0 . This proposal has not yet +been implemented. Please see mutable-DSA.svg for a quick picture of the +crypto scheme described herein) + +This file shows only the differences from RSA-based mutable files to +(EC)DSA-based mutable files. You have to read and understand mutable.txt before +reading this file (mutable-DSA.txt). + +=== SDMF slots overview === + +Each SDMF slot is created with a DSA public/private key pair, using a +system-wide common modulus and generator, in which the private key is a +random 256 bit number, and the public key is a larger value (about 2048 bits) +that can be derived with a bit of math from the private key. The public key +is known as the "verification key", while the private key is called the +"signature key". + +The 256-bit signature key is used verbatim as the "write capability". This +can be converted into the 2048ish-bit verification key through a fairly cheap +set of modular exponentiation operations; this is done any time the holder of +the write-cap wants to read the data. (Note that the signature key can either +be a newly-generated random value, or the hash of something else, if we found +a need for a capability that's stronger than the write-cap). + +This results in a write-cap which is 256 bits long and can thus be expressed +in an ASCII/transport-safe encoded form (base62 encoding, fits in 72 +characters, including a local-node http: convenience prefix). + +The private key is hashed to form a 256-bit "salt". The public key is also +hashed to form a 256-bit "pubkey hash". These two values are concatenated, +hashed, and truncated to 192 bits to form the first 192 bits of the read-cap. +The pubkey hash is hashed by itself and truncated to 64 bits to form the last +64 bits of the read-cap. The full read-cap is 256 bits long, just like the +write-cap. + +The first 192 bits of the read-cap are hashed and truncated to form the first +192 bits of the "traversal cap". The last 64 bits of the read-cap are hashed +to form the last 64 bits of the traversal cap. This gives us a 256-bit +traversal cap. + +The first 192 bits of the traversal-cap are hashed and truncated to form the +first 64 bits of the storage index. The last 64 bits of the traversal-cap are +hashed to form the last 64 bits of the storage index. This gives us a 128-bit +storage index. + +The verification-cap is the first 64 bits of the storage index plus the +pubkey hash, 320 bits total. The verification-cap doesn't need to be +expressed in a printable transport-safe form, so it's ok that it's longer. + +The read-cap is hashed one way to form an AES encryption key that is used to +encrypt the salt; this key is called the "salt key". The encrypted salt is +stored in the share. The private key never changes, therefore the salt never +changes, and the salt key is only used for a single purpose, so there is no +need for an IV. + +The read-cap is hashed a different way to form the master data encryption +key. A random "data salt" is generated each time the share's contents are +replaced, and the master data encryption key is concatenated with the data +salt, then hashed, to form the AES CTR-mode "read key" that will be used to +encrypt the actual file data. This is to avoid key-reuse. An outstanding +issue is how to avoid key reuse when files are modified in place instead of +being replaced completely; this is not done in SDMF but might occur in MDMF. + +The master data encryption key is used to encrypt data that should be visible +to holders of a write-cap or a read-cap, but not to holders of a +traversal-cap. + +The private key is hashed one way to form the salt, and a different way to +form the "write enabler master". For each storage server on which a share is +kept, the write enabler master is concatenated with the server's nodeid and +hashed, and the result is called the "write enabler" for that particular +server. Note that multiple shares of the same slot stored on the same server +will all get the same write enabler, i.e. the write enabler is associated +with the "bucket", rather than the individual shares. + +The private key is hashed a third way to form the "data write key", which can +be used by applications which wish to store some data in a form that is only +available to those with a write-cap, and not to those with merely a read-cap. +This is used to implement transitive read-onlyness of dirnodes. + +The traversal cap is hashed to work the "traversal key", which can be used by +applications that wish to store data in a form that is available to holders +of a write-cap, read-cap, or traversal-cap. + +The idea is that dirnodes will store child write-caps under the writekey, +child names and read-caps under the read-key, and verify-caps (for files) or +deep-verify-caps (for directories) under the traversal key. This would give +the holder of a root deep-verify-cap the ability to create a verify manifest +for everything reachable from the root, but not the ability to see any +plaintext or filenames. This would make it easier to delegate filechecking +and repair to a not-fully-trusted agent. + +The public key is stored on the servers, as is the encrypted salt, the +(non-encrypted) data salt, the encrypted data, and a signature. The container +records the write-enabler, but of course this is not visible to readers. To +make sure that every byte of the share can be verified by a holder of the +verify-cap (and also by the storage server itself), the signature covers the +version number, the sequence number, the root hash "R" of the share merkle +tree, the encoding parameters, and the encrypted salt. "R" itself covers the +hash trees and the share data. + +The read-write URI is just the private key. The read-only URI is the read-cap +key. The deep-verify URI is the traversal-cap. The verify-only URI contains +the the pubkey hash and the first 64 bits of the storage index. + + FMW:b2a(privatekey) + FMR:b2a(readcap) + FMT:b2a(traversalcap) + FMV:b2a(storageindex[:64])b2a(pubkey-hash) + +Note that this allows the read-only, deep-verify, and verify-only URIs to be +derived from the read-write URI without actually retrieving any data from the +share, but instead by regenerating the public key from the private one. Users +of the read-only, deep-verify, or verify-only caps must validate the public +key against their pubkey hash (or its derivative) the first time they +retrieve the pubkey, before trusting any signatures they see. + +The SDMF slot is allocated by sending a request to the storage server with a +desired size, the storage index, and the write enabler for that server's +nodeid. If granted, the write enabler is stashed inside the slot's backing +store file. All further write requests must be accompanied by the write +enabler or they will not be honored. The storage server does not share the +write enabler with anyone else. + +The SDMF slot structure will be described in more detail below. The important +pieces are: + + * a sequence number + * a root hash "R" + * the data salt + * the encoding parameters (including k, N, file size, segment size) + * a signed copy of [seqnum,R,data_salt,encoding_params] (using signature key) + * the verification key (not encrypted) + * the share hash chain (part of a Merkle tree over the share hashes) + * the block hash tree (Merkle tree over blocks of share data) + * the share data itself (erasure-coding of read-key-encrypted file data) + * the salt, encrypted with the salt key + +The access pattern for read (assuming we hold the write-cap) is: + * generate public key from the private one + * hash private key to get the salt, hash public key, form read-cap + * form storage-index + * use storage-index to locate 'k' shares with identical 'R' values + * either get one share, read 'k' from it, then read k-1 shares + * or read, say, 5 shares, discover k, either get more or be finished + * or copy k into the URIs + * .. jump to "COMMON READ", below + +To read (assuming we only hold the read-cap), do: + * hash read-cap pieces to generate storage index and salt key + * use storage-index to locate 'k' shares with identical 'R' values + * retrieve verification key and encrypted salt + * decrypt salt + * hash decrypted salt and pubkey to generate another copy of the read-cap, + make sure they match (this validates the pubkey) + * .. jump to "COMMON READ" + + * COMMON READ: + * read seqnum, R, data salt, encoding parameters, signature + * verify signature against verification key + * hash data salt and read-cap to generate read-key + * read share data, compute block-hash Merkle tree and root "r" + * read share hash chain (leading from "r" to "R") + * validate share hash chain up to the root "R" + * submit share data to erasure decoding + * decrypt decoded data with read-key + * submit plaintext to application + +The access pattern for write is: + * generate pubkey, salt, read-cap, storage-index as in read case + * generate data salt for this update, generate read-key + * encrypt plaintext from application with read-key + * application can encrypt some data with the data-write-key to make it + only available to writers (used for transitively-readonly dirnodes) + * erasure-code crypttext to form shares + * split shares into blocks + * compute Merkle tree of blocks, giving root "r" for each share + * compute Merkle tree of shares, find root "R" for the file as a whole + * create share data structures, one per server: + * use seqnum which is one higher than the old version + * share hash chain has log(N) hashes, different for each server + * signed data is the same for each server + * include pubkey, encrypted salt, data salt + * now we have N shares and need homes for them + * walk through peers + * if share is not already present, allocate-and-set + * otherwise, try to modify existing share: + * send testv_and_writev operation to each one + * testv says to accept share if their(seqnum+R) <= our(seqnum+R) + * count how many servers wind up with which versions (histogram over R) + * keep going until N servers have the same version, or we run out of servers + * if any servers wound up with a different version, report error to + application + * if we ran out of servers, initiate recovery process (described below) + +==== Cryptographic Properties ==== + +This scheme protects the data's confidentiality with 192 bits of key +material, since the read-cap contains 192 secret bits (derived from an +encrypted salt, which is encrypted using those same 192 bits plus some +additional public material). + +The integrity of the data (assuming that the signature is valid) is protected +by the 256-bit hash which gets included in the signature. The privilege of +modifying the data (equivalent to the ability to form a valid signature) is +protected by a 256 bit random DSA private key, and the difficulty of +computing a discrete logarithm in a 2048-bit field. + +There are a few weaker denial-of-service attacks possible. If N-k+1 of the +shares are damaged or unavailable, the client will be unable to recover the +file. Any coalition of more than N-k shareholders will be able to effect this +attack by merely refusing to provide the desired share. The "write enabler" +shared secret protects existing shares from being displaced by new ones, +except by the holder of the write-cap. One server cannot affect the other +shares of the same file, once those other shares are in place. + +The worst DoS attack is the "roadblock attack", which must be made before +those shares get placed. Storage indexes are effectively random (being +derived from the hash of a random value), so they are not guessable before +the writer begins their upload, but there is a window of vulnerability during +the beginning of the upload, when some servers have heard about the storage +index but not all of them. + +The roadblock attack we want to prevent is when the first server that the +uploader contacts quickly runs to all the other selected servers and places a +bogus share under the same storage index, before the uploader can contact +them. These shares will normally be accepted, since storage servers create +new shares on demand. The bogus shares would have randomly-generated +write-enablers, which will of course be different than the real uploader's +write-enabler, since the malicious server does not know the write-cap. + +If this attack were successful, the uploader would be unable to place any of +their shares, because the slots have already been filled by the bogus shares. +The uploader would probably try for peers further and further away from the +desired location, but eventually they will hit a preconfigured distance limit +and give up. In addition, the further the writer searches, the less likely it +is that a reader will search as far. So a successful attack will either cause +the file to be uploaded but not be reachable, or it will cause the upload to +fail. + +If the uploader tries again (creating a new privkey), they may get lucky and +the malicious servers will appear later in the query list, giving sufficient +honest servers a chance to see their share before the malicious one manages +to place bogus ones. + +The first line of defense against this attack is the timing challenges: the +attacking server must be ready to act the moment a storage request arrives +(which will only occur for a certain percentage of all new-file uploads), and +only has a few seconds to act before the other servers will have allocated +the shares (and recorded the write-enabler, terminating the window of +vulnerability). + +The second line of defense is post-verification, and is possible because the +storage index is partially derived from the public key hash. A storage server +can, at any time, verify every public bit of the container as being signed by +the verification key (this operation is recommended as a continual background +process, when disk usage is minimal, to detect disk errors). The server can +also hash the verification key to derive 64 bits of the storage index. If it +detects that these 64 bits do not match (but the rest of the share validates +correctly), then the implication is that this share was stored to the wrong +storage index, either due to a bug or a roadblock attack. + +If an uploader finds that they are unable to place their shares because of +"bad write enabler errors" (as reported by the prospective storage servers), +it can "cry foul", and ask the storage server to perform this verification on +the share in question. If the pubkey and storage index do not match, the +storage server can delete the bogus share, thus allowing the real uploader to +place their share. Of course the origin of the offending bogus share should +be logged and reported to a central authority, so corrective measures can be +taken. It may be necessary to have this "cry foul" protocol include the new +write-enabler, to close the window during which the malicious server can +re-submit the bogus share during the adjudication process. + +If the problem persists, the servers can be placed into pre-verification +mode, in which this verification is performed on all potential shares before +being committed to disk. This mode is more CPU-intensive (since normally the +storage server ignores the contents of the container altogether), but would +solve the problem completely. + +The mere existence of these potential defenses should be sufficient to deter +any actual attacks. Note that the storage index only has 64 bits of +pubkey-derived data in it, which is below the usual crypto guidelines for +security factors. In this case it's a pre-image attack which would be needed, +rather than a collision, and the actual attack would be to find a keypair for +which the public key can be hashed three times to produce the desired portion +of the storage index. We believe that 64 bits of material is sufficiently +resistant to this form of pre-image attack to serve as a suitable deterrent. + +=== SMDF Slot Format === + +This SMDF data lives inside a server-side MutableSlot container. The server +is generally oblivious to this format, but it may look inside the container +when verification is desired. + +This data is tightly packed. There are no gaps left between the different +fields, and the offset table is mainly present to allow future flexibility of +key sizes. + + # offset size name + 1 0 1 version byte, \x01 for this format + 2 1 8 sequence number. 2^64-1 must be handled specially, TBD + 3 9 32 "R" (root of share hash Merkle tree) + 4 41 32 data salt (readkey is H(readcap+data_salt)) + 5 73 32 encrypted salt (AESenc(key=H(readcap), salt) + 6 105 18 encoding parameters: + 105 1 k + 106 1 N + 107 8 segment size + 115 8 data length (of original plaintext) + 7 123 36 offset table: + 127 4 (9) signature + 131 4 (10) share hash chain + 135 4 (11) block hash tree + 139 4 (12) share data + 143 8 (13) EOF + 8 151 256 verification key (2048bit DSA key) + 9 407 40 signature=DSAsig(H([1,2,3,4,5,6])) +10 447 (a) share hash chain, encoded as: + "".join([pack(">H32s", shnum, hash) + for (shnum,hash) in needed_hashes]) +11 ?? (b) block hash tree, encoded as: + "".join([pack(">32s",hash) for hash in block_hash_tree]) +12 ?? LEN share data +13 ?? -- EOF + +(a) The share hash chain contains ceil(log(N)) hashes, each 32 bytes long. + This is the set of hashes necessary to validate this share's leaf in the + share Merkle tree. For N=10, this is 4 hashes, i.e. 128 bytes. +(b) The block hash tree contains ceil(length/segsize) hashes, each 32 bytes + long. This is the set of hashes necessary to validate any given block of + share data up to the per-share root "r". Each "r" is a leaf of the share + has tree (with root "R"), from which a minimal subset of hashes is put in + the share hash chain in (8). + +== TODO == + +Every node in a given tahoe grid must have the same common DSA moduli and +exponent, but different grids could use different parameters. We haven't +figured out how to define a "grid id" yet, but I think the DSA parameters +should be part of that identifier. In practical terms, this might mean that +the Introducer tells each node what parameters to use, or perhaps the node +could have a config file which specifies them instead.