From: Zooko O'Whielacronx Date: Fri, 15 Oct 2010 06:02:02 +0000 (-0700) Subject: doc: add explanation of the motivation for the surprising and awkward API to erasure... X-Git-Tag: trac-4800~42 X-Git-Url: https://git.rkrishnan.org/specifications/%5B/%5D%20/uri/cyclelanguage?a=commitdiff_plain;h=0c2397523b6d50cfe933b3132cefbe375eb53fae;p=tahoe-lafs%2Ftahoe-lafs.git doc: add explanation of the motivation for the surprising and awkward API to erasure coding --- diff --git a/src/allmydata/interfaces.py b/src/allmydata/interfaces.py index 696c2f51..c5a47e12 100644 --- a/src/allmydata/interfaces.py +++ b/src/allmydata/interfaces.py @@ -1163,20 +1163,56 @@ class ICodecEncoder(Interface): encode(), unless of course it already happens to be an even multiple of required_shares in length.) - ALSO: the requirement to break up your data into 'required_shares' - chunks before calling encode() feels a bit surprising, at least from - the point of view of a user who doesn't know how FEC works. It feels - like an implementation detail that has leaked outside the - abstraction barrier. Can you imagine a use case in which the data to - be encoded might already be available in pre-segmented chunks, such - that it is faster or less work to make encode() take a list rather - than splitting a single string? - - ALSO ALSO: I think 'inshares' is a misleading term, since encode() - is supposed to *produce* shares, so what it *accepts* should be - something other than shares. Other places in this interface use the - word 'data' for that-which-is-not-shares.. maybe we should use that - term? + Note: the requirement to break up your data into + 'required_shares' chunks of exactly the right length before + calling encode() is surprising from point of view of a user + who doesn't know how FEC works. It feels like an + implementation detail that has leaked outside the abstraction + barrier. Is there a use case in which the data to be encoded + might already be available in pre-segmented chunks, such that + it is faster or less work to make encode() take a list rather + than splitting a single string? + + Yes, there is: suppose you are uploading a file with K=64, + N=128, segsize=262,144. Then each in-share will be of size + 4096. If you use this .encode() API then your code could first + read each successive 4096-byte chunk from the file and store + each one in a Python string and store each such Python string + in a Python list. Then you could call .encode(), passing that + list as "inshares". The encoder would generate the other 64 + "secondary shares" and return to you a new list containing + references to the same 64 Python strings that you passed in + (as the primary shares) plus references to the new 64 Python + strings. + + (You could even imagine that your code could use readv() so + that the operating system can arrange to get all of those + bytes copied from the file into the Python list of Python + strings as efficiently as possible instead of having a loop + written in C or in Python to copy the next part of the file + into the next string.) + + On the other hand if you instead use the .encode_proposal() + API (above), then your code can first read in all of the + 262,144 bytes of the segment from the file into a Python + string, then call .encode_proposal() passing the segment data + as the "data" argument. The encoder would basically first + split the "data" argument into a list of 64 in-shares of 4096 + byte each, and then do the same thing that .encode() does. So + this would result in a little bit more copying of data and a + little bit higher of a "maximum memory usage" during the + process, although it might or might not make a practical + difference for our current use cases. + + Note that "inshares" is a strange name for the parameter if + you think of the parameter as being just for feeding in data + to the codec. It makes more sense if you think of the result + of this encoding as being the set of shares from inshares plus + an extra set of "secondary shares" (or "check shares"). It is + a surprising name! If the API is going to be surprising then + the name should be surprising. If we switch to + encode_proposal() above then we should also switch to an + unsurprising name. 'desired_share_ids', if provided, is required to be a sequence of ints, each of which is required to be >= 0 and < max_shares. If not