From: Patrick R McDonald Date: Wed, 6 Jun 2012 02:35:23 +0000 (-0500) Subject: Added docs/specifications/backends/raic.rst for ticket #1760 X-Git-Url: https://git.rkrishnan.org/Site/Content/reliability?a=commitdiff_plain;h=2f9f85341368655402b3768c2cfd8e6abd3c6311;p=tahoe-lafs%2Ftahoe-lafs.git Added docs/specifications/backends/raic.rst for ticket #1760 --- diff --git a/docs/specifications/backends/raic.rst b/docs/specifications/backends/raic.rst new file mode 100644 index 00000000..90f4b806 --- /dev/null +++ b/docs/specifications/backends/raic.rst @@ -0,0 +1,405 @@ + +============================================================= +Redundant Array of Independent Clouds: Share To Cloud Mapping +============================================================= + + +Introduction +============ + +This document describes a proposed design for the mapping of LAFS shares to +objects in a cloud storage service. It also analyzes the costs for each of the +functional requirements, including network, disk, storage and API usage costs. + + +Terminology +=========== + +*LAFS share* + A Tahoe-LAFS share representing part of a file after encryption and + erasure encoding. + +*LAFS shareset* + The set of shares stored by a LAFS storage server for a given storage index. + The shares within a shareset are numbered by a small integer. + +*Cloud storage service* + A service such as Amazon S3 `²`_, Rackspace Cloud Files `³`_, + Google Cloud Storage `⁴`_, or Windows Azure `⁵`_, that provides cloud storage. + +*Cloud storage interface* + A protocol interface supported by a cloud storage service, such as the + S3 interface `⁶`_, the OpenStack Object Storage interface `⁷`_, the + Google Cloud Storage interface `⁸`_, or the Azure interface `⁹`_. There may be + multiple services implementing a given cloud storage interface. In this design, + only REST-based APIs `¹⁰`_ over HTTP will be used as interfaces. + +*Cloud object* + A file-like abstraction provided by a cloud storage service, storing a + sequence of bytes. Cloud objects are mutable in the sense that the contents + and metadata of the cloud object with a given name in a given cloud container + can be replaced. Cloud objects are called “blobs” in the Azure interface, + and “objects” in the other interfaces. + +*Cloud container* + A container for cloud objects provided by a cloud service. Cloud containers + are called “buckets” in the S3 and Google Cloud Storage interfaces, and + “containers” in the Azure and OpenStack Storage interfaces. + + +Functional Requirements +======================= + +* *Upload*: a LAFS share can be uploaded to an appropriately configured + Tahoe-LAFS storage server and the data is stored to the cloud + storage service. + + * *Scalable shares*: there is no hard limit on the size of LAFS share + that can be uploaded. + + If the cloud storage interface offers scalable files, then this could be + implemented by using that feature of the specific cloud storage + interface. Alternately, it could be implemented by mapping from the LAFS + abstraction of an unlimited-size immutable share to a set of size-limited + cloud objects. + + * *Streaming upload*: the size of the LAFS share that is uploaded + can exceed the amount of RAM and even the amount of direct attached + storage on the storage server. I.e., the storage server is required to + stream the data directly to the ultimate cloud storage service while + processing it, instead of to buffer the data until the client is finished + uploading and then transfer the data to the cloud storage service. + +* *Download*: a LAFS share can be downloaded from an appropriately + configured Tahoe-LAFS storage server, and the data is loaded from the + cloud storage service. + + * *Streaming download*: the size of the LAFS share that is + downloaded can exceed the amount of RAM and even the amount of direct + attached storage on the storage server. I.e. the storage server is + required to stream the data directly to the client while processing it, + instead of to buffer the data until the cloud storage service is finished + serving and then transfer the data to the client. + +* *Modify*: a LAFS share can have part of its contents modified. + + If the cloud storage interface offers scalable mutable files, then this + could be implemented by using that feature of the specific cloud storage + interface. Alternately, it could be implemented by mapping from the LAFS + abstraction of an unlimited-size mutable share to a set of size-limited + cloud objects. + + * *Efficient modify*: the size of the LAFS share being + modified can exceed the amount of RAM and even the amount of direct + attached storage on the storage server. I.e. the storage server is + required to download, patch, and upload only the segment(s) of the share + that are being modified, instead of to download, patch, and upload the + entire share. + +* *Tracking leases*: The Tahoe-LAFS storage server is required to track when + each share has its lease renewed so that unused shares (shares whose lease + has not been renewed within a time limit, e.g. 30 days) can be garbage + collected. This does not necessarily require code specific to each cloud + storage interface, because the lease tracking can be performed in the + storage server's generic component rather than in the component supporting + each interface. + + +Mapping +======= + +This section describes the mapping between LAFS shares and cloud objects. + +A LAFS share will be split into one or more “chunks” that are each stored in a +cloud object. A LAFS share of size `C` bytes will be stored as `ceiling(C / chunksize)` +chunks. The last chunk has a size between 1 and `chunksize` bytes inclusive. +(It is not possible for `C` to be zero, because valid shares always have a header, +so, there is at least one chunk for each share.) + +For an existing share, the chunk size is determined by the size of the first +chunk. For a new share, it is a parameter that may depend on the storage +interface. It is an error for any chunk to be larger than the first chunk, or +for any chunk other than the last to be smaller than the first chunk. +If a mutable share with total size less than the default chunk size for the +storage interface is being modified, the new contents are split using the +default chunk size. + + *Rationale*: this design allows the `chunksize` parameter to be changed for + new shares written via a particular storage interface, without breaking + compatibility with existing stored shares. All cloud storage interfaces + return the sizes of cloud objects with requests to list objects, and so + the size of the first chunk can be determined without an additional request. + +The name of the cloud object for chunk `i` > 0 of a LAFS share with storage index +`STORAGEINDEX` and share number `SHNUM`, will be + + shares/`ST`/`STORAGEINDEX`/`SHNUM.i` + +where `ST` is the first two characters of `STORAGEINDEX`. When `i` is 0, the +`.0` is omitted. + + *Rationale*: this layout maintains compatibility with data stored by the + prototype S3 backend, for which Least Authority Enterprises has existing + customers. This prototype always used a single cloud object to store each + share, with name + + shares/`ST`/`STORAGEINDEX`/`SHNUM` + + By using the same prefix “shares/`ST`/`STORAGEINDEX`/” for old and new layouts, + the storage server can obtain a list of cloud objects associated with a given + shareset without having to know the layout in advance, and without having to + make multiple API requests. This also simplifies sharing of test code between the + disk and cloud backends. + +Mutable and immutable shares will be “chunked” in the same way. + + +Rationale for Chunking +---------------------- + +Limiting the amount of data received or sent in a single request has the +following advantages: + +* It is unnecessary to write separate code to take advantage of the + “large object” features of each cloud storage interface, which differ + significantly in their design. +* Data needed for each PUT request can be discarded after it completes. + If a PUT request fails, it can be retried while only holding the data + for that request in memory. + + +Costs +===== + +In this section we analyze the costs of the proposed design in terms of network, +disk, memory, cloud storage, and API usage. + + +Network usage—bandwidth and number-of-round-trips +------------------------------------------------- + +When a Tahoe-LAFS storage client allocates a new share on a storage server, +the backend will request a list of the existing cloud objects with the +appropriate prefix. This takes one HTTP request in the common case, but may +take more for the S3 interface, which has a limit of 1000 objects returned in +a single “GET Bucket” request. + +If the share is to be read, the client will make a number of calls each +specifying the offset and length of the required span of bytes. On the first +request that overlaps a given chunk of the share, the server will make an +HTTP GET request for that cloud object. The server may also speculatively +make GET requests for cloud objects that are likely to be needed soon (which +can be predicted since reads are normally sequential), in order to reduce +latency. + +Each read will be satisfied as soon as the corresponding data is available, +without waiting for the rest of the chunk, in order to minimize read latency. + +All four cloud storage interfaces support GET requests using the +Range HTTP header. This could be used to optimize reads where the +Tahoe-LAFS storage client requires only part of a share. + +If the share is to be written, the server will make an HTTP PUT request for +each chunk that has been completed. Tahoe-LAFS clients only write immutable +shares sequentially, and so we can rely on that property to simplify the +implementation. + +When modifying shares of an existing mutable file, the storage server will +be able to make PUT requests only for chunks that have changed. +(Current Tahoe-LAFS v1.9 clients will not take advantage of this ability, but +future versions will probably do so for MDMF files.) + +In some cases, it may be necessary to retry a request (see the `Structure of +Implementation`_ section below). In the case of a PUT request, at the point +at which a retry is needed, the new chunk contents to be stored will still be +in memory and so this is not problematic. + +In the absence of retries, the maximum number of GET requests that will be made +when downloading a file, or the maximum number of PUT requests when uploading +or modifying a file, will be equal to the number of chunks in the file. + +If the new mutable share content has fewer chunks than the old content, +then the remaining cloud objects for old chunks must be deleted (using one +HTTP request each). When reading a share, the backend must tolerate the case +where these cloud objects have not been deleted successfully. + +The last write to a share will be reported as successful only when all +corresponding HTTP PUTs and DELETEs have completed successfully. + + + +Disk usage (local to the storage server) +---------------------------------------- + +It is never necessary for the storage server to write the content of share +chunks to local disk, either when they are read or when they are written. Each +chunk is held only in memory. + +A proposed change to the Tahoe-LAFS storage server implementation uses a sqlite +database to store metadata about shares. In that case the same database would +be used for the cloud backend. This would enable lease tracking to be implemented +in the same way for disk and cloud backends. + + +Memory usage +------------ + +The use of chunking simplifies bounding the memory usage of the storage server +when handling files that may be larger than memory. However, this depends on +limiting the number of chunks that are simultaneously held in memory. +Multiple chunks can be held in memory either because of pipelining of requests +for a single share, or because multiple shares are being read or written +(possibly by multiple clients). + +For immutable shares, the Tahoe-LAFS storage protocol requires the client to +specify in advance the maximum amount of data it will write. Also, a cooperative +client (including all existing released versions of the Tahoe-LAFS code) will +limit the amount of data that is pipelined, currently to 50 KiB. Since the chunk +size will be greater than that, it is possible to ensure that for each allocation, +the maximum chunk data memory usage is the lesser of two chunks, and the allocation +size. (There is some additional overhead but it is small compared to the chunk +data.) If the maximum memory usage of a new allocation would exceed the memory +available, the allocation can be delayed or possibly denied, so that the total +memory usage is bounded. + +It is not clear that the existing protocol allows allocations for mutable +shares to be bounded in general; this may be addressed in a future protocol change. + +The above discussion assumes that clients do not maliciously send large +messages as a denial-of-service attack. Foolscap (the protocol layer underlying +the Tahoe-LAFS storage protocol) does not attempt to resist denial of service. + + +Storage +------- + +The storage requirements, including not-yet-collected garbage shares, are +the same as for the Tahoe-LAFS disk backend. That is, the total size of cloud +objects stored is equal to the total size of shares that the disk backend +would store. + +Erasure coding causes the size of shares for each file to be a +factor `shares.total` / `shares.needed` times the file size, plus overhead +that is logarithmic in the file size `¹¹`_. + + +API usage +--------- + +Cloud storage backends typically charge a small fee per API request. The number of +requests to the cloud storage service for various operations is discussed under +“network usage” above. + + +Structure of Implementation +=========================== + +A generic “cloud backend”, based on the prototype S3 backend but with support +for chunking as described above, will be written. + +An instance of the cloud backend can be attached to one of several +“cloud interface adapters”, one for each cloud storage interface. These +adapters will operate only on chunks, and need not distinguish between +mutable and immutable shares. They will be a relatively “thin” abstraction +layer over the HTTP APIs of each cloud storage interface, similar to the +S3Bucket abstraction in the prototype. + +For some cloud storage services it may be necessary to transparently retry +requests in order to recover from transient failures. (Although the erasure +coding may enable a file to be retrieved even when shares are not stored by or +not readable from all cloud storage services used in a Tahoe-LAFS grid, it may +be desirable to retry cloud storage service requests in order to improve overall +reliability.) Support for this will be implemented in the generic cloud backend, +and used whenever a cloud storage adaptor reports a transient failure. Our +experience with the prototype suggests that it is necessary to retry on transient +failures for Amazon's S3 service. + +There will also be a “mock” cloud interface adaptor, based on the prototype's +MockS3Bucket. This allows tests of the generic cloud backend to be run without +a connection to a real cloud service. The mock adaptor will be able to simulate +transient and non-transient failures. + + +Known Issues +============ + +This design worsens a known “write hole” issue in Tahoe-LAFS when updating +the contents of mutable files. An update to a mutable file can require changing +the contents of multiple chunks, and if the client fails or is disconnected +during the operation the resulting state of the stored cloud objects may be +inconsistent—no longer containing all of the old version, but not yet containing +all of the new version. A mutable share can be left in an inconsistent state +even by the existing Tahoe-LAFS disk backend if it fails during a write, but +that has a smaller chance of occurrence because the current client behavior +leads to mutable shares being written to disk in a single system call. + +The best fix for this issue probably requires changing the Tahoe-LAFS storage +protocol, perhaps by extending it to use a two-phase or three-phase commit +(ticket #1755). + + + +References +=========== + +¹ omitted + +.. _²: + +² “Amazon S3” Amazon (2012) + + https://aws.amazon.com/s3/ + +.. _³: + +³ “Rackspace Cloud Files” Rackspace (2012) + + https://www.rackspace.com/cloud/cloud_hosting_products/files/ + +.. _⁴: + +⁴ “Google Cloud Storage” Google (2012) + + https://developers.google.com/storage/ + +.. _⁵: + +⁵ “Windows Azure Storage” Microsoft (2012) + + https://www.windowsazure.com/en-us/develop/net/fundamentals/cloud-storage/ + +.. _⁶: + +⁶ “Amazon Simple Storage Service (Amazon S3) API Reference: REST API” Amazon (2012) + + http://docs.amazonwebservices.com/AmazonS3/latest/API/APIRest.html + +.. _⁷: + +⁷ “OpenStack Object Storage” openstack.org (2012) + + http://openstack.org/projects/storage/ + +.. _⁸: + +⁸ “Google Cloud Storage Reference Guide” Google (2012) + + https://developers.google.com/storage/docs/reference-guide + +.. _⁹: + +⁹ “Windows Azure Storage Services REST API Reference” Microsoft (2012) + + http://msdn.microsoft.com/en-us/library/windowsazure/dd179355.aspx + +.. _¹⁰: + +¹⁰ “Representational state transfer” English Wikipedia (2012) + + https://en.wikipedia.org/wiki/Representational_state_transfer + +.. _¹¹: + +¹¹ “Performance costs for some common operations” tahoe-lafs.org (2012) + + https://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/docs/performance.rst