2 =============================================================
3 Redundant Array of Independent Clouds: Share To Cloud Mapping
4 =============================================================
10 This document describes a proposed design for the mapping of LAFS shares to
11 objects in a cloud storage service. It also analyzes the costs for each of the
12 functional requirements, including network, disk, storage and API usage costs.
19 A Tahoe-LAFS share representing part of a file after encryption and
23 The set of shares stored by a LAFS storage server for a given storage index.
24 The shares within a shareset are numbered by a small integer.
26 *Cloud storage service*
27 A service such as Amazon S3 `²`_, Rackspace Cloud Files `³`_,
28 Google Cloud Storage `⁴`_, or Windows Azure `⁵`_, that provides cloud storage.
30 *Cloud storage interface*
31 A protocol interface supported by a cloud storage service, such as the
32 S3 interface `⁶`_, the OpenStack Object Storage interface `⁷`_, the
33 Google Cloud Storage interface `⁸`_, or the Azure interface `⁹`_. There may be
34 multiple services implementing a given cloud storage interface. In this design,
35 only REST-based APIs `¹⁰`_ over HTTP will be used as interfaces.
38 A file-like abstraction provided by a cloud storage service, storing a
39 sequence of bytes. Cloud objects are mutable in the sense that the contents
40 and metadata of the cloud object with a given name in a given cloud container
41 can be replaced. Cloud objects are called “blobs” in the Azure interface,
42 and “objects” in the other interfaces.
45 A container for cloud objects provided by a cloud service. Cloud containers
46 are called “buckets” in the S3 and Google Cloud Storage interfaces, and
47 “containers” in the Azure and OpenStack Storage interfaces.
50 Functional Requirements
51 =======================
53 * *Upload*: a LAFS share can be uploaded to an appropriately configured
54 Tahoe-LAFS storage server and the data is stored to the cloud
57 * *Scalable shares*: there is no hard limit on the size of LAFS share
60 If the cloud storage interface offers scalable files, then this could be
61 implemented by using that feature of the specific cloud storage
62 interface. Alternately, it could be implemented by mapping from the LAFS
63 abstraction of an unlimited-size immutable share to a set of size-limited
66 * *Streaming upload*: the size of the LAFS share that is uploaded
67 can exceed the amount of RAM and even the amount of direct attached
68 storage on the storage server. I.e., the storage server is required to
69 stream the data directly to the ultimate cloud storage service while
70 processing it, instead of to buffer the data until the client is finished
71 uploading and then transfer the data to the cloud storage service.
73 * *Download*: a LAFS share can be downloaded from an appropriately
74 configured Tahoe-LAFS storage server, and the data is loaded from the
75 cloud storage service.
77 * *Streaming download*: the size of the LAFS share that is
78 downloaded can exceed the amount of RAM and even the amount of direct
79 attached storage on the storage server. I.e. the storage server is
80 required to stream the data directly to the client while processing it,
81 instead of to buffer the data until the cloud storage service is finished
82 serving and then transfer the data to the client.
84 * *Modify*: a LAFS share can have part of its contents modified.
86 If the cloud storage interface offers scalable mutable files, then this
87 could be implemented by using that feature of the specific cloud storage
88 interface. Alternately, it could be implemented by mapping from the LAFS
89 abstraction of an unlimited-size mutable share to a set of size-limited
92 * *Efficient modify*: the size of the LAFS share being
93 modified can exceed the amount of RAM and even the amount of direct
94 attached storage on the storage server. I.e. the storage server is
95 required to download, patch, and upload only the segment(s) of the share
96 that are being modified, instead of to download, patch, and upload the
99 * *Tracking leases*: The Tahoe-LAFS storage server is required to track when
100 each share has its lease renewed so that unused shares (shares whose lease
101 has not been renewed within a time limit, e.g. 30 days) can be garbage
102 collected. This does not necessarily require code specific to each cloud
103 storage interface, because the lease tracking can be performed in the
104 storage server's generic component rather than in the component supporting
111 This section describes the mapping between LAFS shares and cloud objects.
113 A LAFS share will be split into one or more “chunks” that are each stored in a
114 cloud object. A LAFS share of size `C` bytes will be stored as `ceiling(C / chunksize)`
115 chunks. The last chunk has a size between 1 and `chunksize` bytes inclusive.
116 (It is not possible for `C` to be zero, because valid shares always have a header,
117 so, there is at least one chunk for each share.)
119 For an existing share, the chunk size is determined by the size of the first
120 chunk. For a new share, it is a parameter that may depend on the storage
121 interface. It is an error for any chunk to be larger than the first chunk, or
122 for any chunk other than the last to be smaller than the first chunk.
123 If a mutable share with total size less than the default chunk size for the
124 storage interface is being modified, the new contents are split using the
127 *Rationale*: this design allows the `chunksize` parameter to be changed for
128 new shares written via a particular storage interface, without breaking
129 compatibility with existing stored shares. All cloud storage interfaces
130 return the sizes of cloud objects with requests to list objects, and so
131 the size of the first chunk can be determined without an additional request.
133 The name of the cloud object for chunk `i` > 0 of a LAFS share with storage index
134 `STORAGEINDEX` and share number `SHNUM`, will be
136 shares/`ST`/`STORAGEINDEX`/`SHNUM.i`
138 where `ST` is the first two characters of `STORAGEINDEX`. When `i` is 0, the
141 *Rationale*: this layout maintains compatibility with data stored by the
142 prototype S3 backend, for which Least Authority Enterprises has existing
143 customers. This prototype always used a single cloud object to store each
146 shares/`ST`/`STORAGEINDEX`/`SHNUM`
148 By using the same prefix “shares/`ST`/`STORAGEINDEX`/” for old and new layouts,
149 the storage server can obtain a list of cloud objects associated with a given
150 shareset without having to know the layout in advance, and without having to
151 make multiple API requests. This also simplifies sharing of test code between the
152 disk and cloud backends.
154 Mutable and immutable shares will be “chunked” in the same way.
157 Rationale for Chunking
158 ----------------------
160 Limiting the amount of data received or sent in a single request has the
161 following advantages:
163 * It is unnecessary to write separate code to take advantage of the
164 “large object” features of each cloud storage interface, which differ
165 significantly in their design.
166 * Data needed for each PUT request can be discarded after it completes.
167 If a PUT request fails, it can be retried while only holding the data
168 for that request in memory.
174 In this section we analyze the costs of the proposed design in terms of network,
175 disk, memory, cloud storage, and API usage.
178 Network usage: bandwidth and number-of-round-trips
179 --------------------------------------------------
181 When a Tahoe-LAFS storage client allocates a new share on a storage server,
182 the backend will request a list of the existing cloud objects with the
183 appropriate prefix. This takes one HTTP request in the common case, but may
184 take more for the S3 interface, which has a limit of 1000 objects returned in
185 a single “GET Bucket” request.
187 If the share is to be read, the client will make a number of calls each
188 specifying the offset and length of the required span of bytes. On the first
189 request that overlaps a given chunk of the share, the server will make an
190 HTTP GET request for that cloud object. The server may also speculatively
191 make GET requests for cloud objects that are likely to be needed soon (which
192 can be predicted since reads are normally sequential), in order to reduce
195 Each read will be satisfied as soon as the corresponding data is available,
196 without waiting for the rest of the chunk, in order to minimize read latency.
198 All four cloud storage interfaces support GET requests using the
199 Range HTTP header. This could be used to optimize reads where the
200 Tahoe-LAFS storage client requires only part of a share.
202 If the share is to be written, the server will make an HTTP PUT request for
203 each chunk that has been completed. Tahoe-LAFS clients only write immutable
204 shares sequentially, and so we can rely on that property to simplify the
207 When modifying shares of an existing mutable file, the storage server will
208 be able to make PUT requests only for chunks that have changed.
209 (Current Tahoe-LAFS v1.9 clients will not take advantage of this ability, but
210 future versions will probably do so for MDMF files.)
212 In some cases, it may be necessary to retry a request (see the `Structure of
213 Implementation`_ section below). In the case of a PUT request, at the point
214 at which a retry is needed, the new chunk contents to be stored will still be
215 in memory and so this is not problematic.
217 In the absence of retries, the maximum number of GET requests that will be made
218 when downloading a file, or the maximum number of PUT requests when uploading
219 or modifying a file, will be equal to the number of chunks in the file.
221 If the new mutable share content has fewer chunks than the old content,
222 then the remaining cloud objects for old chunks must be deleted (using one
223 HTTP request each). When reading a share, the backend must tolerate the case
224 where these cloud objects have not been deleted successfully.
226 The last write to a share will be reported as successful only when all
227 corresponding HTTP PUTs and DELETEs have completed successfully.
231 Disk usage (local to the storage server)
232 ----------------------------------------
234 It is never necessary for the storage server to write the content of share
235 chunks to local disk, either when they are read or when they are written. Each
236 chunk is held only in memory.
238 A proposed change to the Tahoe-LAFS storage server implementation uses a sqlite
239 database to store metadata about shares. In that case the same database would
240 be used for the cloud backend. This would enable lease tracking to be implemented
241 in the same way for disk and cloud backends.
247 The use of chunking simplifies bounding the memory usage of the storage server
248 when handling files that may be larger than memory. However, this depends on
249 limiting the number of chunks that are simultaneously held in memory.
250 Multiple chunks can be held in memory either because of pipelining of requests
251 for a single share, or because multiple shares are being read or written
252 (possibly by multiple clients).
254 For immutable shares, the Tahoe-LAFS storage protocol requires the client to
255 specify in advance the maximum amount of data it will write. Also, a cooperative
256 client (including all existing released versions of the Tahoe-LAFS code) will
257 limit the amount of data that is pipelined, currently to 50 KiB. Since the chunk
258 size will be greater than that, it is possible to ensure that for each allocation,
259 the maximum chunk data memory usage is the lesser of two chunks, and the allocation
260 size. (There is some additional overhead but it is small compared to the chunk
261 data.) If the maximum memory usage of a new allocation would exceed the memory
262 available, the allocation can be delayed or possibly denied, so that the total
263 memory usage is bounded.
265 It is not clear that the existing protocol allows allocations for mutable
266 shares to be bounded in general; this may be addressed in a future protocol change.
268 The above discussion assumes that clients do not maliciously send large
269 messages as a denial-of-service attack. Foolscap (the protocol layer underlying
270 the Tahoe-LAFS storage protocol) does not attempt to resist denial of service.
276 The storage requirements, including not-yet-collected garbage shares, are
277 the same as for the Tahoe-LAFS disk backend. That is, the total size of cloud
278 objects stored is equal to the total size of shares that the disk backend
281 Erasure coding causes the size of shares for each file to be a
282 factor `shares.total` / `shares.needed` times the file size, plus overhead
283 that is logarithmic in the file size `¹¹`_.
289 Cloud storage backends typically charge a small fee per API request. The number of
290 requests to the cloud storage service for various operations is discussed under
291 “network usage” above.
294 Structure of Implementation
295 ===========================
297 A generic “cloud backend”, based on the prototype S3 backend but with support
298 for chunking as described above, will be written.
300 An instance of the cloud backend can be attached to one of several
301 “cloud interface adapters”, one for each cloud storage interface. These
302 adapters will operate only on chunks, and need not distinguish between
303 mutable and immutable shares. They will be a relatively “thin” abstraction
304 layer over the HTTP APIs of each cloud storage interface, similar to the
305 S3Bucket abstraction in the prototype.
307 For some cloud storage services it may be necessary to transparently retry
308 requests in order to recover from transient failures. (Although the erasure
309 coding may enable a file to be retrieved even when shares are not stored by or
310 not readable from all cloud storage services used in a Tahoe-LAFS grid, it may
311 be desirable to retry cloud storage service requests in order to improve overall
312 reliability.) Support for this will be implemented in the generic cloud backend,
313 and used whenever a cloud storage adaptor reports a transient failure. Our
314 experience with the prototype suggests that it is necessary to retry on transient
315 failures for Amazon's S3 service.
317 There will also be a “mock” cloud interface adaptor, based on the prototype's
318 MockS3Bucket. This allows tests of the generic cloud backend to be run without
319 a connection to a real cloud service. The mock adaptor will be able to simulate
320 transient and non-transient failures.
326 This design worsens a known “write hole” issue in Tahoe-LAFS when updating
327 the contents of mutable files. An update to a mutable file can require
328 changing the contents of multiple chunks, and if the client fails or is
329 disconnected during the operation the resulting state of the stored cloud
330 objects may be inconsistent: no longer containing all of the old version, but
331 not yet containing all of the new version. A mutable share can be left in an
332 inconsistent state even by the existing Tahoe-LAFS disk backend if it fails
333 during a write, but that has a smaller chance of occurrence because the
334 current client behavior leads to mutable shares being written to disk in a
337 The best fix for this issue probably requires changing the Tahoe-LAFS storage
338 protocol, perhaps by extending it to use a two-phase or three-phase commit
350 ² “Amazon S3” Amazon (2012)
352 https://aws.amazon.com/s3/
356 ³ “Rackspace Cloud Files” Rackspace (2012)
358 https://www.rackspace.com/cloud/cloud_hosting_products/files/
362 ⁴ “Google Cloud Storage” Google (2012)
364 https://developers.google.com/storage/
368 ⁵ “Windows Azure Storage” Microsoft (2012)
370 https://www.windowsazure.com/en-us/develop/net/fundamentals/cloud-storage/
374 ⁶ “Amazon Simple Storage Service (Amazon S3) API Reference: REST API” Amazon (2012)
376 http://docs.amazonwebservices.com/AmazonS3/latest/API/APIRest.html
380 ⁷ “OpenStack Object Storage” openstack.org (2012)
382 http://openstack.org/projects/storage/
386 ⁸ “Google Cloud Storage Reference Guide” Google (2012)
388 https://developers.google.com/storage/docs/reference-guide
392 ⁹ “Windows Azure Storage Services REST API Reference” Microsoft (2012)
394 http://msdn.microsoft.com/en-us/library/windowsazure/dd179355.aspx
398 ¹⁰ “Representational state transfer” English Wikipedia (2012)
400 https://en.wikipedia.org/wiki/Representational_state_transfer
404 ¹¹ “Performance costs for some common operations” tahoe-lafs.org (2012)
406 https://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/docs/performance.rst