docs/specifications/backends/raic.rst

   1
   2 =============================================================
   3 Redundant Array of Independent Clouds: Share To Cloud Mapping
   4 =============================================================
   5
   6
   7 Introduction
   8 ============
   9
  10 This document describes a proposed design for the mapping of LAFS shares to
  11 objects in a cloud storage service. It also analyzes the costs for each of the
  12 functional requirements, including network, disk, storage and API usage costs.
  13
  14
  15 Terminology
  16 ===========
  17
  18 *LAFS share*
  19    A Tahoe-LAFS share representing part of a file after encryption and
  20    erasure encoding.
  21
  22 *LAFS shareset*
  23    The set of shares stored by a LAFS storage server for a given storage index.
  24    The shares within a shareset are numbered by a small integer.
  25
  26 *Cloud storage service*
  27    A service such as Amazon S3 `²`_, Rackspace Cloud Files `³`_,
  28    Google Cloud Storage `⁴`_, or Windows Azure `⁵`_, that provides cloud storage.
  29
  30 *Cloud storage interface*
  31    A protocol interface supported by a cloud storage service, such as the
  32    S3 interface `⁶`_, the OpenStack Object Storage interface `⁷`_, the
  33    Google Cloud Storage interface `⁸`_, or the Azure interface `⁹`_. There may be
  34    multiple services implementing a given cloud storage interface. In this design,
  35    only REST-based APIs `¹⁰`_ over HTTP will be used as interfaces.
  36
  37 *Cloud object*
  38    A file-like abstraction provided by a cloud storage service, storing a
  39    sequence of bytes. Cloud objects are mutable in the sense that the contents
  40    and metadata of the cloud object with a given name in a given cloud container
  41    can be replaced. Cloud objects are called “blobs” in the Azure interface,
  42    and “objects” in the other interfaces.
  43
  44 *Cloud container*
  45    A container for cloud objects provided by a cloud service. Cloud containers
  46    are called “buckets” in the S3 and Google Cloud Storage interfaces, and
  47    “containers” in the Azure and OpenStack Storage interfaces.
  48
  49
  50 Functional Requirements
  51 =======================
  52
  53 * *Upload*: a LAFS share can be uploaded to an appropriately configured
  54   Tahoe-LAFS storage server and the data is stored to the cloud
  55   storage service.
  56
  57  * *Scalable shares*: there is no hard limit on the size of LAFS share
  58    that can be uploaded.
  59
  60    If the cloud storage interface offers scalable files, then this could be
  61    implemented by using that feature of the specific cloud storage
  62    interface. Alternately, it could be implemented by mapping from the LAFS
  63    abstraction of an unlimited-size immutable share to a set of size-limited
  64    cloud objects.
  65
  66  * *Streaming upload*: the size of the LAFS share that is uploaded
  67    can exceed the amount of RAM and even the amount of direct attached
  68    storage on the storage server. I.e., the storage server is required to
  69    stream the data directly to the ultimate cloud storage service while
  70    processing it, instead of to buffer the data until the client is finished
  71    uploading and then transfer the data to the cloud storage service.
  72
  73 * *Download*: a LAFS share can be downloaded from an appropriately
  74   configured Tahoe-LAFS storage server, and the data is loaded from the
  75   cloud storage service.
  76
  77  * *Streaming download*: the size of the LAFS share that is
  78    downloaded can exceed the amount of RAM and even the amount of direct
  79    attached storage on the storage server. I.e. the storage server is
  80    required to stream the data directly to the client while processing it,
  81    instead of to buffer the data until the cloud storage service is finished
  82    serving and then transfer the data to the client.
  83
  84 * *Modify*: a LAFS share can have part of its contents modified.
  85
  86   If the cloud storage interface offers scalable mutable files, then this
  87   could be implemented by using that feature of the specific cloud storage
  88   interface. Alternately, it could be implemented by mapping from the LAFS
  89   abstraction of an unlimited-size mutable share to a set of size-limited
  90   cloud objects.
  91
  92  * *Efficient modify*: the size of the LAFS share being
  93    modified can exceed the amount of RAM and even the amount of direct
  94    attached storage on the storage server. I.e. the storage server is
  95    required to download, patch, and upload only the segment(s) of the share
  96    that are being modified, instead of to download, patch, and upload the
  97    entire share.
  98
  99 * *Tracking leases*: The Tahoe-LAFS storage server is required to track when
 100   each share has its lease renewed so that unused shares (shares whose lease
 101   has not been renewed within a time limit, e.g. 30 days) can be garbage
 102   collected. This does not necessarily require code specific to each cloud
 103   storage interface, because the lease tracking can be performed in the
 104   storage server's generic component rather than in the component supporting
 105   each interface.
 106
 107
 108 Mapping
 109 =======
 110
 111 This section describes the mapping between LAFS shares and cloud objects.
 112
 113 A LAFS share will be split into one or more “chunks” that are each stored in a
 114 cloud object. A LAFS share of size `C` bytes will be stored as `ceiling(C / chunksize)`
 115 chunks. The last chunk has a size between 1 and `chunksize` bytes inclusive.
 116 (It is not possible for `C` to be zero, because valid shares always have a header,
 117 so, there is at least one chunk for each share.)
 118
 119 For an existing share, the chunk size is determined by the size of the first
 120 chunk. For a new share, it is a parameter that may depend on the storage
 121 interface. It is an error for any chunk to be larger than the first chunk, or
 122 for any chunk other than the last to be smaller than the first chunk.
 123 If a mutable share with total size less than the default chunk size for the
 124 storage interface is being modified, the new contents are split using the
 125 default chunk size.
 126
 127   *Rationale*: this design allows the `chunksize` parameter to be changed for
 128   new shares written via a particular storage interface, without breaking
 129   compatibility with existing stored shares. All cloud storage interfaces
 130   return the sizes of cloud objects with requests to list objects, and so
 131   the size of the first chunk can be determined without an additional request.
 132
 133 The name of the cloud object for chunk `i` > 0 of a LAFS share with storage index
 134 `STORAGEINDEX` and share number `SHNUM`, will be
 135
 136   shares/`ST`/`STORAGEINDEX`/`SHNUM.i`
 137
 138 where `ST` is the first two characters of `STORAGEINDEX`. When `i` is 0, the
 139 `.0` is omitted.
 140
 141   *Rationale*: this layout maintains compatibility with data stored by the
 142   prototype S3 backend, for which Least Authority Enterprises has existing
 143   customers. This prototype always used a single cloud object to store each
 144   share, with name
 145
 146     shares/`ST`/`STORAGEINDEX`/`SHNUM`
 147
 148   By using the same prefix “shares/`ST`/`STORAGEINDEX`/” for old and new layouts,
 149   the storage server can obtain a list of cloud objects associated with a given
 150   shareset without having to know the layout in advance, and without having to
 151   make multiple API requests. This also simplifies sharing of test code between the
 152   disk and cloud backends.
 153
 154 Mutable and immutable shares will be “chunked” in the same way.
 155
 156
 157 Rationale for Chunking
 158 ----------------------
 159
 160 Limiting the amount of data received or sent in a single request has the
 161 following advantages:
 162
 163 * It is unnecessary to write separate code to take advantage of the
 164   “large object” features of each cloud storage interface, which differ
 165   significantly in their design.
 166 * Data needed for each PUT request can be discarded after it completes.
 167   If a PUT request fails, it can be retried while only holding the data
 168   for that request in memory.
 169
 170
 171 Costs
 172 =====
 173
 174 In this section we analyze the costs of the proposed design in terms of network,
 175 disk, memory, cloud storage, and API usage.
 176
 177
 178 Network usage: bandwidth and number-of-round-trips
 179 --------------------------------------------------
 180
 181 When a Tahoe-LAFS storage client allocates a new share on a storage server,
 182 the backend will request a list of the existing cloud objects with the
 183 appropriate prefix. This takes one HTTP request in the common case, but may
 184 take more for the S3 interface, which has a limit of 1000 objects returned in
 185 a single “GET Bucket” request.
 186
 187 If the share is to be read, the client will make a number of calls each
 188 specifying the offset and length of the required span of bytes. On the first
 189 request that overlaps a given chunk of the share, the server will make an
 190 HTTP GET request for that cloud object. The server may also speculatively
 191 make GET requests for cloud objects that are likely to be needed soon (which
 192 can be predicted since reads are normally sequential), in order to reduce
 193 latency.
 194
 195 Each read will be satisfied as soon as the corresponding data is available,
 196 without waiting for the rest of the chunk, in order to minimize read latency.
 197
 198 All four cloud storage interfaces support GET requests using the
 199 Range HTTP header. This could be used to optimize reads where the
 200 Tahoe-LAFS storage client requires only part of a share.
 201
 202 If the share is to be written, the server will make an HTTP PUT request for
 203 each chunk that has been completed. Tahoe-LAFS clients only write immutable
 204 shares sequentially, and so we can rely on that property to simplify the
 205 implementation.
 206
 207 When modifying shares of an existing mutable file, the storage server will
 208 be able to make PUT requests only for chunks that have changed.
 209 (Current Tahoe-LAFS v1.9 clients will not take advantage of this ability, but
 210 future versions will probably do so for MDMF files.)
 211
 212 In some cases, it may be necessary to retry a request (see the `Structure of
 213 Implementation`_ section below). In the case of a PUT request, at the point
 214 at which a retry is needed, the new chunk contents to be stored will still be
 215 in memory and so this is not problematic.
 216
 217 In the absence of retries, the maximum number of GET requests that will be made
 218 when downloading a file, or the maximum number of PUT requests when uploading
 219 or modifying a file, will be equal to the number of chunks in the file.
 220
 221 If the new mutable share content has fewer chunks than the old content,
 222 then the remaining cloud objects for old chunks must be deleted (using one
 223 HTTP request each). When reading a share, the backend must tolerate the case
 224 where these cloud objects have not been deleted successfully.
 225
 226 The last write to a share will be reported as successful only when all
 227 corresponding HTTP PUTs and DELETEs have completed successfully.
 228
 229
 230
 231 Disk usage (local to the storage server)
 232 ----------------------------------------
 233
 234 It is never necessary for the storage server to write the content of share
 235 chunks to local disk, either when they are read or when they are written. Each
 236 chunk is held only in memory.
 237
 238 A proposed change to the Tahoe-LAFS storage server implementation uses a sqlite
 239 database to store metadata about shares. In that case the same database would
 240 be used for the cloud backend. This would enable lease tracking to be implemented
 241 in the same way for disk and cloud backends.
 242
 243
 244 Memory usage
 245 ------------
 246
 247 The use of chunking simplifies bounding the memory usage of the storage server
 248 when handling files that may be larger than memory. However, this depends on
 249 limiting the number of chunks that are simultaneously held in memory.
 250 Multiple chunks can be held in memory either because of pipelining of requests
 251 for a single share, or because multiple shares are being read or written
 252 (possibly by multiple clients).
 253
 254 For immutable shares, the Tahoe-LAFS storage protocol requires the client to
 255 specify in advance the maximum amount of data it will write. Also, a cooperative
 256 client (including all existing released versions of the Tahoe-LAFS code) will
 257 limit the amount of data that is pipelined, currently to 50 KiB. Since the chunk
 258 size will be greater than that, it is possible to ensure that for each allocation,
 259 the maximum chunk data memory usage is the lesser of two chunks, and the allocation
 260 size. (There is some additional overhead but it is small compared to the chunk
 261 data.) If the maximum memory usage of a new allocation would exceed the memory
 262 available, the allocation can be delayed or possibly denied, so that the total
 263 memory usage is bounded.
 264
 265 It is not clear that the existing protocol allows allocations for mutable
 266 shares to be bounded in general; this may be addressed in a future protocol change.
 267
 268 The above discussion assumes that clients do not maliciously send large
 269 messages as a denial-of-service attack. Foolscap (the protocol layer underlying
 270 the Tahoe-LAFS storage protocol) does not attempt to resist denial of service.
 271
 272
 273 Storage
 274 -------
 275
 276 The storage requirements, including not-yet-collected garbage shares, are
 277 the same as for the Tahoe-LAFS disk backend. That is, the total size of cloud
 278 objects stored is equal to the total size of shares that the disk backend
 279 would store.
 280
 281 Erasure coding causes the size of shares for each file to be a
 282 factor `shares.total` / `shares.needed` times the file size, plus overhead
 283 that is logarithmic in the file size `¹¹`_.
 284
 285
 286 API usage
 287 ---------
 288
 289 Cloud storage backends typically charge a small fee per API request. The number of
 290 requests to the cloud storage service for various operations is discussed under
 291 “network usage” above.
 292
 293
 294 Structure of Implementation
 295 ===========================
 296
 297 A generic “cloud backend”, based on the prototype S3 backend but with support
 298 for chunking as described above, will be written.
 299
 300 An instance of the cloud backend can be attached to one of several
 301 “cloud interface adapters”, one for each cloud storage interface. These
 302 adapters will operate only on chunks, and need not distinguish between
 303 mutable and immutable shares. They will be a relatively “thin” abstraction
 304 layer over the HTTP APIs of each cloud storage interface, similar to the
 305 S3Bucket abstraction in the prototype.
 306
 307 For some cloud storage services it may be necessary to transparently retry
 308 requests in order to recover from transient failures. (Although the erasure
 309 coding may enable a file to be retrieved even when shares are not stored by or
 310 not readable from all cloud storage services used in a Tahoe-LAFS grid, it may
 311 be desirable to retry cloud storage service requests in order to improve overall
 312 reliability.) Support for this will be implemented in the generic cloud backend,
 313 and used whenever a cloud storage adaptor reports a transient failure. Our
 314 experience with the prototype suggests that it is necessary to retry on transient
 315 failures for Amazon's S3 service.
 316
 317 There will also be a “mock” cloud interface adaptor, based on the prototype's
 318 MockS3Bucket. This allows tests of the generic cloud backend to be run without
 319 a connection to a real cloud service. The mock adaptor will be able to simulate
 320 transient and non-transient failures.
 321
 322
 323 Known Issues
 324 ============
 325
 326 This design worsens a known “write hole” issue in Tahoe-LAFS when updating
 327 the contents of mutable files. An update to a mutable file can require
 328 changing the contents of multiple chunks, and if the client fails or is
 329 disconnected during the operation the resulting state of the stored cloud
 330 objects may be inconsistent: no longer containing all of the old version, but
 331 not yet containing all of the new version. A mutable share can be left in an
 332 inconsistent state even by the existing Tahoe-LAFS disk backend if it fails
 333 during a write, but that has a smaller chance of occurrence because the
 334 current client behavior leads to mutable shares being written to disk in a
 335 single system call.
 336
 337 The best fix for this issue probably requires changing the Tahoe-LAFS storage
 338 protocol, perhaps by extending it to use a two-phase or three-phase commit
 339 (ticket #1755).
 340
 341
 342
 343 References
 344 ===========
 345
 346 ¹ omitted
 347
 348 .. _²:
 349
 350 ² “Amazon S3” Amazon (2012)
 351
 352    https://aws.amazon.com/s3/
 353
 354 .. _³:
 355
 356 ³ “Rackspace Cloud Files” Rackspace (2012)
 357
 358    https://www.rackspace.com/cloud/cloud_hosting_products/files/
 359
 360 .. _⁴:
 361
 362 ⁴ “Google Cloud Storage” Google (2012)
 363
 364    https://developers.google.com/storage/
 365
 366 .. _⁵:
 367
 368 ⁵ “Windows Azure Storage” Microsoft (2012)
 369
 370    https://www.windowsazure.com/en-us/develop/net/fundamentals/cloud-storage/
 371
 372 .. _⁶:
 373
 374 ⁶ “Amazon Simple Storage Service (Amazon S3) API Reference: REST API” Amazon (2012)
 375
 376    http://docs.amazonwebservices.com/AmazonS3/latest/API/APIRest.html
 377
 378 .. _⁷:
 379
 380 ⁷ “OpenStack Object Storage” openstack.org (2012)
 381
 382    http://openstack.org/projects/storage/
 383
 384 .. _⁸:
 385
 386 ⁸ “Google Cloud Storage Reference Guide” Google (2012)
 387
 388    https://developers.google.com/storage/docs/reference-guide
 389
 390 .. _⁹:
 391
 392 ⁹ “Windows Azure Storage Services REST API Reference” Microsoft (2012)
 393
 394    http://msdn.microsoft.com/en-us/library/windowsazure/dd179355.aspx
 395
 396 .. _¹⁰:
 397
 398 ¹⁰ “Representational state transfer” English Wikipedia (2012)
 399
 400     https://en.wikipedia.org/wiki/Representational_state_transfer
 401
 402 .. _¹¹:
 403
 404 ¹¹ “Performance costs for some common operations” tahoe-lafs.org (2012)
 405
 406     https://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/docs/performance.rst