zfec/README.txt

   1  * Intro and Licence
   2
   3 This package implements an "erasure code", or "forward error correction
   4 code".
   5
   6 This program is free software; you can redistribute it and/or modify it under
   7 the terms of the GNU General Public License as published by the Free Software
   8 Foundation; either version 2 of the License, or (at your option) any later
   9 version, with the added permission that, if you become obligated to release a
  10 derived work under this licence (as per section 2.b), you may delay the
  11 fulfillment of this obligation for up to 12 months.  If you are obligated to
  12 release code under section 2.b of this licence, you are obligated to release
  13 it under these same terms, including the 12-month grace period clause.  See
  14 the COPYING file for details.
  15
  16 The most widely known example of an erasure code is the RAID-5 algorithm
  17 which makes it so that in the event of the loss of any one hard drive, the
  18 stored data can be completely recovered.  The algorithm in the zfec package
  19 has a similar effect, but instead of recovering from the loss of only a
  20 single element, it can be parameterized to choose in advance the number of
  21 elements whose loss it can tolerate.
  22
  23 This package is largely based on the old "fec" library by Luigi Rizzo et al.,
  24 which is a mature and optimized implementation of erasure coding.  The zfec
  25 package makes several changes from the original "fec" package, including
  26 addition of the Python API, refactoring of the C API to support zero-copy
  27 operation, a few clean-ups and micro-optimizations of the core code itself,
  28 and the addition of a command-line tool named "zfec".
  29
  30
  31  * Installation
  32
  33 This package is managed with the "setuptools" package management tool.  To
  34 build and install the package directly into your system, just run "python
  35 ./setup.py install".  If you prefer to keep the package limited to a specific
  36 directory so that you can manage it yourself (perhaps by using the "GNU
  37 stow") tool, then give it these arguments: "python ./setup.py install
  38 --single-version-externally-managed
  39 --record=${specificdirectory}/zfec-install.log --prefix=${specificdirectory}"
  40
  41 To run the self-tests, execute "python ./setup.py test" (or if you have
  42 Twisted Python installed, you can run "trial zfec" for nicer output and test
  43 options.)
  44
  45
  46  * Community
  47
  48 The source is currently available via darcs on the web with the command:
  49
  50 darcs get http://allmydata.org/source/zfec
  51
  52 More information on darcs is available at http://darcs.net
  53
  54 Please join the zfec mailing list and submit patches:
  55
  56 <http://allmydata.org/cgi-bin/mailman/listinfo/zfec-dev>
  57
  58
  59  * Overview
  60
  61 This package performs two operations, encoding and decoding.  Encoding takes
  62 some input data and expands its size by producing extra "check blocks", also
  63 called "secondary blocks".  Decoding takes some data -- any combination of
  64 blocks of the original data (called "primary blocks") and "secondary blocks",
  65 and produces the original data.
  66
  67 The encoding is parameterized by two integers, k and m.  m is the total number
  68 of blocks produced, and k is how many of those blocks are necessary to
  69 reconstruct the original data.  m is required to be at least 1 and at most 256,
  70 and k is required to be at least 1 and at most m.
  71
  72 (Note that when k == m then there is no point in doing erasure coding -- it
  73 degenerates to the equivalent of the Unix "split" utility which simply splits
  74 the input into successive segments.  Similarly, when k == 1 it degenerates to
  75 the equivalent of the unix "cp" utility -- each block is a complete copy of the
  76 input data.  The "zfec" command-line tool does not implement these degenerate
  77 cases.)
  78
  79 Note that each "primary block" is a segment of the original data, so its size
  80 is 1/k'th of the size of original data, and each "secondary block" is of the
  81 same size, so the total space used by all the blocks is m/k times the size of
  82 the original data (plus some padding to fill out the last primary block to be
  83 the same size as all the others).  In addition to the data contained in the
  84 blocks themselves there are also a few pieces of metadata which are necessary
  85 for later reconstruction.  Those pieces are: 1.  the value of K, 2.  the value
  86 of M,  3.  the sharenum of each block,  4.  the number of bytes of padding
  87 that were used.  The "zfec" command-line tool compresses these pieces of data
  88 and prepends them to the beginning of each share, so each the sharefile
  89 produced by the "zfec" command-line tool is between one and four bytes larger
  90 than the share data alone.
  91
  92 The decoding step requires as input k of the blocks which were produced by the
  93 encoding step.  The decoding step produces as output the data that was earlier
  94 input to the encoding step.
  95
  96
  97  * Command-Line Tool
  98
  99 NOTE: the format of the sharefiles was changed in zfec v1.1 to allow K == 1
 100 and K == M.  This change of the format of sharefiles means that zfec >= v1.1
 101 cannot read sharefiles produced by zfec < v1.1.
 102
 103 The bin/ directory contains two Unix-style, command-line tools "zfec" and
 104 "zunfec".  Execute "zfec --help" or "zunfec --help" for usage instructions.
 105
 106 Note: a Unix-style tool like "zfec" does only one thing -- in this case
 107 erasure coding -- and leaves other tasks to other tools.  Other Unix-style
 108 tools that go well with zfec include "GNU tar" for archiving multiple files
 109 and directories into one file, "rzip" or "lrzip" for compression, and "GNU
 110 Privacy Guard" for encryption or "sha256sum" for integrity.  It is important
 111 to do things in order: first archive, then compress, then either encrypt or
 112 sha256sum, then erasure code.  Note that if GNU Privacy Guard is used for
 113 privacy, then it will also ensure integrity, so the use of sha256sum is
 114 unnecessary in that case.
 115
 116
 117  * Performance Measurements
 118
 119 On my Athlon 64 2.4 GHz workstation (running Linux), the "zfec" command-line
 120 tool encoded a 160 MB file with m=100, k=94 (about 6% redundancy) in 3.9
 121 seconds, where the "par2" tool encoded the file with about 6% redundancy in
 122 27 seconds.  zfec encoded the same file with m=12, k=6 (100% redundancy) in
 123 4.1 seconds, where par2 encoded it with about 100% redundancy in 7 minutes
 124 and 56 seconds.
 125
 126 The underlying C library in benchmark mode encoded from a file at about
 127 4.9 million bytes per second and decoded at about 5.8 million bytes per second.
 128
 129 On Peter's fancy Intel Mac laptop (2.16 GHz Core Duo), it encoded from a file
 130 at about 6.2 million bytes per second.
 131
 132 On my even fancier Intel Mac laptop (2.33 GHz Core Duo), it encoded from a file
 133 at about 6.8 million bytes per second.
 134
 135 On my old PowerPC G4 867 MHz Mac laptop, it encoded from a file at about 1.3
 136 million bytes per second.
 137
 138
 139  * API
 140
 141 Each block is associated with "blocknum".  The blocknum of each primary block is
 142 its index (starting from zero), so the 0'th block is the first primary block,
 143 which is the first few bytes of the file, the 1'st block is the next primary
 144 block, which is the next few bytes of the file, and so on.  The last primary
 145 block has blocknum k-1.  The blocknum of each secondary block is an arbitrary
 146 integer between k and 255 inclusive.  (When using the Python API, if you don't
 147 specify which blocknums you want for your secondary blocks when invoking
 148 encode(), then it will by default provide the blocks with ids from k to m-1
 149 inclusive.)
 150
 151  ** C API
 152
 153 fec_encode() takes as input an array of k pointers, where each pointer points
 154 to a memory buffer containing the input data (i.e., the i'th buffer contains
 155 the i'th primary block).  There is also a second parameter which is an array of
 156 the blocknums of the secondary blocks which are to be produced.  (Each element
 157 in that array is required to be the blocknum of a secondary block, i.e. it is
 158 required to be >= k and < m.)
 159
 160 The output from fec_encode() is the requested set of secondary blocks which are
 161 written into output buffers provided by the caller.
 162
 163 fec_decode() takes as input an array of k pointers, where each pointer points
 164 to a buffer containing a block.  There is also a separate input parameter which
 165 is an array of blocknums, indicating the blocknum of each of the blocks which is
 166 being passed in.
 167
 168 The output from fec_decode() is the set of primary blocks which were missing
 169 from the input and had to be reconstructed.  These reconstructed blocks are
 170 written into putput buffers provided by the caller.
 171
 172  ** Python API
 173
 174 encode() and decode() take as input a sequence of k buffers, where a "sequence"
 175 is any object that implements the Python sequence protocol (such as a list or
 176 tuple) and a "buffer" is any object that implements the Python buffer protocol
 177 (such as a string or array).  The contents that are required to be present in
 178 these buffers are the same as for the C API.
 179
 180 encode() also takes a list of desired blocknums.  Unlike the C API, the Python
 181 API accepts blocknums of primary blocks as well as secondary blocks in its list
 182 of desired blocknums.  encode() returns a list of buffer objects which contain
 183 the blocks requested.  For each requested block which is a primary block, the
 184 resulting list contains a reference to the apppropriate primary block from the
 185 input list.  For each requested block which is a secondary block, the list
 186 contains a newly created string object containing that block.
 187
 188 decode() also takes a list of integers indicating the blocknums of the blocks
 189 being passed int.  decode() returns a list of buffer objects which contain all
 190 of the primary blocks of the original data (in order).  For each primary block
 191 which was present in the input list, then the result list simply contains a
 192 reference to the object that was passed in the input list.  For each primary
 193 block which was not present in the input, the result list contains a newly
 194 created string object containing that primary block.
 195
 196 Beware of a "gotcha" that can result from the combination of mutable data and
 197 the fact that the Python API returns references to inputs when possible.
 198
 199 Returning references to its inputs is efficient since it avoids making an
 200 unnecessary copy of the data, but if the object which was passed as input is
 201 mutable and if that object is mutated after the call to zfec returns, then the
 202 result from zfec -- which is just a reference to that same object -- will also
 203 be mutated.  This subtlety is the price you pay for avoiding data copying.  If
 204 you don't want to have to worry about this then you can simply use immutable
 205 objects (e.g. Python strings) to hold the data that you pass to zfec.
 206
 207
 208  * Utilities
 209
 210 The filefec.py module has a utility function for efficiently reading a file
 211 and encoding it piece by piece.  This module is used by the "zfec" and
 212 "zunfec" command-line tools from the bin/ directory.
 213
 214
 215  * Dependencies
 216
 217 A C compiler is required.  To use the Python API or the command-line tools a
 218 Python interpreter is also required.  We have tested it with Python v2.4 and
 219 v2.5.
 220
 221
 222  * Acknowledgements
 223
 224 Thanks to the author of the original fec lib, Luigi Rizzo, and the folks that
 225 contributed to it: Phil Karn, Robert Morelos-Zaragoza, Hari Thirumoorthy, and
 226 Dan Rubenstein.  Thanks to the Mnet hackers who wrote an earlier Python
 227 wrapper, especially Myers Carpenter and Hauke Johannknecht.  Thanks to Brian
 228 Warner and Amber O'Whielacronx for help with the API, documentation,
 229 debugging, compression, and unit tests.  Thanks to the creators of GCC
 230 (starting with Richard M. Stallman) and Valgrind (starting with Julian Seward)
 231 for a pair of excellent tools.  Thanks to my coworkers at Allmydata --
 232 http://allmydata.com -- Fabrice Grinda, Peter Secor, Rob Kinninmont, Brian
 233 Warner, Zandr Milewski, Justin Boreta, Mark Meras for sponsoring this work and
 234 releasing it under a Free Software licence.
 235
 236
 237 Enjoy!
 238
 239 Zooko Wilcox-O'Hearn
 240 2007-10-01
 241 Boulder, Colorado