2 zfec -- efficient, portable erasure coding tool
3 ===============================================
5 Generate redundant blocks of information such that if some of the blocks are
6 lost then the original data can be recovered from the remaining blocks. This
7 package includes command-line tools, C API, Python API, and Haskell API.
13 This package implements an "erasure code", or "forward error correction
16 You may use this package under the GNU General Public License, version 2 or,
17 at your option, any later version. You may use this package under the
18 Transitive Grace Period Public Licence, version 1.0 or, at your option, any
19 later version. (You may choose to use this package under the terms of either
20 licence, at your option.) See the file COPYING.GPL for the terms of the GNU
21 General Public License, version 2. See the file COPYING.TGPPL.html for the
22 terms of the Transitive Grace Period Public Licence, version 1.0.
24 The most widely known example of an erasure code is the RAID-5 algorithm
25 which makes it so that in the event of the loss of any one hard drive, the
26 stored data can be completely recovered. The algorithm in the zfec package
27 has a similar effect, but instead of recovering from the loss of only a
28 single element, it can be parameterized to choose in advance the number of
29 elements whose loss it can tolerate.
31 This package is largely based on the old "fec" library by Luigi Rizzo et al.,
32 which is a mature and optimized implementation of erasure coding. The zfec
33 package makes several changes from the original "fec" package, including
34 addition of the Python API, refactoring of the C API to support zero-copy
35 operation, a few clean-ups and optimizations of the core code itself, and the
36 addition of a command-line tool named "zfec".
42 This package is managed with the "setuptools" package management tool. To
43 build and install the package directly into your system, just run ``python
44 ./setup.py install``. If you prefer to keep the package limited to a
45 specific directory so that you can manage it yourself (perhaps by using the
46 "GNU stow") tool, then give it these arguments: ``python ./setup.py install
47 --single-version-externally-managed
48 --record=${specificdirectory}/zfec-install.log
49 --prefix=${specificdirectory}``
51 To run the self-tests, execute ``python ./setup.py test`` (or if you have
52 Twisted Python installed, you can run ``trial zfec`` for nicer output and
53 test options.) This will run the tests of the C API, the Python API, and the
56 To run the tests of the Haskell API: ``runhaskell haskell/test/FECTest.hs``
58 Note that in order to run the Haskell API tests you must have installed the
59 library first due to the fact that the interpreter cannot process FEC.hs as
60 it takes a reference to an FFI function.
66 The source is currently available via darcs on the web with the command:
68 darcs get https://tahoe-lafs.org/source/zfec/trunk
70 More information on darcs is available at http://darcs.net
72 Please post about zfec to the Tahoe-LAFS mailing list and contribute patches:
74 <https://tahoe-lafs.org/cgi-bin/mailman/listinfo/tahoe-dev>
80 This package performs two operations, encoding and decoding. Encoding takes
81 some input data and expands its size by producing extra "check blocks", also
82 called "secondary blocks". Decoding takes some data -- any combination of
83 blocks of the original data (called "primary blocks") and "secondary blocks",
84 and produces the original data.
86 The encoding is parameterized by two integers, k and m. m is the total
87 number of blocks produced, and k is how many of those blocks are necessary to
88 reconstruct the original data. m is required to be at least 1 and at most
89 256, and k is required to be at least 1 and at most m.
91 (Note that when k == m then there is no point in doing erasure coding -- it
92 degenerates to the equivalent of the Unix "split" utility which simply splits
93 the input into successive segments. Similarly, when k == 1 it degenerates to
94 the equivalent of the unix "cp" utility -- each block is a complete copy of
97 Note that each "primary block" is a segment of the original data, so its size
98 is 1/k'th of the size of original data, and each "secondary block" is of the
99 same size, so the total space used by all the blocks is m/k times the size of
100 the original data (plus some padding to fill out the last primary block to be
101 the same size as all the others). In addition to the data contained in the
102 blocks themselves there are also a few pieces of metadata which are necessary
103 for later reconstruction. Those pieces are: 1. the value of K, 2. the
104 value of M, 3. the sharenum of each block, 4. the number of bytes of
105 padding that were used. The "zfec" command-line tool compresses these pieces
106 of data and prepends them to the beginning of each share, so each the
107 sharefile produced by the "zfec" command-line tool is between one and four
108 bytes larger than the share data alone.
110 The decoding step requires as input k of the blocks which were produced by
111 the encoding step. The decoding step produces as output the data that was
112 earlier input to the encoding step.
118 The bin/ directory contains two Unix-style, command-line tools "zfec" and
119 "zunfec". Execute ``zfec --help`` or ``zunfec --help`` for usage
122 Note: a Unix-style tool like "zfec" does only one thing -- in this case
123 erasure coding -- and leaves other tasks to other tools. Other Unix-style
124 tools that go well with zfec include `GNU tar`_ for archiving multiple files
125 and directories into one file, `lzip`_ for compression, and `GNU Privacy
126 Guard`_ for encryption or `sha256sum`_ for integrity. It is important to do
127 things in order: first archive, then compress, then either encrypt or
128 integrity-check, then erasure code. Note that if GNU Privacy Guard is used
129 for privacy, then it will also ensure integrity, so the use of sha256sum is
130 unnecessary in that case. Note also that you also need to do integrity
131 checking (such as with sha256sum) on the blocks that result from the erasure
132 coding in *addition* to doing it on the file contents! (There are two
133 different subtle failure modes -- see "more than one file can match an
134 immutable file cap" on the `Hack Tahoe-LAFS!`_ Hall of Fame.)
136 The `Tahoe-LAFS`_ project uses zfec as part of a complete distributed
137 filesystem with integrated encryption, integrity, remote distribution of the
138 blocks, directory structure, backup of changed files or directories, access
139 control, immutable files and directories, proof-of-retrievability, and repair
140 of damaged files and directories.
142 .. _GNU tar: http://directory.fsf.org/project/tar/
143 .. _lzip: http://www.nongnu.org/lzip/lzip.html
144 .. _GNU Privacy Guard: http://gnupg.org/
145 .. _sha256sum: http://www.gnu.org/software/coreutils/
146 .. _Tahoe-LAFS: https://tahoe-lafs.org
147 .. _Hack Tahoe-LAFS!: https://tahoe-lafs.org/hacktahoelafs/
153 To run the benchmarks, execute the included bench/bench_zfec.py script with
154 optional --k= and --m= arguments.
156 On my Athlon 64 2.4 GHz workstation (running Linux), the "zfec" command-line
157 tool encoded a 160 MB file with m=100, k=94 (about 6% redundancy) in 3.9
158 seconds, where the "par2" tool encoded the file with about 6% redundancy in
159 27 seconds. zfec encoded the same file with m=12, k=6 (100% redundancy) in
160 4.1 seconds, where par2 encoded it with about 100% redundancy in 7 minutes
163 The underlying C library in benchmark mode encoded from a file at about 4.9
164 million bytes per second and decoded at about 5.8 million bytes per second.
166 On Peter's fancy Intel Mac laptop (2.16 GHz Core Duo), it encoded from a file
167 at about 6.2 million bytes per second.
169 On my even fancier Intel Mac laptop (2.33 GHz Core Duo), it encoded from a
170 file at about 6.8 million bytes per second.
172 On my old PowerPC G4 867 MHz Mac laptop, it encoded from a file at about 1.3
173 million bytes per second.
175 Here is a paper analyzing the performance of various erasure codes and their
176 implementations, including zfec:
178 http://www.usenix.org/events/fast09/tech/full_papers/plank/plank.pdf
180 Zfec shows good performance on different machines and with different values
181 of K and M. It also has a nice small memory footprint.
187 Each block is associated with "blocknum". The blocknum of each primary block
188 is its index (starting from zero), so the 0'th block is the first primary
189 block, which is the first few bytes of the file, the 1'st block is the next
190 primary block, which is the next few bytes of the file, and so on. The last
191 primary block has blocknum k-1. The blocknum of each secondary block is an
192 arbitrary integer between k and 255 inclusive. (When using the Python API,
193 if you don't specify which secondary blocks you want when invoking encode(),
194 then it will by default provide the blocks with ids from k to m-1 inclusive.)
198 fec_encode() takes as input an array of k pointers, where each pointer
199 points to a memory buffer containing the input data (i.e., the i'th buffer
200 contains the i'th primary block). There is also a second parameter which
201 is an array of the blocknums of the secondary blocks which are to be
202 produced. (Each element in that array is required to be the blocknum of a
203 secondary block, i.e. it is required to be >= k and < m.)
205 The output from fec_encode() is the requested set of secondary blocks which
206 are written into output buffers provided by the caller.
208 Note that this fec_encode() is a "low-level" API in that it requires the
209 input data to be provided in a set of memory buffers of exactly the right
210 sizes. If you are starting instead with a single buffer containing all of
211 the data then please see easyfec.py's "class Encoder" as an example of how
212 to split a single large buffer into the appropriate set of input buffers
213 for fec_encode(). If you are starting with a file on disk, then please see
214 filefec.py's encode_file_stringy_easyfec() for an example of how to read
215 the data from a file and pass it to "class Encoder". The Python interface
216 provides these higher-level operations, as does the Haskell interface. If
217 you implement functions to do these higher-level tasks in other languages,
218 please send a patch to tahoe-dev@tahoe-lafs.org so that your API can be
219 included in future releases of zfec.
221 fec_decode() takes as input an array of k pointers, where each pointer
222 points to a buffer containing a block. There is also a separate input
223 parameter which is an array of blocknums, indicating the blocknum of each
224 of the blocks which is being passed in.
226 The output from fec_decode() is the set of primary blocks which were
227 missing from the input and had to be reconstructed. These reconstructed
228 blocks are written into output buffers provided by the caller.
233 encode() and decode() take as input a sequence of k buffers, where a
234 "sequence" is any object that implements the Python sequence protocol (such
235 as a list or tuple) and a "buffer" is any object that implements the Python
236 buffer protocol (such as a string or array). The contents that are
237 required to be present in these buffers are the same as for the C API.
239 encode() also takes a list of desired blocknums. Unlike the C API, the
240 Python API accepts blocknums of primary blocks as well as secondary blocks
241 in its list of desired blocknums. encode() returns a list of buffer
242 objects which contain the blocks requested. For each requested block which
243 is a primary block, the resulting list contains a reference to the
244 apppropriate primary block from the input list. For each requested block
245 which is a secondary block, the list contains a newly created string object
246 containing that block.
248 decode() also takes a list of integers indicating the blocknums of the
249 blocks being passed int. decode() returns a list of buffer objects which
250 contain all of the primary blocks of the original data (in order). For
251 each primary block which was present in the input list, then the result
252 list simply contains a reference to the object that was passed in the input
253 list. For each primary block which was not present in the input, the
254 result list contains a newly created string object containing that primary
257 Beware of a "gotcha" that can result from the combination of mutable data
258 and the fact that the Python API returns references to inputs when
261 Returning references to its inputs is efficient since it avoids making an
262 unnecessary copy of the data, but if the object which was passed as input
263 is mutable and if that object is mutated after the call to zfec returns,
264 then the result from zfec -- which is just a reference to that same object
265 -- will also be mutated. This subtlety is the price you pay for avoiding
266 data copying. If you don't want to have to worry about this then you can
267 simply use immutable objects (e.g. Python strings) to hold the data that
272 The Haskell code is fully Haddocked, to generate the documentation, run
273 ``runhaskell Setup.lhs haddock``.
279 The filefec.py module has a utility function for efficiently reading a file
280 and encoding it piece by piece. This module is used by the "zfec" and
281 "zunfec" command-line tools from the bin/ directory.
287 A C compiler is required. To use the Python API or the command-line tools a
288 Python interpreter is also required. We have tested it with Python v2.4,
289 v2.5, v2.6, and v2.7. For the Haskell interface, GHC >= 6.8.1 is required.
295 Thanks to the author of the original fec lib, Luigi Rizzo, and the folks that
296 contributed to it: Phil Karn, Robert Morelos-Zaragoza, Hari Thirumoorthy, and
297 Dan Rubenstein. Thanks to the Mnet hackers who wrote an earlier Python
298 wrapper, especially Myers Carpenter and Hauke Johannknecht. Thanks to Brian
299 Warner and Amber O'Whielacronx for help with the API, documentation,
300 debugging, compression, and unit tests. Thanks to Adam Langley for improving
301 the C API and contributing the Haskell API. Thanks to the creators of GCC
302 (starting with Richard M. Stallman) and Valgrind (starting with Julian
303 Seward) for a pair of excellent tools. Thanks to my coworkers at Allmydata
304 -- http://allmydata.com -- Fabrice Grinda, Peter Secor, Rob Kinninmont, Brian
305 Warner, Zandr Milewski, Justin Boreta, Mark Meras for sponsoring this work
306 and releasing it under a Free Software licence. Thanks to Jack Lloyd, Samuel
307 Neves, and David-Sarah Hopwood.