docs/proposed/magic-folder/remote-to-local-sync.rst

   1 Magic Folder design for remote-to-local sync
   2 ============================================
   3
   4 Scope
   5 -----
   6
   7 In this Objective we will design remote-to-local synchronization:
   8
   9 * How to efficiently determine which objects (files and directories) have
  10   to be downloaded in order to bring the current local filesystem into sync
  11   with the newly-discovered version of the remote filesystem.
  12 * How to distinguish overwrites, in which the remote side was aware of
  13   your most recent version and overwrote it with a new version, from
  14   conflicts, in which the remote side was unaware of your most recent
  15   version when it published its new version. The latter needs to be raised
  16   to the user as an issue the user will have to resolve and the former must
  17   not bother the user.
  18 * How to overwrite the (stale) local versions of those objects with the
  19   newly acquired objects, while preserving backed-up versions of those
  20   overwritten objects in case the user didn't want this overwrite and wants
  21   to recover the old version.
  22
  23 Tickets on the Tahoe-LAFS trac with the `otf-magic-folder-objective4`_
  24 keyword are within the scope of the remote-to-local synchronization
  25 design.
  26
  27 .. _otf-magic-folder-objective4: https://tahoe-lafs.org/trac/tahoe-lafs/query?status=!closed&keywords=~otf-magic-folder-objective4
  28
  29
  30 Glossary
  31 ''''''''
  32
  33 Object: a file or directory
  34
  35 DMD: distributed mutable directory
  36
  37 Folder: an abstract directory that is synchronized between clients.
  38 (A folder is not the same as the directory corresponding to it on
  39 any particular client, nor is it the same as a DMD.)
  40
  41 Descendant: a direct or indirect child in a directory or folder tree
  42
  43 Subfolder: a folder that is a descendant of a magic folder
  44
  45 Subpath: the path from a magic folder to one of its descendants
  46
  47 Write: a modification to a local filesystem object by a client
  48
  49 Read: a read from a local filesystem object by a client
  50
  51 Upload: an upload of a local object to the Tahoe-LAFS file store
  52
  53 Download: a download from the Tahoe-LAFS file store to a local object
  54
  55 Pending notification: a local filesystem change that has been detected
  56 but not yet processed.
  57
  58
  59 Representing the Magic Folder in Tahoe-LAFS
  60 -------------------------------------------
  61
  62 Unlike the local case where we use inotify or ReadDirectoryChangesW to
  63 detect filesystem changes, we have no mechanism to register a monitor for
  64 changes to a Tahoe-LAFS directory. Therefore, we must periodically poll
  65 for changes.
  66
  67 An important constraint on the solution is Tahoe-LAFS' "`write
  68 coordination directive`_", which prohibits concurrent writes by different
  69 storage clients to the same mutable object:
  70
  71     Tahoe does not provide locking of mutable files and directories. If
  72     there is more than one simultaneous attempt to change a mutable file
  73     or directory, then an UncoordinatedWriteError may result. This might,
  74     in rare cases, cause the file or directory contents to be accidentally
  75     deleted.  The user is expected to ensure that there is at most one
  76     outstanding write or update request for a given file or directory at
  77     a time.  One convenient way to accomplish this is to make a different
  78     file or directory for each person or process that wants to write.
  79
  80 .. _`write coordination directive`: ../../write_coordination.rst
  81
  82 Since it is a goal to allow multiple users to write to a Magic Folder,
  83 if the write coordination directive remains the same as above, then we
  84 will not be able to implement the Magic Folder as a single Tahoe-LAFS
  85 DMD. In general therefore, we will have multiple DMDs —spread across
  86 clients— that together represent the Magic Folder. Each client polls
  87 the other clients' DMDs in order to detect remote changes.
  88
  89 Six possible designs were considered for the representation of subfolders
  90 of the Magic Folder:
  91
  92 1. All subfolders written by a given Magic Folder client are collapsed
  93 into a single client DMD, containing immutable files. The child name of
  94 each file encodes the full subpath of that file relative to the Magic
  95 Folder.
  96
  97 2. The DMD tree under a client DMD is a direct copy of the folder tree
  98 written by that client to the Magic Folder. Not all subfolders have
  99 corresponding DMDs; only those to which that client has written files or
 100 child subfolders.
 101
 102 3. The directory tree under a client DMD is a ``tahoe backup`` structure
 103 containing immutable snapshots of the folder tree written by that client
 104 to the Magic Folder. As in design 2, only objects written by that client
 105 are present.
 106
 107 4. *Each* client DMD contains an eventually consistent mirror of all
 108 files and folders written by *any* Magic Folder client. Thus each client
 109 must also copy changes made by other Magic Folder clients to its own
 110 client DMD.
 111
 112 5. *Each* client DMD contains a ``tahoe backup`` structure containing
 113 immutable snapshots of all files and folders written by *any* Magic
 114 Folder client. Thus each client must also create another snapshot in its
 115 own client DMD when changes are made by another client. (It can potentially
 116 batch changes, subject to latency requirements.)
 117
 118 6. The write coordination problem is solved by implementing `two-phase
 119 commit`_. Then, the representation consists of a single DMD tree which is
 120 written by all clients.
 121
 122 .. _`two-phase commit`: https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1755
 123
 124 Here is a summary of advantages and disadvantages of each design:
 125
 126 +----------------------------+
 127 | Key                        |
 128 +=======+====================+
 129 | \+\+  | major advantage    |
 130 +-------+--------------------+
 131 | \+    | minor advantage    |
 132 +-------+--------------------+
 133 | ‒     | minor disadvantage |
 134 +-------+--------------------+
 135 | ‒ ‒   | major disadvantage |
 136 +-------+--------------------+
 137 | ‒ ‒ ‒ | showstopper        |
 138 +-------+--------------------+
 139
 140
 141 123456+: All designs have the property that a recursive add-lease
 142 operation starting from the parent Tahoe-LAFS DMD will find all of the
 143 files and directories used in the Magic Folder representation. Therefore
 144 the representation is compatible with `garbage collection`_, even when a
 145 pre-Magic-Folder client does the lease marking.
 146
 147 .. _`garbage collection`: https://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/docs/garbage-collection.rst
 148
 149 123456+: All designs avoid "breaking" pre-Magic-Folder clients that read
 150 a directory or file that is part of the representation.
 151
 152 456++: Only these designs allow a readcap to one of the client
 153 directories —or one of their subdirectories— to be directly shared
 154 with other Tahoe-LAFS clients (not necessarily Magic Folder clients),
 155 so that such a client sees all of the contents of the Magic Folder.
 156 Note that this was not a requirement of the OTF proposal, although it
 157 is useful.
 158
 159 135+: A Magic Folder client has only one mutable Tahoe-LAFS object to
 160 monitor per other client. This minimizes communication bandwidth for
 161 polling, or alternatively the latency possible for a given polling
 162 bandwidth.
 163
 164 1236+: A client does not need to make changes to its own DMD that repeat
 165 changes that another Magic Folder client had previously made. This reduces
 166 write bandwidth and complexity.
 167
 168 1‒: If the Magic Folder has many subfolders, their files will all be
 169 collapsed into the same DMD, which could get quite large. In practice a
 170 single DMD can easily handle the number of files expected to be written
 171 by a client, so this is unlikely to be a significant issue.
 172
 173 123‒ ‒ ‒: In these designs, the set of files in a Magic Folder is
 174 represented as the union of the files in all client DMDs. However,
 175 when a file is modified by more than one client, it will be linked
 176 from multiple client DMDs. We therefore need a mechanism, such as a
 177 version number or a monotonically increasing timestamp, to determine
 178 which copy takes priority.
 179
 180 35‒ ‒: When a Magic Folder client detects a remote change, it must
 181 traverse an immutable directory structure to see what has changed.
 182 Completely unchanged subtrees will have the same URI, allowing some of
 183 this traversal to be shortcutted.
 184
 185 24‒ ‒ ‒: When a Magic Folder client detects a remote change, it must
 186 traverse a mutable directory structure to see what has changed. This is
 187 more complex and less efficient than traversing an immutable structure,
 188 because shortcutting is not possible (each DMD retains the same URI even
 189 if a descendant object has changed), and because the structure may change
 190 while it is being traversed. Also the traversal needs to be robust
 191 against cycles, which can only occur in mutable structures.
 192
 193 45‒ ‒: When a change occurs in one Magic Folder client, it will propagate
 194 to all the other clients. Each client will therefore see multiple
 195 representation changes for a single logical change to the Magic Folder
 196 contents, and must suppress the duplicates. This is particularly
 197 problematic for design 4 where it interacts with the preceding issue.
 198
 199 4‒ ‒ ‒, 5‒ ‒: There is the potential for client DMDs to get "out of sync"
 200 with each other, potentially for long periods if errors occur. Thus each
 201 client must be able to "repair" its client directory (and its
 202 subdirectory structure) concurrently with performing its own writes. This
 203 is a significant complexity burden and may introduce failure modes that
 204 could not otherwise happen.
 205
 206 6‒ ‒ ‒: While two-phase commit is a well-established protocol, its
 207 application to Tahoe-LAFS requires significant design work, and may still
 208 leave some corner cases of the write coordination problem unsolved.
 209
 210
 211 +------------------------------------------------+-----------------------------------------+
 212 | Design Property                                | Designs Proposed                        |
 213 +================================================+======+======+======+======+======+======+
 214 | **advantages**                                 | *1*  | *2*  | *3*  | *4*  | *5*  | *6*  |
 215 +------------------------------------------------+------+------+------+------+------+------+
 216 | Compatible with garbage collection             |\+    |\+    |\+    |\+    |\+    |\+    |
 217 +------------------------------------------------+------+------+------+------+------+------+
 218 | Does not break old clients                     |\+    |\+    |\+    |\+    |\+    |\+    |
 219 +------------------------------------------------+------+------+------+------+------+------+
 220 | Allows direct sharing                          |      |      |      |\+\+  |\+\+  |\+\+  |
 221 +------------------------------------------------+------+------+------+------+------+------+
 222 | Efficient use of bandwidth                     |\+    |      |\+    |      |\+    |      |
 223 +------------------------------------------------+------+------+------+------+------+------+
 224 | No repeated changes                            |\+    |\+    |\+    |      |      |\+    |
 225 +------------------------------------------------+------+------+------+------+------+------+
 226 | **disadvantages**                              | *1*  | *2*  | *3*  | *4*  | *5*  | *6*  |
 227 +------------------------------------------------+------+------+------+------+------+------+
 228 | Can result in large DMDs                       |‒     |      |      |      |      |      |
 229 +------------------------------------------------+------+------+------+------+------+------+
 230 | Need version number to determine priority      |‒     |‒     |‒     |      |      |      |
 231 +------------------------------------------------+------+------+------+------+------+------+
 232 | Must traverse immutable directory structure    |      |      |‒ ‒   |      |‒ ‒   |      |
 233 +------------------------------------------------+------+------+------+------+------+------+
 234 | Must traverse mutable directory structure      |      |‒ ‒   |      |‒ ‒   |      |      |
 235 +------------------------------------------------+------+------+------+------+------+------+
 236 | Must suppress duplicate representation changes |      |      |      |‒ ‒   |‒ ‒   |      |
 237 +------------------------------------------------+------+------+------+------+------+------+
 238 | "Out of sync" problem                          |      |      |      |‒ ‒ ‒ |‒ ‒   |      |
 239 +------------------------------------------------+------+------+------+------+------+------+
 240 | Unsolved design problems                       |      |      |      |      |      |‒ ‒ ‒ |
 241 +------------------------------------------------+------+------+------+------+------+------+
 242
 243
 244 Evaluation of designs
 245 '''''''''''''''''''''
 246
 247 Designs 2 and 3 have no significant advantages over design 1, while
 248 requiring higher polling bandwidth and greater complexity due to the need
 249 to create subdirectories. These designs were therefore rejected.
 250
 251 Design 4 was rejected due to the out-of-sync problem, which is severe
 252 and possibly unsolvable for mutable structures.
 253
 254 For design 5, the out-of-sync problem is still present but possibly
 255 solvable. However, design 5 is substantially more complex, less efficient
 256 in bandwidth/latency, and less scalable in number of clients and
 257 subfolders than design 1. It only gains over design 1 on the ability to
 258 share directory readcaps to the Magic Folder (or subfolders), which was
 259 not a requirement. It would be possible to implement this feature in
 260 future by switching to design 6.
 261
 262 For the time being, however, design 6 was considered out-of-scope for
 263 this project.
 264
 265 Therefore, design 1 was chosen. That is:
 266
 267     All subfolders written by a given Magic Folder client are collapsed
 268     into a single client DMD, containing immutable files. The child name
 269     of each file encodes the full subpath of that file relative to the
 270     Magic Folder.
 271
 272 Each directory entry in a DMD also stores a version number, so that the
 273 latest version of a file is well-defined when it has been modified by
 274 multiple clients.
 275
 276 To enable representing empty directories, a client that creates a
 277 directory should link a corresponding zero-length file in its DMD,
 278 at a name that ends with the encoded directory separator character.
 279
 280 We want to enable dynamic configuration of the set of clients subscribed
 281 to a Magic Folder, without having to reconfigure or restart each client
 282 when another client joins. To support this, we have a single parent DMD
 283 that links to all of the client DMDs, named by their client nicknames.
 284 Then it is possible to change the contents of the parent DMD in order to
 285 add clients. Note that a client DMD should not be unlinked from the
 286 parent directory unless all of its files are first copied to some other
 287 client DMD.
 288
 289 A client needs to be able to write to its own DMD, and read from other DMDs.
 290 To be consistent with the `Principle of Least Authority`_, each client's
 291 reference to its own DMD is a write capability, whereas its reference
 292 to the parent DMD is a read capability. The latter transitively grants
 293 read access to all of the other client DMDs and the files linked from
 294 them, as required.
 295
 296 .. _`Principle of Least Authority`: http://www.eros-os.org/papers/secnotsep.pdf
 297
 298 Design and implementation of the user interface for maintaining this
 299 DMD structure and configuration will be addressed in Objectives 5 and 6.
 300
 301 During operation, each client will poll for changes on other clients
 302 at a predetermined frequency. On each poll, it will reread the parent DMD
 303 (to allow for added or removed clients), and then read each client DMD
 304 linked from the parent.
 305
 306 "Hidden" files, and files with names matching the patterns used for backup,
 307 temporary, and conflicted files, will be ignored, i.e. not synchronized
 308 in either direction. A file is hidden if it has a filename beginning with
 309 "." (on any platform), or has the hidden or system attribute on Windows.
 310
 311
 312 Conflict Detection and Resolution
 313 ---------------------------------
 314
 315 The combination of local filesystems and distributed objects is
 316 an example of shared state concurrency, which is highly error-prone
 317 and can result in race conditions that are complex to analyze.
 318 Unfortunately we have no option but to use shared state in this
 319 situation.
 320
 321 We call the resulting design issues "dragons" (as in "Here be dragons"),
 322 which as a convenient mnemonic we have named after the classical
 323 Greek elements Earth, Fire, Air, and Water.
 324
 325 Note: all filenames used in the following sections are examples,
 326 and the filename patterns we use in the actual implementation may
 327 differ. The actual patterns will probably include timestamps, and
 328 for conflicted files, the nickname of the client that last changed
 329 the file.
 330
 331
 332 Earth Dragons: Collisions between local filesystem operations and downloads
 333 '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
 334
 335 Write/download collisions
 336 ~~~~~~~~~~~~~~~~~~~~~~~~~
 337
 338 Suppose that Alice's Magic Folder client is about to write a
 339 version of ``foo`` that it has downloaded in response to a remote
 340 change.
 341
 342 The criteria for distinguishing overwrites from conflicts are
 343 described later in the `Fire Dragons`_ section. Suppose that the
 344 remote change has been initially classified as an overwrite.
 345 (As we will see, it may be reclassified in some circumstances.)
 346
 347 .. _`Fire Dragons`: #fire-dragons-distinguishing-conflicts-from-overwrites
 348
 349 A *write/download collision* occurs when another program writes
 350 to ``foo`` in the local filesystem, concurrently with the new
 351 version being written by the Magic Folder client. We need to
 352 ensure that this does not cause data loss, as far as possible.
 353
 354 An important constraint on the design is that on Windows, it is
 355 not possible to rename a file to the same name as an existing
 356 file in that directory. Also, on Windows it may not be possible to
 357 delete or rename a file that has been opened by another process
 358 (depending on the sharing flags specified by that process).
 359 Therefore we need to consider carefully how to handle failure
 360 conditions.
 361
 362 In our proposed design, Alice's Magic Folder client follows
 363 this procedure for an overwrite in response to a remote change:
 364
 365 1. Write a temporary file, say ``.foo.tmp``.
 366 2. Use the procedure described in the `Fire Dragons_` section
 367    to obtain an initial classification as an overwrite or a
 368    conflict. (This takes as input the ``last_downloaded_uri``
 369    field from the directory entry of the changed ``foo``.)
 370 3. Set the ``mtime`` of the replacement file to be *T* seconds
 371    before the current local time.
 372 4. Perform a ''file replacement'' operation (explained below)
 373    with backup filename ``foo.backup``, replaced file ``foo``,
 374    and replacement file ``.foo.tmp``. If any step of this
 375    operation fails, reclassify as a conflict and stop.
 376
 377 To reclassify as a conflict, attempt to rename ``.foo.tmp`` to
 378 ``foo.conflicted``, suppressing errors.
 379
 380 The implementation of file replacement differs between Unix
 381 and Windows. On Unix, it can be implemented as follows:
 382
 383 * 4a. Set the permissions of the replacement file to be the
 384   same as the replaced file, bitwise-or'd with octal 600
 385   (``rw-------``).
 386 * 4b. Attempt to move the replaced file (``foo``) to the
 387   backup filename (``foo.backup``).
 388 * 4c. Attempt to create a hard link at the replaced filename
 389   (``foo``) pointing to the replacement file (``.foo.tmp``).
 390 * 4d. Attempt to unlink the replacement file (``.foo.tmp``),
 391   suppressing errors.
 392
 393 Note that, if there is no conflict, the entry for ``foo``
 394 recorded in the `magic folder db`_ will reflect the ``mtime``
 395 set in step 3. The link operation in step 4c will cause an
 396 ``IN_CREATE`` event for ``foo``, but this will not trigger an
 397 upload, because the metadata recorded in the database entry
 398 will exactly match the metadata for the file's inode on disk.
 399 (The two hard links — ``foo`` and, while it still exists,
 400 ``.foo.tmp`` — share the same inode and therefore the same
 401 metadata.)
 402
 403 .. _`magic folder db`: filesystem_integration.rst#local-scanning-and-database
 404
 405 On Windows, file replacement can be implemented as a single
 406 call to the `ReplaceFileW`_ API (with the
 407 ``REPLACEFILE_IGNORE_MERGE_ERRORS`` flag).
 408
 409 Similar to the Unix case, the `ReplaceFileW`_ operation will
 410 cause a change notification for ``foo``. The replaced ``foo``
 411 has the same ``mtime`` as the replacement file, and so this
 412 notification will not trigger an unwanted upload.
 413
 414 .. _`ReplaceFileW`: https://msdn.microsoft.com/en-us/library/windows/desktop/aa365512%28v=vs.85%29.aspx
 415
 416 To determine whether this procedure adequately protects against data
 417 loss, we need to consider what happens if another process attempts to
 418 update ``foo``, for example by renaming ``foo.other`` to ``foo``.
 419 This requires us to analyze all possible interleavings between the
 420 operations performed by the Magic Folder client and the other process.
 421 (Note that atomic operations on a directory are totally ordered.)
 422 The set of possible interleavings differs between Windows and Unix.
 423
 424 On Unix, we have:
 425
 426 * Interleaving A: the other process' rename precedes our rename in
 427   step 4b, and we get an ``IN_MOVED_TO`` event for its rename by
 428   step 2. Then we reclassify as a conflict; its changes end up at
 429   ``foo`` and ours end up at ``foo.conflicted``. This avoids data
 430   loss.
 431
 432 * Interleaving B: its rename precedes ours in step 4b, and we do
 433   not get an event for its rename by step 2. Its changes end up at
 434   ``foo.backup``, and ours end up at ``foo`` after being linked there
 435   in step 4c. This avoids data loss.
 436
 437 * Interleaving C: its rename happens between our rename in step 4b,
 438   and our link operation in step 4c of the file replacement. The
 439   latter fails with an ``EEXIST`` error because ``foo`` already
 440   exists. We reclassify as a conflict; the old version ends up at
 441   ``foo.backup``, the other process' changes end up at ``foo``, and
 442   ours at ``foo.conflicted``. This avoids data loss.
 443
 444 * Interleaving D: its rename happens after our link in step 4c,
 445   and causes an ``IN_MOVED_TO`` event for ``foo``. Its rename also
 446   changes the ``mtime`` for ``foo`` so that it is different from
 447   the ``mtime`` calculated in step 3, and therefore different
 448   from the metadata recorded for ``foo`` in the magic folder db.
 449   (Assuming no system clock changes, its rename will set an ``mtime``
 450   timestamp corresponding to a time after step 4c, which is not
 451   equal to the timestamp *T* seconds before step 4a, provided that
 452   *T* seconds is sufficiently greater than the timestamp granularity.)
 453   Therefore, an upload will be triggered for ``foo`` after its
 454   change, which is correct and avoids data loss.
 455
 456 On Windows, the internal implementation of `ReplaceFileW`_ is similar
 457 to what we have described above for Unix; it works like this:
 458
 459 * 4a′. Copy metadata (which does not include ``mtime``) from the
 460   replaced file (``foo``) to the replacement file (``.foo.tmp``).
 461
 462 * 4b′. Attempt to move the replaced file (``foo``) onto the
 463   backup filename (``foo.backup``), deleting the latter if it
 464   already exists.
 465
 466 * 4c′. Attempt to move the replacement file (``.foo.tmp``) to the
 467   replaced filename (``foo``); fail if the destination already
 468   exists.
 469
 470 Notice that this is essentially the same as the algorithm we use
 471 for Unix, but steps 4c and 4d on Unix are combined into a single
 472 step 4c′. (If there is a failure at steps 4c′ after step 4b′ has
 473 completed, the `ReplaceFileW`_ call will fail with return code
 474 ``ERROR_UNABLE_TO_MOVE_REPLACEMENT_2``. However, it is still
 475 preferable to use this API over two `MoveFileExW`_ calls, because
 476 it retains the attributes and ACLs of ``foo`` where possible.)
 477
 478 However, on Windows the other application will not be able to
 479 directly rename ``foo.other`` onto ``foo`` (which would fail because
 480 the destination already exists); it will have to rename or delete
 481 ``foo`` first. Without loss of generality, let's say ``foo`` is
 482 deleted. This complicates the interleaving analysis, because we
 483 have two operations done by the other process interleaving with
 484 three done by the magic folder process (rather than one operation
 485 interleaving with four as on Unix). The cases are:
 486
 487 * Interleaving A′: the other process' deletion of ``foo`` and its
 488   rename of ``foo.other`` to ``foo`` both precede our rename in
 489   step 4b. We get an event corresponding to its rename by step 2.
 490   Then we reclassify as a conflict; its changes end up at ``foo``
 491   and ours end up at ``foo.conflicted``. This avoids data loss.
 492
 493 * Interleaving B′: the other process' deletion of ``foo`` and its
 494   rename of ``foo.other`` to ``foo`` both precede our rename in
 495   step 4b. We do not get an event for its rename by step 2.
 496   Its changes end up at ``foo.backup``, and ours end up at ``foo``
 497   after being moved there in step 4c′. This avoids data loss.
 498
 499 * Interleaving C′: the other process' deletion of ``foo`` precedes
 500   our rename of ``foo`` to ``foo.backup`` done by `ReplaceFileW`_,
 501   but its rename of ``foo.other`` to ``foo`` does not, so we get
 502   an ``ERROR_FILE_NOT_FOUND`` error from `ReplaceFileW`_ indicating
 503   that the replaced file does not exist. Then we reclassify as a
 504   conflict; the other process' changes end up at ``foo`` (after
 505   it has renamed ``foo.other`` to ``foo``) and our changes end up
 506   at ``foo.conflicted``. This avoids data loss.
 507
 508 * Interleaving D′: the other process' deletion and/or rename happen
 509   during the call to `ReplaceFileW`_, causing the latter to fail.
 510   There are two subcases:
 511
 512   * if the error is ``ERROR_UNABLE_TO_MOVE_REPLACEMENT_2``, then
 513     ``foo`` is renamed to ``foo.backup`` and ``.foo.tmp`` remains
 514     at its original name after the call.
 515   * for all other errors, ``foo`` and ``.foo.tmp`` both remain at
 516     their original names after the call.
 517
 518   In both subcases, we reclassify as a conflict and rename ``.foo.tmp``
 519   to ``foo.conflicted``. This avoids data loss.
 520
 521 * Interleaving E′: the other process' deletion of ``foo`` and attempt
 522   to rename ``foo.other`` to ``foo`` both happen after all internal
 523   operations of `ReplaceFileW`_ have completed. This causes deletion
 524   and rename events for ``foo`` (which will in practice be merged due
 525   to the pending delay, although we don't rely on that for correctness).
 526   The rename also changes the ``mtime`` for ``foo`` so that it is
 527   different from the ``mtime`` calculated in step 3, and therefore
 528   different from the metadata recorded for ``foo`` in the magic folder
 529   db. (Assuming no system clock changes, its rename will set an
 530   ``mtime`` timestamp corresponding to a time after the internal
 531   operations of `ReplaceFileW`_ have completed, which is not equal to
 532   the timestamp *T* seconds before `ReplaceFileW`_ is called, provided
 533   that *T* seconds is sufficiently greater than the timestamp
 534   granularity.) Therefore, an upload will be triggered for ``foo``
 535   after its change, which is correct and avoids data loss.
 536
 537 .. _`MoveFileExW`: https://msdn.microsoft.com/en-us/library/windows/desktop/aa365240%28v=vs.85%29.aspx
 538
 539 We also need to consider what happens if another process opens ``foo``
 540 and writes to it directly, rather than renaming another file onto it:
 541
 542 * On Unix, open file handles refer to inodes, not paths. If the other
 543   process opens ``foo`` before it has been renamed to ``foo.backup``,
 544   and then closes the file, changes will have been written to the file
 545   at the same inode, even if that inode is now linked at ``foo.backup``.
 546   This avoids data loss.
 547
 548 * On Windows, we have two subcases, depending on whether the sharing
 549   flags specified by the other process when it opened its file handle
 550   included ``FILE_SHARE_DELETE``. (This flag covers both deletion and
 551   rename operations.)
 552
 553   i.  If the sharing flags *do not* allow deletion/renaming, the
 554       `ReplaceFileW`_ operation will fail without renaming ``foo``.
 555       In this case we will end up with ``foo`` changed by the other
 556       process, and the downloaded file still in ``foo.tmp``.
 557       This avoids data loss.
 558
 559   ii. If the sharing flags *do* allow deletion/renaming, then
 560       data loss or corruption may occur. This is unavoidable and
 561       can be attributed to other process making a poor choice of
 562       sharing flags (either explicitly if it used `CreateFile`_, or
 563       via whichever higher-level API it used).
 564
 565 .. _`CreateFile`: https://msdn.microsoft.com/en-us/library/windows/desktop/aa363858%28v=vs.85%29.aspx
 566
 567 Note that it is possible that another process tries to open the file
 568 between steps 4b and 4c (or 4b′ and 4c′ on Windows). In this case the
 569 open will fail because ``foo`` does not exist. Nevertheless, no data
 570 will be lost, and in many cases the user will be able to retry the
 571 operation.
 572
 573 Above we only described the case where the download was initially
 574 classified as an overwrite. If it was classed as a conflict, the
 575 procedure is the same except that we choose a unique filename
 576 for the conflicted file (say, ``foo.conflicted_unique``). We write
 577 the new contents to ``.foo.tmp`` and then rename it to
 578 ``foo.conflicted_unique`` in such a way that the rename will fail
 579 if the destination already exists. (On Windows this is a simple
 580 rename; on Unix it can be implemented as a link operation followed
 581 by an unlink, similar to steps 4c and 4d above.) If this fails
 582 because another process wrote ``foo.conflicted_unique`` after we
 583 chose the filename, then we retry with a different filename.
 584
 585
 586 Read/download collisions
 587 ~~~~~~~~~~~~~~~~~~~~~~~~
 588
 589 A *read/download collision* occurs when another program reads
 590 from ``foo`` in the local filesystem, concurrently with the new
 591 version being written by the Magic Folder client. We want to
 592 ensure that any successful attempt to read the file by the other
 593 program obtains a consistent view of its contents.
 594
 595 On Unix, the above procedure for writing downloads is sufficient
 596 to achieve this. There are three cases:
 597
 598 * A. The other process opens ``foo`` for reading before it is
 599   renamed to ``foo.backup``. Then the file handle will continue to
 600   refer to the old file across the rename, and the other process
 601   will read the old contents.
 602
 603 * B. The other process attempts to open ``foo`` after it has been
 604   renamed to ``foo.backup``, and before it is linked in step c.
 605   The open call fails, which is acceptable.
 606
 607 * C. The other process opens ``foo`` after it has been linked to
 608   the new file. Then it will read the new contents.
 609
 610 On Windows, the analysis is very similar, but case A′ needs to
 611 be split into two subcases, depending on the sharing mode the other
 612 process uses when opening the file for reading:
 613
 614 * A′. The other process opens ``foo`` before the Magic Folder
 615   client's attempt to rename ``foo`` to ``foo.backup`` (as part
 616   of the implementation of `ReplaceFileW`_). The subcases are:
 617
 618   i.  The other process uses sharing flags that deny deletion and
 619       renames. The `ReplaceFileW`_ call fails, and the download is
 620       reclassified as a conflict. The downloaded file ends up at
 621       ``foo.conflicted``, which is correct.
 622
 623   ii. The other process uses sharing flags that allow deletion
 624       and renames. The `ReplaceFileW`_ call succeeds, and the
 625       other process reads inconsistent data. This can be attributed
 626       to a poor choice of sharing flags by the other process.
 627
 628 * B′. The other process attempts to open ``foo`` at the point
 629   during the `ReplaceFileW`_ call where it does not exist.
 630   The open call fails, which is acceptable.
 631
 632 * C′. The other process opens ``foo`` after it has been linked to
 633   the new file. Then it will read the new contents.
 634
 635
 636 For both write/download and read/download collisions, we have
 637 considered only interleavings with a single other process, and
 638 only the most common possibilities for the other process'
 639 interaction with the file. If multiple other processes are
 640 involved, or if a process performs operations other than those
 641 considered, then we cannot say much about the outcome in general;
 642 however, we believe that such cases will be much less common.
 643
 644
 645
 646 Fire Dragons: Distinguishing conflicts from overwrites
 647 ''''''''''''''''''''''''''''''''''''''''''''''''''''''
 648
 649 When synchronizing a file that has changed remotely, the Magic Folder
 650 client needs to distinguish between overwrites, in which the remote
 651 side was aware of your most recent version and overwrote it with a
 652 new version, and conflicts, in which the remote side was unaware of
 653 your most recent version when it published its new version. Those two
 654 cases have to be handled differently — the latter needs to be raised
 655 to the user as an issue the user will have to resolve and the former
 656 must not bother the user.
 657
 658 For example, suppose that Alice's Magic Folder client sees a change
 659 to ``foo`` in Bob's DMD. If the version it downloads from Bob's DMD
 660 is "based on" the version currently in Alice's local filesystem at
 661 the time Alice's client attempts to write the downloaded file, then
 662 it is an overwrite. Otherwise it is initially classified as a
 663 conflict.
 664
 665 This initial classification is used by the procedure for writing a
 666 file described in the `Earth Dragons`_ section above. As explained
 667 in that section, we may reclassify an overwrite as a conflict if an
 668 error occurs during the write procedure.
 669
 670 .. _`Earth Dragons`: #earth-dragons-collisions-between-local-filesystem-operations-and-downloads
 671
 672 In order to implement this policy, we need to specify how the
 673 "based on" relation between file versions is recorded and updated.
 674
 675 We propose to record this information:
 676
 677 * in the `magic folder db`_, for local files;
 678 * in the Tahoe-LAFS directory metadata, for files stored in the
 679   Magic Folder.
 680
 681 In the magic folder db we will add a *last-downloaded record*,
 682 consisting of ``last_downloaded_uri`` and ``last_downloaded_timestamp``
 683 fields, for each path stored in the database. Whenever a Magic Folder
 684 client downloads a file, it stores the downloaded version's URI and
 685 the current local timestamp in this record. Since only immutable
 686 files are used, the URI will be an immutable file URI, which is
 687 deterministically and uniquely derived from the file contents and
 688 the Tahoe-LAFS node's `convergence secret`_.
 689
 690 (Note that the last-downloaded record is updated regardless of
 691 whether the download is an overwrite or a conflict. The rationale
 692 for this to avoid "conflict loops" between clients, where every
 693 new version after the first conflict would be considered as another
 694 conflict.)
 695
 696 .. _`convergence secret`: https://tahoe-lafs.org/trac/tahoe-lafs/browser/docs/convergence-secret.rst
 697
 698 Later, in response to a local filesystem change at a given path, the
 699 Magic Folder client reads the last-downloaded record associated with
 700 that path (if any) from the database and then uploads the current
 701 file. When it links the uploaded file into its client DMD, it
 702 includes the ``last_downloaded_uri`` field in the metadata of the
 703 directory entry, overwriting any existing field of that name. If
 704 there was no last-downloaded record associated with the path, this
 705 field is omitted.
 706
 707 Note that ``last_downloaded_uri`` field does *not* record the URI of
 708 the uploaded file (which would be redundant); it records the URI of
 709 the last download before the local change that caused the upload.
 710 The field will be absent if the file has never been downloaded by
 711 this client (i.e. if it was created on this client and no change
 712 by any other client has been detected).
 713
 714 A possible refinement also takes into account the
 715 ``last_downloaded_timestamp`` field from the magic folder db, and
 716 compares it to the timestamp of the change that caused the upload
 717 (which should be later, assuming no system clock changes).
 718 If the duration between these timestamps is very short, then we
 719 are uncertain about whether the process on Bob's system that wrote
 720 the local file could have taken into account the last download.
 721 We can use this information to be conservative about treating
 722 changes as conflicts. So, if the duration is less than a configured
 723 threshold, we omit the ``last_downloaded_uri`` field from the
 724 metadata. This will have the effect of making other clients treat
 725 this change as a conflict whenever they already have a copy of the
 726 file.
 727
 728 Now we are ready to describe the algorithm for determining whether a
 729 download for the file ``foo`` is an overwrite or a conflict (refining
 730 step 2 of the procedure from the `Earth Dragons`_ section).
 731
 732 Let ``last_downloaded_uri`` be the field of that name obtained from
 733 the directory entry metadata for ``foo`` in Bob's DMD (this field
 734 may be absent). Then the algorithm is:
 735
 736 * 2a. If Alice has no local copy of ``foo``, classify as an overwrite.
 737
 738 * 2b. Otherwise, "stat" ``foo`` to get its *current statinfo* (size
 739   in bytes, ``mtime``, and ``ctime``).
 740
 741 * 2c. Read the following information for the path ``foo`` from the
 742   local magic folder db:
 743
 744   * the *last-uploaded statinfo*, if any (this is the size in
 745     bytes, ``mtime``, and ``ctime`` stored in the ``local_files``
 746     table when the file was last uploaded);
 747   * the ``filecap`` field of the ``caps`` table for this file,
 748     which is the URI under which the file was last uploaded.
 749     Call this ``last_uploaded_uri``.
 750
 751 * 2d. If any of the following are true, then classify as a conflict:
 752
 753   * there are pending notifications of changes to ``foo``;
 754   * the last-uploaded statinfo is either absent, or different
 755     from the current statinfo;
 756   * either ``last_downloaded_uri`` or ``last_uploaded_uri``
 757     (or both) are absent, or they are different.
 758
 759   Otherwise, classify as an overwrite.
 760
 761
 762 Air Dragons: Collisions between local writes and uploads
 763 ''''''''''''''''''''''''''''''''''''''''''''''''''''''''
 764
 765 Short of filesystem-specific features on Unix or the `shadow copy service`_
 766 on Windows (which is per-volume and therefore difficult to use in this
 767 context), there is no way to *read* the whole contents of a file
 768 atomically. Therefore, when we read a file in order to upload it, we
 769 may read an inconsistent version if it was also being written locally.
 770
 771 .. _`shadow copy service`: https://technet.microsoft.com/en-us/library/ee923636%28v=ws.10%29.aspx
 772
 773 A well-behaved application can avoid this problem for its writes:
 774
 775 * On Unix, if another process modifies a file by renaming a temporary
 776   file onto it, then we will consistently read either the old contents
 777   or the new contents.
 778 * On Windows, if the other process uses sharing flags to deny reads
 779   while it is writing a file, then we will consistently read either
 780   the old contents or the new contents, unless a sharing error occurs.
 781   In the case of a sharing error we should retry later, up to a
 782   maximum number of retries.
 783
 784 In the case of a not-so-well-behaved application writing to a file
 785 at the same time we read from it, the magic folder will still be
 786 eventually consistent, but inconsistent versions may be visible to
 787 other users' clients.
 788
 789 In Objective 2 we implemented a delay, called the *pending delay*,
 790 after the notification of a filesystem change and before the file is
 791 read in order to upload it (Tahoe-LAFS ticket `#1440`_). If another
 792 change notification occurs within the pending delay time, the delay
 793 is restarted. This helps to some extent because it means that if
 794 files are written more quickly than the pending delay and less
 795 frequently than the pending delay, we shouldn't encounter this
 796 inconsistency.
 797
 798 .. _`#1440`: https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1440
 799
 800 The likelihood of inconsistency could be further reduced, even for
 801 writes by not-so-well-behaved applications, by delaying the actual
 802 upload for a further period —called the *stability delay*— after the
 803 file has finished being read. If a notification occurs between the
 804 end of the pending delay and the end of the stability delay, then
 805 the read would be aborted and the notification requeued.
 806
 807 This would have the effect of ensuring that no write notifications
 808 have been received for the file during a time window that brackets
 809 the period when it was being read, with margin before and after
 810 this period defined by the pending and stability delays. The delays
 811 are intended to account for asynchronous notification of events, and
 812 caching in the filesystem.
 813
 814 Note however that we cannot guarantee that the delays will be long
 815 enough to prevent inconsistency in any particular case. Also, the
 816 stability delay would potentially affect performance significantly
 817 because (unlike the pending delay) it is not overlapped when there
 818 are multiple files on the upload queue. This performance impact
 819 could be mitigated by uploading files in parallel where possible
 820 (Tahoe-LAFS ticket `#1459`_).
 821
 822 We have not yet decided whether to implement the stability delay, and
 823 it is not planned to be implemented for the OTF objective 4 milestone.
 824 Ticket `#2431`_ has been opened to track this idea.
 825
 826 .. _`#1459`: https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1459
 827 .. _`#2431`: https://tahoe-lafs.org/trac/tahoe-lafs/ticket/2431
 828
 829 Note that the situation of both a local process and the Magic Folder
 830 client reading a file at the same time cannot cause any inconsistency.
 831
 832
 833 Water Dragons: Handling deletion and renames
 834 ''''''''''''''''''''''''''''''''''''''''''''
 835
 836 Deletion of a file
 837 ~~~~~~~~~~~~~~~~~~
 838
 839 When a file is deleted from the filesystem of a Magic Folder client,
 840 the most intuitive behavior is for it also to be deleted under that
 841 name from other clients. To avoid data loss, the other clients should
 842 actually rename their copies to a backup (``*.old``) filename.
 843
 844 It would not be sufficient for a Magic Folder client that deletes
 845 a file to implement this simply by removing the directory entry from
 846 its DMD. Indeed, the entry may not exist in the client's DMD if it
 847 has never previously changed the file.
 848
 849 Instead, the client links a zero-length file into its DMD and sets
 850 ``deleted: true`` in the directory entry metadata. Other clients
 851 take this as a signal to rename their copies to the backup filename.
 852
 853 Note that the entry for this zero-length file has a version number as
 854 usual, and later versions may restore the file.
 855
 856 When a Magic Folder client restarts, we can detect files that had
 857 been downloaded but were deleted while it was not running, because
 858 their paths will have last-downloaded records in the magic folder db
 859 without any corresponding local file.
 860
 861 Deletion of a directory
 862 ~~~~~~~~~~~~~~~~~~~~~~~
 863
 864 Local filesystems (unlike a Tahoe-LAFS filesystem) normally cannot
 865 unlink a directory that has any remaining children. Therefore a
 866 Magic Folder client cannot delete local copies of directories in
 867 general, because they will typically contain backup files. This must
 868 be done manually on each client if desired.
 869
 870 Nevertheless, a Magic Folder client that deletes a directory should
 871 set ``deleted: true`` on the metadata entry for the corresponding
 872 zero-length file. This avoids the directory being recreated after
 873 it has been manually deleted from a client.
 874
 875 Renaming
 876 ~~~~~~~~
 877
 878 It is sufficient to handle renaming of a file by treating it as a
 879 deletion and an addition under the new name.
 880
 881 This also applies to directories, although users may find the
 882 resulting behavior unintuitive: all of the files under the old name
 883 will be renamed to backup filenames, and a new directory structure
 884 created under the new name. We believe this is the best that can be
 885 done without imposing unreasonable implementation complexity.
 886
 887
 888 Summary
 889 -------
 890
 891 This completes the design of remote-to-local synchronization.
 892 We realize that it may seem very complicated. Anecdotally, proprietary
 893 filesystem synchronization designs we are aware of, such as Dropbox,
 894 are said to incur similar or greater design complexity.