docs/proposed/magic-folder/remote-to-local-sync.rst

   1 Magic Folder design for remote-to-local sync
   2 ============================================
   3
   4 Scope
   5 -----
   6
   7 In this Objective we will design remote-to-local synchronization:
   8
   9 * How to efficiently determine which objects (files and directories) have
  10   to be downloaded in order to bring the current local filesystem into sync
  11   with the newly-discovered version of the remote filesystem.
  12 * How to distinguish overwrites, in which the remote side was aware of
  13   your most recent version and overwrote it with a new version, from
  14   conflicts, in which the remote side was unaware of your most recent
  15   version when it published its new version. The latter needs to be raised
  16   to the user as an issue the user will have to resolve and the former must
  17   not bother the user.
  18 * How to overwrite the (stale) local versions of those objects with the
  19   newly acquired objects, while preserving backed-up versions of those
  20   overwritten objects in case the user didn't want this overwrite and wants
  21   to recover the old version.
  22
  23 Tickets on the Tahoe-LAFS trac with the `otf-magic-folder-objective4`_
  24 keyword are within the scope of the remote-to-local synchronization
  25 design.
  26
  27 .. _otf-magic-folder-objective4: https://tahoe-lafs.org/trac/tahoe-lafs/query?status=!closed&keywords=~otf-magic-folder-objective4
  28
  29
  30 Glossary
  31 ''''''''
  32
  33 Object: a file or directory
  34
  35 DMD: distributed mutable directory
  36
  37 Folder: an abstract directory that is synchronized between clients.
  38 (A folder is not the same as the directory corresponding to it on
  39 any particular client, nor is it the same as a DMD.)
  40
  41 Collective: the set of clients subscribed to a given Magic Folder.
  42
  43 Descendant: a direct or indirect child in a directory or folder tree
  44
  45 Subfolder: a folder that is a descendant of a magic folder
  46
  47 Subpath: the path from a magic folder to one of its descendants
  48
  49 Write: a modification to a local filesystem object by a client
  50
  51 Read: a read from a local filesystem object by a client
  52
  53 Upload: an upload of a local object to the Tahoe-LAFS file store
  54
  55 Download: a download from the Tahoe-LAFS file store to a local object
  56
  57 Pending notification: a local filesystem change that has been detected
  58 but not yet processed.
  59
  60
  61 Representing the Magic Folder in Tahoe-LAFS
  62 -------------------------------------------
  63
  64 Unlike the local case where we use inotify or ReadDirectoryChangesW to
  65 detect filesystem changes, we have no mechanism to register a monitor for
  66 changes to a Tahoe-LAFS directory. Therefore, we must periodically poll
  67 for changes.
  68
  69 An important constraint on the solution is Tahoe-LAFS' "`write
  70 coordination directive`_", which prohibits concurrent writes by different
  71 storage clients to the same mutable object:
  72
  73     Tahoe does not provide locking of mutable files and directories. If
  74     there is more than one simultaneous attempt to change a mutable file
  75     or directory, then an UncoordinatedWriteError may result. This might,
  76     in rare cases, cause the file or directory contents to be accidentally
  77     deleted.  The user is expected to ensure that there is at most one
  78     outstanding write or update request for a given file or directory at
  79     a time.  One convenient way to accomplish this is to make a different
  80     file or directory for each person or process that wants to write.
  81
  82 .. _`write coordination directive`: ../../write_coordination.rst
  83
  84 Since it is a goal to allow multiple users to write to a Magic Folder,
  85 if the write coordination directive remains the same as above, then we
  86 will not be able to implement the Magic Folder as a single Tahoe-LAFS
  87 DMD. In general therefore, we will have multiple DMDs —spread across
  88 clients— that together represent the Magic Folder. Each client in a
  89 Magic Folder collective polls the other clients' DMDs in order to detect
  90 remote changes.
  91
  92 Six possible designs were considered for the representation of subfolders
  93 of the Magic Folder:
  94
  95 1. All subfolders written by a given Magic Folder client are collapsed
  96 into a single client DMD, containing immutable files. The child name of
  97 each file encodes the full subpath of that file relative to the Magic
  98 Folder.
  99
 100 2. The DMD tree under a client DMD is a direct copy of the folder tree
 101 written by that client to the Magic Folder. Not all subfolders have
 102 corresponding DMDs; only those to which that client has written files or
 103 child subfolders.
 104
 105 3. The directory tree under a client DMD is a ``tahoe backup`` structure
 106 containing immutable snapshots of the folder tree written by that client
 107 to the Magic Folder. As in design 2, only objects written by that client
 108 are present.
 109
 110 4. *Each* client DMD contains an eventually consistent mirror of all
 111 files and folders written by *any* Magic Folder client. Thus each client
 112 must also copy changes made by other Magic Folder clients to its own
 113 client DMD.
 114
 115 5. *Each* client DMD contains a ``tahoe backup`` structure containing
 116 immutable snapshots of all files and folders written by *any* Magic
 117 Folder client. Thus each client must also create another snapshot in its
 118 own client DMD when changes are made by another client. (It can potentially
 119 batch changes, subject to latency requirements.)
 120
 121 6. The write coordination problem is solved by implementing `two-phase
 122 commit`_. Then, the representation consists of a single DMD tree which is
 123 written by all clients.
 124
 125 .. _`two-phase commit`: https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1755
 126
 127 Here is a summary of advantages and disadvantages of each design:
 128
 129 +----------------------------+
 130 | Key                        |
 131 +=======+====================+
 132 | \+\+  | major advantage    |
 133 +-------+--------------------+
 134 | \+    | minor advantage    |
 135 +-------+--------------------+
 136 | ‒     | minor disadvantage |
 137 +-------+--------------------+
 138 | ‒ ‒   | major disadvantage |
 139 +-------+--------------------+
 140 | ‒ ‒ ‒ | showstopper        |
 141 +-------+--------------------+
 142
 143
 144 123456+: All designs have the property that a recursive add-lease
 145 operation starting from a *collective directory* containing all of
 146 the client DMDs, will find all of the files and directories used in
 147 the Magic Folder representation. Therefore the representation is
 148 compatible with `garbage collection`_, even when a pre-Magic-Folder
 149 client does the lease marking.
 150
 151 .. _`garbage collection`: https://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/docs/garbage-collection.rst
 152
 153 123456+: All designs avoid "breaking" pre-Magic-Folder clients that read
 154 a directory or file that is part of the representation.
 155
 156 456++: Only these designs allow a readcap to one of the client
 157 directories —or one of their subdirectories— to be directly shared
 158 with other Tahoe-LAFS clients (not necessarily Magic Folder clients),
 159 so that such a client sees all of the contents of the Magic Folder.
 160 Note that this was not a requirement of the OTF proposal, although it
 161 is useful.
 162
 163 135+: A Magic Folder client has only one mutable Tahoe-LAFS object to
 164 monitor per other client. This minimizes communication bandwidth for
 165 polling, or alternatively the latency possible for a given polling
 166 bandwidth.
 167
 168 1236+: A client does not need to make changes to its own DMD that repeat
 169 changes that another Magic Folder client had previously made. This reduces
 170 write bandwidth and complexity.
 171
 172 1‒: If the Magic Folder has many subfolders, their files will all be
 173 collapsed into the same DMD, which could get quite large. In practice a
 174 single DMD can easily handle the number of files expected to be written
 175 by a client, so this is unlikely to be a significant issue.
 176
 177 123‒ ‒: In these designs, the set of files in a Magic Folder is
 178 represented as the union of the files in all client DMDs. However,
 179 when a file is modified by more than one client, it will be linked
 180 from multiple client DMDs. We therefore need a mechanism, such as a
 181 version number or a monotonically increasing timestamp, to determine
 182 which copy takes priority.
 183
 184 35‒ ‒: When a Magic Folder client detects a remote change, it must
 185 traverse an immutable directory structure to see what has changed.
 186 Completely unchanged subtrees will have the same URI, allowing some of
 187 this traversal to be shortcutted.
 188
 189 24‒ ‒ ‒: When a Magic Folder client detects a remote change, it must
 190 traverse a mutable directory structure to see what has changed. This is
 191 more complex and less efficient than traversing an immutable structure,
 192 because shortcutting is not possible (each DMD retains the same URI even
 193 if a descendant object has changed), and because the structure may change
 194 while it is being traversed. Also the traversal needs to be robust
 195 against cycles, which can only occur in mutable structures.
 196
 197 45‒ ‒: When a change occurs in one Magic Folder client, it will propagate
 198 to all the other clients. Each client will therefore see multiple
 199 representation changes for a single logical change to the Magic Folder
 200 contents, and must suppress the duplicates. This is particularly
 201 problematic for design 4 where it interacts with the preceding issue.
 202
 203 4‒ ‒ ‒, 5‒ ‒: There is the potential for client DMDs to get "out of sync"
 204 with each other, potentially for long periods if errors occur. Thus each
 205 client must be able to "repair" its client directory (and its
 206 subdirectory structure) concurrently with performing its own writes. This
 207 is a significant complexity burden and may introduce failure modes that
 208 could not otherwise happen.
 209
 210 6‒ ‒ ‒: While two-phase commit is a well-established protocol, its
 211 application to Tahoe-LAFS requires significant design work, and may still
 212 leave some corner cases of the write coordination problem unsolved.
 213
 214
 215 +------------------------------------------------+-----------------------------------------+
 216 | Design Property                                | Designs Proposed                        |
 217 +================================================+======+======+======+======+======+======+
 218 | **advantages**                                 | *1*  | *2*  | *3*  | *4*  | *5*  | *6*  |
 219 +------------------------------------------------+------+------+------+------+------+------+
 220 | Compatible with garbage collection             |\+    |\+    |\+    |\+    |\+    |\+    |
 221 +------------------------------------------------+------+------+------+------+------+------+
 222 | Does not break old clients                     |\+    |\+    |\+    |\+    |\+    |\+    |
 223 +------------------------------------------------+------+------+------+------+------+------+
 224 | Allows direct sharing                          |      |      |      |\+\+  |\+\+  |\+\+  |
 225 +------------------------------------------------+------+------+------+------+------+------+
 226 | Efficient use of bandwidth                     |\+    |      |\+    |      |\+    |      |
 227 +------------------------------------------------+------+------+------+------+------+------+
 228 | No repeated changes                            |\+    |\+    |\+    |      |      |\+    |
 229 +------------------------------------------------+------+------+------+------+------+------+
 230 | **disadvantages**                              | *1*  | *2*  | *3*  | *4*  | *5*  | *6*  |
 231 +------------------------------------------------+------+------+------+------+------+------+
 232 | Can result in large DMDs                       |‒     |      |      |      |      |      |
 233 +------------------------------------------------+------+------+------+------+------+------+
 234 | Need version number to determine priority      |‒ ‒   |‒ ‒   |‒ ‒   |      |      |      |
 235 +------------------------------------------------+------+------+------+------+------+------+
 236 | Must traverse immutable directory structure    |      |      |‒ ‒   |      |‒ ‒   |      |
 237 +------------------------------------------------+------+------+------+------+------+------+
 238 | Must traverse mutable directory structure      |      |‒ ‒ ‒ |      |‒ ‒ ‒ |      |      |
 239 +------------------------------------------------+------+------+------+------+------+------+
 240 | Must suppress duplicate representation changes |      |      |      |‒ ‒   |‒ ‒   |      |
 241 +------------------------------------------------+------+------+------+------+------+------+
 242 | "Out of sync" problem                          |      |      |      |‒ ‒ ‒ |‒ ‒   |      |
 243 +------------------------------------------------+------+------+------+------+------+------+
 244 | Unsolved design problems                       |      |      |      |      |      |‒ ‒ ‒ |
 245 +------------------------------------------------+------+------+------+------+------+------+
 246
 247
 248 Evaluation of designs
 249 '''''''''''''''''''''
 250
 251 Designs 2 and 3 have no significant advantages over design 1, while
 252 requiring higher polling bandwidth and greater complexity due to the need
 253 to create subdirectories. These designs were therefore rejected.
 254
 255 Design 4 was rejected due to the out-of-sync problem, which is severe
 256 and possibly unsolvable for mutable structures.
 257
 258 For design 5, the out-of-sync problem is still present but possibly
 259 solvable. However, design 5 is substantially more complex, less efficient
 260 in bandwidth/latency, and less scalable in number of clients and
 261 subfolders than design 1. It only gains over design 1 on the ability to
 262 share directory readcaps to the Magic Folder (or subfolders), which was
 263 not a requirement. It would be possible to implement this feature in
 264 future by switching to design 6.
 265
 266 For the time being, however, design 6 was considered out-of-scope for
 267 this project.
 268
 269 Therefore, design 1 was chosen. That is:
 270
 271     All subfolders written by a given Magic Folder client are collapsed
 272     into a single client DMD, containing immutable files. The child name
 273     of each file encodes the full subpath of that file relative to the
 274     Magic Folder.
 275
 276 Each directory entry in a DMD also stores a version number, so that the
 277 latest version of a file is well-defined when it has been modified by
 278 multiple clients.
 279
 280 To enable representing empty directories, a client that creates a
 281 directory should link a corresponding zero-length file in its DMD,
 282 at a name that ends with the encoded directory separator character.
 283
 284 We want to enable dynamic configuration of the membership of a Magic
 285 Folder collective, without having to reconfigure or restart each client
 286 when another client joins. To support this, we have a single collective
 287 directory that links to all of the client DMDs, named by their client
 288 nicknames. If the collective directory is mutable, then it is possible
 289 to change its contents in order to add clients. Note that a client DMD
 290 should not be unlinked from the collective directory unless all of its
 291 files are first copied to some other client DMD.
 292
 293 A client needs to be able to write to its own DMD, and read from other DMDs.
 294 To be consistent with the `Principle of Least Authority`_, each client's
 295 reference to its own DMD is a write capability, whereas its reference
 296 to the collective directory is a read capability. The latter transitively
 297 grants read access to all of the other client DMDs and the files linked
 298 from them, as required.
 299
 300 .. _`Principle of Least Authority`: http://www.eros-os.org/papers/secnotsep.pdf
 301
 302 Design and implementation of the user interface for maintaining this
 303 DMD structure and configuration will be addressed in Objectives 5 and 6.
 304
 305 During operation, each client will poll for changes on other clients
 306 at a predetermined frequency. On each poll, it will reread the collective
 307 directory (to allow for added or removed clients), and then read each
 308 client DMD linked from it.
 309
 310 "Hidden" files, and files with names matching the patterns used for backup,
 311 temporary, and conflicted files, will be ignored, i.e. not synchronized
 312 in either direction. A file is hidden if it has a filename beginning with
 313 "." (on any platform), or has the hidden or system attribute on Windows.
 314
 315
 316 Conflict Detection and Resolution
 317 ---------------------------------
 318
 319 The combination of local filesystems and distributed objects is
 320 an example of shared state concurrency, which is highly error-prone
 321 and can result in race conditions that are complex to analyze.
 322 Unfortunately we have no option but to use shared state in this
 323 situation.
 324
 325 We call the resulting design issues "dragons" (as in "Here be dragons"),
 326 which as a convenient mnemonic we have named after the classical
 327 Greek elements Earth, Fire, Air, and Water.
 328
 329 Note: all filenames used in the following sections are examples,
 330 and the filename patterns we use in the actual implementation may
 331 differ. The actual patterns will probably include timestamps, and
 332 for conflicted files, the nickname of the client that last changed
 333 the file.
 334
 335
 336 Earth Dragons: Collisions between local filesystem operations and downloads
 337 '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
 338
 339 Write/download collisions
 340 ~~~~~~~~~~~~~~~~~~~~~~~~~
 341
 342 Suppose that Alice's Magic Folder client is about to write a
 343 version of ``foo`` that it has downloaded in response to a remote
 344 change.
 345
 346 The criteria for distinguishing overwrites from conflicts are
 347 described later in the `Fire Dragons`_ section. Suppose that the
 348 remote change has been initially classified as an overwrite.
 349 (As we will see, it may be reclassified in some circumstances.)
 350
 351 .. _`Fire Dragons`: #fire-dragons-distinguishing-conflicts-from-overwrites
 352
 353 Note that writing a file that does not already have an entry in
 354 the `magic folder db`_ is initially classed as an overwrite.
 355
 356 A *write/download collision* occurs when another program writes
 357 to ``foo`` in the local filesystem, concurrently with the new
 358 version being written by the Magic Folder client. We need to
 359 ensure that this does not cause data loss, as far as possible.
 360
 361 An important constraint on the design is that on Windows, it is
 362 not possible to rename a file to the same name as an existing
 363 file in that directory. Also, on Windows it may not be possible to
 364 delete or rename a file that has been opened by another process
 365 (depending on the sharing flags specified by that process).
 366 Therefore we need to consider carefully how to handle failure
 367 conditions.
 368
 369 In our proposed design, Alice's Magic Folder client follows
 370 this procedure for an overwrite in response to a remote change:
 371
 372 1. Write a temporary file, say ``.foo.tmp``.
 373 2. Use the procedure described in the `Fire Dragons_` section
 374    to obtain an initial classification as an overwrite or a
 375    conflict. (This takes as input the ``last_downloaded_uri``
 376    field from the directory entry of the changed ``foo``.)
 377 3. Set the ``mtime`` of the replacement file to be *T* seconds
 378    before the current local time. Stat the replacement file
 379    to obtain its ``mtime`` and ``ctime`` as stored in the local
 380    filesystem, and update the file's last-seen statinfo in
 381    the magic folder db with this information. (Note that the
 382    retrieved ``mtime`` may differ from the one that was set due
 383    to rounding.)
 384 4. Perform a ''file replacement'' operation (explained below)
 385    with backup filename ``foo.backup``, replaced file ``foo``,
 386    and replacement file ``.foo.tmp``. If any step of this
 387    operation fails, reclassify as a conflict and stop.
 388
 389 To reclassify as a conflict, attempt to rename ``.foo.tmp`` to
 390 ``foo.conflicted``, suppressing errors.
 391
 392 The implementation of file replacement differs between Unix
 393 and Windows. On Unix, it can be implemented as follows:
 394
 395 * 4a. Stat the replaced path, and set the permissions of the
 396   replacement file to be the same as the replaced file,
 397   bitwise-or'd with octal 600 (``rw-------``). If the replaced
 398   file does not exist, set the permissions according to the
 399   user's umask. If there is a directory at the replaced path,
 400   fail.
 401 * 4b. Attempt to move the replaced file (``foo``) to the
 402   backup filename (``foo.backup``). If an ``ENOENT`` error
 403   occurs because the replaced file does not exist, ignore this
 404   error and continue with steps 4c and 4d.
 405 * 4c. Attempt to create a hard link at the replaced filename
 406   (``foo``) pointing to the replacement file (``.foo.tmp``).
 407 * 4d. Attempt to unlink the replacement file (``.foo.tmp``),
 408   suppressing errors.
 409
 410 Note that, if there is no conflict, the entry for ``foo``
 411 recorded in the `magic folder db`_ will reflect the ``mtime``
 412 set in step 3. The move operation in step 4b will cause a
 413 ``MOVED_FROM`` event for ``foo``, and the link operation in
 414 step 4c will cause an ``IN_CREATE`` event for ``foo``.
 415 However, these events will not trigger an upload, because they
 416 are guaranteed to be processed only after the file replacement
 417 has finished, at which point the last-seen statinfo recorded
 418 in the database entry will exactly match the metadata for the
 419 file's inode on disk. (The two hard links — ``foo`` and, while
 420 it still exists, ``.foo.tmp`` — share the same inode and
 421 therefore the same metadata.)
 422
 423 .. _`magic folder db`: filesystem_integration.rst#local-scanning-and-database
 424
 425 On Windows, file replacement can be implemented by a call to
 426 the `ReplaceFileW`_ API (with the
 427 ``REPLACEFILE_IGNORE_MERGE_ERRORS`` flag). If an error occurs
 428 because the replaced file does not exist, then we ignore this
 429 error and attempt to move the replacement file to the replaced
 430 file.
 431
 432 Similar to the Unix case, the `ReplaceFileW`_ operation will
 433 cause one or more change notifications for ``foo``. The replaced
 434 ``foo`` has the same ``mtime`` as the replacement file, and so any
 435 such notification(s) will not trigger an unwanted upload.
 436
 437 .. _`ReplaceFileW`: https://msdn.microsoft.com/en-us/library/windows/desktop/aa365512%28v=vs.85%29.aspx
 438
 439 To determine whether this procedure adequately protects against data
 440 loss, we need to consider what happens if another process attempts to
 441 update ``foo``, for example by renaming ``foo.other`` to ``foo``.
 442 This requires us to analyze all possible interleavings between the
 443 operations performed by the Magic Folder client and the other process.
 444 (Note that atomic operations on a directory are totally ordered.)
 445 The set of possible interleavings differs between Windows and Unix.
 446
 447 On Unix, for the case where the replaced file already exists, we have:
 448
 449 * Interleaving A: the other process' rename precedes our rename in
 450   step 4b, and we get an ``IN_MOVED_TO`` event for its rename by
 451   step 2. Then we reclassify as a conflict; its changes end up at
 452   ``foo`` and ours end up at ``foo.conflicted``. This avoids data
 453   loss.
 454
 455 * Interleaving B: its rename precedes ours in step 4b, and we do
 456   not get an event for its rename by step 2. Its changes end up at
 457   ``foo.backup``, and ours end up at ``foo`` after being linked there
 458   in step 4c. This avoids data loss.
 459
 460 * Interleaving C: its rename happens between our rename in step 4b,
 461   and our link operation in step 4c of the file replacement. The
 462   latter fails with an ``EEXIST`` error because ``foo`` already
 463   exists. We reclassify as a conflict; the old version ends up at
 464   ``foo.backup``, the other process' changes end up at ``foo``, and
 465   ours at ``foo.conflicted``. This avoids data loss.
 466
 467 * Interleaving D: its rename happens after our link in step 4c,
 468   and causes an ``IN_MOVED_TO`` event for ``foo``. Its rename also
 469   changes the ``mtime`` for ``foo`` so that it is different from
 470   the ``mtime`` calculated in step 3, and therefore different
 471   from the metadata recorded for ``foo`` in the magic folder db.
 472   (Assuming no system clock changes, its rename will set an ``mtime``
 473   timestamp corresponding to a time after step 4c, which is not
 474   equal to the timestamp *T* seconds before step 4a, provided that
 475   *T* seconds is sufficiently greater than the timestamp granularity.)
 476   Therefore, an upload will be triggered for ``foo`` after its
 477   change, which is correct and avoids data loss.
 478
 479 If the replaced file did not already exist, an ``ENOENT`` error
 480 occurs at step 4b, and we continue with steps 4c and 4d. The other
 481 process' rename races with our link operation in step 4c. If the
 482 other process wins the race then the effect is similar to
 483 Interleaving C, and if we win the race this it is similar to
 484 Interleaving D. Either case avoids data loss.
 485
 486
 487 On Windows, the internal implementation of `ReplaceFileW`_ is similar
 488 to what we have described above for Unix; it works like this:
 489
 490 * 4a′. Copy metadata (which does not include ``mtime``) from the
 491   replaced file (``foo``) to the replacement file (``.foo.tmp``).
 492
 493 * 4b′. Attempt to move the replaced file (``foo``) onto the
 494   backup filename (``foo.backup``), deleting the latter if it
 495   already exists.
 496
 497 * 4c′. Attempt to move the replacement file (``.foo.tmp``) to the
 498   replaced filename (``foo``); fail if the destination already
 499   exists.
 500
 501 Notice that this is essentially the same as the algorithm we use
 502 for Unix, but steps 4c and 4d on Unix are combined into a single
 503 step 4c′. (If there is a failure at steps 4c′ after step 4b′ has
 504 completed, the `ReplaceFileW`_ call will fail with return code
 505 ``ERROR_UNABLE_TO_MOVE_REPLACEMENT_2``. However, it is still
 506 preferable to use this API over two `MoveFileExW`_ calls, because
 507 it retains the attributes and ACLs of ``foo`` where possible.
 508 Also note that if the `ReplaceFileW`_ call fails with
 509 ``ERROR_FILE_NOT_FOUND`` because the replaced file does not exist,
 510 then the replacment operation ignores this error and continues with
 511 the equivalent of step 4c′, as on Unix.)
 512
 513 However, on Windows the other application will not be able to
 514 directly rename ``foo.other`` onto ``foo`` (which would fail because
 515 the destination already exists); it will have to rename or delete
 516 ``foo`` first. Without loss of generality, let's say ``foo`` is
 517 deleted. This complicates the interleaving analysis, because we
 518 have two operations done by the other process interleaving with
 519 three done by the magic folder process (rather than one operation
 520 interleaving with four as on Unix).
 521
 522 So on Windows, for the case where the replaced file already exists,
 523 we have:
 524
 525 * Interleaving A′: the other process' deletion of ``foo`` and its
 526   rename of ``foo.other`` to ``foo`` both precede our rename in
 527   step 4b. We get an event corresponding to its rename by step 2.
 528   Then we reclassify as a conflict; its changes end up at ``foo``
 529   and ours end up at ``foo.conflicted``. This avoids data loss.
 530
 531 * Interleaving B′: the other process' deletion of ``foo`` and its
 532   rename of ``foo.other`` to ``foo`` both precede our rename in
 533   step 4b. We do not get an event for its rename by step 2.
 534   Its changes end up at ``foo.backup``, and ours end up at ``foo``
 535   after being moved there in step 4c′. This avoids data loss.
 536
 537 * Interleaving C′: the other process' deletion of ``foo`` precedes
 538   our rename of ``foo`` to ``foo.backup`` done by `ReplaceFileW`_,
 539   but its rename of ``foo.other`` to ``foo`` does not, so we get
 540   an ``ERROR_FILE_NOT_FOUND`` error from `ReplaceFileW`_ indicating
 541   that the replaced file does not exist. We ignore this error and
 542   attempt to move ``foo.tmp`` to ``foo``, racing with the other
 543   process which is attempting to move ``foo.other`` to ``foo``.
 544   If we win the race, then our changes end up at ``foo``, and the
 545   other process' move fails. If the other process wins the race,
 546   then its changes end up at ``foo``, our move fails, and we
 547   reclassify as a conflict, so that our changes end up at
 548   ``foo.conflicted``. Either possibility avoids data loss.
 549
 550 * Interleaving D′: the other process' deletion and/or rename happen
 551   during the call to `ReplaceFileW`_, causing the latter to fail.
 552   There are two subcases:
 553
 554   * if the error is ``ERROR_UNABLE_TO_MOVE_REPLACEMENT_2``, then
 555     ``foo`` is renamed to ``foo.backup`` and ``.foo.tmp`` remains
 556     at its original name after the call.
 557   * for all other errors, ``foo`` and ``.foo.tmp`` both remain at
 558     their original names after the call.
 559
 560   In both subcases, we reclassify as a conflict and rename ``.foo.tmp``
 561   to ``foo.conflicted``. This avoids data loss.
 562
 563 * Interleaving E′: the other process' deletion of ``foo`` and attempt
 564   to rename ``foo.other`` to ``foo`` both happen after all internal
 565   operations of `ReplaceFileW`_ have completed. This causes deletion
 566   and rename events for ``foo`` (which will in practice be merged due
 567   to the pending delay, although we don't rely on that for correctness).
 568   The rename also changes the ``mtime`` for ``foo`` so that it is
 569   different from the ``mtime`` calculated in step 3, and therefore
 570   different from the metadata recorded for ``foo`` in the magic folder
 571   db. (Assuming no system clock changes, its rename will set an
 572   ``mtime`` timestamp corresponding to a time after the internal
 573   operations of `ReplaceFileW`_ have completed, which is not equal to
 574   the timestamp *T* seconds before `ReplaceFileW`_ is called, provided
 575   that *T* seconds is sufficiently greater than the timestamp
 576   granularity.) Therefore, an upload will be triggered for ``foo``
 577   after its change, which is correct and avoids data loss.
 578
 579 .. _`MoveFileExW`: https://msdn.microsoft.com/en-us/library/windows/desktop/aa365240%28v=vs.85%29.aspx
 580
 581 If the replaced file did not already exist, we get an
 582 ``ERROR_FILE_NOT_FOUND`` error from `ReplaceFileW`_, and attempt to
 583 move ``foo.tmp`` to ``foo``. This is similar to Interleaving C, and
 584 either possibility for the resulting race avoids data loss.
 585
 586 We also need to consider what happens if another process opens ``foo``
 587 and writes to it directly, rather than renaming another file onto it:
 588
 589 * On Unix, open file handles refer to inodes, not paths. If the other
 590   process opens ``foo`` before it has been renamed to ``foo.backup``,
 591   and then closes the file, changes will have been written to the file
 592   at the same inode, even if that inode is now linked at ``foo.backup``.
 593   This avoids data loss.
 594
 595 * On Windows, we have two subcases, depending on whether the sharing
 596   flags specified by the other process when it opened its file handle
 597   included ``FILE_SHARE_DELETE``. (This flag covers both deletion and
 598   rename operations.)
 599
 600   i.  If the sharing flags *do not* allow deletion/renaming, the
 601       `ReplaceFileW`_ operation will fail without renaming ``foo``.
 602       In this case we will end up with ``foo`` changed by the other
 603       process, and the downloaded file still in ``foo.tmp``.
 604       This avoids data loss.
 605
 606   ii. If the sharing flags *do* allow deletion/renaming, then
 607       data loss or corruption may occur. This is unavoidable and
 608       can be attributed to other process making a poor choice of
 609       sharing flags (either explicitly if it used `CreateFile`_, or
 610       via whichever higher-level API it used).
 611
 612 .. _`CreateFile`: https://msdn.microsoft.com/en-us/library/windows/desktop/aa363858%28v=vs.85%29.aspx
 613
 614 Note that it is possible that another process tries to open the file
 615 between steps 4b and 4c (or 4b′ and 4c′ on Windows). In this case the
 616 open will fail because ``foo`` does not exist. Nevertheless, no data
 617 will be lost, and in many cases the user will be able to retry the
 618 operation.
 619
 620 Above we only described the case where the download was initially
 621 classified as an overwrite. If it was classed as a conflict, the
 622 procedure is the same except that we choose a unique filename
 623 for the conflicted file (say, ``foo.conflicted_unique``). We write
 624 the new contents to ``.foo.tmp`` and then rename it to
 625 ``foo.conflicted_unique`` in such a way that the rename will fail
 626 if the destination already exists. (On Windows this is a simple
 627 rename; on Unix it can be implemented as a link operation followed
 628 by an unlink, similar to steps 4c and 4d above.) If this fails
 629 because another process wrote ``foo.conflicted_unique`` after we
 630 chose the filename, then we retry with a different filename.
 631
 632
 633 Read/download collisions
 634 ~~~~~~~~~~~~~~~~~~~~~~~~
 635
 636 A *read/download collision* occurs when another program reads
 637 from ``foo`` in the local filesystem, concurrently with the new
 638 version being written by the Magic Folder client. We want to
 639 ensure that any successful attempt to read the file by the other
 640 program obtains a consistent view of its contents.
 641
 642 On Unix, the above procedure for writing downloads is sufficient
 643 to achieve this. There are three cases:
 644
 645 * A. The other process opens ``foo`` for reading before it is
 646   renamed to ``foo.backup``. Then the file handle will continue to
 647   refer to the old file across the rename, and the other process
 648   will read the old contents.
 649
 650 * B. The other process attempts to open ``foo`` after it has been
 651   renamed to ``foo.backup``, and before it is linked in step c.
 652   The open call fails, which is acceptable.
 653
 654 * C. The other process opens ``foo`` after it has been linked to
 655   the new file. Then it will read the new contents.
 656
 657 On Windows, the analysis is very similar, but case A′ needs to
 658 be split into two subcases, depending on the sharing mode the other
 659 process uses when opening the file for reading:
 660
 661 * A′. The other process opens ``foo`` before the Magic Folder
 662   client's attempt to rename ``foo`` to ``foo.backup`` (as part
 663   of the implementation of `ReplaceFileW`_). The subcases are:
 664
 665   i.  The other process uses sharing flags that deny deletion and
 666       renames. The `ReplaceFileW`_ call fails, and the download is
 667       reclassified as a conflict. The downloaded file ends up at
 668       ``foo.conflicted``, which is correct.
 669
 670   ii. The other process uses sharing flags that allow deletion
 671       and renames. The `ReplaceFileW`_ call succeeds, and the
 672       other process reads inconsistent data. This can be attributed
 673       to a poor choice of sharing flags by the other process.
 674
 675 * B′. The other process attempts to open ``foo`` at the point
 676   during the `ReplaceFileW`_ call where it does not exist.
 677   The open call fails, which is acceptable.
 678
 679 * C′. The other process opens ``foo`` after it has been linked to
 680   the new file. Then it will read the new contents.
 681
 682
 683 For both write/download and read/download collisions, we have
 684 considered only interleavings with a single other process, and
 685 only the most common possibilities for the other process'
 686 interaction with the file. If multiple other processes are
 687 involved, or if a process performs operations other than those
 688 considered, then we cannot say much about the outcome in general;
 689 however, we believe that such cases will be much less common.
 690
 691
 692
 693 Fire Dragons: Distinguishing conflicts from overwrites
 694 ''''''''''''''''''''''''''''''''''''''''''''''''''''''
 695
 696 When synchronizing a file that has changed remotely, the Magic Folder
 697 client needs to distinguish between overwrites, in which the remote
 698 side was aware of your most recent version (if any) and overwrote it
 699 with a new version, and conflicts, in which the remote side was unaware
 700 of your most recent version when it published its new version. Those two
 701 cases have to be handled differently — the latter needs to be raised
 702 to the user as an issue the user will have to resolve and the former
 703 must not bother the user.
 704
 705 For example, suppose that Alice's Magic Folder client sees a change
 706 to ``foo`` in Bob's DMD. If the version it downloads from Bob's DMD
 707 is "based on" the version currently in Alice's local filesystem at
 708 the time Alice's client attempts to write the downloaded file ‒or if
 709 there is no existing version in Alice's local filesystem at that time‒
 710 then it is an overwrite. Otherwise it is initially classified as a
 711 conflict.
 712
 713 This initial classification is used by the procedure for writing a
 714 file described in the `Earth Dragons`_ section above. As explained
 715 in that section, we may reclassify an overwrite as a conflict if an
 716 error occurs during the write procedure.
 717
 718 .. _`Earth Dragons`: #earth-dragons-collisions-between-local-filesystem-operations-and-downloads
 719
 720 In order to implement this policy, we need to specify how the
 721 "based on" relation between file versions is recorded and updated.
 722
 723 We propose to record this information:
 724
 725 * in the `magic folder db`_, for local files;
 726 * in the Tahoe-LAFS directory metadata, for files stored in the
 727   Magic Folder.
 728
 729 In the magic folder db we will add a *last-downloaded record*,
 730 consisting of ``last_downloaded_uri`` and ``last_downloaded_timestamp``
 731 fields, for each path stored in the database. Whenever a Magic Folder
 732 client downloads a file, it stores the downloaded version's URI and
 733 the current local timestamp in this record. Since only immutable
 734 files are used, the URI will be an immutable file URI, which is
 735 deterministically and uniquely derived from the file contents and
 736 the Tahoe-LAFS node's `convergence secret`_.
 737
 738 (Note that the last-downloaded record is updated regardless of
 739 whether the download is an overwrite or a conflict. The rationale
 740 for this to avoid "conflict loops" between clients, where every
 741 new version after the first conflict would be considered as another
 742 conflict.)
 743
 744 .. _`convergence secret`: https://tahoe-lafs.org/trac/tahoe-lafs/browser/docs/convergence-secret.rst
 745
 746 Later, in response to a local filesystem change at a given path, the
 747 Magic Folder client reads the last-downloaded record associated with
 748 that path (if any) from the database and then uploads the current
 749 file. When it links the uploaded file into its client DMD, it
 750 includes the ``last_downloaded_uri`` field in the metadata of the
 751 directory entry, overwriting any existing field of that name. If
 752 there was no last-downloaded record associated with the path, this
 753 field is omitted.
 754
 755 Note that ``last_downloaded_uri`` field does *not* record the URI of
 756 the uploaded file (which would be redundant); it records the URI of
 757 the last download before the local change that caused the upload.
 758 The field will be absent if the file has never been downloaded by
 759 this client (i.e. if it was created on this client and no change
 760 by any other client has been detected).
 761
 762 A possible refinement also takes into account the
 763 ``last_downloaded_timestamp`` field from the magic folder db, and
 764 compares it to the timestamp of the change that caused the upload
 765 (which should be later, assuming no system clock changes).
 766 If the duration between these timestamps is very short, then we
 767 are uncertain about whether the process on Bob's system that wrote
 768 the local file could have taken into account the last download.
 769 We can use this information to be conservative about treating
 770 changes as conflicts. So, if the duration is less than a configured
 771 threshold, we omit the ``last_downloaded_uri`` field from the
 772 metadata. This will have the effect of making other clients treat
 773 this change as a conflict whenever they already have a copy of the
 774 file.
 775
 776 Conflict/overwrite decision algorithm
 777 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 778
 779 Now we are ready to describe the algorithm for determining whether a
 780 download for the file ``foo`` is an overwrite or a conflict (refining
 781 step 2 of the procedure from the `Earth Dragons`_ section).
 782
 783 Let ``last_downloaded_uri`` be the field of that name obtained from
 784 the directory entry metadata for ``foo`` in Bob's DMD (this field
 785 may be absent). Then the algorithm is:
 786
 787 * 2a. Attempt to "stat" ``foo`` to get its *current statinfo* (size
 788   in bytes, ``mtime``, and ``ctime``). If Alice has no local copy
 789   of ``foo``, classify as an overwrite.
 790
 791 * 2b. Read the following information for the path ``foo`` from the
 792   local magic folder db:
 793
 794   * the *last-seen statinfo*, if any (this is the size in
 795     bytes, ``mtime``, and ``ctime`` stored in the ``local_files``
 796     table when the file was last uploaded);
 797   * the ``last_uploaded_uri`` field of the ``local_files`` table
 798     for this file, which is the URI under which the file was last
 799     uploaded.
 800
 801 * 2c. If any of the following are true, then classify as a conflict:
 802
 803   * i. there are pending notifications of changes to ``foo``;
 804   * ii. the last-seen statinfo is either absent (i.e. there is
 805     no entry in the database for this path), or different from the
 806     current statinfo;
 807   * iii. either ``last_downloaded_uri`` or ``last_uploaded_uri``
 808     (or both) are absent, or they are different.
 809
 810   Otherwise, classify as an overwrite.
 811
 812
 813 Air Dragons: Collisions between local writes and uploads
 814 ''''''''''''''''''''''''''''''''''''''''''''''''''''''''
 815
 816 Short of filesystem-specific features on Unix or the `shadow copy service`_
 817 on Windows (which is per-volume and therefore difficult to use in this
 818 context), there is no way to *read* the whole contents of a file
 819 atomically. Therefore, when we read a file in order to upload it, we
 820 may read an inconsistent version if it was also being written locally.
 821
 822 .. _`shadow copy service`: https://technet.microsoft.com/en-us/library/ee923636%28v=ws.10%29.aspx
 823
 824 A well-behaved application can avoid this problem for its writes:
 825
 826 * On Unix, if another process modifies a file by renaming a temporary
 827   file onto it, then we will consistently read either the old contents
 828   or the new contents.
 829 * On Windows, if the other process uses sharing flags to deny reads
 830   while it is writing a file, then we will consistently read either
 831   the old contents or the new contents, unless a sharing error occurs.
 832   In the case of a sharing error we should retry later, up to a
 833   maximum number of retries.
 834
 835 In the case of a not-so-well-behaved application writing to a file
 836 at the same time we read from it, the magic folder will still be
 837 eventually consistent, but inconsistent versions may be visible to
 838 other users' clients.
 839
 840 In Objective 2 we implemented a delay, called the *pending delay*,
 841 after the notification of a filesystem change and before the file is
 842 read in order to upload it (Tahoe-LAFS ticket `#1440`_). If another
 843 change notification occurs within the pending delay time, the delay
 844 is restarted. This helps to some extent because it means that if
 845 files are written more quickly than the pending delay and less
 846 frequently than the pending delay, we shouldn't encounter this
 847 inconsistency.
 848
 849 .. _`#1440`: https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1440
 850
 851 The likelihood of inconsistency could be further reduced, even for
 852 writes by not-so-well-behaved applications, by delaying the actual
 853 upload for a further period —called the *stability delay*— after the
 854 file has finished being read. If a notification occurs between the
 855 end of the pending delay and the end of the stability delay, then
 856 the read would be aborted and the notification requeued.
 857
 858 This would have the effect of ensuring that no write notifications
 859 have been received for the file during a time window that brackets
 860 the period when it was being read, with margin before and after
 861 this period defined by the pending and stability delays. The delays
 862 are intended to account for asynchronous notification of events, and
 863 caching in the filesystem.
 864
 865 Note however that we cannot guarantee that the delays will be long
 866 enough to prevent inconsistency in any particular case. Also, the
 867 stability delay would potentially affect performance significantly
 868 because (unlike the pending delay) it is not overlapped when there
 869 are multiple files on the upload queue. This performance impact
 870 could be mitigated by uploading files in parallel where possible
 871 (Tahoe-LAFS ticket `#1459`_).
 872
 873 We have not yet decided whether to implement the stability delay, and
 874 it is not planned to be implemented for the OTF objective 4 milestone.
 875 Ticket `#2431`_ has been opened to track this idea.
 876
 877 .. _`#1459`: https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1459
 878 .. _`#2431`: https://tahoe-lafs.org/trac/tahoe-lafs/ticket/2431
 879
 880 Note that the situation of both a local process and the Magic Folder
 881 client reading a file at the same time cannot cause any inconsistency.
 882
 883
 884 Water Dragons: Handling deletion and renames
 885 ''''''''''''''''''''''''''''''''''''''''''''
 886
 887 Deletion of a file
 888 ~~~~~~~~~~~~~~~~~~
 889
 890 When a file is deleted from the filesystem of a Magic Folder client,
 891 the most intuitive behavior is for it also to be deleted under that
 892 name from other clients. To avoid data loss, the other clients should
 893 actually rename their copies to a backup filename.
 894
 895 It would not be sufficient for a Magic Folder client that deletes
 896 a file to implement this simply by removing the directory entry from
 897 its DMD. Indeed, the entry may not exist in the client's DMD if it
 898 has never previously changed the file.
 899
 900 Instead, the client links a zero-length file into its DMD and sets
 901 ``deleted: true`` in the directory entry metadata. Other clients
 902 take this as a signal to rename their copies to the backup filename.
 903
 904 Note that the entry for this zero-length file has a version number as
 905 usual, and later versions may restore the file.
 906
 907 When the downloader deletes a file (or renames it to a filename
 908 ending in ``.backup``) in response to a remote change, a local
 909 filesystem notification will occur, and we must make sure that this
 910 is not treated as a local change. To do this we have the downloader
 911 set the ``size`` field in the magic folder db to ``None`` (SQL NULL)
 912 just before deleting the file, and suppress notifications for which
 913 the local file does not exist, and the recorded ``size`` field is
 914 ``None``.
 915
 916 When a Magic Folder client restarts, we can detect files that had
 917 been downloaded but were deleted while it was not running, because
 918 their paths will have last-downloaded records in the magic folder db
 919 with a ``size`` other than ``None``, and without any corresponding
 920 local file.
 921
 922 Deletion of a directory
 923 ~~~~~~~~~~~~~~~~~~~~~~~
 924
 925 Local filesystems (unlike a Tahoe-LAFS filesystem) normally cannot
 926 unlink a directory that has any remaining children. Therefore a
 927 Magic Folder client cannot delete local copies of directories in
 928 general, because they will typically contain backup files. This must
 929 be done manually on each client if desired.
 930
 931 Nevertheless, a Magic Folder client that deletes a directory should
 932 set ``deleted: true`` on the metadata entry for the corresponding
 933 zero-length file. This avoids the directory being recreated after
 934 it has been manually deleted from a client.
 935
 936 Renaming
 937 ~~~~~~~~
 938
 939 It is sufficient to handle renaming of a file by treating it as a
 940 deletion and an addition under the new name.
 941
 942 This also applies to directories, although users may find the
 943 resulting behavior unintuitive: all of the files under the old name
 944 will be renamed to backup filenames, and a new directory structure
 945 created under the new name. We believe this is the best that can be
 946 done without imposing unreasonable implementation complexity.
 947
 948
 949 Summary
 950 -------
 951
 952 This completes the design of remote-to-local synchronization.
 953 We realize that it may seem very complicated. Anecdotally, proprietary
 954 filesystem synchronization designs we are aware of, such as Dropbox,
 955 are said to incur similar or greater design complexity.