From: Daira Hopwood Date: Fri, 29 May 2015 02:03:46 +0000 (+0100) Subject: Magic Folder: add remote-to-local sync design doc. X-Git-Tag: allmydata-tahoe-1.10.1b1~13 X-Git-Url: https://git.rkrishnan.org/vdrive/%5B%5E?a=commitdiff_plain;h=97c33b175b880332b41ff453b8e254c667a512fa;p=tahoe-lafs%2Ftahoe-lafs.git Magic Folder: add remote-to-local sync design doc. Signed-off-by: Daira Hopwood --- diff --git a/docs/proposed/magic-folder/remote-to-local-sync.rst b/docs/proposed/magic-folder/remote-to-local-sync.rst new file mode 100644 index 00000000..8a65dfb9 --- /dev/null +++ b/docs/proposed/magic-folder/remote-to-local-sync.rst @@ -0,0 +1,894 @@ +Magic Folder design for remote-to-local sync +============================================ + +Scope +----- + +In this Objective we will design remote-to-local synchronization: + +* How to efficiently determine which objects (files and directories) have + to be downloaded in order to bring the current local filesystem into sync + with the newly-discovered version of the remote filesystem. +* How to distinguish overwrites, in which the remote side was aware of + your most recent version and overwrote it with a new version, from + conflicts, in which the remote side was unaware of your most recent + version when it published its new version. The latter needs to be raised + to the user as an issue the user will have to resolve and the former must + not bother the user. +* How to overwrite the (stale) local versions of those objects with the + newly acquired objects, while preserving backed-up versions of those + overwritten objects in case the user didn't want this overwrite and wants + to recover the old version. + +Tickets on the Tahoe-LAFS trac with the `otf-magic-folder-objective4`_ +keyword are within the scope of the remote-to-local synchronization +design. + +.. _otf-magic-folder-objective4: https://tahoe-lafs.org/trac/tahoe-lafs/query?status=!closed&keywords=~otf-magic-folder-objective4 + + +Glossary +'''''''' + +Object: a file or directory + +DMD: distributed mutable directory + +Folder: an abstract directory that is synchronized between clients. +(A folder is not the same as the directory corresponding to it on +any particular client, nor is it the same as a DMD.) + +Descendant: a direct or indirect child in a directory or folder tree + +Subfolder: a folder that is a descendant of a magic folder + +Subpath: the path from a magic folder to one of its descendants + +Write: a modification to a local filesystem object by a client + +Read: a read from a local filesystem object by a client + +Upload: an upload of a local object to the Tahoe-LAFS file store + +Download: a download from the Tahoe-LAFS file store to a local object + +Pending notification: a local filesystem change that has been detected +but not yet processed. + + +Representing the Magic Folder in Tahoe-LAFS +------------------------------------------- + +Unlike the local case where we use inotify or ReadDirectoryChangesW to +detect filesystem changes, we have no mechanism to register a monitor for +changes to a Tahoe-LAFS directory. Therefore, we must periodically poll +for changes. + +An important constraint on the solution is Tahoe-LAFS' "`write +coordination directive`_", which prohibits concurrent writes by different +storage clients to the same mutable object: + + Tahoe does not provide locking of mutable files and directories. If + there is more than one simultaneous attempt to change a mutable file + or directory, then an UncoordinatedWriteError may result. This might, + in rare cases, cause the file or directory contents to be accidentally + deleted. The user is expected to ensure that there is at most one + outstanding write or update request for a given file or directory at + a time. One convenient way to accomplish this is to make a different + file or directory for each person or process that wants to write. + +.. _`write coordination directive`: ../../write_coordination.rst + +Since it is a goal to allow multiple users to write to a Magic Folder, +if the write coordination directive remains the same as above, then we +will not be able to implement the Magic Folder as a single Tahoe-LAFS +DMD. In general therefore, we will have multiple DMDs —spread across +clients— that together represent the Magic Folder. Each client polls +the other clients' DMDs in order to detect remote changes. + +Six possible designs were considered for the representation of subfolders +of the Magic Folder: + +1. All subfolders written by a given Magic Folder client are collapsed +into a single client DMD, containing immutable files. The child name of +each file encodes the full subpath of that file relative to the Magic +Folder. + +2. The DMD tree under a client DMD is a direct copy of the folder tree +written by that client to the Magic Folder. Not all subfolders have +corresponding DMDs; only those to which that client has written files or +child subfolders. + +3. The directory tree under a client DMD is a ``tahoe backup`` structure +containing immutable snapshots of the folder tree written by that client +to the Magic Folder. As in design 2, only objects written by that client +are present. + +4. *Each* client DMD contains an eventually consistent mirror of all +files and folders written by *any* Magic Folder client. Thus each client +must also copy changes made by other Magic Folder clients to its own +client DMD. + +5. *Each* client DMD contains a ``tahoe backup`` structure containing +immutable snapshots of all files and folders written by *any* Magic +Folder client. Thus each client must also create another snapshot in its +own client DMD when changes are made by another client. (It can potentially +batch changes, subject to latency requirements.) + +6. The write coordination problem is solved by implementing `two-phase +commit`_. Then, the representation consists of a single DMD tree which is +written by all clients. + +.. _`two-phase commit`: https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1755 + +Here is a summary of advantages and disadvantages of each design: + ++----------------------------+ +| Key | ++=======+====================+ +| \+\+ | major advantage | ++-------+--------------------+ +| \+ | minor advantage | ++-------+--------------------+ +| ‒ | minor disadvantage | ++-------+--------------------+ +| ‒ ‒ | major disadvantage | ++-------+--------------------+ +| ‒ ‒ ‒ | showstopper | ++-------+--------------------+ + + +123456+: All designs have the property that a recursive add-lease +operation starting from the parent Tahoe-LAFS DMD will find all of the +files and directories used in the Magic Folder representation. Therefore +the representation is compatible with `garbage collection`_, even when a +pre-Magic-Folder client does the lease marking. + +.. _`garbage collection`: https://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/docs/garbage-collection.rst + +123456+: All designs avoid "breaking" pre-Magic-Folder clients that read +a directory or file that is part of the representation. + +456++: Only these designs allow a readcap to one of the client +directories —or one of their subdirectories— to be directly shared +with other Tahoe-LAFS clients (not necessarily Magic Folder clients), +so that such a client sees all of the contents of the Magic Folder. +Note that this was not a requirement of the OTF proposal, although it +is useful. + +135+: A Magic Folder client has only one mutable Tahoe-LAFS object to +monitor per other client. This minimizes communication bandwidth for +polling, or alternatively the latency possible for a given polling +bandwidth. + +1236+: A client does not need to make changes to its own DMD that repeat +changes that another Magic Folder client had previously made. This reduces +write bandwidth and complexity. + +1‒: If the Magic Folder has many subfolders, their files will all be +collapsed into the same DMD, which could get quite large. In practice a +single DMD can easily handle the number of files expected to be written +by a client, so this is unlikely to be a significant issue. + +123‒ ‒ ‒: In these designs, the set of files in a Magic Folder is +represented as the union of the files in all client DMDs. However, +when a file is modified by more than one client, it will be linked +from multiple client DMDs. We therefore need a mechanism, such as a +version number or a monotonically increasing timestamp, to determine +which copy takes priority. + +35‒ ‒: When a Magic Folder client detects a remote change, it must +traverse an immutable directory structure to see what has changed. +Completely unchanged subtrees will have the same URI, allowing some of +this traversal to be shortcutted. + +24‒ ‒ ‒: When a Magic Folder client detects a remote change, it must +traverse a mutable directory structure to see what has changed. This is +more complex and less efficient than traversing an immutable structure, +because shortcutting is not possible (each DMD retains the same URI even +if a descendant object has changed), and because the structure may change +while it is being traversed. Also the traversal needs to be robust +against cycles, which can only occur in mutable structures. + +45‒ ‒: When a change occurs in one Magic Folder client, it will propagate +to all the other clients. Each client will therefore see multiple +representation changes for a single logical change to the Magic Folder +contents, and must suppress the duplicates. This is particularly +problematic for design 4 where it interacts with the preceding issue. + +4‒ ‒ ‒, 5‒ ‒: There is the potential for client DMDs to get "out of sync" +with each other, potentially for long periods if errors occur. Thus each +client must be able to "repair" its client directory (and its +subdirectory structure) concurrently with performing its own writes. This +is a significant complexity burden and may introduce failure modes that +could not otherwise happen. + +6‒ ‒ ‒: While two-phase commit is a well-established protocol, its +application to Tahoe-LAFS requires significant design work, and may still +leave some corner cases of the write coordination problem unsolved. + + ++------------------------------------------------+-----------------------------------------+ +| Design Property | Designs Proposed | ++================================================+======+======+======+======+======+======+ +| **advantages** | *1* | *2* | *3* | *4* | *5* | *6* | ++------------------------------------------------+------+------+------+------+------+------+ +| Compatible with garbage collection |\+ |\+ |\+ |\+ |\+ |\+ | ++------------------------------------------------+------+------+------+------+------+------+ +| Does not break old clients |\+ |\+ |\+ |\+ |\+ |\+ | ++------------------------------------------------+------+------+------+------+------+------+ +| Allows direct sharing | | | |\+\+ |\+\+ |\+\+ | ++------------------------------------------------+------+------+------+------+------+------+ +| Efficient use of bandwidth |\+ | |\+ | |\+ | | ++------------------------------------------------+------+------+------+------+------+------+ +| No repeated changes |\+ |\+ |\+ | | |\+ | ++------------------------------------------------+------+------+------+------+------+------+ +| **disadvantages** | *1* | *2* | *3* | *4* | *5* | *6* | ++------------------------------------------------+------+------+------+------+------+------+ +| Can result in large DMDs |‒ | | | | | | ++------------------------------------------------+------+------+------+------+------+------+ +| Need version number to determine priority |‒ |‒ |‒ | | | | ++------------------------------------------------+------+------+------+------+------+------+ +| Must traverse immutable directory structure | | |‒ ‒ | |‒ ‒ | | ++------------------------------------------------+------+------+------+------+------+------+ +| Must traverse mutable directory structure | |‒ ‒ | |‒ ‒ | | | ++------------------------------------------------+------+------+------+------+------+------+ +| Must suppress duplicate representation changes | | | |‒ ‒ |‒ ‒ | | ++------------------------------------------------+------+------+------+------+------+------+ +| "Out of sync" problem | | | |‒ ‒ ‒ |‒ ‒ | | ++------------------------------------------------+------+------+------+------+------+------+ +| Unsolved design problems | | | | | |‒ ‒ ‒ | ++------------------------------------------------+------+------+------+------+------+------+ + + +Evaluation of designs +''''''''''''''''''''' + +Designs 2 and 3 have no significant advantages over design 1, while +requiring higher polling bandwidth and greater complexity due to the need +to create subdirectories. These designs were therefore rejected. + +Design 4 was rejected due to the out-of-sync problem, which is severe +and possibly unsolvable for mutable structures. + +For design 5, the out-of-sync problem is still present but possibly +solvable. However, design 5 is substantially more complex, less efficient +in bandwidth/latency, and less scalable in number of clients and +subfolders than design 1. It only gains over design 1 on the ability to +share directory readcaps to the Magic Folder (or subfolders), which was +not a requirement. It would be possible to implement this feature in +future by switching to design 6. + +For the time being, however, design 6 was considered out-of-scope for +this project. + +Therefore, design 1 was chosen. That is: + + All subfolders written by a given Magic Folder client are collapsed + into a single client DMD, containing immutable files. The child name + of each file encodes the full subpath of that file relative to the + Magic Folder. + +Each directory entry in a DMD also stores a version number, so that the +latest version of a file is well-defined when it has been modified by +multiple clients. + +To enable representing empty directories, a client that creates a +directory should link a corresponding zero-length file in its DMD, +at a name that ends with the encoded directory separator character. + +We want to enable dynamic configuration of the set of clients subscribed +to a Magic Folder, without having to reconfigure or restart each client +when another client joins. To support this, we have a single parent DMD +that links to all of the client DMDs, named by their client nicknames. +Then it is possible to change the contents of the parent DMD in order to +add clients. Note that a client DMD should not be unlinked from the +parent directory unless all of its files are first copied to some other +client DMD. + +A client needs to be able to write to its own DMD, and read from other DMDs. +To be consistent with the `Principle of Least Authority`_, each client's +reference to its own DMD is a write capability, whereas its reference +to the parent DMD is a read capability. The latter transitively grants +read access to all of the other client DMDs and the files linked from +them, as required. + +.. _`Principle of Least Authority`: http://www.eros-os.org/papers/secnotsep.pdf + +Design and implementation of the user interface for maintaining this +DMD structure and configuration will be addressed in Objectives 5 and 6. + +During operation, each client will poll for changes on other clients +at a predetermined frequency. On each poll, it will reread the parent DMD +(to allow for added or removed clients), and then read each client DMD +linked from the parent. + +"Hidden" files, and files with names matching the patterns used for backup, +temporary, and conflicted files, will be ignored, i.e. not synchronized +in either direction. A file is hidden if it has a filename beginning with +"." (on any platform), or has the hidden or system attribute on Windows. + + +Conflict Detection and Resolution +--------------------------------- + +The combination of local filesystems and distributed objects is +an example of shared state concurrency, which is highly error-prone +and can result in race conditions that are complex to analyze. +Unfortunately we have no option but to use shared state in this +situation. + +We call the resulting design issues "dragons" (as in "Here be dragons"), +which as a convenient mnemonic we have named after the classical +Greek elements Earth, Fire, Air, and Water. + +Note: all filenames used in the following sections are examples, +and the filename patterns we use in the actual implementation may +differ. The actual patterns will probably include timestamps, and +for conflicted files, the nickname of the client that last changed +the file. + + +Earth Dragons: Collisions between local filesystem operations and downloads +''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Write/download collisions +~~~~~~~~~~~~~~~~~~~~~~~~~ + +Suppose that Alice's Magic Folder client is about to write a +version of ``foo`` that it has downloaded in response to a remote +change. + +The criteria for distinguishing overwrites from conflicts are +described later in the `Fire Dragons`_ section. Suppose that the +remote change has been initially classified as an overwrite. +(As we will see, it may be reclassified in some circumstances.) + +.. _`Fire Dragons`: #fire-dragons-distinguishing-conflicts-from-overwrites + +A *write/download collision* occurs when another program writes +to ``foo`` in the local filesystem, concurrently with the new +version being written by the Magic Folder client. We need to +ensure that this does not cause data loss, as far as possible. + +An important constraint on the design is that on Windows, it is +not possible to rename a file to the same name as an existing +file in that directory. Also, on Windows it may not be possible to +delete or rename a file that has been opened by another process +(depending on the sharing flags specified by that process). +Therefore we need to consider carefully how to handle failure +conditions. + +In our proposed design, Alice's Magic Folder client follows +this procedure for an overwrite in response to a remote change: + +1. Write a temporary file, say ``.foo.tmp``. +2. Use the procedure described in the `Fire Dragons_` section + to obtain an initial classification as an overwrite or a + conflict. (This takes as input the ``last_downloaded_uri`` + field from the directory entry of the changed ``foo``.) +3. Set the ``mtime`` of the replacement file to be *T* seconds + before the current local time. +4. Perform a ''file replacement'' operation (explained below) + with backup filename ``foo.backup``, replaced file ``foo``, + and replacement file ``.foo.tmp``. If any step of this + operation fails, reclassify as a conflict and stop. + +To reclassify as a conflict, attempt to rename ``.foo.tmp`` to +``foo.conflicted``, suppressing errors. + +The implementation of file replacement differs between Unix +and Windows. On Unix, it can be implemented as follows: + +* 4a. Set the permissions of the replacement file to be the + same as the replaced file, bitwise-or'd with octal 600 + (``rw-------``). +* 4b. Attempt to move the replaced file (``foo``) to the + backup filename (``foo.backup``). +* 4c. Attempt to create a hard link at the replaced filename + (``foo``) pointing to the replacement file (``.foo.tmp``). +* 4d. Attempt to unlink the replacement file (``.foo.tmp``), + suppressing errors. + +Note that, if there is no conflict, the entry for ``foo`` +recorded in the `magic folder db`_ will reflect the ``mtime`` +set in step 3. The link operation in step 4c will cause an +``IN_CREATE`` event for ``foo``, but this will not trigger an +upload, because the metadata recorded in the database entry +will exactly match the metadata for the file's inode on disk. +(The two hard links — ``foo`` and, while it still exists, +``.foo.tmp`` — share the same inode and therefore the same +metadata.) + +.. _`magic folder db`: filesystem_integration.rst#local-scanning-and-database + +On Windows, file replacement can be implemented as a single +call to the `ReplaceFileW`_ API (with the +``REPLACEFILE_IGNORE_MERGE_ERRORS`` flag). + +Similar to the Unix case, the `ReplaceFileW`_ operation will +cause a change notification for ``foo``. The replaced ``foo`` +has the same ``mtime`` as the replacement file, and so this +notification will not trigger an unwanted upload. + +.. _`ReplaceFileW`: https://msdn.microsoft.com/en-us/library/windows/desktop/aa365512%28v=vs.85%29.aspx + +To determine whether this procedure adequately protects against data +loss, we need to consider what happens if another process attempts to +update ``foo``, for example by renaming ``foo.other`` to ``foo``. +This requires us to analyze all possible interleavings between the +operations performed by the Magic Folder client and the other process. +(Note that atomic operations on a directory are totally ordered.) +The set of possible interleavings differs between Windows and Unix. + +On Unix, we have: + +* Interleaving A: the other process' rename precedes our rename in + step 4b, and we get an ``IN_MOVED_TO`` event for its rename by + step 2. Then we reclassify as a conflict; its changes end up at + ``foo`` and ours end up at ``foo.conflicted``. This avoids data + loss. + +* Interleaving B: its rename precedes ours in step 4b, and we do + not get an event for its rename by step 2. Its changes end up at + ``foo.backup``, and ours end up at ``foo`` after being linked there + in step 4c. This avoids data loss. + +* Interleaving C: its rename happens between our rename in step 4b, + and our link operation in step 4c of the file replacement. The + latter fails with an ``EEXIST`` error because ``foo`` already + exists. We reclassify as a conflict; the old version ends up at + ``foo.backup``, the other process' changes end up at ``foo``, and + ours at ``foo.conflicted``. This avoids data loss. + +* Interleaving D: its rename happens after our link in step 4c, + and causes an ``IN_MOVED_TO`` event for ``foo``. Its rename also + changes the ``mtime`` for ``foo`` so that it is different from + the ``mtime`` calculated in step 3, and therefore different + from the metadata recorded for ``foo`` in the magic folder db. + (Assuming no system clock changes, its rename will set an ``mtime`` + timestamp corresponding to a time after step 4c, which is not + equal to the timestamp *T* seconds before step 4a, provided that + *T* seconds is sufficiently greater than the timestamp granularity.) + Therefore, an upload will be triggered for ``foo`` after its + change, which is correct and avoids data loss. + +On Windows, the internal implementation of `ReplaceFileW`_ is similar +to what we have described above for Unix; it works like this: + +* 4a′. Copy metadata (which does not include ``mtime``) from the + replaced file (``foo``) to the replacement file (``.foo.tmp``). + +* 4b′. Attempt to move the replaced file (``foo``) onto the + backup filename (``foo.backup``), deleting the latter if it + already exists. + +* 4c′. Attempt to move the replacement file (``.foo.tmp``) to the + replaced filename (``foo``); fail if the destination already + exists. + +Notice that this is essentially the same as the algorithm we use +for Unix, but steps 4c and 4d on Unix are combined into a single +step 4c′. (If there is a failure at steps 4c′ after step 4b′ has +completed, the `ReplaceFileW`_ call will fail with return code +``ERROR_UNABLE_TO_MOVE_REPLACEMENT_2``. However, it is still +preferable to use this API over two `MoveFileExW`_ calls, because +it retains the attributes and ACLs of ``foo`` where possible.) + +However, on Windows the other application will not be able to +directly rename ``foo.other`` onto ``foo`` (which would fail because +the destination already exists); it will have to rename or delete +``foo`` first. Without loss of generality, let's say ``foo`` is +deleted. This complicates the interleaving analysis, because we +have two operations done by the other process interleaving with +three done by the magic folder process (rather than one operation +interleaving with four as on Unix). The cases are: + +* Interleaving A′: the other process' deletion of ``foo`` and its + rename of ``foo.other`` to ``foo`` both precede our rename in + step 4b. We get an event corresponding to its rename by step 2. + Then we reclassify as a conflict; its changes end up at ``foo`` + and ours end up at ``foo.conflicted``. This avoids data loss. + +* Interleaving B′: the other process' deletion of ``foo`` and its + rename of ``foo.other`` to ``foo`` both precede our rename in + step 4b. We do not get an event for its rename by step 2. + Its changes end up at ``foo.backup``, and ours end up at ``foo`` + after being moved there in step 4c′. This avoids data loss. + +* Interleaving C′: the other process' deletion of ``foo`` precedes + our rename of ``foo`` to ``foo.backup`` done by `ReplaceFileW`_, + but its rename of ``foo.other`` to ``foo`` does not, so we get + an ``ERROR_FILE_NOT_FOUND`` error from `ReplaceFileW`_ indicating + that the replaced file does not exist. Then we reclassify as a + conflict; the other process' changes end up at ``foo`` (after + it has renamed ``foo.other`` to ``foo``) and our changes end up + at ``foo.conflicted``. This avoids data loss. + +* Interleaving D′: the other process' deletion and/or rename happen + during the call to `ReplaceFileW`_, causing the latter to fail. + There are two subcases: + + * if the error is ``ERROR_UNABLE_TO_MOVE_REPLACEMENT_2``, then + ``foo`` is renamed to ``foo.backup`` and ``.foo.tmp`` remains + at its original name after the call. + * for all other errors, ``foo`` and ``.foo.tmp`` both remain at + their original names after the call. + + In both subcases, we reclassify as a conflict and rename ``.foo.tmp`` + to ``foo.conflicted``. This avoids data loss. + +* Interleaving E′: the other process' deletion of ``foo`` and attempt + to rename ``foo.other`` to ``foo`` both happen after all internal + operations of `ReplaceFileW`_ have completed. This causes deletion + and rename events for ``foo`` (which will in practice be merged due + to the pending delay, although we don't rely on that for correctness). + The rename also changes the ``mtime`` for ``foo`` so that it is + different from the ``mtime`` calculated in step 3, and therefore + different from the metadata recorded for ``foo`` in the magic folder + db. (Assuming no system clock changes, its rename will set an + ``mtime`` timestamp corresponding to a time after the internal + operations of `ReplaceFileW`_ have completed, which is not equal to + the timestamp *T* seconds before `ReplaceFileW`_ is called, provided + that *T* seconds is sufficiently greater than the timestamp + granularity.) Therefore, an upload will be triggered for ``foo`` + after its change, which is correct and avoids data loss. + +.. _`MoveFileExW`: https://msdn.microsoft.com/en-us/library/windows/desktop/aa365240%28v=vs.85%29.aspx + +We also need to consider what happens if another process opens ``foo`` +and writes to it directly, rather than renaming another file onto it: + +* On Unix, open file handles refer to inodes, not paths. If the other + process opens ``foo`` before it has been renamed to ``foo.backup``, + and then closes the file, changes will have been written to the file + at the same inode, even if that inode is now linked at ``foo.backup``. + This avoids data loss. + +* On Windows, we have two subcases, depending on whether the sharing + flags specified by the other process when it opened its file handle + included ``FILE_SHARE_DELETE``. (This flag covers both deletion and + rename operations.) + + i. If the sharing flags *do not* allow deletion/renaming, the + `ReplaceFileW`_ operation will fail without renaming ``foo``. + In this case we will end up with ``foo`` changed by the other + process, and the downloaded file still in ``foo.tmp``. + This avoids data loss. + + ii. If the sharing flags *do* allow deletion/renaming, then + data loss or corruption may occur. This is unavoidable and + can be attributed to other process making a poor choice of + sharing flags (either explicitly if it used `CreateFile`_, or + via whichever higher-level API it used). + +.. _`CreateFile`: https://msdn.microsoft.com/en-us/library/windows/desktop/aa363858%28v=vs.85%29.aspx + +Note that it is possible that another process tries to open the file +between steps 4b and 4c (or 4b′ and 4c′ on Windows). In this case the +open will fail because ``foo`` does not exist. Nevertheless, no data +will be lost, and in many cases the user will be able to retry the +operation. + +Above we only described the case where the download was initially +classified as an overwrite. If it was classed as a conflict, the +procedure is the same except that we choose a unique filename +for the conflicted file (say, ``foo.conflicted_unique``). We write +the new contents to ``.foo.tmp`` and then rename it to +``foo.conflicted_unique`` in such a way that the rename will fail +if the destination already exists. (On Windows this is a simple +rename; on Unix it can be implemented as a link operation followed +by an unlink, similar to steps 4c and 4d above.) If this fails +because another process wrote ``foo.conflicted_unique`` after we +chose the filename, then we retry with a different filename. + + +Read/download collisions +~~~~~~~~~~~~~~~~~~~~~~~~ + +A *read/download collision* occurs when another program reads +from ``foo`` in the local filesystem, concurrently with the new +version being written by the Magic Folder client. We want to +ensure that any successful attempt to read the file by the other +program obtains a consistent view of its contents. + +On Unix, the above procedure for writing downloads is sufficient +to achieve this. There are three cases: + +* A. The other process opens ``foo`` for reading before it is + renamed to ``foo.backup``. Then the file handle will continue to + refer to the old file across the rename, and the other process + will read the old contents. + +* B. The other process attempts to open ``foo`` after it has been + renamed to ``foo.backup``, and before it is linked in step c. + The open call fails, which is acceptable. + +* C. The other process opens ``foo`` after it has been linked to + the new file. Then it will read the new contents. + +On Windows, the analysis is very similar, but case A′ needs to +be split into two subcases, depending on the sharing mode the other +process uses when opening the file for reading: + +* A′. The other process opens ``foo`` before the Magic Folder + client's attempt to rename ``foo`` to ``foo.backup`` (as part + of the implementation of `ReplaceFileW`_). The subcases are: + + i. The other process uses sharing flags that deny deletion and + renames. The `ReplaceFileW`_ call fails, and the download is + reclassified as a conflict. The downloaded file ends up at + ``foo.conflicted``, which is correct. + + ii. The other process uses sharing flags that allow deletion + and renames. The `ReplaceFileW`_ call succeeds, and the + other process reads inconsistent data. This can be attributed + to a poor choice of sharing flags by the other process. + +* B′. The other process attempts to open ``foo`` at the point + during the `ReplaceFileW`_ call where it does not exist. + The open call fails, which is acceptable. + +* C′. The other process opens ``foo`` after it has been linked to + the new file. Then it will read the new contents. + + +For both write/download and read/download collisions, we have +considered only interleavings with a single other process, and +only the most common possibilities for the other process' +interaction with the file. If multiple other processes are +involved, or if a process performs operations other than those +considered, then we cannot say much about the outcome in general; +however, we believe that such cases will be much less common. + + + +Fire Dragons: Distinguishing conflicts from overwrites +'''''''''''''''''''''''''''''''''''''''''''''''''''''' + +When synchronizing a file that has changed remotely, the Magic Folder +client needs to distinguish between overwrites, in which the remote +side was aware of your most recent version and overwrote it with a +new version, and conflicts, in which the remote side was unaware of +your most recent version when it published its new version. Those two +cases have to be handled differently — the latter needs to be raised +to the user as an issue the user will have to resolve and the former +must not bother the user. + +For example, suppose that Alice's Magic Folder client sees a change +to ``foo`` in Bob's DMD. If the version it downloads from Bob's DMD +is "based on" the version currently in Alice's local filesystem at +the time Alice's client attempts to write the downloaded file, then +it is an overwrite. Otherwise it is initially classified as a +conflict. + +This initial classification is used by the procedure for writing a +file described in the `Earth Dragons`_ section above. As explained +in that section, we may reclassify an overwrite as a conflict if an +error occurs during the write procedure. + +.. _`Earth Dragons`: #earth-dragons-collisions-between-local-filesystem-operations-and-downloads + +In order to implement this policy, we need to specify how the +"based on" relation between file versions is recorded and updated. + +We propose to record this information: + +* in the `magic folder db`_, for local files; +* in the Tahoe-LAFS directory metadata, for files stored in the + Magic Folder. + +In the magic folder db we will add a *last-downloaded record*, +consisting of ``last_downloaded_uri`` and ``last_downloaded_timestamp`` +fields, for each path stored in the database. Whenever a Magic Folder +client downloads a file, it stores the downloaded version's URI and +the current local timestamp in this record. Since only immutable +files are used, the URI will be an immutable file URI, which is +deterministically and uniquely derived from the file contents and +the Tahoe-LAFS node's `convergence secret`_. + +(Note that the last-downloaded record is updated regardless of +whether the download is an overwrite or a conflict. The rationale +for this to avoid "conflict loops" between clients, where every +new version after the first conflict would be considered as another +conflict.) + +.. _`convergence secret`: https://tahoe-lafs.org/trac/tahoe-lafs/browser/docs/convergence-secret.rst + +Later, in response to a local filesystem change at a given path, the +Magic Folder client reads the last-downloaded record associated with +that path (if any) from the database and then uploads the current +file. When it links the uploaded file into its client DMD, it +includes the ``last_downloaded_uri`` field in the metadata of the +directory entry, overwriting any existing field of that name. If +there was no last-downloaded record associated with the path, this +field is omitted. + +Note that ``last_downloaded_uri`` field does *not* record the URI of +the uploaded file (which would be redundant); it records the URI of +the last download before the local change that caused the upload. +The field will be absent if the file has never been downloaded by +this client (i.e. if it was created on this client and no change +by any other client has been detected). + +A possible refinement also takes into account the +``last_downloaded_timestamp`` field from the magic folder db, and +compares it to the timestamp of the change that caused the upload +(which should be later, assuming no system clock changes). +If the duration between these timestamps is very short, then we +are uncertain about whether the process on Bob's system that wrote +the local file could have taken into account the last download. +We can use this information to be conservative about treating +changes as conflicts. So, if the duration is less than a configured +threshold, we omit the ``last_downloaded_uri`` field from the +metadata. This will have the effect of making other clients treat +this change as a conflict whenever they already have a copy of the +file. + +Now we are ready to describe the algorithm for determining whether a +download for the file ``foo`` is an overwrite or a conflict (refining +step 2 of the procedure from the `Earth Dragons`_ section). + +Let ``last_downloaded_uri`` be the field of that name obtained from +the directory entry metadata for ``foo`` in Bob's DMD (this field +may be absent). Then the algorithm is: + +* 2a. If Alice has no local copy of ``foo``, classify as an overwrite. + +* 2b. Otherwise, "stat" ``foo`` to get its *current statinfo* (size + in bytes, ``mtime``, and ``ctime``). + +* 2c. Read the following information for the path ``foo`` from the + local magic folder db: + + * the *last-uploaded statinfo*, if any (this is the size in + bytes, ``mtime``, and ``ctime`` stored in the ``local_files`` + table when the file was last uploaded); + * the ``filecap`` field of the ``caps`` table for this file, + which is the URI under which the file was last uploaded. + Call this ``last_uploaded_uri``. + +* 2d. If any of the following are true, then classify as a conflict: + + * there are pending notifications of changes to ``foo``; + * the last-uploaded statinfo is either absent, or different + from the current statinfo; + * either ``last_downloaded_uri`` or ``last_uploaded_uri`` + (or both) are absent, or they are different. + + Otherwise, classify as an overwrite. + + +Air Dragons: Collisions between local writes and uploads +'''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +Short of filesystem-specific features on Unix or the `shadow copy service`_ +on Windows (which is per-volume and therefore difficult to use in this +context), there is no way to *read* the whole contents of a file +atomically. Therefore, when we read a file in order to upload it, we +may read an inconsistent version if it was also being written locally. + +.. _`shadow copy service`: https://technet.microsoft.com/en-us/library/ee923636%28v=ws.10%29.aspx + +A well-behaved application can avoid this problem for its writes: + +* On Unix, if another process modifies a file by renaming a temporary + file onto it, then we will consistently read either the old contents + or the new contents. +* On Windows, if the other process uses sharing flags to deny reads + while it is writing a file, then we will consistently read either + the old contents or the new contents, unless a sharing error occurs. + In the case of a sharing error we should retry later, up to a + maximum number of retries. + +In the case of a not-so-well-behaved application writing to a file +at the same time we read from it, the magic folder will still be +eventually consistent, but inconsistent versions may be visible to +other users' clients. + +In Objective 2 we implemented a delay, called the *pending delay*, +after the notification of a filesystem change and before the file is +read in order to upload it (Tahoe-LAFS ticket `#1440`_). If another +change notification occurs within the pending delay time, the delay +is restarted. This helps to some extent because it means that if +files are written more quickly than the pending delay and less +frequently than the pending delay, we shouldn't encounter this +inconsistency. + +.. _`#1440`: https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1440 + +The likelihood of inconsistency could be further reduced, even for +writes by not-so-well-behaved applications, by delaying the actual +upload for a further period —called the *stability delay*— after the +file has finished being read. If a notification occurs between the +end of the pending delay and the end of the stability delay, then +the read would be aborted and the notification requeued. + +This would have the effect of ensuring that no write notifications +have been received for the file during a time window that brackets +the period when it was being read, with margin before and after +this period defined by the pending and stability delays. The delays +are intended to account for asynchronous notification of events, and +caching in the filesystem. + +Note however that we cannot guarantee that the delays will be long +enough to prevent inconsistency in any particular case. Also, the +stability delay would potentially affect performance significantly +because (unlike the pending delay) it is not overlapped when there +are multiple files on the upload queue. This performance impact +could be mitigated by uploading files in parallel where possible +(Tahoe-LAFS ticket `#1459`_). + +We have not yet decided whether to implement the stability delay, and +it is not planned to be implemented for the OTF objective 4 milestone. +Ticket `#2431`_ has been opened to track this idea. + +.. _`#1459`: https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1459 +.. _`#2431`: https://tahoe-lafs.org/trac/tahoe-lafs/ticket/2431 + +Note that the situation of both a local process and the Magic Folder +client reading a file at the same time cannot cause any inconsistency. + + +Water Dragons: Handling deletion and renames +'''''''''''''''''''''''''''''''''''''''''''' + +Deletion of a file +~~~~~~~~~~~~~~~~~~ + +When a file is deleted from the filesystem of a Magic Folder client, +the most intuitive behavior is for it also to be deleted under that +name from other clients. To avoid data loss, the other clients should +actually rename their copies to a backup (``*.old``) filename. + +It would not be sufficient for a Magic Folder client that deletes +a file to implement this simply by removing the directory entry from +its DMD. Indeed, the entry may not exist in the client's DMD if it +has never previously changed the file. + +Instead, the client links a zero-length file into its DMD and sets +``deleted: true`` in the directory entry metadata. Other clients +take this as a signal to rename their copies to the backup filename. + +Note that the entry for this zero-length file has a version number as +usual, and later versions may restore the file. + +When a Magic Folder client restarts, we can detect files that had +been downloaded but were deleted while it was not running, because +their paths will have last-downloaded records in the magic folder db +without any corresponding local file. + +Deletion of a directory +~~~~~~~~~~~~~~~~~~~~~~~ + +Local filesystems (unlike a Tahoe-LAFS filesystem) normally cannot +unlink a directory that has any remaining children. Therefore a +Magic Folder client cannot delete local copies of directories in +general, because they will typically contain backup files. This must +be done manually on each client if desired. + +Nevertheless, a Magic Folder client that deletes a directory should +set ``deleted: true`` on the metadata entry for the corresponding +zero-length file. This avoids the directory being recreated after +it has been manually deleted from a client. + +Renaming +~~~~~~~~ + +It is sufficient to handle renaming of a file by treating it as a +deletion and an addition under the new name. + +This also applies to directories, although users may find the +resulting behavior unintuitive: all of the files under the old name +will be renamed to backup filenames, and a new directory structure +created under the new name. We believe this is the best that can be +done without imposing unreasonable implementation complexity. + + +Summary +------- + +This completes the design of remote-to-local synchronization. +We realize that it may seem very complicated. Anecdotally, proprietary +filesystem synchronization designs we are aware of, such as Dropbox, +are said to incur similar or greater design complexity.