docs/logging.rst

   1 =============
   2 Tahoe Logging
   3 =============
   4
   5 1.  `Overview`_
   6 2.  `Realtime Logging`_
   7 3.  `Incidents`_
   8 4.  `Working with flogfiles`_
   9 5.  `Gatherers`_
  10
  11     1.  `Incident Gatherer`_
  12     2.  `Log Gatherer`_
  13
  14 6.  `Local twistd.log files`_
  15 7.  `Adding log messages`_
  16 8.  `Log Messages During Unit Tests`_
  17
  18 Overview
  19 ========
  20
  21 Tahoe uses the Foolscap logging mechanism (known as the "flog" subsystem) to
  22 record information about what is happening inside the Tahoe node. This is
  23 primarily for use by programmers and grid operators who want to find out what
  24 went wrong.
  25
  26 The Foolscap logging system is documented at
  27 `<http://foolscap.lothar.com/docs/logging.html>`_.
  28
  29 The Foolscap distribution includes a utility named "``flogtool``" that is
  30 used to get access to many Foolscap logging features. However, using this
  31 command directly on Tahoe log files may fail, due to use of an incorrect
  32 PYTHONPATH. Installing Foolscap v0.6.1 or later and then running
  33 ``bin/tahoe @flogtool`` from the root of a Tahoe-LAFS source distribution
  34 may avoid this problem (but only on Unix, not Windows).
  35
  36
  37 Realtime Logging
  38 ================
  39
  40 When you are working on Tahoe code, and want to see what the node is doing,
  41 the easiest tool to use is "``flogtool tail``". This connects to the Tahoe
  42 node and subscribes to hear about all log events. These events are then
  43 displayed to stdout, and optionally saved to a file.
  44
  45 "``flogtool tail``" connects to the "logport", for which the FURL is stored
  46 in ``BASEDIR/private/logport.furl`` . The following command will connect to
  47 this port and start emitting log information::
  48
  49   flogtool tail BASEDIR/private/logport.furl
  50
  51 The ``--save-to FILENAME`` option will save all received events to a file,
  52 where then can be examined later with "``flogtool dump``" or "``flogtool
  53 web-viewer``". The ``--catch-up`` option will ask the node to dump all stored
  54 events before subscribing to new ones (without ``--catch-up``, you will only
  55 hear about events that occur after the tool has connected and subscribed).
  56
  57 Incidents
  58 =========
  59
  60 Foolscap keeps a short list of recent events in memory. When something goes
  61 wrong, it writes all the history it has (and everything that gets logged in
  62 the next few seconds) into a file called an "incident". These files go into
  63 ``BASEDIR/logs/incidents/`` , in a file named
  64 "``incident-TIMESTAMP-UNIQUE.flog.bz2``". The default definition of
  65 "something goes wrong" is the generation of a log event at the ``log.WEIRD``
  66 level or higher, but other criteria could be implemented.
  67
  68 The typical "incident report" we've seen in a large Tahoe grid is about 40kB
  69 compressed, representing about 1800 recent events.
  70
  71 These "flogfiles" have a similar format to the files saved by "``flogtool
  72 tail --save-to``". They are simply lists of log events, with a small header
  73 to indicate which event triggered the incident.
  74
  75 The "``flogtool dump FLOGFILE``" command will take one of these ``.flog.bz2``
  76 files and print their contents to stdout, one line per event. The raw event
  77 dictionaries can be dumped by using "``flogtool dump --verbose FLOGFILE``".
  78
  79 The "``flogtool web-viewer``" command can be used to examine the flogfile in
  80 a web browser. It runs a small HTTP server and emits the URL on stdout.  This
  81 view provides more structure than the output of "``flogtool dump``": the
  82 parent/child relationships of log events is displayed in a nested format.
  83 "``flogtool web-viewer``" is still fairly immature.
  84
  85 Working with flogfiles
  86 ======================
  87
  88 The "``flogtool filter``" command can be used to take a large flogfile
  89 (perhaps one created by the log-gatherer, see below) and copy a subset of
  90 events into a second file. This smaller flogfile may be easier to work with
  91 than the original. The arguments to "``flogtool filter``" specify filtering
  92 criteria: a predicate that each event must match to be copied into the target
  93 file. ``--before`` and ``--after`` are used to exclude events outside a given
  94 window of time. ``--above`` will retain events above a certain severity
  95 level. ``--from`` retains events send by a specific tubid.
  96 ``--strip-facility`` removes events that were emitted with a given facility
  97 (like ``foolscap.negotiation`` or ``tahoe.upload``).
  98
  99 Gatherers
 100 =========
 101
 102 In a deployed Tahoe grid, it is useful to get log information automatically
 103 transferred to a central log-gatherer host. This offloads the (admittedly
 104 modest) storage requirements to a different host and provides access to
 105 logfiles from multiple nodes (web-API, storage, or helper) in a single place.
 106
 107 There are two kinds of gatherers: "log gatherer" and "stats gatherer". Each
 108 produces a FURL which needs to be placed in the ``NODEDIR/tahoe.cfg`` file of
 109 each node that is to publish to the gatherer, under the keys
 110 "log_gatherer.furl" and "stats_gatherer.furl" respectively. When the Tahoe
 111 node starts, it will connect to the configured gatherers and offer its
 112 logport: the gatherer will then use the logport to subscribe to hear about
 113 events.
 114
 115 The gatherer will write to files in its working directory, which can then be
 116 examined with tools like "``flogtool dump``" as described above.
 117
 118 Incident Gatherer
 119 -----------------
 120
 121 The "incident gatherer" only collects Incidents: records of the log events
 122 that occurred just before and slightly after some high-level "trigger event"
 123 was recorded. Each incident is classified into a "category": a short string
 124 that summarizes what sort of problem took place. These classification
 125 functions are written after examining a new/unknown incident. The idea is to
 126 recognize when the same problem is happening multiple times.
 127
 128 A collection of classification functions that are useful for Tahoe nodes are
 129 provided in ``misc/incident-gatherer/support_classifiers.py`` . There is
 130 roughly one category for each ``log.WEIRD``-or-higher level event in the
 131 Tahoe source code.
 132
 133 The incident gatherer is created with the "``flogtool
 134 create-incident-gatherer WORKDIR``" command, and started with "``tahoe
 135 start``". The generated "``gatherer.tac``" file should be modified to add
 136 classifier functions.
 137
 138 The incident gatherer writes incident names (which are simply the relative
 139 pathname of the ``incident-\*.flog.bz2`` file) into ``classified/CATEGORY``.
 140 For example, the ``classified/mutable-retrieve-uncoordinated-write-error``
 141 file contains a list of all incidents which were triggered by an
 142 uncoordinated write that was detected during mutable file retrieval (caused
 143 when somebody changed the contents of the mutable file in between the node's
 144 mapupdate step and the retrieve step). The ``classified/unknown`` file
 145 contains a list of all incidents that did not match any of the classification
 146 functions.
 147
 148 At startup, the incident gatherer will automatically reclassify any incident
 149 report which is not mentioned in any of the ``classified/\*`` files. So the
 150 usual workflow is to examine the incidents in ``classified/unknown``, add a
 151 new classification function, delete ``classified/unknown``, then bound the
 152 gatherer with "``tahoe restart WORKDIR``". The incidents which can be
 153 classified with the new functions will be added to their own
 154 ``classified/FOO`` lists, and the remaining ones will be put in
 155 ``classified/unknown``, where the process can be repeated until all events
 156 are classifiable.
 157
 158 The incident gatherer is still fairly immature: future versions will have a
 159 web interface and an RSS feed, so operations personnel can track problems in
 160 the storage grid.
 161
 162 In our experience, each incident takes about two seconds to transfer from the
 163 node that generated it to the gatherer. The gatherer will automatically catch
 164 up to any incidents which occurred while it is offline.
 165
 166 Log Gatherer
 167 ------------
 168
 169 The "Log Gatherer" subscribes to hear about every single event published by
 170 the connected nodes, regardless of severity. This server writes these log
 171 events into a large flogfile that is rotated (closed, compressed, and
 172 replaced with a new one) on a periodic basis. Each flogfile is named
 173 according to the range of time it represents, with names like
 174 "``from-2008-08-26-132256--to-2008-08-26-162256.flog.bz2``". The flogfiles
 175 contain events from many different sources, making it easier to correlate
 176 things that happened on multiple machines (such as comparing a client node
 177 making a request with the storage servers that respond to that request).
 178
 179 Create the Log Gatherer with the "``flogtool create-gatherer WORKDIR``"
 180 command, and start it with "``tahoe start``". Then copy the contents of the
 181 ``log_gatherer.furl`` file it creates into the ``BASEDIR/tahoe.cfg`` file
 182 (under the key ``log_gatherer.furl`` of the section ``[node]``) of all nodes
 183 that should be sending it log events. (See `<configuration.rst>`_.)
 184
 185 The "``flogtool filter``" command, described above, is useful to cut down the
 186 potentially large flogfiles into a more focussed form.
 187
 188 Busy nodes, particularly web-API nodes which are performing recursive
 189 deep-size/deep-stats/deep-check operations, can produce a lot of log events.
 190 To avoid overwhelming the node (and using an unbounded amount of memory for
 191 the outbound TCP queue), publishing nodes will start dropping log events when
 192 the outbound queue grows too large. When this occurs, there will be gaps
 193 (non-sequential event numbers) in the log-gatherer's flogfiles.
 194
 195 Local twistd.log files
 196 ======================
 197
 198 [TODO: not yet true, requires foolscap-0.3.1 and a change to ``allmydata.node``]
 199
 200 In addition to the foolscap-based event logs, certain high-level events will
 201 be recorded directly in human-readable text form, in the
 202 ``BASEDIR/logs/twistd.log`` file (and its rotated old versions:
 203 ``twistd.log.1``, ``twistd.log.2``, etc). This form does not contain as much
 204 information as the flogfiles available through the means described
 205 previously, but they are immediately available to the curious developer, and
 206 are retained until the twistd.log.NN files are explicitly deleted.
 207
 208 Only events at the ``log.OPERATIONAL`` level or higher are bridged to
 209 ``twistd.log`` (i.e. not the ``log.NOISY`` debugging events). In addition,
 210 foolscap internal events (like connection negotiation messages) are not
 211 bridged to ``twistd.log``.
 212
 213 Adding log messages
 214 ===================
 215
 216 When adding new code, the Tahoe developer should add a reasonable number of
 217 new log events. For details, please see the Foolscap logging documentation,
 218 but a few notes are worth stating here:
 219
 220 * use a facility prefix of "``tahoe.``", like "``tahoe.mutable.publish``"
 221
 222 * assign each severe (``log.WEIRD`` or higher) event a unique message
 223   identifier, as the ``umid=`` argument to the ``log.msg()`` call. The
 224   ``misc/coding_tools/make_umid`` script may be useful for this purpose.
 225   This will make it easier to write a classification function for these
 226   messages.
 227
 228 * use the ``parent=`` argument whenever the event is causally/temporally
 229   clustered with its parent. For example, a download process that involves
 230   three sequential hash fetches could announce the send and receipt of those
 231   hash-fetch messages with a ``parent=`` argument that ties them to the
 232   overall download process. However, each new web-API download request should
 233   be unparented.
 234
 235 * use the ``format=`` argument in preference to the ``message=`` argument.
 236   E.g. use ``log.msg(format="got %(n)d shares, need %(k)d", n=n, k=k)``
 237   instead of ``log.msg("got %d shares, need %d" % (n,k))``. This will allow
 238   later tools to analyze the event without needing to scrape/reconstruct the
 239   structured data out of the formatted string.
 240
 241 * Pass extra information as extra keyword arguments, even if they aren't
 242   included in the ``format=`` string. This information will be displayed in
 243   the "``flogtool dump --verbose``" output, as well as being available to
 244   other tools. The ``umid=`` argument should be passed this way.
 245
 246 * use ``log.err`` for the catch-all ``addErrback`` that gets attached to the
 247   end of any given Deferred chain. When used in conjunction with
 248   ``LOGTOTWISTED=1``, ``log.err()`` will tell Twisted about the error-nature
 249   of the log message, causing Trial to flunk the test (with an "ERROR"
 250   indication that prints a copy of the Failure, including a traceback).
 251   Don't use ``log.err`` for events that are ``BAD`` but handled (like hash
 252   failures: since these are often deliberately provoked by test code, they
 253   should not cause test failures): use ``log.msg(level=BAD)`` for those
 254   instead.
 255
 256
 257 Log Messages During Unit Tests
 258 ==============================
 259
 260 If a test is failing and you aren't sure why, start by enabling
 261 ``FLOGTOTWISTED=1`` like this::
 262
 263   make test FLOGTOTWISTED=1
 264
 265 With ``FLOGTOTWISTED=1``, sufficiently-important log events will be written
 266 into ``_trial_temp/test.log``, which may give you more ideas about why the
 267 test is failing. Note, however, that ``_trial_temp/log.out`` will not receive
 268 messages below the ``level=OPERATIONAL`` threshold, due to this issue:
 269 `<http://foolscap.lothar.com/trac/ticket/154>`_
 270
 271
 272 If that isn't enough, look at the detailed foolscap logging messages instead,
 273 by running the tests like this::
 274
 275   make test FLOGFILE=flog.out.bz2 FLOGLEVEL=1 FLOGTOTWISTED=1
 276
 277 The first environment variable will cause foolscap log events to be written
 278 to ``./flog.out.bz2`` (instead of merely being recorded in the circular
 279 buffers for the use of remote subscribers or incident reports). The second
 280 will cause all log events to be written out, not just the higher-severity
 281 ones. The third will cause twisted log events (like the markers that indicate
 282 when each unit test is starting and stopping) to be copied into the flogfile,
 283 making it easier to correlate log events with unit tests.
 284
 285 Enabling this form of logging appears to roughly double the runtime of the
 286 unit tests. The ``flog.out.bz2`` file is approximately 2MB.
 287
 288 You can then use "``flogtool dump``" or "``flogtool web-viewer``" on the
 289 resulting ``flog.out`` file.
 290
 291 ("``flogtool tail``" and the log-gatherer are not useful during unit tests,
 292 since there is no single Tub to which all the log messages are published).
 293
 294 It is possible for setting these environment variables to cause spurious test
 295 failures in tests with race condition bugs. All known instances of this have
 296 been fixed as of Tahoe-LAFS v1.7.1.