docs/logging.rst

   1 =============
   2 Tahoe Logging
   3 =============
   4
   5 1.  `Overview`_
   6 2.  `Realtime Logging`_
   7 3.  `Incidents`_
   8 4.  `Working with flogfiles`_
   9 5.  `Gatherers`_
  10
  11     1.  `Incident Gatherer`_
  12     2.  `Log Gatherer`_
  13
  14 6.  `Local twistd.log files`_
  15 7.  `Adding log messages`_
  16 8.  `Log Messages During Unit Tests`_
  17
  18 Overview
  19 ========
  20
  21 Tahoe uses the Foolscap logging mechanism (known as the "flog" subsystem) to
  22 record information about what is happening inside the Tahoe node. This is
  23 primarily for use by programmers and grid operators who want to find out what
  24 went wrong.
  25
  26 The foolscap logging system is documented at
  27 `<http://foolscap.lothar.com/docs/logging.html>`_.
  28
  29 The foolscap distribution includes a utility named "``flogtool``" (usually
  30 at ``/usr/bin/flogtool`` on Unix) which is used to get access to many
  31 foolscap logging features.
  32
  33 Realtime Logging
  34 ================
  35
  36 When you are working on Tahoe code, and want to see what the node is doing,
  37 the easiest tool to use is "``flogtool tail``". This connects to the Tahoe
  38 node and subscribes to hear about all log events. These events are then
  39 displayed to stdout, and optionally saved to a file.
  40
  41 "``flogtool tail``" connects to the "logport", for which the FURL is stored
  42 in ``BASEDIR/private/logport.furl`` . The following command will connect to
  43 this port and start emitting log information::
  44
  45   flogtool tail BASEDIR/private/logport.furl
  46
  47 The ``--save-to FILENAME`` option will save all received events to a file,
  48 where then can be examined later with "``flogtool dump``" or
  49 "``flogtool web-viewer``". The ``--catch-up`` option will ask the node to
  50 dump all stored events before subscribing to new ones (without ``--catch-up``,
  51 you will only hear about events that occur after the tool has connected and
  52 subscribed).
  53
  54 Incidents
  55 =========
  56
  57 Foolscap keeps a short list of recent events in memory. When something goes
  58 wrong, it writes all the history it has (and everything that gets logged in
  59 the next few seconds) into a file called an "incident". These files go into
  60 ``BASEDIR/logs/incidents/`` , in a file named
  61 "``incident-TIMESTAMP-UNIQUE.flog.bz2``". The default definition of
  62 "something goes wrong" is the generation of a log event at the ``log.WEIRD``
  63 level or higher, but other criteria could be implemented.
  64
  65 The typical "incident report" we've seen in a large Tahoe grid is about 40kB
  66 compressed, representing about 1800 recent events.
  67
  68 These "flogfiles" have a similar format to the files saved by
  69 "``flogtool tail --save-to``". They are simply lists of log events, with a
  70 small header to indicate which event triggered the incident.
  71
  72 The "``flogtool dump FLOGFILE``" command will take one of these ``.flog.bz2``
  73 files and print their contents to stdout, one line per event. The raw event
  74 dictionaries can be dumped by using "``flogtool dump --verbose FLOGFILE``".
  75
  76 The "``flogtool web-viewer``" command can be used to examine the flogfile
  77 in a web browser. It runs a small HTTP server and emits the URL on stdout.
  78 This view provides more structure than the output of "``flogtool dump``":
  79 the parent/child relationships of log events is displayed in a nested format.
  80 "``flogtool web-viewer``" is still fairly immature.
  81
  82 Working with flogfiles
  83 ======================
  84
  85 The "``flogtool filter``" command can be used to take a large flogfile
  86 (perhaps one created by the log-gatherer, see below) and copy a subset of
  87 events into a second file. This smaller flogfile may be easier to work with
  88 than the original. The arguments to "``flogtool filter``" specify filtering
  89 criteria: a predicate that each event must match to be copied into the
  90 target file. ``--before`` and ``--after`` are used to exclude events outside
  91 a given window of time. ``--above`` will retain events above a certain
  92 severity level. ``--from`` retains events send by a specific tubid.
  93 ``--strip-facility`` removes events that were emitted with a given facility
  94 (like ``foolscap.negotiation`` or ``tahoe.upload``).
  95
  96 Gatherers
  97 =========
  98
  99 In a deployed Tahoe grid, it is useful to get log information automatically
 100 transferred to a central log-gatherer host. This offloads the (admittedly
 101 modest) storage requirements to a different host and provides access to
 102 logfiles from multiple nodes (web-API, storage, or helper) in a single place.
 103
 104 There are two kinds of gatherers: "log gatherer" and "stats gatherer". Each
 105 produces a FURL which needs to be placed in the ``NODEDIR/tahoe.cfg`` file
 106 of each node that is to publish to the gatherer, under the keys
 107 "log_gatherer.furl" and "stats_gatherer.furl" respectively. When the Tahoe
 108 node starts, it will connect to the configured gatherers and offer its
 109 logport: the gatherer will then use the logport to subscribe to hear about
 110 events.
 111
 112 The gatherer will write to files in its working directory, which can then be
 113 examined with tools like "``flogtool dump``" as described above.
 114
 115 Incident Gatherer
 116 -----------------
 117
 118 The "incident gatherer" only collects Incidents: records of the log events
 119 that occurred just before and slightly after some high-level "trigger event"
 120 was recorded. Each incident is classified into a "category": a short string
 121 that summarizes what sort of problem took place. These classification
 122 functions are written after examining a new/unknown incident. The idea is to
 123 recognize when the same problem is happening multiple times.
 124
 125 A collection of classification functions that are useful for Tahoe nodes are
 126 provided in ``misc/incident-gatherer/support_classifiers.py`` . There is
 127 roughly one category for each ``log.WEIRD``-or-higher level event in the
 128 Tahoe source code.
 129
 130 The incident gatherer is created with the "``flogtool create-incident-gatherer
 131 WORKDIR``" command, and started with "``tahoe start``". The generated
 132 "``gatherer.tac``" file should be modified to add classifier functions.
 133
 134 The incident gatherer writes incident names (which are simply the relative
 135 pathname of the ``incident-\*.flog.bz2`` file) into ``classified/CATEGORY``.
 136 For example, the ``classified/mutable-retrieve-uncoordinated-write-error``
 137 file contains a list of all incidents which were triggered by an uncoordinated
 138 write that was detected during mutable file retrieval (caused when somebody
 139 changed the contents of the mutable file in between the node's mapupdate step
 140 and the retrieve step). The ``classified/unknown`` file contains a list of all
 141 incidents that did not match any of the classification functions.
 142
 143 At startup, the incident gatherer will automatically reclassify any incident
 144 report which is not mentioned in any of the ``classified/\*`` files. So the
 145 usual workflow is to examine the incidents in ``classified/unknown``, add a
 146 new classification function, delete ``classified/unknown``, then bound the
 147 gatherer with "``tahoe restart WORKDIR``". The incidents which can be
 148 classified with the new functions will be added to their own ``classified/FOO``
 149 lists, and the remaining ones will be put in ``classified/unknown``, where
 150 the process can be repeated until all events are classifiable.
 151
 152 The incident gatherer is still fairly immature: future versions will have a
 153 web interface and an RSS feed, so operations personnel can track problems in
 154 the storage grid.
 155
 156 In our experience, each incident takes about two seconds to transfer from
 157 the node that generated it to the gatherer. The gatherer will automatically
 158 catch up to any incidents which occurred while it is offline.
 159
 160 Log Gatherer
 161 ------------
 162
 163 The "Log Gatherer" subscribes to hear about every single event published by
 164 the connected nodes, regardless of severity. This server writes these log
 165 events into a large flogfile that is rotated (closed, compressed, and
 166 replaced with a new one) on a periodic basis. Each flogfile is named
 167 according to the range of time it represents, with names like
 168 "``from-2008-08-26-132256--to-2008-08-26-162256.flog.bz2``". The flogfiles
 169 contain events from many different sources, making it easier to correlate
 170 things that happened on multiple machines (such as comparing a client node
 171 making a request with the storage servers that respond to that request).
 172
 173 Create the Log Gatherer with the "``flogtool create-gatherer
 174 WORKDIR``" command, and start it with "``tahoe start``". Then copy the
 175 contents of the ``log_gatherer.furl`` file it creates into the
 176 ``BASEDIR/tahoe.cfg`` file (under the key ``log_gatherer.furl`` of the
 177 section ``[node]``) of all nodes that should be sending it log
 178 events. (See `<configuration.rst>`_.)
 179
 180 The "``flogtool filter``" command, described above, is useful to cut down the
 181 potentially large flogfiles into a more focussed form.
 182
 183 Busy nodes, particularly web-API nodes which are performing recursive
 184 deep-size/deep-stats/deep-check operations, can produce a lot of log events.
 185 To avoid overwhelming the node (and using an unbounded amount of memory for
 186 the outbound TCP queue), publishing nodes will start dropping log events when
 187 the outbound queue grows too large. When this occurs, there will be gaps
 188 (non-sequential event numbers) in the log-gatherer's flogfiles.
 189
 190 Local twistd.log files
 191 ======================
 192
 193 [TODO: not yet true, requires foolscap-0.3.1 and a change to ``allmydata.node``]
 194
 195 In addition to the foolscap-based event logs, certain high-level events will
 196 be recorded directly in human-readable text form, in the
 197 ``BASEDIR/logs/twistd.log`` file (and its rotated old versions: ``twistd.log.1``,
 198 ``twistd.log.2``, etc). This form does not contain as much information as the
 199 flogfiles available through the means described previously, but they are
 200 immediately available to the curious developer, and are retained until the
 201 twistd.log.NN files are explicitly deleted.
 202
 203 Only events at the ``log.OPERATIONAL`` level or higher are bridged to
 204 ``twistd.log`` (i.e. not the ``log.NOISY`` debugging events). In addition,
 205 foolscap internal events (like connection negotiation messages) are not
 206 bridged to ``twistd.log``.
 207
 208 Adding log messages
 209 ===================
 210
 211 When adding new code, the Tahoe developer should add a reasonable number of
 212 new log events. For details, please see the Foolscap logging documentation,
 213 but a few notes are worth stating here:
 214
 215 * use a facility prefix of "``tahoe.``", like "``tahoe.mutable.publish``"
 216
 217 * assign each severe (``log.WEIRD`` or higher) event a unique message
 218   identifier, as the ``umid=`` argument to the ``log.msg()`` call. The
 219   ``misc/coding_tools/make_umid`` script may be useful for this purpose.
 220   This will make it easier to write a classification function for these
 221   messages.
 222
 223 * use the ``parent=`` argument whenever the event is causally/temporally
 224   clustered with its parent. For example, a download process that involves
 225   three sequential hash fetches could announce the send and receipt of those
 226   hash-fetch messages with a ``parent=`` argument that ties them to the
 227   overall download process. However, each new web-API download request
 228   should be unparented.
 229
 230 * use the ``format=`` argument in preference to the ``message=`` argument.
 231   E.g. use ``log.msg(format="got %(n)d shares, need %(k)d", n=n, k=k)``
 232   instead of ``log.msg("got %d shares, need %d" % (n,k))``. This will allow
 233   later tools to analyze the event without needing to scrape/reconstruct
 234   the structured data out of the formatted string.
 235
 236 * Pass extra information as extra keyword arguments, even if they aren't
 237   included in the ``format=`` string. This information will be displayed in
 238   the "``flogtool dump --verbose``" output, as well as being available to
 239   other tools. The ``umid=`` argument should be passed this way.
 240
 241 * use ``log.err`` for the catch-all ``addErrback`` that gets attached to
 242   the end of any given Deferred chain. When used in conjunction with
 243   ``LOGTOTWISTED=1``, ``log.err()`` will tell Twisted about the error-nature
 244   of the log message, causing Trial to flunk the test (with an "ERROR"
 245   indication that prints a copy of the Failure, including a traceback).
 246   Don't use ``log.err`` for events that are ``BAD`` but handled (like hash
 247   failures: since these are often deliberately provoked by test code, they
 248   should not cause test failures): use ``log.msg(level=BAD)`` for those
 249   instead.
 250
 251
 252 Log Messages During Unit Tests
 253 ==============================
 254
 255 If a test is failing and you aren't sure why, start by enabling
 256 ``FLOGTOTWISTED=1`` like this::
 257
 258   make test FLOGTOTWISTED=1
 259
 260 With ``FLOGTOTWISTED=1``, sufficiently-important log events will be written
 261 into ``_trial_temp/test.log``, which may give you more ideas about why the
 262 test is failing. Note, however, that ``_trial_temp/log.out`` will not receive
 263 messages below the ``level=OPERATIONAL`` threshold, due to this issue:
 264 `<http://foolscap.lothar.com/trac/ticket/154>`_
 265
 266
 267 If that isn't enough, look at the detailed foolscap logging messages instead,
 268 by running the tests like this::
 269
 270   make test FLOGFILE=flog.out.bz2 FLOGLEVEL=1 FLOGTOTWISTED=1
 271
 272 The first environment variable will cause foolscap log events to be written
 273 to ``./flog.out.bz2`` (instead of merely being recorded in the circular buffers
 274 for the use of remote subscribers or incident reports). The second will cause
 275 all log events to be written out, not just the higher-severity ones. The
 276 third will cause twisted log events (like the markers that indicate when each
 277 unit test is starting and stopping) to be copied into the flogfile, making it
 278 easier to correlate log events with unit tests.
 279
 280 Enabling this form of logging appears to roughly double the runtime of the
 281 unit tests. The ``flog.out.bz2`` file is approximately 2MB.
 282
 283 You can then use "``flogtool dump``" or "``flogtool web-viewer``" on the
 284 resulting ``flog.out`` file.
 285
 286 ("``flogtool tail``" and the log-gatherer are not useful during unit tests,
 287 since there is no single Tub to which all the log messages are published).
 288
 289 It is possible for setting these environment variables to cause spurious test
 290 failures in tests with race condition bugs. All known instances of this have
 291 been fixed as of Tahoe-LAFS v1.7.1.