docs/logging.txt

   1 = Tahoe Logging =
   2
   3 1.  Overview
   4 2.  Realtime Logging
   5 3.  Incidents
   6 4.  Working with flogfiles
   7 5.  Gatherers
   8   5.1.  Incident Gatherer
   9   5.2.  Log Gatherer
  10 6.  Local twistd.log files
  11 7.  Adding log messages
  12 8.  Log Messages During Unit Tests
  13
  14 == Overview ==
  15
  16 Tahoe uses the Foolscap logging mechanism (known as the "flog" subsystem) to
  17 record information about what is happening inside the Tahoe node. This is
  18 primarily for use by programmers and grid operators who want to find out what
  19 went wrong.
  20
  21 The foolscap logging system is documented here:
  22
  23   http://foolscap.lothar.com/docs/logging.html
  24
  25 The foolscap distribution includes a utility named "flogtool" (usually at
  26 /usr/bin/flogtool) which is used to get access to many foolscap logging
  27 features.
  28
  29 == Realtime Logging ==
  30
  31 When you are working on Tahoe code, and want to see what the node is doing,
  32 the easiest tool to use is "flogtool tail". This connects to the tahoe node
  33 and subscribes to hear about all log events. These events are then displayed
  34 to stdout, and optionally saved to a file.
  35
  36 "flogtool tail" connects to the "logport", for which the FURL is stored in
  37 BASEDIR/private/logport.furl . The following command will connect to this
  38 port and start emitting log information:
  39
  40  flogtool tail BASEDIR/private/logport.furl
  41
  42 The "--save-to FILENAME" option will save all received events to a file,
  43 where then can be examined later with "flogtool dump" or "flogtool
  44 web-viewer". The --catch-up flag will ask the node to dump all stored events
  45 before subscribing to new ones (without --catch-up, you will only hear about
  46 events that occur after the tool has connected and subscribed).
  47
  48 == Incidents ==
  49
  50 Foolscap keeps a short list of recent events in memory. When something goes
  51 wrong, it writes all the history it has (and everything that gets logged in
  52 the next few seconds) into a file called an "incident". These files go into
  53 BASEDIR/logs/incidents/ , in a file named
  54 "incident-TIMESTAMP-UNIQUE.flog.bz2". The default definition of "something
  55 goes wrong" is the generation of a log event at the log.WEIRD level or
  56 higher, but other criteria could be implemented.
  57
  58 The typical "incident report" we've seen in a large Tahoe grid is about 40kB
  59 compressed, representing about 1800 recent events.
  60
  61 These "flogfiles" have a similar format to the files saved by "flogtool tail
  62 --save-to". They are simply lists of log events, with a small header to
  63 indicate which event triggered the incident.
  64
  65 The "flogtool dump FLOGFILE" command will take one of these .flog.bz2 files
  66 and print their contents to stdout, one line per event. The raw event
  67 dictionaries can be dumped by using "flogtool dump --verbose FLOGFILE".
  68
  69 The "flogtool web-viewer" command can be used to examine the flogfile in a
  70 web browser. It runs a small HTTP server and emits the URL on stdout. This
  71 view provides more structure than the output of "flogtool dump": the
  72 parent/child relationships of log events is displayed in a nested format.
  73 "flogtool web-viewer" is still fairly immature.
  74
  75 == Working with flogfiles ==
  76
  77 The "flogtool filter" command can be used to take a large flogfile (perhaps
  78 one created by the log-gatherer, see below) and copy a subset of events into
  79 a second file. This smaller flogfile may be easier to work with than the
  80 original. The arguments to "flogtool filter" specify filtering criteria: a
  81 predicate that each event must match to be copied into the target file.
  82 --before and --after are used to exclude events outside a given window of
  83 time. --above will retain events above a certain severity level. --from
  84 retains events send by a specific tubid. --strip-facility removes events that
  85 were emitted with a given facility (like foolscap.negotiation or
  86 tahoe.upload).
  87
  88 == Gatherers ==
  89
  90 In a deployed Tahoe grid, it is useful to get log information automatically
  91 transferred to a central log-gatherer host. This offloads the (admittedly
  92 modest) storage requirements to a different host and provides access to
  93 logfiles from multiple nodes (webapi/storage/helper) nodes in a single place.
  94
  95 There are two kinds of gatherers. Both produce a FURL which needs to be
  96 placed in the NODEDIR/log_gatherer.furl file (one FURL per line) of the nodes
  97 that are to publish their logs to the gatherer. When the Tahoe node starts,
  98 it will connect to the configured gatherers and offer its logport: the
  99 gatherer will then use the logport to subscribe to hear about events.
 100
 101 The gatherer will write to files in its working directory, which can then be
 102 examined with tools like "flogtool dump" as described above.
 103
 104 === Incident Gatherer ===
 105
 106 The "incident gatherer" only collects Incidents: records of the log events
 107 that occurred just before and slightly after some high-level "trigger event"
 108 was recorded. Each incident is classified into a "category": a short string
 109 that summarizes what sort of problem took place. These classification
 110 functions are written after examining a new/unknown incident. The idea is to
 111 recognize when the same problem is happening multiple times.
 112
 113 A collection of classification functions that are useful for Tahoe nodes are
 114 provided in misc/incident-gatherer/support_classifiers.py . There is roughly
 115 one category for each log.WEIRD-or-higher level event in the Tahoe source
 116 code.
 117
 118 The incident gatherer is created with the "flogtool create-incident-gatherer
 119 WORKDIR" command, and started with "tahoe start". The generated
 120 "gatherer.tac" file should be modified to add classifier functions.
 121
 122 The incident gatherer writes incident names (which are simply the relative
 123 pathname of the incident-*.flog.bz2 file) into classified/CATEGORY. For
 124 example, the classified/mutable-retrieve-uncoordinated-write-error file
 125 contains a list of all incidents which were triggered by an uncoordinated
 126 write that was detected during mutable file retrieval (caused when somebody
 127 changed the contents of the mutable file in between the node's mapupdate step
 128 and the retrieve step). The classified/unknown file contains a list of all
 129 incidents that did not match any of the classification functions.
 130
 131 At startup, the incident gatherer will automatically reclassify any incident
 132 report which is not mentioned in any of the classified/* files. So the usual
 133 workflow is to examine the incidents in classified/unknown, add a new
 134 classification function, delete classified/unknown, then bound the gatherer
 135 with "tahoe restart WORKDIR". The incidents which can be classified with the
 136 new functions will be added to their own classified/FOO lists, and the
 137 remaining ones will be put in classified/unknown, where the process can be
 138 repeated until all events are classifiable.
 139
 140 The incident gatherer is still fairly immature: future versions will have a
 141 web interface and an RSS feed, so operations personnel can track problems in
 142 the storage grid.
 143
 144 In our experience, each Incident takes about two seconds to transfer from the
 145 node which generated it to the gatherer. The gatherer will automatically
 146 catch up to any incidents which occurred while it is offline.
 147
 148 === Log Gatherer ===
 149
 150 The "Log Gatherer" subscribes to hear about every single event published by
 151 the connected nodes, regardless of severity. This server writes these log
 152 events into a large flogfile that is rotated (closed, compressed, and
 153 replaced with a new one) on a periodic basis. Each flogfile is named
 154 according to the range of time it represents, with names like
 155 "from-2008-08-26-132256--to-2008-08-26-162256.flog.bz2". The flogfiles
 156 contain events from many different sources, making it easier to correlate
 157 things that happened on multiple machines (such as comparing a client node
 158 making a request with the storage servers that respond to that request).
 159
 160 The Log Gatherer is created with the "flogtool create-gatherer WORKDIR"
 161 command, and started with "tahoe start". The log_gatherer.furl it creates
 162 then needs to be copied into the BASEDIR/log_gatherer.furl file of all nodes
 163 which should be sending it log events.
 164
 165 The "flogtool filter" command, described above, is useful to cut down the
 166 potentially-large flogfiles into more a narrowly-focussed form.
 167
 168 Busy nodes, particularly wapi nodes which are performing recursive
 169 deep-size/deep-stats/deep-check operations, can produce a lot of log events.
 170 To avoid overwhelming the node (and using an unbounded amount of memory for
 171 the outbound TCP queue), publishing nodes will start dropping log events when
 172 the outbound queue grows too large. When this occurs, there will be gaps
 173 (non-sequential event numbers) in the log-gatherer's flogfiles.
 174
 175 == Local twistd.log files ==
 176
 177 [TODO: not yet true, requires foolscap-0.3.1 and a change to allmydata.node]
 178
 179 In addition to the foolscap-based event logs, certain high-level events will
 180 be recorded directly in human-readable text form, in the
 181 BASEDIR/logs/twistd.log file (and its rotated old versions: twistd.log.1,
 182 twistd.log.2, etc). This form does not contain as much information as the
 183 flogfiles available through the means described previously, but they are
 184 immediately available to the curious developer, and are retained until the
 185 twistd.log.NN files are explicitly deleted.
 186
 187 Only events at the log.OPERATIONAL level or higher are bridged to twistd.log
 188 (i.e. not the log.NOISY debugging events). In addition, foolscap internal
 189 events (like connection negotiation messages) are not bridged to twistd.log .
 190
 191 == Adding log messages ==
 192
 193 When adding new code, the Tahoe developer should add a reasonable number of
 194 new log events. For details, please see the Foolscap logging documentation,
 195 but a few notes are worth stating here:
 196
 197  * use a facility prefix of "tahoe.", like "tahoe.mutable.publish"
 198
 199  * assign each severe (log.WEIRD or higher) event a unique message
 200    identifier, as the umid= argument to the log.msg() call. The
 201    misc/coding_tools/make_umid script may be useful for this purpose. This will make it
 202    easier to write a classification function for these messages.
 203
 204  * use the parent= argument whenever the event is causally/temporally
 205    clustered with its parent. For example, a download process that involves
 206    three sequential hash fetches could announce the send and receipt of those
 207    hash-fetch messages with a parent= argument that ties them to the overall
 208    download process. However, each new wapi download request should be
 209    unparented.
 210
 211  * use the format= argument in preference to the message= argument. E.g.
 212    use log.msg(format="got %(n)d shares, need %(k)d", n=n, k=k) instead of
 213    log.msg("got %d shares, need %d" % (n,k)). This will allow later tools to
 214    analyze the event without needing to scrape/reconstruct the structured
 215    data out of the formatted string.
 216
 217  * Pass extra information as extra keyword arguments, even if they aren't
 218    included in the format= string. This information will be displayed in the
 219    "flogtool dump --verbose" output, as well as being available to other
 220    tools. The umid= argument should be passed this way.
 221
 222  * use log.err for the catch-all addErrback that gets attached to the end of
 223    any given Deferred chain. When used in conjunction with LOGTOTWISTED=1,
 224    log.err() will tell Twisted about the error-nature of the log message,
 225    causing Trial to flunk the test (with an "ERROR" indication that prints a
 226    copy of the Failure, including a traceback). Don't use log.err for events
 227    that are BAD but handled (like hash failures: since these are often
 228    deliberately provoked by test code, they should not cause test failures):
 229    use log.msg(level=BAD) for those instead.
 230
 231
 232 == Log Messages During Unit Tests ==
 233
 234 *** WARNING: setting the environment variables below may cause some tests to   ***
 235 *** fail spuriously. See ticket #923 for the status of a fix for this problem. ***
 236
 237 If a test is failing and you aren't sure why, start by enabling
 238 FLOGTOTWISTED=1 like this:
 239
 240  make test FLOGTOTWISTED=1
 241
 242 With FLOGTOTWISTED=1, sufficiently-important log events will be written into
 243 _trial_temp/test.log, which may give you more ideas about why the test is
 244 failing.
 245
 246
 247 If that isn't enough, look at the detailed foolscap logging messages instead,
 248 by running the tests like this:
 249
 250  make test FLOGFILE=flog.out.bz2 FLOGLEVEL=1 FLOGTOTWISTED=1
 251
 252 The first environment variable will cause foolscap log events to be written
 253 to ./flog.out.bz2 (instead of merely being recorded in the circular buffers
 254 for the use of remote subscribers or incident reports). The second will cause
 255 all log events to be written out, not just the higher-severity ones. The
 256 third will cause twisted log events (like the markers that indicate when each
 257 unit test is starting and stopping) to be copied into the flogfile, making it
 258 easier to correlate log events with unit tests.
 259
 260 Enabling this form of logging appears to roughly double the runtime of the
 261 unit tests. The flog.out.bz2 file is approximately 2MB.
 262
 263 You can then use "flogtool dump" or "flogtool web-viewer" on the resulting
 264 flog.out file.
 265
 266 ("flogtool tail" and the log-gatherer are not useful during unit tests, since
 267 there is no single Tub to which all the log messages are published).