From: Brian Warner Date: Thu, 4 Sep 2008 00:25:31 +0000 (-0700) Subject: docs/logging.txt: explain tahoe/foolscap logging. Addresses #239. X-Git-Url: https://git.rkrishnan.org/frontends/specifications/reliability?a=commitdiff_plain;h=af07b9cad6c4125514d2ff5ce5198705c95ff7a3;p=tahoe-lafs%2Ftahoe-lafs.git docs/logging.txt: explain tahoe/foolscap logging. Addresses #239. --- diff --git a/docs/logging.txt b/docs/logging.txt new file mode 100644 index 00000000..bea2f6da --- /dev/null +++ b/docs/logging.txt @@ -0,0 +1,231 @@ += Tahoe Logging = + +Tahoe uses the Foolscap logging mechanism (known as the "flog" subsystem) to +record information about what is happening inside the Tahoe node. This is +primarily for use by programmers and grid operators who want to find out what +went wrong. + +The foolscap logging system is documented here: + + http://foolscap.lothar.com/docs/logging.html + +The foolscap distribution includes a utility named "flogtool" (usually at +/usr/bin/flogtool) which is used to get access to many foolscap logging +features. + +== Realtime Logging == + +When you are working on Tahoe code, and want to see what the node is doing, +the easiest tool to use is "flogtool tail". This connects to the tahoe node +and subscribes to hear about all log events. These events are then displayed +to stdout, and optionally saved to a file. + +"flogtool tail" connects to the "logport", for which the FURL is stored in +BASEDIR/private/logport.furl . The following command will connect to this +port and start emitting log information: + + flogtool tail BASEDIR/private/logport.furl + +The "--save-to FILENAME" option will save all received events to a file, +where then can be examined later with "flogtool dump" or "flogtool +web-viewer". The --catch-up flag will ask the node to dump all stored events +before subscribing to new ones (without --catch-up, you will only hear about +events that occur after the tool has connected and subscribed). + +== Incidents == + +Foolscap keeps a short list of recent events in memory. When something goes +wrong, it writes all the history it has (and everything that gets logged in +the next few seconds) into a file called an "incident". These files go into +BASEDIR/logs/incidents/ , in a file named +"incident-TIMESTAMP-UNIQUE.flog.bz2". The default definition of "something +goes wrong" is the generation of a log event at the log.WEIRD level or +higher, but other criteria could be implemented. + +The typical "incident report" we've seen in a large Tahoe grid is about 40kB +compressed, representing about 1800 recent events. + +These "flogfiles" have a similar format to the files saved by "flogtool tail +--save-to". They are simply lists of log events, with a small header to +indicate which event triggered the incident. + +The "flogtool dump FLOGFILE" command will take one of these .flog.bz2 files +and print their contents to stdout, one line per event. The raw event +dictionaries can be dumped by using "flogtool dump --verbose FLOGFILE". + +The "flogtool web-viewer" command can be used to examine the flogfile in a +web browser. It runs a small HTTP server and emits the URL on stdout. This +view provides more structure than the output of "flogtool dump": the +parent/child relationships of log events is displayed in a nested format. +"flogtool web-viewer" is still fairly immature. + +== Working with flogfiles == + +The "flogtool filter" command can be used to take a large flogfile (perhaps +one created by the log-gatherer, see below) and copy a subset of events into +a second file. This smaller flogfile may be easier to work with than the +original. The arguments to "flogtool filter" specify filtering criteria: a +predicate that each event must match to be copied into the target file. +--before and --after are used to exclude events outside a given window of +time. --above will retain events above a certain severity level. --from +retains events send by a specific tubid. --strip-facility removes events that +were emitted with a given facility (like foolscap.negotiation or +tahoe.upload). + +== Gatherers == + +In a deployed Tahoe grid, it is useful to get log information automatically +transferred to a central log-gatherer host. This offloads the (admittedly +modest) storage requirements to a different host and provides access to +logfiles from multiple nodes (webapi/storage/helper) nodes in a single place. + +There are two kinds of gatherers. Both produce a FURL which needs to be +placed in the NODEDIR/log_gatherer.furl file (one FURL per line) of the nodes +that are to publish their logs to the gatherer. When the Tahoe node starts, +it will connect to the configured gatherers and offer its logport: the +gatherer will then use the logport to subscribe to hear about events. + +The gatherer will write to files in its working directory, which can then be +examined with tools like "flogtool dump" as described above. + +=== Incident Gatherer === + +The "incident gatherer" only collects Incidents: records of the log events +that occurred just before and slightly after some high-level "trigger event" +was recorded. Each incident is classified into a "category": a short string +that summarizes what sort of problem took place. These classification +functions are written after examining a new/unknown incident. The idea is to +recognize when the same problem is happening multiple times. + +A collection of classification functions that are useful for Tahoe nodes are +provided in misc/incident-gatherer/support_classifiers.py . There is roughly +one category for each log.WEIRD-or-higher level event in the Tahoe source +code. + +The incident gatherer is created with the "flogtool create-incident-gatherer +WORKDIR" command, and started with "tahoe start". The generated +"gatherer.tac" file should be modified to add classifier functions. + +The incident gatherer writes incident names (which are simply the relative +pathname of the incident-*.flog.bz2 file) into classified/CATEGORY. For +example, the classified/mutable-retrieve-uncoordinated-write-error file +contains a list of all incidents which were triggered by an uncoordinated +write that was detected during mutable file retrieval (caused when somebody +changed the contents of the mutable file in between the node's mapupdate step +and the retrieve step). The classified/unknown file contains a list of all +incidents that did not match any of the classification functions. + +At startup, the incident gatherer will automatically reclassify any incident +report which is not mentioned in any of the classified/* files. So the usual +workflow is to examine the incidents in classified/unknown, add a new +classification function, delete classified/unknown, then bound the gatherer +with "tahoe restart WORKDIR". The incidents which can be classified with the +new functions will be added to their own classified/FOO lists, and the +remaining ones will be put in classified/unknown, where the process can be +repeated until all events are classifiable. + +The incident gatherer is still fairly immature: future versions will have a +web interface and an RSS feed, so operations personnel can track problems in +the storage grid. + +In our experience, each Incident takes about two seconds to transfer from the +node which generated it to the gatherer. The gatherer will automatically +catch up to any incidents which occurred while it is offline. + +=== Log Gatherer === + +The "Log Gatherer" subscribes to hear about every single event published by +the connected nodes, regardless of severity. This server writes these log +events into a large flogfile that is rotated (closed, compressed, and +replaced with a new one) on a periodic basis. Each flogfile is named +according to the range of time it represents, with names like +"from-2008-08-26-132256--to-2008-08-26-162256.flog.bz2". The flogfiles +contain events from many different sources, making it easier to correlate +things that happened on multiple machines (such as comparing a client node +making a request with the storage servers that respond to that request). + +The Log Gatherer is created with the "flogtool create-gatherer WORKDIR" +command, and started with "tahoe start". The log_gatherer.furl it creates +then needs to be copied into the BASEDIR/log_gatherer.furl file of all nodes +which should be sending it log events. + +The "flogtool filter" command, described above, is useful to cut down the +potentially-large flogfiles into more a narrowly-focussed form. + +Busy nodes, particularly webapi nodes which are performing recursive +deep-size/deep-stats/deep-check operations, can produce a lot of log events. +To avoid overwhelming the node (and using an unbounded amount of memory for +the outbound TCP queue), publishing nodes will start dropping log events when +the outbound queue grows too large. When this occurs, there will be gaps +(non-sequential event numbers) in the log-gatherer's flogfiles. + +== Local twistd.log files == + +[TODO: not yet true, requires foolscap-0.3.1 and a change to allmydata.node] + +In addition to the foolscap-based event logs, certain high-level events will +be recorded directly in human-readable text form, in the +BASEDIR/logs/twistd.log file (and its rotated old versions: twistd.log.1, +twistd.log.2, etc). This form does not contain as much information as the +flogfiles available through the means described previously, but they are +immediately available to the curious developer, and are retained until the +twistd.log.NN files are explicitly deleted. + +Only events at the log.OPERATIONAL level or higher are bridged to twistd.log +(i.e. not the log.NOISY debugging events). In addition, foolscap internal +events (like connection negotiation messages) are not bridged to twistd.log . + +== Adding log messages == + +When adding new code, the Tahoe developer should add a reasonable number of +new log events. For details, please see the Foolscap logging documentation, +but a few notes are worth stating here: + + * use a facility prefix of "tahoe.", like "tahoe.mutable.publish" + + * assign each severe (log.WEIRD or higher) event a unique message + identifier, as the umid= argument to the log.msg() call. The + misc/make_umid script may be useful for this purpose. This will make it + easier to write a classification function for these messages. + + * use the parent= argument whenever the event is causally/temporally + clustered with its parent. For example, a download process that involves + three sequential hash fetches could announce the send and receipt of those + hash-fetch messages with a parent= argument that ties them to the overall + download process. However, each new webapi download request should be + unparented. + + * use the format= argument in preference to the message= argument. E.g. + use log.msg(format="got %(n)d shares, need %(k)d", n=n, k=k) instead of + log.msg("got %d shares, need %d" % (n,k)). This will allow later tools to + analyze the event without needing to scrape/reconstruct the structured + data out of the formatted string. + + * Pass extra information as extra keyword arguments, even if they aren't + included in the format= string. This information will be displayed in the + "flogtool dump --verbose" output, as well as being available to other + tools. The umid= argument should be passed this way. + + +== Log Messages During Unit Tests == + +To get detailed log message out of unit tests, run the tests like this: + + make test FLOGFILE=flog.out.bz2 FLOGLEVEL=1 FLOGTWISTED=1 + +The first environment variable will cause foolscap log events to be written +to ./flog.out.bz2 (instead of merely being recorded in the circular buffers +for the use of remote subscribers or incident reports). The second will cause +all log events to be written out, not just the higher-severity ones. The +third will cause twisted log events (like the markers that indicate when each +unit test is starting and stopping) to be copied into the flogfile, making it +easier to correlate log events with unit tests. + +Enabling this form of logging appears to roughly double the runtime of the +unit tests. The flog.out.bz2 file is approximately 2MB. + +You can then use "flogtool dump" or "flogtool web-viewer" on the resulting +flog.out file. + +("flogtool tail" and the log-gatherer are not useful during unit tests, since +there is no single Tub to which all the log messages are published).