From: Brian Warner <warner@lothar.com>
Date: Wed, 23 Dec 2009 05:24:00 +0000 (-0500)
Subject: Add docs/stats.py, explaining Tahoe stats, the gatherer, and the munin plugins.
X-Git-Tag: trac-4200~68
X-Git-Url: https://git.rkrishnan.org/pf/content/en/?a=commitdiff_plain;h=950b1d80bb444d0c1640fc3e902b2b1f50b28a90;p=tahoe-lafs%2Ftahoe-lafs.git

Add docs/stats.py, explaining Tahoe stats, the gatherer, and the munin plugins.
---

diff --git a/docs/stats.txt b/docs/stats.txt
new file mode 100644
index 00000000..6e2e7de2
--- /dev/null
+++ b/docs/stats.txt
@@ -0,0 +1,259 @@
+= Tahoe Statistics =
+
+Each Tahoe node collects and publishes statistics about its operations as it
+runs. These include counters of how many files have been uploaded and
+downloaded, CPU usage information, performance numbers like latency of
+storage server operations, and available disk space.
+
+The easiest way to see the stats for any given node is use the web interface.
+From the main "Welcome Page", follow the "Operational Statistics" link inside
+the small "This Client" box. If the welcome page lives at
+http://localhost:3456/, then the statistics page will live at
+http://localhost:3456/statistics . This presents a summary of the stats
+block, along with a copy of the raw counters. To obtain just the raw counters
+(in JSON format), use /statistics?t=json instead.
+
+= Statistics Categories =
+
+The stats dictionary contains two keys: 'counters' and 'stats'. 'counters'
+are strictly counters: they are reset to zero when the node is started, and
+grow upwards. 'stats' are non-incrementing values, used to measure the
+current state of various systems. Some stats are actually booleans, expressed
+as '1' for true and '0' for false (internal restrictions require all stats
+values to be numbers).
+
+Under both the 'counters' and 'stats' dictionaries, each individual stat has
+a key with a dot-separated name, breaking them up into groups like
+'cpu_monitor' and 'storage_server'.
+
+The currently available stats (as of release 1.6.0 or so) are described here:
+
+counters.storage_server.*: this group counts inbound storage-server
+                           operations. They are not provided by client-only
+                           nodes which have been configured to not run a
+                           storage server (with [storage]enabled=false in
+                           tahoe.cfg)
+  allocate, write, close, abort: these are for immutable file uploads.
+                                 'allocate' is incremented when a client asks
+                                 if it can upload a share to the server.
+                                 'write' is incremented for each chunk of
+                                 data written. 'close' is incremented when
+                                 the share is finished. 'abort' is
+                                 incremented if the client abandons the
+                                 uploaed.
+  get, read: these are for immutable file downloads. 'get' is incremented
+             when a client asks if the server has a specific share. 'read' is
+             incremented for each chunk of data read.
+  readv, writev: these are for immutable file creation, publish, and
+                 retrieve. 'readv' is incremented each time a client reads
+                 part of a mutable share. 'writev' is incremented each time a
+                 client sends a modification request.
+  add-lease, renew, cancel: these are for share lease modifications.
+                            'add-lease' is incremented when an 'add-lease'
+                            operation is performed (which either adds a new
+                            lease or renews an existing lease). 'renew' is
+                            for the 'renew-lease' operation (which can only
+                            be used to renew an existing one). 'cancel' is
+                            used for the 'cancel-lease' operation.
+  bytes_freed: this counts how many bytes were freed when a 'cancel-lease'
+               operation removed the last lease from a share and the share
+               was thus deleted.
+  bytes_added: this counts how many bytes were consumed by immutable share
+               uploads. It is incremented at the same time as the 'close'
+               counter.
+
+stats.storage_server.*:
+ allocated: this counts how many bytes are currently 'allocated', which
+            tracks the space that will eventually be consumed by immutable
+            share upload operations. The stat is increased as soon as the
+            upload begins (at the same time the 'allocated' counter is
+            incremented), and goes back to zero when the 'close' or 'abort'
+            message is received (at which point the 'disk_used' stat should
+            incremented by the same amount).
+ disk_total
+ disk_used
+ disk_free_for_root
+ disk_free_for_nonroot
+ disk_avail
+ reserved_space: these all reflect disk-space usage policies and status.
+                 'disk_total' is the total size of disk where the storage
+                 server's BASEDIR/storage/shares directory lives, as reported
+                 by /bin/df or equivalent. 'disk_used', 'disk_free_for_root',
+                 and 'disk_free_for_nonroot' show related information.
+                 'reserved_space' reports the reservation configured by the
+                 tahoe.cfg [storage]reserved_space value. 'disk_avail'
+                 reports the remaining disk space available for the Tahoe
+                 server after subtracting reserved_space from disk_avail. All
+                 values are in bytes.
+ accepting_immutable_shares: this is '1' if the storage server is currently
+                             accepting uploads of immutable shares. It may be
+                             '0' if a server is disabled by configuration, or
+                             if the disk is full (i.e. disk_avail is less
+                             than reserved_space).
+ total_bucket_count: this counts the number of 'buckets' (i.e. unique
+                     storage-index values) currently managed by the storage
+                     server. It indicates roughly how many files are managed
+                     by the server.
+ latencies.*.*: these stats keep track of local disk latencies for
+                storage-server operations. A number of percentile values are
+                tracked for many operations. For example,
+                'storage_server.latencies.readv.50_0_percentile' records the
+                median response time for a 'readv' request. All values are in
+                seconds. These are recorded by the storage server, starting
+                from the time the request arrives (post-deserialization) and
+                ending when the response begins serialization. As such, they
+                are mostly useful for measuring disk speeds. The operations
+                tracked are the same as the counters.storage_server.* counter
+                values (allocate, write, close, get, read, add-lease, renew,
+                cancel, readv, writev). The percentile values tracked are:
+                mean, 01_0_percentile, 10_0_percentile, 50_0_percentile,
+                90_0_percentile, 95_0_percentile, 99_0_percentile,
+                99_9_percentile. (the last value, 99.9 percentile, means that
+                999 out of the last 1000 operations were faster than the
+                given number, and is the same threshold used by Amazon's
+                internal SLA, according to the Dynamo paper).
+
+counters.uploader.files_uploaded
+counters.uploader.bytes_uploaded
+counters.downloader.files_downloaded
+counters.downloader.bytes_downloaded
+
+ These count client activity: a Tahoe client will increment these when it
+ uploads or downloads an immutable file. 'files_uploaded' is incremented by
+ one for each operation, while 'bytes_uploaded' is incremented by the size of
+ the file.
+
+counters.mutable.files_published
+counters.mutable.bytes_published
+counters.mutable.files_retrieved
+counters.mutable.bytes_retrieved
+
+ These count client activity for mutable files. 'published' is the act of
+ changing an existing mutable file (or creating a brand-new mutable file).
+ 'retrieved' is the act of reading its current contents.
+
+counters.chk_upload_helper.*
+
+ These count activity of the "Helper", which receives ciphertext from clients
+ and performs erasure-coding and share upload for files that are not already
+ in the grid. The code which implements these counters is in
+ src/allmydata/immutable/offloaded.py .
+
+  upload_requests: incremented each time a client asks to upload a file
+  upload_already_present: incremented when the file is already in the grid
+  upload_need_upload: incremented when the file is not already in the grid
+  resumes: incremented when the helper already has partial ciphertext for
+           the requested upload, indicating that the client is resuming an
+           earlier upload
+  fetched_bytes: this counts how many bytes of ciphertext have been fetched
+                 from uploading clients
+  encoded_bytes: this counts how many bytes of ciphertext have been
+                 encoded and turned into successfully-uploaded shares. If no
+                 uploads have failed or been abandoned, encoded_bytes should
+                 eventually equal fetched_bytes.
+
+stats.chk_upload_helper.*
+
+ These also track Helper activity:
+
+  active_uploads: how many files are currently being uploaded. 0 when idle.
+  incoming_count: how many cache files are present in the incoming/ directory,
+                  which holds ciphertext files that are still being fetched
+                  from the client
+  incoming_size: total size of cache files in the incoming/ directory
+  incoming_size_old: total size of 'old' cache files (more than 48 hours)
+  encoding_count: how many cache files are present in the encoding/ directory,
+                  which holds ciphertext files that are being encoded and
+                  uploaded
+  encoding_size: total size of cache files in the encoding/ directory
+  encoding_size_old: total size of 'old' cache files (more than 48 hours)
+
+stats.node.uptime: how many seconds since the node process was started
+
+stats.cpu_monitor.*:
+  .1min_avg, 5min_avg, 15min_avg: estimate of what percentage of system CPU
+                                  time was consumed by the node process, over
+                                  the given time interval. Expressed as a
+                                  float, 0.0 for 0%, 1.0 for 100%
+  .total: estimate of total number of CPU seconds consumed by node since
+          the process was started. Ticket #472 indicates that .total may
+          sometimes be negative due to wraparound of the kernel's counter.
+
+stats.load_monitor.*:
+ When enabled, the "load monitor" continually schedules a one-second
+ callback, and measures how late the response is. This estimates system load
+ (if the system is idle, the response should be on time). This is only
+ enabled if a stats-gatherer is configured.
+
+ .avg_load: average "load" value (seconds late) over the last minute
+ .max_load: maximum "load" value over the last minute
+
+
+= Running a Tahoe Stats-Gatherer Service =
+
+The "stats-gatherer" is a simple daemon that periodically collects stats from
+several tahoe nodes. It could be useful, e.g., in a production environment,
+where you want to monitor dozens of storage servers from a central management
+host.
+
+The stats gatherer listens on a network port using the same Foolscap
+connection library that Tahoe clients use to connect to storage servers.
+Tahoe nodes can be configured to connect to the stats gatherer and publish
+their stats on a periodic basis. (in fact, what happens is that nodes connect
+to the gatherer and offer it a second FURL which points back to the node's
+"stats port", which the gatherer then uses to pull stats on a periodic basis.
+The initial connection is flipped to allow the nodes to live behind NAT
+boxes, as long as the stats-gatherer has a reachable IP address)
+
+The stats-gatherer is created in the same fashion as regular tahoe client
+nodes and introducer nodes. Choose a base directory for the gatherer to live
+in (but do not create the directory). Then run:
+
+ tahoe create-stats-gatherer $BASEDIR
+
+and start it with "tahoe start $BASEDIR". Once running, the gatherer will
+write a FURL into $BASEDIR/stats_gatherer.furl .
+
+To configure a Tahoe client/server node to contact the stats gatherer, copy
+this FURL into the node's tahoe.cfg file, in a section named "[client]",
+under a key named "stats_gatherer.furl", like so:
+
+ [client]
+ stats_gatherer.furl = pb://qbo4ktl667zmtiuou6lwbjryli2brv6t@192.168.0.8:49997/wxycb4kaexzskubjnauxeoptympyf45y
+
+or simply copy the stats_gatherer.furl file into the node's base directory
+(next to the tahoe.cfg file): it will be interpreted in the same way.
+
+Once running, the stats gatherer will create a standard python "pickle" file
+in $BASEDIR/stats.pickle . Once a minute, the gatherer will pull stats
+information from every connected node and write them into the pickle. The
+pickle will contain a dictionary, in which node identifiers (known as "tubid"
+strings) are the keys, and the values are a dict with 'timestamp',
+'nickname', and 'stats' keys. d[tubid][stats] will contain the stats
+dictionary as made available at http://localhost:3456/statistics?t=json . The
+pickle file will only contain the most recent update from each node.
+
+Other tools can be built to examine these stats and render them into
+something useful. For example, a tool could sum the
+"storage_server.disk_avail' values from all servers to compute a
+total-disk-available number for the entire grid (however, the "disk watcher"
+daemon, in misc/spacetime/, is better suited for this specific task).
+
+= Using Munin To Graph Stats Values =
+
+The misc/munin/ directory contains various plugins to graph stats for Tahoe
+nodes. They are intended for use with the Munin system-management tool, which
+typically polls target systems every 5 minutes and produces a web page with
+graphs of various things over multiple time scales (last hour, last month,
+last year).
+
+Most of the plugins are designed to pull stats from a single Tahoe node, and
+are configured with the http://localhost:3456/statistics?t=json URL. The
+"tahoe_stats" plugin is designed to read from the pickle file created by the
+stats-gatherer. Some are to be used with the disk watcher, and a few (like
+tahoe_nodememory) are designed to watch the node processes directly (and must
+therefore run on the same host as the target node).
+
+Please see the docstrings at the beginning of each plugin for details, and
+the "tahoe-conf" file for notes about configuration and installing these
+plugins into a Munin environment.