From: Brian Warner Date: Wed, 23 Dec 2009 05:24:00 +0000 (-0500) Subject: Add docs/stats.py, explaining Tahoe stats, the gatherer, and the munin plugins. X-Git-Tag: trac-4200~68 X-Git-Url: https://git.rkrishnan.org/junkers?a=commitdiff_plain;h=950b1d80bb444d0c1640fc3e902b2b1f50b28a90;p=tahoe-lafs%2Ftahoe-lafs.git Add docs/stats.py, explaining Tahoe stats, the gatherer, and the munin plugins. --- diff --git a/docs/stats.txt b/docs/stats.txt new file mode 100644 index 00000000..6e2e7de2 --- /dev/null +++ b/docs/stats.txt @@ -0,0 +1,259 @@ += Tahoe Statistics = + +Each Tahoe node collects and publishes statistics about its operations as it +runs. These include counters of how many files have been uploaded and +downloaded, CPU usage information, performance numbers like latency of +storage server operations, and available disk space. + +The easiest way to see the stats for any given node is use the web interface. +From the main "Welcome Page", follow the "Operational Statistics" link inside +the small "This Client" box. If the welcome page lives at +http://localhost:3456/, then the statistics page will live at +http://localhost:3456/statistics . This presents a summary of the stats +block, along with a copy of the raw counters. To obtain just the raw counters +(in JSON format), use /statistics?t=json instead. + += Statistics Categories = + +The stats dictionary contains two keys: 'counters' and 'stats'. 'counters' +are strictly counters: they are reset to zero when the node is started, and +grow upwards. 'stats' are non-incrementing values, used to measure the +current state of various systems. Some stats are actually booleans, expressed +as '1' for true and '0' for false (internal restrictions require all stats +values to be numbers). + +Under both the 'counters' and 'stats' dictionaries, each individual stat has +a key with a dot-separated name, breaking them up into groups like +'cpu_monitor' and 'storage_server'. + +The currently available stats (as of release 1.6.0 or so) are described here: + +counters.storage_server.*: this group counts inbound storage-server + operations. They are not provided by client-only + nodes which have been configured to not run a + storage server (with [storage]enabled=false in + tahoe.cfg) + allocate, write, close, abort: these are for immutable file uploads. + 'allocate' is incremented when a client asks + if it can upload a share to the server. + 'write' is incremented for each chunk of + data written. 'close' is incremented when + the share is finished. 'abort' is + incremented if the client abandons the + uploaed. + get, read: these are for immutable file downloads. 'get' is incremented + when a client asks if the server has a specific share. 'read' is + incremented for each chunk of data read. + readv, writev: these are for immutable file creation, publish, and + retrieve. 'readv' is incremented each time a client reads + part of a mutable share. 'writev' is incremented each time a + client sends a modification request. + add-lease, renew, cancel: these are for share lease modifications. + 'add-lease' is incremented when an 'add-lease' + operation is performed (which either adds a new + lease or renews an existing lease). 'renew' is + for the 'renew-lease' operation (which can only + be used to renew an existing one). 'cancel' is + used for the 'cancel-lease' operation. + bytes_freed: this counts how many bytes were freed when a 'cancel-lease' + operation removed the last lease from a share and the share + was thus deleted. + bytes_added: this counts how many bytes were consumed by immutable share + uploads. It is incremented at the same time as the 'close' + counter. + +stats.storage_server.*: + allocated: this counts how many bytes are currently 'allocated', which + tracks the space that will eventually be consumed by immutable + share upload operations. The stat is increased as soon as the + upload begins (at the same time the 'allocated' counter is + incremented), and goes back to zero when the 'close' or 'abort' + message is received (at which point the 'disk_used' stat should + incremented by the same amount). + disk_total + disk_used + disk_free_for_root + disk_free_for_nonroot + disk_avail + reserved_space: these all reflect disk-space usage policies and status. + 'disk_total' is the total size of disk where the storage + server's BASEDIR/storage/shares directory lives, as reported + by /bin/df or equivalent. 'disk_used', 'disk_free_for_root', + and 'disk_free_for_nonroot' show related information. + 'reserved_space' reports the reservation configured by the + tahoe.cfg [storage]reserved_space value. 'disk_avail' + reports the remaining disk space available for the Tahoe + server after subtracting reserved_space from disk_avail. All + values are in bytes. + accepting_immutable_shares: this is '1' if the storage server is currently + accepting uploads of immutable shares. It may be + '0' if a server is disabled by configuration, or + if the disk is full (i.e. disk_avail is less + than reserved_space). + total_bucket_count: this counts the number of 'buckets' (i.e. unique + storage-index values) currently managed by the storage + server. It indicates roughly how many files are managed + by the server. + latencies.*.*: these stats keep track of local disk latencies for + storage-server operations. A number of percentile values are + tracked for many operations. For example, + 'storage_server.latencies.readv.50_0_percentile' records the + median response time for a 'readv' request. All values are in + seconds. These are recorded by the storage server, starting + from the time the request arrives (post-deserialization) and + ending when the response begins serialization. As such, they + are mostly useful for measuring disk speeds. The operations + tracked are the same as the counters.storage_server.* counter + values (allocate, write, close, get, read, add-lease, renew, + cancel, readv, writev). The percentile values tracked are: + mean, 01_0_percentile, 10_0_percentile, 50_0_percentile, + 90_0_percentile, 95_0_percentile, 99_0_percentile, + 99_9_percentile. (the last value, 99.9 percentile, means that + 999 out of the last 1000 operations were faster than the + given number, and is the same threshold used by Amazon's + internal SLA, according to the Dynamo paper). + +counters.uploader.files_uploaded +counters.uploader.bytes_uploaded +counters.downloader.files_downloaded +counters.downloader.bytes_downloaded + + These count client activity: a Tahoe client will increment these when it + uploads or downloads an immutable file. 'files_uploaded' is incremented by + one for each operation, while 'bytes_uploaded' is incremented by the size of + the file. + +counters.mutable.files_published +counters.mutable.bytes_published +counters.mutable.files_retrieved +counters.mutable.bytes_retrieved + + These count client activity for mutable files. 'published' is the act of + changing an existing mutable file (or creating a brand-new mutable file). + 'retrieved' is the act of reading its current contents. + +counters.chk_upload_helper.* + + These count activity of the "Helper", which receives ciphertext from clients + and performs erasure-coding and share upload for files that are not already + in the grid. The code which implements these counters is in + src/allmydata/immutable/offloaded.py . + + upload_requests: incremented each time a client asks to upload a file + upload_already_present: incremented when the file is already in the grid + upload_need_upload: incremented when the file is not already in the grid + resumes: incremented when the helper already has partial ciphertext for + the requested upload, indicating that the client is resuming an + earlier upload + fetched_bytes: this counts how many bytes of ciphertext have been fetched + from uploading clients + encoded_bytes: this counts how many bytes of ciphertext have been + encoded and turned into successfully-uploaded shares. If no + uploads have failed or been abandoned, encoded_bytes should + eventually equal fetched_bytes. + +stats.chk_upload_helper.* + + These also track Helper activity: + + active_uploads: how many files are currently being uploaded. 0 when idle. + incoming_count: how many cache files are present in the incoming/ directory, + which holds ciphertext files that are still being fetched + from the client + incoming_size: total size of cache files in the incoming/ directory + incoming_size_old: total size of 'old' cache files (more than 48 hours) + encoding_count: how many cache files are present in the encoding/ directory, + which holds ciphertext files that are being encoded and + uploaded + encoding_size: total size of cache files in the encoding/ directory + encoding_size_old: total size of 'old' cache files (more than 48 hours) + +stats.node.uptime: how many seconds since the node process was started + +stats.cpu_monitor.*: + .1min_avg, 5min_avg, 15min_avg: estimate of what percentage of system CPU + time was consumed by the node process, over + the given time interval. Expressed as a + float, 0.0 for 0%, 1.0 for 100% + .total: estimate of total number of CPU seconds consumed by node since + the process was started. Ticket #472 indicates that .total may + sometimes be negative due to wraparound of the kernel's counter. + +stats.load_monitor.*: + When enabled, the "load monitor" continually schedules a one-second + callback, and measures how late the response is. This estimates system load + (if the system is idle, the response should be on time). This is only + enabled if a stats-gatherer is configured. + + .avg_load: average "load" value (seconds late) over the last minute + .max_load: maximum "load" value over the last minute + + += Running a Tahoe Stats-Gatherer Service = + +The "stats-gatherer" is a simple daemon that periodically collects stats from +several tahoe nodes. It could be useful, e.g., in a production environment, +where you want to monitor dozens of storage servers from a central management +host. + +The stats gatherer listens on a network port using the same Foolscap +connection library that Tahoe clients use to connect to storage servers. +Tahoe nodes can be configured to connect to the stats gatherer and publish +their stats on a periodic basis. (in fact, what happens is that nodes connect +to the gatherer and offer it a second FURL which points back to the node's +"stats port", which the gatherer then uses to pull stats on a periodic basis. +The initial connection is flipped to allow the nodes to live behind NAT +boxes, as long as the stats-gatherer has a reachable IP address) + +The stats-gatherer is created in the same fashion as regular tahoe client +nodes and introducer nodes. Choose a base directory for the gatherer to live +in (but do not create the directory). Then run: + + tahoe create-stats-gatherer $BASEDIR + +and start it with "tahoe start $BASEDIR". Once running, the gatherer will +write a FURL into $BASEDIR/stats_gatherer.furl . + +To configure a Tahoe client/server node to contact the stats gatherer, copy +this FURL into the node's tahoe.cfg file, in a section named "[client]", +under a key named "stats_gatherer.furl", like so: + + [client] + stats_gatherer.furl = pb://qbo4ktl667zmtiuou6lwbjryli2brv6t@192.168.0.8:49997/wxycb4kaexzskubjnauxeoptympyf45y + +or simply copy the stats_gatherer.furl file into the node's base directory +(next to the tahoe.cfg file): it will be interpreted in the same way. + +Once running, the stats gatherer will create a standard python "pickle" file +in $BASEDIR/stats.pickle . Once a minute, the gatherer will pull stats +information from every connected node and write them into the pickle. The +pickle will contain a dictionary, in which node identifiers (known as "tubid" +strings) are the keys, and the values are a dict with 'timestamp', +'nickname', and 'stats' keys. d[tubid][stats] will contain the stats +dictionary as made available at http://localhost:3456/statistics?t=json . The +pickle file will only contain the most recent update from each node. + +Other tools can be built to examine these stats and render them into +something useful. For example, a tool could sum the +"storage_server.disk_avail' values from all servers to compute a +total-disk-available number for the entire grid (however, the "disk watcher" +daemon, in misc/spacetime/, is better suited for this specific task). + += Using Munin To Graph Stats Values = + +The misc/munin/ directory contains various plugins to graph stats for Tahoe +nodes. They are intended for use with the Munin system-management tool, which +typically polls target systems every 5 minutes and produces a web page with +graphs of various things over multiple time scales (last hour, last month, +last year). + +Most of the plugins are designed to pull stats from a single Tahoe node, and +are configured with the http://localhost:3456/statistics?t=json URL. The +"tahoe_stats" plugin is designed to read from the pickle file created by the +stats-gatherer. Some are to be used with the disk watcher, and a few (like +tahoe_nodememory) are designed to watch the node processes directly (and must +therefore run on the same host as the target node). + +Please see the docstrings at the beginning of each plugin for details, and +the "tahoe-conf" file for notes about configuration and installing these +plugins into a Munin environment.