1 .. -*- coding: utf-8-with-signature -*-
8 2. `Statistics Categories`_
9 3. `Running a Tahoe Stats-Gatherer Service`_
10 4. `Using Munin To Graph Stats Values`_
15 Each Tahoe node collects and publishes statistics about its operations as it
16 runs. These include counters of how many files have been uploaded and
17 downloaded, CPU usage information, performance numbers like latency of
18 storage server operations, and available disk space.
20 The easiest way to see the stats for any given node is use the web interface.
21 From the main "Welcome Page", follow the "Operational Statistics" link inside
22 the small "This Client" box. If the welcome page lives at
23 http://localhost:3456/, then the statistics page will live at
24 http://localhost:3456/statistics . This presents a summary of the stats
25 block, along with a copy of the raw counters. To obtain just the raw counters
26 (in JSON format), use /statistics?t=json instead.
31 The stats dictionary contains two keys: 'counters' and 'stats'. 'counters'
32 are strictly counters: they are reset to zero when the node is started, and
33 grow upwards. 'stats' are non-incrementing values, used to measure the
34 current state of various systems. Some stats are actually booleans, expressed
35 as '1' for true and '0' for false (internal restrictions require all stats
36 values to be numbers).
38 Under both the 'counters' and 'stats' dictionaries, each individual stat has
39 a key with a dot-separated name, breaking them up into groups like
40 'cpu_monitor' and 'storage_server'.
42 The currently available stats (as of release 1.6.0 or so) are described here:
44 **counters.storage_server.\***
46 this group counts inbound storage-server operations. They are not provided
47 by client-only nodes which have been configured to not run a storage server
48 (with [storage]enabled=false in tahoe.cfg)
50 allocate, write, close, abort
51 these are for immutable file uploads. 'allocate' is incremented when a
52 client asks if it can upload a share to the server. 'write' is
53 incremented for each chunk of data written. 'close' is incremented when
54 the share is finished. 'abort' is incremented if the client abandons
58 these are for immutable file downloads. 'get' is incremented
59 when a client asks if the server has a specific share. 'read' is
60 incremented for each chunk of data read.
63 these are for immutable file creation, publish, and retrieve. 'readv'
64 is incremented each time a client reads part of a mutable share.
65 'writev' is incremented each time a client sends a modification
68 add-lease, renew, cancel
69 these are for share lease modifications. 'add-lease' is incremented
70 when an 'add-lease' operation is performed (which either adds a new
71 lease or renews an existing lease). 'renew' is for the 'renew-lease'
72 operation (which can only be used to renew an existing one). 'cancel'
73 is used for the 'cancel-lease' operation.
76 this counts how many bytes were freed when a 'cancel-lease'
77 operation removed the last lease from a share and the share
81 this counts how many bytes were consumed by immutable share
82 uploads. It is incremented at the same time as the 'close'
85 **stats.storage_server.\***
88 this counts how many bytes are currently 'allocated', which
89 tracks the space that will eventually be consumed by immutable
90 share upload operations. The stat is increased as soon as the
91 upload begins (at the same time the 'allocated' counter is
92 incremented), and goes back to zero when the 'close' or 'abort'
93 message is received (at which point the 'disk_used' stat should
94 incremented by the same amount).
96 disk_total, disk_used, disk_free_for_root, disk_free_for_nonroot, disk_avail, reserved_space
97 these all reflect disk-space usage policies and status.
98 'disk_total' is the total size of disk where the storage
99 server's BASEDIR/storage/shares directory lives, as reported
100 by /bin/df or equivalent. 'disk_used', 'disk_free_for_root',
101 and 'disk_free_for_nonroot' show related information.
102 'reserved_space' reports the reservation configured by the
103 tahoe.cfg [storage]reserved_space value. 'disk_avail'
104 reports the remaining disk space available for the Tahoe
105 server after subtracting reserved_space from disk_avail. All
108 accepting_immutable_shares
109 this is '1' if the storage server is currently accepting uploads of
110 immutable shares. It may be '0' if a server is disabled by
111 configuration, or if the disk is full (i.e. disk_avail is less than
115 this counts the number of 'buckets' (i.e. unique
116 storage-index values) currently managed by the storage
117 server. It indicates roughly how many files are managed
121 these stats keep track of local disk latencies for
122 storage-server operations. A number of percentile values are
123 tracked for many operations. For example,
124 'storage_server.latencies.readv.50_0_percentile' records the
125 median response time for a 'readv' request. All values are in
126 seconds. These are recorded by the storage server, starting
127 from the time the request arrives (post-deserialization) and
128 ending when the response begins serialization. As such, they
129 are mostly useful for measuring disk speeds. The operations
130 tracked are the same as the counters.storage_server.* counter
131 values (allocate, write, close, get, read, add-lease, renew,
132 cancel, readv, writev). The percentile values tracked are:
133 mean, 01_0_percentile, 10_0_percentile, 50_0_percentile,
134 90_0_percentile, 95_0_percentile, 99_0_percentile,
135 99_9_percentile. (the last value, 99.9 percentile, means that
136 999 out of the last 1000 operations were faster than the
137 given number, and is the same threshold used by Amazon's
138 internal SLA, according to the Dynamo paper).
139 Percentiles are only reported in the case of a sufficient
140 number of observations for unambiguous interpretation. For
141 example, the 99.9th percentile is (at the level of thousandths
142 precision) 9 thousandths greater than the 99th
143 percentile for sample sizes greater than or equal to 1000,
144 thus the 99.9th percentile is only reported for samples of 1000
145 or more observations.
148 **counters.uploader.files_uploaded**
150 **counters.uploader.bytes_uploaded**
152 **counters.downloader.files_downloaded**
154 **counters.downloader.bytes_downloaded**
156 These count client activity: a Tahoe client will increment these when it
157 uploads or downloads an immutable file. 'files_uploaded' is incremented by
158 one for each operation, while 'bytes_uploaded' is incremented by the size of
161 **counters.mutable.files_published**
163 **counters.mutable.bytes_published**
165 **counters.mutable.files_retrieved**
167 **counters.mutable.bytes_retrieved**
169 These count client activity for mutable files. 'published' is the act of
170 changing an existing mutable file (or creating a brand-new mutable file).
171 'retrieved' is the act of reading its current contents.
173 **counters.chk_upload_helper.\***
175 These count activity of the "Helper", which receives ciphertext from clients
176 and performs erasure-coding and share upload for files that are not already
177 in the grid. The code which implements these counters is in
178 src/allmydata/immutable/offloaded.py .
181 incremented each time a client asks to upload a file
182 upload_already_present: incremented when the file is already in the grid
185 incremented when the file is not already in the grid
188 incremented when the helper already has partial ciphertext for
189 the requested upload, indicating that the client is resuming an
193 this counts how many bytes of ciphertext have been fetched
194 from uploading clients
197 this counts how many bytes of ciphertext have been
198 encoded and turned into successfully-uploaded shares. If no
199 uploads have failed or been abandoned, encoded_bytes should
200 eventually equal fetched_bytes.
202 **stats.chk_upload_helper.\***
204 These also track Helper activity:
207 how many files are currently being uploaded. 0 when idle.
210 how many cache files are present in the incoming/ directory,
211 which holds ciphertext files that are still being fetched
215 total size of cache files in the incoming/ directory
218 total size of 'old' cache files (more than 48 hours)
221 how many cache files are present in the encoding/ directory,
222 which holds ciphertext files that are being encoded and
226 total size of cache files in the encoding/ directory
229 total size of 'old' cache files (more than 48 hours)
231 **stats.node.uptime**
232 how many seconds since the node process was started
234 **stats.cpu_monitor.\***
236 1min_avg, 5min_avg, 15min_avg
237 estimate of what percentage of system CPU time was consumed by the
238 node process, over the given time interval. Expressed as a float, 0.0
242 estimate of total number of CPU seconds consumed by node since
243 the process was started. Ticket #472 indicates that .total may
244 sometimes be negative due to wraparound of the kernel's counter.
246 **stats.load_monitor.\***
248 When enabled, the "load monitor" continually schedules a one-second
249 callback, and measures how late the response is. This estimates system load
250 (if the system is idle, the response should be on time). This is only
251 enabled if a stats-gatherer is configured.
254 average "load" value (seconds late) over the last minute
257 maximum "load" value over the last minute
260 Running a Tahoe Stats-Gatherer Service
261 ======================================
263 The "stats-gatherer" is a simple daemon that periodically collects stats from
264 several tahoe nodes. It could be useful, e.g., in a production environment,
265 where you want to monitor dozens of storage servers from a central management
266 host. It merely gatherers statistics from many nodes into a single place: it
267 does not do any actual analysis.
269 The stats gatherer listens on a network port using the same Foolscap_
270 connection library that Tahoe clients use to connect to storage servers.
271 Tahoe nodes can be configured to connect to the stats gatherer and publish
272 their stats on a periodic basis. (In fact, what happens is that nodes connect
273 to the gatherer and offer it a second FURL which points back to the node's
274 "stats port", which the gatherer then uses to pull stats on a periodic basis.
275 The initial connection is flipped to allow the nodes to live behind NAT
276 boxes, as long as the stats-gatherer has a reachable IP address.)
278 .. _Foolscap: http://foolscap.lothar.com/trac
280 The stats-gatherer is created in the same fashion as regular tahoe client
281 nodes and introducer nodes. Choose a base directory for the gatherer to live
282 in (but do not create the directory). Then run:
286 tahoe create-stats-gatherer $BASEDIR
288 and start it with "tahoe start $BASEDIR". Once running, the gatherer will
289 write a FURL into $BASEDIR/stats_gatherer.furl .
291 To configure a Tahoe client/server node to contact the stats gatherer, copy
292 this FURL into the node's tahoe.cfg file, in a section named "[client]",
293 under a key named "stats_gatherer.furl", like so:
298 stats_gatherer.furl = pb://qbo4ktl667zmtiuou6lwbjryli2brv6t@192.168.0.8:49997/wxycb4kaexzskubjnauxeoptympyf45y
300 or simply copy the stats_gatherer.furl file into the node's base directory
301 (next to the tahoe.cfg file): it will be interpreted in the same way.
303 The first time it is started, the gatherer will listen on a random unused TCP
304 port, so it should not conflict with anything else that you have running on
305 that host at that time. On subsequent runs, it will re-use the same port (to
306 keep its FURL consistent). To explicitly control which port it uses, write
307 the desired portnumber into a file named "portnum" (i.e. $BASEDIR/portnum),
308 and the next time the gatherer is started, it will start listening on the
309 given port. The portnum file is actually a "strports specification string",
310 as described in configuration.rst_.
312 Once running, the stats gatherer will create a standard python "pickle" file
313 in $BASEDIR/stats.pickle . Once a minute, the gatherer will pull stats
314 information from every connected node and write them into the pickle. The
315 pickle will contain a dictionary, in which node identifiers (known as "tubid"
316 strings) are the keys, and the values are a dict with 'timestamp',
317 'nickname', and 'stats' keys. d[tubid][stats] will contain the stats
318 dictionary as made available at http://localhost:3456/statistics?t=json . The
319 pickle file will only contain the most recent update from each node.
321 Other tools can be built to examine these stats and render them into
322 something useful. For example, a tool could sum the
323 "storage_server.disk_avail' values from all servers to compute a
324 total-disk-available number for the entire grid (however, the "disk watcher"
325 daemon, in misc/operations_helpers/spacetime/, is better suited for this specific task).
327 .. _configuration.rst: configuration.rst
329 Using Munin To Graph Stats Values
330 =================================
332 The misc/munin/ directory contains various plugins to graph stats for Tahoe
333 nodes. They are intended for use with the Munin_ system-management tool, which
334 typically polls target systems every 5 minutes and produces a web page with
335 graphs of various things over multiple time scales (last hour, last month,
338 Most of the plugins are designed to pull stats from a single Tahoe node, and
339 are configured with the e.g. http://localhost:3456/statistics?t=json URL. The
340 "tahoe_stats" plugin is designed to read from the pickle file created by the
341 stats-gatherer. Some plugins are to be used with the disk watcher, and a few
342 (like tahoe_nodememory) are designed to watch the node processes directly
343 (and must therefore run on the same host as the target node).
345 Please see the docstrings at the beginning of each plugin for details, and
346 the "tahoe-conf" file for notes about configuration and installing these
347 plugins into a Munin environment.
349 .. _Munin: http://munin-monitoring.org/