docs/stats.rst

   1 ================
   2 Tahoe Statistics
   3 ================
   4
   5 1. `Overview`_
   6 2. `Statistics Categories`_
   7 3. `Running a Tahoe Stats-Gatherer Service`_
   8 4. `Using Munin To Graph Stats Values`_
   9
  10 Overview
  11 ========
  12
  13 Each Tahoe node collects and publishes statistics about its operations as it
  14 runs. These include counters of how many files have been uploaded and
  15 downloaded, CPU usage information, performance numbers like latency of
  16 storage server operations, and available disk space.
  17
  18 The easiest way to see the stats for any given node is use the web interface.
  19 From the main "Welcome Page", follow the "Operational Statistics" link inside
  20 the small "This Client" box. If the welcome page lives at
  21 http://localhost:3456/, then the statistics page will live at
  22 http://localhost:3456/statistics . This presents a summary of the stats
  23 block, along with a copy of the raw counters. To obtain just the raw counters
  24 (in JSON format), use /statistics?t=json instead.
  25
  26 Statistics Categories
  27 =====================
  28
  29 The stats dictionary contains two keys: 'counters' and 'stats'. 'counters'
  30 are strictly counters: they are reset to zero when the node is started, and
  31 grow upwards. 'stats' are non-incrementing values, used to measure the
  32 current state of various systems. Some stats are actually booleans, expressed
  33 as '1' for true and '0' for false (internal restrictions require all stats
  34 values to be numbers).
  35
  36 Under both the 'counters' and 'stats' dictionaries, each individual stat has
  37 a key with a dot-separated name, breaking them up into groups like
  38 'cpu_monitor' and 'storage_server'.
  39
  40 The currently available stats (as of release 1.6.0 or so) are described here:
  41
  42 **counters.storage_server.\***
  43
  44     this group counts inbound storage-server operations. They are not provided
  45     by client-only nodes which have been configured to not run a storage server
  46     (with [storage]enabled=false in tahoe.cfg)
  47
  48     allocate, write, close, abort
  49         these are for immutable file uploads. 'allocate' is incremented when a
  50         client asks if it can upload a share to the server. 'write' is
  51         incremented for each chunk of data written. 'close' is incremented when
  52         the share is finished. 'abort' is incremented if the client abandons
  53         the upload.
  54
  55     get, read
  56         these are for immutable file downloads. 'get' is incremented
  57         when a client asks if the server has a specific share. 'read' is
  58         incremented for each chunk of data read.
  59
  60     readv, writev
  61         these are for immutable file creation, publish, and retrieve. 'readv'
  62         is incremented each time a client reads part of a mutable share.
  63         'writev' is incremented each time a client sends a modification
  64         request.
  65
  66     add-lease, renew, cancel
  67         these are for share lease modifications. 'add-lease' is incremented
  68         when an 'add-lease' operation is performed (which either adds a new
  69         lease or renews an existing lease). 'renew' is for the 'renew-lease'
  70         operation (which can only be used to renew an existing one). 'cancel'
  71         is used for the 'cancel-lease' operation.
  72
  73     bytes_freed
  74         this counts how many bytes were freed when a 'cancel-lease'
  75         operation removed the last lease from a share and the share
  76         was thus deleted.
  77
  78     bytes_added
  79         this counts how many bytes were consumed by immutable share
  80         uploads. It is incremented at the same time as the 'close'
  81         counter.
  82
  83 **stats.storage_server.\***
  84
  85     allocated
  86         this counts how many bytes are currently 'allocated', which
  87         tracks the space that will eventually be consumed by immutable
  88         share upload operations. The stat is increased as soon as the
  89         upload begins (at the same time the 'allocated' counter is
  90         incremented), and goes back to zero when the 'close' or 'abort'
  91         message is received (at which point the 'disk_used' stat should
  92         incremented by the same amount).
  93
  94     disk_total, disk_used, disk_free_for_root, disk_free_for_nonroot, disk_avail, reserved_space
  95         these all reflect disk-space usage policies and status.
  96         'disk_total' is the total size of disk where the storage
  97         server's BASEDIR/storage/shares directory lives, as reported
  98         by /bin/df or equivalent. 'disk_used', 'disk_free_for_root',
  99         and 'disk_free_for_nonroot' show related information.
 100         'reserved_space' reports the reservation configured by the
 101         tahoe.cfg [storage]reserved_space value. 'disk_avail'
 102         reports the remaining disk space available for the Tahoe
 103         server after subtracting reserved_space from disk_avail. All
 104         values are in bytes.
 105
 106     accepting_immutable_shares
 107         this is '1' if the storage server is currently accepting uploads of
 108         immutable shares. It may be '0' if a server is disabled by
 109         configuration, or if the disk is full (i.e. disk_avail is less than
 110         reserved_space).
 111
 112     total_bucket_count
 113         this counts the number of 'buckets' (i.e. unique
 114         storage-index values) currently managed by the storage
 115         server. It indicates roughly how many files are managed
 116         by the server.
 117
 118     latencies.*.*
 119         these stats keep track of local disk latencies for
 120         storage-server operations. A number of percentile values are
 121         tracked for many operations. For example,
 122         'storage_server.latencies.readv.50_0_percentile' records the
 123         median response time for a 'readv' request. All values are in
 124         seconds. These are recorded by the storage server, starting
 125         from the time the request arrives (post-deserialization) and
 126         ending when the response begins serialization. As such, they
 127         are mostly useful for measuring disk speeds. The operations
 128         tracked are the same as the counters.storage_server.* counter
 129         values (allocate, write, close, get, read, add-lease, renew,
 130         cancel, readv, writev). The percentile values tracked are:
 131         mean, 01_0_percentile, 10_0_percentile, 50_0_percentile,
 132         90_0_percentile, 95_0_percentile, 99_0_percentile,
 133         99_9_percentile. (the last value, 99.9 percentile, means that
 134         999 out of the last 1000 operations were faster than the
 135         given number, and is the same threshold used by Amazon's
 136         internal SLA, according to the Dynamo paper).
 137         Percentiles are only reported in the case of a sufficient
 138         number of observations for unambiguous interpretation. For
 139         example, the 99.9th percentile is (at the level of thousandths
 140         precision) 9 thousandths greater than the 99th
 141         percentile for sample sizes greater than or equal to 1000,
 142         thus the 99.9th percentile is only reported for samples of 1000
 143         or more observations.
 144
 145
 146 **counters.uploader.files_uploaded**
 147
 148 **counters.uploader.bytes_uploaded**
 149
 150 **counters.downloader.files_downloaded**
 151
 152 **counters.downloader.bytes_downloaded**
 153
 154     These count client activity: a Tahoe client will increment these when it
 155     uploads or downloads an immutable file. 'files_uploaded' is incremented by
 156     one for each operation, while 'bytes_uploaded' is incremented by the size of
 157     the file.
 158
 159 **counters.mutable.files_published**
 160
 161 **counters.mutable.bytes_published**
 162
 163 **counters.mutable.files_retrieved**
 164
 165 **counters.mutable.bytes_retrieved**
 166
 167  These count client activity for mutable files. 'published' is the act of
 168  changing an existing mutable file (or creating a brand-new mutable file).
 169  'retrieved' is the act of reading its current contents.
 170
 171 **counters.chk_upload_helper.\***
 172
 173     These count activity of the "Helper", which receives ciphertext from clients
 174     and performs erasure-coding and share upload for files that are not already
 175     in the grid. The code which implements these counters is in
 176     src/allmydata/immutable/offloaded.py .
 177
 178     upload_requests
 179         incremented each time a client asks to upload a file
 180         upload_already_present: incremented when the file is already in the grid
 181
 182     upload_need_upload
 183         incremented when the file is not already in the grid
 184
 185     resumes
 186         incremented when the helper already has partial ciphertext for
 187         the requested upload, indicating that the client is resuming an
 188         earlier upload
 189
 190     fetched_bytes
 191         this counts how many bytes of ciphertext have been fetched
 192         from uploading clients
 193
 194     encoded_bytes
 195         this counts how many bytes of ciphertext have been
 196         encoded and turned into successfully-uploaded shares. If no
 197         uploads have failed or been abandoned, encoded_bytes should
 198         eventually equal fetched_bytes.
 199
 200 **stats.chk_upload_helper.\***
 201
 202     These also track Helper activity:
 203
 204     active_uploads
 205         how many files are currently being uploaded. 0 when idle.
 206
 207     incoming_count
 208         how many cache files are present in the incoming/ directory,
 209         which holds ciphertext files that are still being fetched
 210         from the client
 211
 212     incoming_size
 213         total size of cache files in the incoming/ directory
 214
 215     incoming_size_old
 216         total size of 'old' cache files (more than 48 hours)
 217
 218     encoding_count
 219         how many cache files are present in the encoding/ directory,
 220         which holds ciphertext files that are being encoded and
 221         uploaded
 222
 223     encoding_size
 224         total size of cache files in the encoding/ directory
 225
 226     encoding_size_old
 227         total size of 'old' cache files (more than 48 hours)
 228
 229 **stats.node.uptime**
 230     how many seconds since the node process was started
 231
 232 **stats.cpu_monitor.\***
 233
 234     1min_avg, 5min_avg, 15min_avg
 235         estimate of what percentage of system CPU time was consumed by the
 236         node process, over the given time interval. Expressed as a float, 0.0
 237         for 0%, 1.0 for 100%
 238
 239     total
 240         estimate of total number of CPU seconds consumed by node since
 241         the process was started. Ticket #472 indicates that .total may
 242         sometimes be negative due to wraparound of the kernel's counter.
 243
 244 **stats.load_monitor.\***
 245
 246     When enabled, the "load monitor" continually schedules a one-second
 247     callback, and measures how late the response is. This estimates system load
 248     (if the system is idle, the response should be on time). This is only
 249     enabled if a stats-gatherer is configured.
 250
 251     avg_load
 252         average "load" value (seconds late) over the last minute
 253
 254     max_load
 255         maximum "load" value over the last minute
 256
 257
 258 Running a Tahoe Stats-Gatherer Service
 259 ======================================
 260
 261 The "stats-gatherer" is a simple daemon that periodically collects stats from
 262 several tahoe nodes. It could be useful, e.g., in a production environment,
 263 where you want to monitor dozens of storage servers from a central management
 264 host. It merely gatherers statistics from many nodes into a single place: it
 265 does not do any actual analysis.
 266
 267 The stats gatherer listens on a network port using the same Foolscap_
 268 connection library that Tahoe clients use to connect to storage servers.
 269 Tahoe nodes can be configured to connect to the stats gatherer and publish
 270 their stats on a periodic basis. (In fact, what happens is that nodes connect
 271 to the gatherer and offer it a second FURL which points back to the node's
 272 "stats port", which the gatherer then uses to pull stats on a periodic basis.
 273 The initial connection is flipped to allow the nodes to live behind NAT
 274 boxes, as long as the stats-gatherer has a reachable IP address.)
 275
 276 .. _Foolscap: http://foolscap.lothar.com/trac
 277
 278 The stats-gatherer is created in the same fashion as regular tahoe client
 279 nodes and introducer nodes. Choose a base directory for the gatherer to live
 280 in (but do not create the directory). Then run:
 281
 282 ::
 283
 284    tahoe create-stats-gatherer $BASEDIR
 285
 286 and start it with "tahoe start $BASEDIR". Once running, the gatherer will
 287 write a FURL into $BASEDIR/stats_gatherer.furl .
 288
 289 To configure a Tahoe client/server node to contact the stats gatherer, copy
 290 this FURL into the node's tahoe.cfg file, in a section named "[client]",
 291 under a key named "stats_gatherer.furl", like so:
 292
 293 ::
 294
 295     [client]
 296     stats_gatherer.furl = pb://qbo4ktl667zmtiuou6lwbjryli2brv6t@192.168.0.8:49997/wxycb4kaexzskubjnauxeoptympyf45y
 297
 298 or simply copy the stats_gatherer.furl file into the node's base directory
 299 (next to the tahoe.cfg file): it will be interpreted in the same way.
 300
 301 The first time it is started, the gatherer will listen on a random unused TCP
 302 port, so it should not conflict with anything else that you have running on
 303 that host at that time. On subsequent runs, it will re-use the same port (to
 304 keep its FURL consistent). To explicitly control which port it uses, write
 305 the desired portnumber into a file named "portnum" (i.e. $BASEDIR/portnum),
 306 and the next time the gatherer is started, it will start listening on the
 307 given port. The portnum file is actually a "strports specification string",
 308 as described in `docs/configuration.rst <configuration.rst>`_.
 309
 310 Once running, the stats gatherer will create a standard python "pickle" file
 311 in $BASEDIR/stats.pickle . Once a minute, the gatherer will pull stats
 312 information from every connected node and write them into the pickle. The
 313 pickle will contain a dictionary, in which node identifiers (known as "tubid"
 314 strings) are the keys, and the values are a dict with 'timestamp',
 315 'nickname', and 'stats' keys. d[tubid][stats] will contain the stats
 316 dictionary as made available at http://localhost:3456/statistics?t=json . The
 317 pickle file will only contain the most recent update from each node.
 318
 319 Other tools can be built to examine these stats and render them into
 320 something useful. For example, a tool could sum the
 321 "storage_server.disk_avail' values from all servers to compute a
 322 total-disk-available number for the entire grid (however, the "disk watcher"
 323 daemon, in misc/operations_helpers/spacetime/, is better suited for this specific task).
 324
 325 Using Munin To Graph Stats Values
 326 =================================
 327
 328 The misc/munin/ directory contains various plugins to graph stats for Tahoe
 329 nodes. They are intended for use with the Munin_ system-management tool, which
 330 typically polls target systems every 5 minutes and produces a web page with
 331 graphs of various things over multiple time scales (last hour, last month,
 332 last year).
 333
 334 .. _Munin: http://munin-monitoring.org/
 335
 336 Most of the plugins are designed to pull stats from a single Tahoe node, and
 337 are configured with the e.g. http://localhost:3456/statistics?t=json URL. The
 338 "tahoe_stats" plugin is designed to read from the pickle file created by the
 339 stats-gatherer. Some plugins are to be used with the disk watcher, and a few
 340 (like tahoe_nodememory) are designed to watch the node processes directly
 341 (and must therefore run on the same host as the target node).
 342
 343 Please see the docstrings at the beginning of each plugin for details, and
 344 the "tahoe-conf" file for notes about configuration and installing these
 345 plugins into a Munin environment.