docs/stats.rst

   1 .. -*- coding: utf-8-with-signature -*-
   2
   3 ================
   4 Tahoe Statistics
   5 ================
   6
   7 1. `Overview`_
   8 2. `Statistics Categories`_
   9 3. `Running a Tahoe Stats-Gatherer Service`_
  10 4. `Using Munin To Graph Stats Values`_
  11
  12 Overview
  13 ========
  14
  15 Each Tahoe node collects and publishes statistics about its operations as it
  16 runs. These include counters of how many files have been uploaded and
  17 downloaded, CPU usage information, performance numbers like latency of
  18 storage server operations, and available disk space.
  19
  20 The easiest way to see the stats for any given node is use the web interface.
  21 From the main "Welcome Page", follow the "Operational Statistics" link inside
  22 the small "This Client" box. If the welcome page lives at
  23 http://localhost:3456/, then the statistics page will live at
  24 http://localhost:3456/statistics . This presents a summary of the stats
  25 block, along with a copy of the raw counters. To obtain just the raw counters
  26 (in JSON format), use /statistics?t=json instead.
  27
  28 Statistics Categories
  29 =====================
  30
  31 The stats dictionary contains two keys: 'counters' and 'stats'. 'counters'
  32 are strictly counters: they are reset to zero when the node is started, and
  33 grow upwards. 'stats' are non-incrementing values, used to measure the
  34 current state of various systems. Some stats are actually booleans, expressed
  35 as '1' for true and '0' for false (internal restrictions require all stats
  36 values to be numbers).
  37
  38 Under both the 'counters' and 'stats' dictionaries, each individual stat has
  39 a key with a dot-separated name, breaking them up into groups like
  40 'cpu_monitor' and 'storage_server'.
  41
  42 The currently available stats (as of release 1.6.0 or so) are described here:
  43
  44 **counters.storage_server.\***
  45
  46     this group counts inbound storage-server operations. They are not provided
  47     by client-only nodes which have been configured to not run a storage server
  48     (with [storage]enabled=false in tahoe.cfg)
  49
  50     allocate, write, close, abort
  51         these are for immutable file uploads. 'allocate' is incremented when a
  52         client asks if it can upload a share to the server. 'write' is
  53         incremented for each chunk of data written. 'close' is incremented when
  54         the share is finished. 'abort' is incremented if the client abandons
  55         the upload.
  56
  57     get, read
  58         these are for immutable file downloads. 'get' is incremented
  59         when a client asks if the server has a specific share. 'read' is
  60         incremented for each chunk of data read.
  61
  62     readv, writev
  63         these are for immutable file creation, publish, and retrieve. 'readv'
  64         is incremented each time a client reads part of a mutable share.
  65         'writev' is incremented each time a client sends a modification
  66         request.
  67
  68     add-lease, renew, cancel
  69         these are for share lease modifications. 'add-lease' is incremented
  70         when an 'add-lease' operation is performed (which either adds a new
  71         lease or renews an existing lease). 'renew' is for the 'renew-lease'
  72         operation (which can only be used to renew an existing one). 'cancel'
  73         is used for the 'cancel-lease' operation.
  74
  75     bytes_freed
  76         this counts how many bytes were freed when a 'cancel-lease'
  77         operation removed the last lease from a share and the share
  78         was thus deleted.
  79
  80     bytes_added
  81         this counts how many bytes were consumed by immutable share
  82         uploads. It is incremented at the same time as the 'close'
  83         counter.
  84
  85 **stats.storage_server.\***
  86
  87     allocated
  88         this counts how many bytes are currently 'allocated', which
  89         tracks the space that will eventually be consumed by immutable
  90         share upload operations. The stat is increased as soon as the
  91         upload begins (at the same time the 'allocated' counter is
  92         incremented), and goes back to zero when the 'close' or 'abort'
  93         message is received (at which point the 'disk_used' stat should
  94         incremented by the same amount).
  95
  96     disk_total, disk_used, disk_free_for_root, disk_free_for_nonroot, disk_avail, reserved_space
  97         these all reflect disk-space usage policies and status.
  98         'disk_total' is the total size of disk where the storage
  99         server's BASEDIR/storage/shares directory lives, as reported
 100         by /bin/df or equivalent. 'disk_used', 'disk_free_for_root',
 101         and 'disk_free_for_nonroot' show related information.
 102         'reserved_space' reports the reservation configured by the
 103         tahoe.cfg [storage]reserved_space value. 'disk_avail'
 104         reports the remaining disk space available for the Tahoe
 105         server after subtracting reserved_space from disk_avail. All
 106         values are in bytes.
 107
 108     accepting_immutable_shares
 109         this is '1' if the storage server is currently accepting uploads of
 110         immutable shares. It may be '0' if a server is disabled by
 111         configuration, or if the disk is full (i.e. disk_avail is less than
 112         reserved_space).
 113
 114     total_bucket_count
 115         this counts the number of 'buckets' (i.e. unique
 116         storage-index values) currently managed by the storage
 117         server. It indicates roughly how many files are managed
 118         by the server.
 119
 120     latencies.*.*
 121         these stats keep track of local disk latencies for
 122         storage-server operations. A number of percentile values are
 123         tracked for many operations. For example,
 124         'storage_server.latencies.readv.50_0_percentile' records the
 125         median response time for a 'readv' request. All values are in
 126         seconds. These are recorded by the storage server, starting
 127         from the time the request arrives (post-deserialization) and
 128         ending when the response begins serialization. As such, they
 129         are mostly useful for measuring disk speeds. The operations
 130         tracked are the same as the counters.storage_server.* counter
 131         values (allocate, write, close, get, read, add-lease, renew,
 132         cancel, readv, writev). The percentile values tracked are:
 133         mean, 01_0_percentile, 10_0_percentile, 50_0_percentile,
 134         90_0_percentile, 95_0_percentile, 99_0_percentile,
 135         99_9_percentile. (the last value, 99.9 percentile, means that
 136         999 out of the last 1000 operations were faster than the
 137         given number, and is the same threshold used by Amazon's
 138         internal SLA, according to the Dynamo paper).
 139         Percentiles are only reported in the case of a sufficient
 140         number of observations for unambiguous interpretation. For
 141         example, the 99.9th percentile is (at the level of thousandths
 142         precision) 9 thousandths greater than the 99th
 143         percentile for sample sizes greater than or equal to 1000,
 144         thus the 99.9th percentile is only reported for samples of 1000
 145         or more observations.
 146
 147
 148 **counters.uploader.files_uploaded**
 149
 150 **counters.uploader.bytes_uploaded**
 151
 152 **counters.downloader.files_downloaded**
 153
 154 **counters.downloader.bytes_downloaded**
 155
 156     These count client activity: a Tahoe client will increment these when it
 157     uploads or downloads an immutable file. 'files_uploaded' is incremented by
 158     one for each operation, while 'bytes_uploaded' is incremented by the size of
 159     the file.
 160
 161 **counters.mutable.files_published**
 162
 163 **counters.mutable.bytes_published**
 164
 165 **counters.mutable.files_retrieved**
 166
 167 **counters.mutable.bytes_retrieved**
 168
 169  These count client activity for mutable files. 'published' is the act of
 170  changing an existing mutable file (or creating a brand-new mutable file).
 171  'retrieved' is the act of reading its current contents.
 172
 173 **counters.chk_upload_helper.\***
 174
 175     These count activity of the "Helper", which receives ciphertext from clients
 176     and performs erasure-coding and share upload for files that are not already
 177     in the grid. The code which implements these counters is in
 178     src/allmydata/immutable/offloaded.py .
 179
 180     upload_requests
 181         incremented each time a client asks to upload a file
 182         upload_already_present: incremented when the file is already in the grid
 183
 184     upload_need_upload
 185         incremented when the file is not already in the grid
 186
 187     resumes
 188         incremented when the helper already has partial ciphertext for
 189         the requested upload, indicating that the client is resuming an
 190         earlier upload
 191
 192     fetched_bytes
 193         this counts how many bytes of ciphertext have been fetched
 194         from uploading clients
 195
 196     encoded_bytes
 197         this counts how many bytes of ciphertext have been
 198         encoded and turned into successfully-uploaded shares. If no
 199         uploads have failed or been abandoned, encoded_bytes should
 200         eventually equal fetched_bytes.
 201
 202 **stats.chk_upload_helper.\***
 203
 204     These also track Helper activity:
 205
 206     active_uploads
 207         how many files are currently being uploaded. 0 when idle.
 208
 209     incoming_count
 210         how many cache files are present in the incoming/ directory,
 211         which holds ciphertext files that are still being fetched
 212         from the client
 213
 214     incoming_size
 215         total size of cache files in the incoming/ directory
 216
 217     incoming_size_old
 218         total size of 'old' cache files (more than 48 hours)
 219
 220     encoding_count
 221         how many cache files are present in the encoding/ directory,
 222         which holds ciphertext files that are being encoded and
 223         uploaded
 224
 225     encoding_size
 226         total size of cache files in the encoding/ directory
 227
 228     encoding_size_old
 229         total size of 'old' cache files (more than 48 hours)
 230
 231 **stats.node.uptime**
 232     how many seconds since the node process was started
 233
 234 **stats.cpu_monitor.\***
 235
 236     1min_avg, 5min_avg, 15min_avg
 237         estimate of what percentage of system CPU time was consumed by the
 238         node process, over the given time interval. Expressed as a float, 0.0
 239         for 0%, 1.0 for 100%
 240
 241     total
 242         estimate of total number of CPU seconds consumed by node since
 243         the process was started. Ticket #472 indicates that .total may
 244         sometimes be negative due to wraparound of the kernel's counter.
 245
 246 **stats.load_monitor.\***
 247
 248     When enabled, the "load monitor" continually schedules a one-second
 249     callback, and measures how late the response is. This estimates system load
 250     (if the system is idle, the response should be on time). This is only
 251     enabled if a stats-gatherer is configured.
 252
 253     avg_load
 254         average "load" value (seconds late) over the last minute
 255
 256     max_load
 257         maximum "load" value over the last minute
 258
 259
 260 Running a Tahoe Stats-Gatherer Service
 261 ======================================
 262
 263 The "stats-gatherer" is a simple daemon that periodically collects stats from
 264 several tahoe nodes. It could be useful, e.g., in a production environment,
 265 where you want to monitor dozens of storage servers from a central management
 266 host. It merely gatherers statistics from many nodes into a single place: it
 267 does not do any actual analysis.
 268
 269 The stats gatherer listens on a network port using the same Foolscap_
 270 connection library that Tahoe clients use to connect to storage servers.
 271 Tahoe nodes can be configured to connect to the stats gatherer and publish
 272 their stats on a periodic basis. (In fact, what happens is that nodes connect
 273 to the gatherer and offer it a second FURL which points back to the node's
 274 "stats port", which the gatherer then uses to pull stats on a periodic basis.
 275 The initial connection is flipped to allow the nodes to live behind NAT
 276 boxes, as long as the stats-gatherer has a reachable IP address.)
 277
 278 .. _Foolscap: http://foolscap.lothar.com/trac
 279
 280 The stats-gatherer is created in the same fashion as regular tahoe client
 281 nodes and introducer nodes. Choose a base directory for the gatherer to live
 282 in (but do not create the directory). Then run:
 283
 284 ::
 285
 286    tahoe create-stats-gatherer $BASEDIR
 287
 288 and start it with "tahoe start $BASEDIR". Once running, the gatherer will
 289 write a FURL into $BASEDIR/stats_gatherer.furl .
 290
 291 To configure a Tahoe client/server node to contact the stats gatherer, copy
 292 this FURL into the node's tahoe.cfg file, in a section named "[client]",
 293 under a key named "stats_gatherer.furl", like so:
 294
 295 ::
 296
 297     [client]
 298     stats_gatherer.furl = pb://qbo4ktl667zmtiuou6lwbjryli2brv6t@192.168.0.8:49997/wxycb4kaexzskubjnauxeoptympyf45y
 299
 300 or simply copy the stats_gatherer.furl file into the node's base directory
 301 (next to the tahoe.cfg file): it will be interpreted in the same way.
 302
 303 The first time it is started, the gatherer will listen on a random unused TCP
 304 port, so it should not conflict with anything else that you have running on
 305 that host at that time. On subsequent runs, it will re-use the same port (to
 306 keep its FURL consistent). To explicitly control which port it uses, write
 307 the desired portnumber into a file named "portnum" (i.e. $BASEDIR/portnum),
 308 and the next time the gatherer is started, it will start listening on the
 309 given port. The portnum file is actually a "strports specification string",
 310 as described in configuration.rst_.
 311
 312 Once running, the stats gatherer will create a standard python "pickle" file
 313 in $BASEDIR/stats.pickle . Once a minute, the gatherer will pull stats
 314 information from every connected node and write them into the pickle. The
 315 pickle will contain a dictionary, in which node identifiers (known as "tubid"
 316 strings) are the keys, and the values are a dict with 'timestamp',
 317 'nickname', and 'stats' keys. d[tubid][stats] will contain the stats
 318 dictionary as made available at http://localhost:3456/statistics?t=json . The
 319 pickle file will only contain the most recent update from each node.
 320
 321 Other tools can be built to examine these stats and render them into
 322 something useful. For example, a tool could sum the
 323 "storage_server.disk_avail' values from all servers to compute a
 324 total-disk-available number for the entire grid (however, the "disk watcher"
 325 daemon, in misc/operations_helpers/spacetime/, is better suited for this specific task).
 326
 327 .. _configuration.rst: configuration.rst
 328
 329 Using Munin To Graph Stats Values
 330 =================================
 331
 332 The misc/munin/ directory contains various plugins to graph stats for Tahoe
 333 nodes. They are intended for use with the Munin_ system-management tool, which
 334 typically polls target systems every 5 minutes and produces a web page with
 335 graphs of various things over multiple time scales (last hour, last month,
 336 last year).
 337
 338 Most of the plugins are designed to pull stats from a single Tahoe node, and
 339 are configured with the e.g. http://localhost:3456/statistics?t=json URL. The
 340 "tahoe_stats" plugin is designed to read from the pickle file created by the
 341 stats-gatherer. Some plugins are to be used with the disk watcher, and a few
 342 (like tahoe_nodememory) are designed to watch the node processes directly
 343 (and must therefore run on the same host as the target node).
 344
 345 Please see the docstrings at the beginning of each plugin for details, and
 346 the "tahoe-conf" file for notes about configuration and installing these
 347 plugins into a Munin environment.
 348
 349 .. _Munin: http://munin-monitoring.org/