docs/stats.txt

   1 = Tahoe Statistics =
   2
   3 1. Overview
   4 2. Statistics Categories
   5 3. Running a Tahoe Stats-Gatherer Service
   6 4. Using Munin To Graph Stats Values
   7
   8 == Overview ==
   9
  10 Each Tahoe node collects and publishes statistics about its operations as it
  11 runs. These include counters of how many files have been uploaded and
  12 downloaded, CPU usage information, performance numbers like latency of
  13 storage server operations, and available disk space.
  14
  15 The easiest way to see the stats for any given node is use the web interface.
  16 From the main "Welcome Page", follow the "Operational Statistics" link inside
  17 the small "This Client" box. If the welcome page lives at
  18 http://localhost:3456/, then the statistics page will live at
  19 http://localhost:3456/statistics . This presents a summary of the stats
  20 block, along with a copy of the raw counters. To obtain just the raw counters
  21 (in JSON format), use /statistics?t=json instead.
  22
  23 == Statistics Categories ==
  24
  25 The stats dictionary contains two keys: 'counters' and 'stats'. 'counters'
  26 are strictly counters: they are reset to zero when the node is started, and
  27 grow upwards. 'stats' are non-incrementing values, used to measure the
  28 current state of various systems. Some stats are actually booleans, expressed
  29 as '1' for true and '0' for false (internal restrictions require all stats
  30 values to be numbers).
  31
  32 Under both the 'counters' and 'stats' dictionaries, each individual stat has
  33 a key with a dot-separated name, breaking them up into groups like
  34 'cpu_monitor' and 'storage_server'.
  35
  36 The currently available stats (as of release 1.6.0 or so) are described here:
  37
  38 counters.storage_server.*: this group counts inbound storage-server
  39                            operations. They are not provided by client-only
  40                            nodes which have been configured to not run a
  41                            storage server (with [storage]enabled=false in
  42                            tahoe.cfg)
  43   allocate, write, close, abort: these are for immutable file uploads.
  44                                  'allocate' is incremented when a client asks
  45                                  if it can upload a share to the server.
  46                                  'write' is incremented for each chunk of
  47                                  data written. 'close' is incremented when
  48                                  the share is finished. 'abort' is
  49                                  incremented if the client abandons the
  50                                  uploaed.
  51   get, read: these are for immutable file downloads. 'get' is incremented
  52              when a client asks if the server has a specific share. 'read' is
  53              incremented for each chunk of data read.
  54   readv, writev: these are for immutable file creation, publish, and
  55                  retrieve. 'readv' is incremented each time a client reads
  56                  part of a mutable share. 'writev' is incremented each time a
  57                  client sends a modification request.
  58   add-lease, renew, cancel: these are for share lease modifications.
  59                             'add-lease' is incremented when an 'add-lease'
  60                             operation is performed (which either adds a new
  61                             lease or renews an existing lease). 'renew' is
  62                             for the 'renew-lease' operation (which can only
  63                             be used to renew an existing one). 'cancel' is
  64                             used for the 'cancel-lease' operation.
  65   bytes_freed: this counts how many bytes were freed when a 'cancel-lease'
  66                operation removed the last lease from a share and the share
  67                was thus deleted.
  68   bytes_added: this counts how many bytes were consumed by immutable share
  69                uploads. It is incremented at the same time as the 'close'
  70                counter.
  71
  72 stats.storage_server.*:
  73  allocated: this counts how many bytes are currently 'allocated', which
  74             tracks the space that will eventually be consumed by immutable
  75             share upload operations. The stat is increased as soon as the
  76             upload begins (at the same time the 'allocated' counter is
  77             incremented), and goes back to zero when the 'close' or 'abort'
  78             message is received (at which point the 'disk_used' stat should
  79             incremented by the same amount).
  80  disk_total
  81  disk_used
  82  disk_free_for_root
  83  disk_free_for_nonroot
  84  disk_avail
  85  reserved_space: these all reflect disk-space usage policies and status.
  86                  'disk_total' is the total size of disk where the storage
  87                  server's BASEDIR/storage/shares directory lives, as reported
  88                  by /bin/df or equivalent. 'disk_used', 'disk_free_for_root',
  89                  and 'disk_free_for_nonroot' show related information.
  90                  'reserved_space' reports the reservation configured by the
  91                  tahoe.cfg [storage]reserved_space value. 'disk_avail'
  92                  reports the remaining disk space available for the Tahoe
  93                  server after subtracting reserved_space from disk_avail. All
  94                  values are in bytes.
  95  accepting_immutable_shares: this is '1' if the storage server is currently
  96                              accepting uploads of immutable shares. It may be
  97                              '0' if a server is disabled by configuration, or
  98                              if the disk is full (i.e. disk_avail is less
  99                              than reserved_space).
 100  total_bucket_count: this counts the number of 'buckets' (i.e. unique
 101                      storage-index values) currently managed by the storage
 102                      server. It indicates roughly how many files are managed
 103                      by the server.
 104  latencies.*.*: these stats keep track of local disk latencies for
 105                 storage-server operations. A number of percentile values are
 106                 tracked for many operations. For example,
 107                 'storage_server.latencies.readv.50_0_percentile' records the
 108                 median response time for a 'readv' request. All values are in
 109                 seconds. These are recorded by the storage server, starting
 110                 from the time the request arrives (post-deserialization) and
 111                 ending when the response begins serialization. As such, they
 112                 are mostly useful for measuring disk speeds. The operations
 113                 tracked are the same as the counters.storage_server.* counter
 114                 values (allocate, write, close, get, read, add-lease, renew,
 115                 cancel, readv, writev). The percentile values tracked are:
 116                 mean, 01_0_percentile, 10_0_percentile, 50_0_percentile,
 117                 90_0_percentile, 95_0_percentile, 99_0_percentile,
 118                 99_9_percentile. (the last value, 99.9 percentile, means that
 119                 999 out of the last 1000 operations were faster than the
 120                 given number, and is the same threshold used by Amazon's
 121                 internal SLA, according to the Dynamo paper).
 122
 123 counters.uploader.files_uploaded
 124 counters.uploader.bytes_uploaded
 125 counters.downloader.files_downloaded
 126 counters.downloader.bytes_downloaded
 127
 128  These count client activity: a Tahoe client will increment these when it
 129  uploads or downloads an immutable file. 'files_uploaded' is incremented by
 130  one for each operation, while 'bytes_uploaded' is incremented by the size of
 131  the file.
 132
 133 counters.mutable.files_published
 134 counters.mutable.bytes_published
 135 counters.mutable.files_retrieved
 136 counters.mutable.bytes_retrieved
 137
 138  These count client activity for mutable files. 'published' is the act of
 139  changing an existing mutable file (or creating a brand-new mutable file).
 140  'retrieved' is the act of reading its current contents.
 141
 142 counters.chk_upload_helper.*
 143
 144  These count activity of the "Helper", which receives ciphertext from clients
 145  and performs erasure-coding and share upload for files that are not already
 146  in the grid. The code which implements these counters is in
 147  src/allmydata/immutable/offloaded.py .
 148
 149   upload_requests: incremented each time a client asks to upload a file
 150   upload_already_present: incremented when the file is already in the grid
 151   upload_need_upload: incremented when the file is not already in the grid
 152   resumes: incremented when the helper already has partial ciphertext for
 153            the requested upload, indicating that the client is resuming an
 154            earlier upload
 155   fetched_bytes: this counts how many bytes of ciphertext have been fetched
 156                  from uploading clients
 157   encoded_bytes: this counts how many bytes of ciphertext have been
 158                  encoded and turned into successfully-uploaded shares. If no
 159                  uploads have failed or been abandoned, encoded_bytes should
 160                  eventually equal fetched_bytes.
 161
 162 stats.chk_upload_helper.*
 163
 164  These also track Helper activity:
 165
 166   active_uploads: how many files are currently being uploaded. 0 when idle.
 167   incoming_count: how many cache files are present in the incoming/ directory,
 168                   which holds ciphertext files that are still being fetched
 169                   from the client
 170   incoming_size: total size of cache files in the incoming/ directory
 171   incoming_size_old: total size of 'old' cache files (more than 48 hours)
 172   encoding_count: how many cache files are present in the encoding/ directory,
 173                   which holds ciphertext files that are being encoded and
 174                   uploaded
 175   encoding_size: total size of cache files in the encoding/ directory
 176   encoding_size_old: total size of 'old' cache files (more than 48 hours)
 177
 178 stats.node.uptime: how many seconds since the node process was started
 179
 180 stats.cpu_monitor.*:
 181   .1min_avg, 5min_avg, 15min_avg: estimate of what percentage of system CPU
 182                                   time was consumed by the node process, over
 183                                   the given time interval. Expressed as a
 184                                   float, 0.0 for 0%, 1.0 for 100%
 185   .total: estimate of total number of CPU seconds consumed by node since
 186           the process was started. Ticket #472 indicates that .total may
 187           sometimes be negative due to wraparound of the kernel's counter.
 188
 189 stats.load_monitor.*:
 190  When enabled, the "load monitor" continually schedules a one-second
 191  callback, and measures how late the response is. This estimates system load
 192  (if the system is idle, the response should be on time). This is only
 193  enabled if a stats-gatherer is configured.
 194
 195  .avg_load: average "load" value (seconds late) over the last minute
 196  .max_load: maximum "load" value over the last minute
 197
 198
 199 == Running a Tahoe Stats-Gatherer Service ==
 200
 201 The "stats-gatherer" is a simple daemon that periodically collects stats from
 202 several tahoe nodes. It could be useful, e.g., in a production environment,
 203 where you want to monitor dozens of storage servers from a central management
 204 host. It merely gatherers statistics from many nodes into a single place: it
 205 does not do any actual analysis.
 206
 207 The stats gatherer listens on a network port using the same Foolscap
 208 connection library that Tahoe clients use to connect to storage servers.
 209 Tahoe nodes can be configured to connect to the stats gatherer and publish
 210 their stats on a periodic basis. (in fact, what happens is that nodes connect
 211 to the gatherer and offer it a second FURL which points back to the node's
 212 "stats port", which the gatherer then uses to pull stats on a periodic basis.
 213 The initial connection is flipped to allow the nodes to live behind NAT
 214 boxes, as long as the stats-gatherer has a reachable IP address)
 215
 216 The stats-gatherer is created in the same fashion as regular tahoe client
 217 nodes and introducer nodes. Choose a base directory for the gatherer to live
 218 in (but do not create the directory). Then run:
 219
 220  tahoe create-stats-gatherer $BASEDIR
 221
 222 and start it with "tahoe start $BASEDIR". Once running, the gatherer will
 223 write a FURL into $BASEDIR/stats_gatherer.furl .
 224
 225 To configure a Tahoe client/server node to contact the stats gatherer, copy
 226 this FURL into the node's tahoe.cfg file, in a section named "[client]",
 227 under a key named "stats_gatherer.furl", like so:
 228
 229  [client]
 230  stats_gatherer.furl = pb://qbo4ktl667zmtiuou6lwbjryli2brv6t@192.168.0.8:49997/wxycb4kaexzskubjnauxeoptympyf45y
 231
 232 or simply copy the stats_gatherer.furl file into the node's base directory
 233 (next to the tahoe.cfg file): it will be interpreted in the same way.
 234
 235 The first time it is started, the gatherer will listen on a random unused TCP
 236 port, so it should not conflict with anything else that you have running on
 237 that host at that time. On subsequent runs, it will re-use the same port (to
 238 keep its FURL consistent). To explicitly control which port it uses, write
 239 the desired portnumber into a file named "portnum" (i.e. $BASEDIR/portnum),
 240 and the next time the gatherer is started, it will start listening on the
 241 given port. The portnum file is actually a "strports specification string",
 242 as described in docs/configuration.txt .
 243
 244 Once running, the stats gatherer will create a standard python "pickle" file
 245 in $BASEDIR/stats.pickle . Once a minute, the gatherer will pull stats
 246 information from every connected node and write them into the pickle. The
 247 pickle will contain a dictionary, in which node identifiers (known as "tubid"
 248 strings) are the keys, and the values are a dict with 'timestamp',
 249 'nickname', and 'stats' keys. d[tubid][stats] will contain the stats
 250 dictionary as made available at http://localhost:3456/statistics?t=json . The
 251 pickle file will only contain the most recent update from each node.
 252
 253 Other tools can be built to examine these stats and render them into
 254 something useful. For example, a tool could sum the
 255 "storage_server.disk_avail' values from all servers to compute a
 256 total-disk-available number for the entire grid (however, the "disk watcher"
 257 daemon, in misc/operations_helpers/spacetime/, is better suited for this specific task).
 258
 259 == Using Munin To Graph Stats Values ==
 260
 261 The misc/munin/ directory contains various plugins to graph stats for Tahoe
 262 nodes. They are intended for use with the Munin system-management tool, which
 263 typically polls target systems every 5 minutes and produces a web page with
 264 graphs of various things over multiple time scales (last hour, last month,
 265 last year).
 266
 267 Most of the plugins are designed to pull stats from a single Tahoe node, and
 268 are configured with the e.g. http://localhost:3456/statistics?t=json URL. The
 269 "tahoe_stats" plugin is designed to read from the pickle file created by the
 270 stats-gatherer. Some plugins are to be used with the disk watcher, and a few
 271 (like tahoe_nodememory) are designed to watch the node processes directly
 272 (and must therefore run on the same host as the target node).
 273
 274 Please see the docstrings at the beginning of each plugin for details, and
 275 the "tahoe-conf" file for notes about configuration and installing these
 276 plugins into a Munin environment.