Filename: 166-statistics-extra-info-docs.txt
Title: Including Network Statistics in Extra-Info Documents
Author: Karsten Loesing
Created: 21-Jul-2009
Target: 0.2.2
Status: Closed

Change history:

  21-Jul-2009  Initial proposal for or-dev


Overview:

  The Tor network has grown to almost two thousand relays and millions
  of casual users over the past few years. With growth has come
  increasing performance problems and attempts by some countries to
  block access to the Tor network. In order to address these problems,
  we need to learn more about the Tor network. This proposal suggests to
  measure additional statistics and include them in extra-info documents
  to help us understand the Tor network better.


Introduction:

  As of May 2009, relays, bridges, and directories gather the following
  data for statistical purposes:

  - Relays and bridges count the number of bytes that they have pushed
    in 15-minute intervals over the past 24 hours. Relays and bridges
    include these data in extra-info documents that they send to the
    directory authorities whenever they publish their server descriptor.

  - Bridges further include a rough number of clients per country that
    they have seen in the past 48 hours in their extra-info documents.

  - Directories can be configured to count the number of clients they
    see per country in the past 24 hours and to write them to a local
    file.

  Since then we extended the network statistics in Tor. These statistics
  include:

  - Directories now gather more precise statistics about connecting
    clients. Fixes include measuring in intervals of exactly 24 hours,
    counting unsuccessful requests, measuring download times, etc. The
    directories append their statistics to a local file every 24 hours.

  - Entry guards count the number of clients per country per day like
    bridges do and write them to a local file every 24 hours.

  - Relays measure statistics of the number of cells in their circuit
    queues and how much time these cells spend waiting there. Relays
    write these statistics to a local file every 24 hours.

  - Exit nodes count the number of read and written bytes on exit
    connections per port as well as the number of opened exit streams
    per port in 24-hour intervals. Exit nodes write their statistics to
    a local file.

  The following four sections contain descriptions for adding these
  statistics to the relays' extra-info documents.


Directory request statistics:

  The first type of statistics aims at measuring directory requests sent
  by clients to a directory mirror or directory authority. More
  precisely, these statistics aim at requests for v2 and v3 network
  statuses only. These directory requests are sent non-anonymously,
  either via HTTP-like requests to a directory's Dir port or tunneled
  over a 1-hop circuit.

  Measuring directory request statistics is useful for several reasons:
  First, the number of locally seen directory requests can be used to
  estimate the total number of clients in the Tor network. Second, the
  country-wise classification of requests using a GeoIP database can
  help counting the relative and absolute number of users per country.
  Third, the download times can give hints on the available bandwidth
  capacity at clients.

  Directory requests do not give any hints on the contents that clients
  send or receive over the Tor network. Every client requests network
  statuses from the directories, so that there are no anonymity-related
  concerns to gather these statistics. It might be, though, that clients
  wish to hide the fact that they are connecting to the Tor network.
  Therefore, IP addresses are resolved to country codes in memory,
  events are accumulated over 24 hours, and numbers are rounded up to
  multiples of 4 or 8.

   "dirreq-stats-end" YYYY-MM-DD HH:MM:SS (NSEC s) NL
      [At most once.]

      YYYY-MM-DD HH:MM:SS defines the end of the included measurement
      interval of length NSEC seconds (86400 seconds by default).

      A "dirreq-stats-end" line, as well as any other "dirreq-*" line,
      is only added when the relay has opened its Dir port and after 24
      hours of measuring directory requests.

   "dirreq-v2-ips" CC=N,CC=N,... NL
      [At most once.]
   "dirreq-v3-ips" CC=N,CC=N,... NL
      [At most once.]

      List of mappings from two-letter country codes to the number of
      unique IP addresses that have connected from that country to
      request a v2/v3 network status, rounded up to the nearest multiple
      of 8. Only those IP addresses are counted that the directory can
      answer with a 200 OK status code.

   "dirreq-v2-reqs" CC=N,CC=N,... NL
      [At most once.]
   "dirreq-v3-reqs" CC=N,CC=N,... NL
      [At most once.]

      List of mappings from two-letter country codes to the number of
      requests for v2/v3 network statuses from that country, rounded up
      to the nearest multiple of 8. Only those requests are counted that
      the directory can answer with a 200 OK status code.

   "dirreq-v2-share" num% NL
      [At most once.]
   "dirreq-v3-share" num% NL
      [At most once.]

      The share of v2/v3 network status requests that the directory
      expects to receive from clients based on its advertised bandwidth
      compared to the overall network bandwidth capacity. Shares are
      formatted in percent with two decimal places. Shares are
      calculated as means over the whole 24-hour interval.

   "dirreq-v2-resp" status=num,... NL
      [At most once.]
   "dirreq-v3-resp" status=nul,... NL
      [At most once.]

      List of mappings from response statuses to the number of requests
      for v2/v3 network statuses that were answered with that response
      status, rounded up to the nearest multiple of 4. Only response
      statuses with at least 1 response are reported. New response
      statuses can be added at any time. The current list of response
      statuses is as follows:

      "ok": a network status request is answered; this number
         corresponds to the sum of all requests as reported in
         "dirreq-v2-reqs" or "dirreq-v3-reqs", respectively, before
         rounding up.
      "not-enough-sigs: a version 3 network status is not signed by a
         sufficient number of requested authorities.
      "unavailable": a requested network status object is unavailable.
      "not-found": a requested network status is not found.
      "not-modified": a network status has not been modified since the
         If-Modified-Since time that is included in the request.
      "busy": the directory is busy.

   "dirreq-v2-direct-dl" key=val,... NL
      [At most once.]
   "dirreq-v3-direct-dl" key=val,... NL
      [At most once.]
   "dirreq-v2-tunneled-dl" key=val,... NL
      [At most once.]
   "dirreq-v3-tunneled-dl" key=val,... NL
      [At most once.]

      List of statistics about possible failures in the download process
      of v2/v3 network statuses. Requests are either "direct"
      HTTP-encoded requests over the relay's directory port, or
      "tunneled" requests using a BEGIN_DIR cell over the relay's OR
      port. The list of possible statistics can change, and statistics
      can be left out from reporting. The current list of statistics is
      as follows:

      Successful downloads and failures:

      "complete": a client has finished the download successfully.
      "timeout": a download did not finish within 10 minutes after
         starting to send the response.
      "running": a download is still running at the end of the
         measurement period for less than 10 minutes after starting to
         send the response.

      Download times:

      "min", "max": smallest and largest measured bandwidth in B/s.
      "d[1-4,6-9]": 1st to 4th and 6th to 9th decile of measured
         bandwidth in B/s. For a given decile i, i/10 of all downloads
         had a smaller bandwidth than di, and (10-i)/10 of all downloads
         had a larger bandwidth than di.
      "q[1,3]": 1st and 3rd quartile of measured bandwidth in B/s. One
         fourth of all downloads had a smaller bandwidth than q1, one
         fourth of all downloads had a larger bandwidth than q3, and the
         remaining half of all downloads had a bandwidth between q1 and
         q3.
      "md": median of measured bandwidth in B/s. Half of the downloads
         had a smaller bandwidth than md, the other half had a larger
         bandwidth than md.


Entry guard statistics:

  Entry guard statistics include the number of clients per country and
  per day that are connecting directly to an entry guard.

  Entry guard statistics are important to learn more about the
  distribution of clients to countries. In the future, this knowledge
  can be useful to detect if there are or start to be any restrictions
  for clients connecting from specific countries.

  The information which client connects to a given entry guard is very
  sensitive. This information must not be combined with the information
  what contents are leaving the network at the exit nodes. Therefore,
  entry guard statistics need to be aggregated to prevent them from
  becoming useful for de-anonymization. Aggregation includes resolving
  IP addresses to country codes, counting events over 24-hour intervals,
  and rounding up numbers to the next multiple of 8.

   "entry-stats-end" YYYY-MM-DD HH:MM:SS (NSEC s) NL
      [At most once.]

      YYYY-MM-DD HH:MM:SS defines the end of the included measurement
      interval of length NSEC seconds (86400 seconds by default).

      An "entry-stats-end" line, as well as any other "entry-*"
      line, is first added after the relay has been running for at least
      24 hours.

   "entry-ips" CC=N,CC=N,... NL
      [At most once.]

      List of mappings from two-letter country codes to the number of
      unique IP addresses that have connected from that country to the
      relay and which are no known other relays, rounded up to the
      nearest multiple of 8.


Cell statistics:

  The third type of statistics have to do with the time that cells spend
  in circuit queues. In order to gather these statistics, the relay
  memorizes when it puts a given cell in a circuit queue and when this
  cell is flushed. The relay further notes the life time of the circuit.
  These data are sufficient to determine the mean number of cells in a
  queue over time and the mean time that cells spend in a queue.

  Cell statistics are necessary to learn more about possible reasons for
  the poor network performance of the Tor network, especially high
  latencies. The same statistics are also useful to determine the
  effects of design changes by comparing today's data with future data.

  There are basically no privacy concerns from measuring cell
  statistics, regardless of a node being an entry, middle, or exit node.

   "cell-stats-end" YYYY-MM-DD HH:MM:SS (NSEC s) NL
      [At most once.]

      YYYY-MM-DD HH:MM:SS defines the end of the included measurement
      interval of length NSEC seconds (86400 seconds by default).

      A "cell-stats-end" line, as well as any other "cell-*" line,
      is first added after the relay has been running for at least 24
      hours.

   "cell-processed-cells" num,...,num NL
      [At most once.]

      Mean number of processed cells per circuit, subdivided into
      deciles of circuits by the number of cells they have processed in
      descending order from loudest to quietest circuits.

   "cell-queued-cells" num,...,num NL
      [At most once.]

      Mean number of cells contained in queues by circuit decile. These
      means are calculated by 1) determining the mean number of cells in
      a single circuit between its creation and its termination and 2)
      calculating the mean for all circuits in a given decile as
      determined in "cell-processed-cells". Numbers have a precision of
      two decimal places.

   "cell-time-in-queue" num,...,num NL
      [At most once.]

      Mean time cells spend in circuit queues in milliseconds. Times are
      calculated by 1) determining the mean time cells spend in the
      queue of a single circuit and 2) calculating the mean for all
      circuits in a given decile as determined in
      "cell-processed-cells".

   "cell-circuits-per-decile" num NL
      [At most once.]

      Mean number of circuits that are included in any of the deciles,
      rounded up to the next integer.


Exit statistics:

  The last type of statistics affects exit nodes counting the number of
  bytes written and read and the number of streams opened per port and
  per 24 hours. Exit port statistics can be measured from looking at
  headers of BEGIN and DATA cells. A BEGIN cell contains the exit port
  that is required for the exit node to open a new exit stream.
  Subsequent DATA cells coming from the client or being sent back to the
  client contain a length field stating how many bytes of application
  data are contained in the cell.

  Exit port statistics are important to measure in order to identify
  possible load-balancing problems with respect to exit policies. Exit
  nodes that permit more ports than others are very likely overloaded
  with traffic for those ports plus traffic for other ports. Improving
  load balancing in the Tor network improves the overall utilization of
  bandwidth capacity.

  Exit traffic is one of the most sensitive parts of network data in the
  Tor network. Even though these statistics do not require looking at
  traffic contents, statistics are aggregated so that they are not
  useful for de-anonymizing users. Only those ports are reported that
  have seen at least 0.1% of exiting or incoming bytes, numbers of bytes
  are rounded up to full kibibytes (KiB), and stream numbers are rounded
  up to the next multiple of 4.

   "exit-stats-end" YYYY-MM-DD HH:MM:SS (NSEC s) NL
      [At most once.]

      YYYY-MM-DD HH:MM:SS defines the end of the included measurement
      interval of length NSEC seconds (86400 seconds by default).

      An "exit-stats-end" line, as well as any other "exit-*" line, is
      first added after the relay has been running for at least 24 hours
      and only if the relay permits exiting (where exiting to a single
      port and IP address is sufficient).

   "exit-kibibytes-written" port=N,port=N,... NL
      [At most once.]
   "exit-kibibytes-read" port=N,port=N,... NL
      [At most once.]

      List of mappings from ports to the number of kibibytes that the
      relay has written to or read from exit connections to that port,
      rounded up to the next full kibibyte.

   "exit-streams-opened" port=N,port=N,... NL
      [At most once.]

      List of mappings from ports to the number of opened exit streams
      to that port, rounded up to the nearest multiple of 4.


Implementation notes:

  Right now, relays that are configured accordingly write similar
  statistics to those described in this proposal to disk every 24 hours.
  With this proposal being implemented, relays include the contents of
  these files in extra-info documents.

  The following steps are necessary to implement this proposal:

  1. The current format of [dirreq|entry|buffer|exit]-stats files needs
     to be adapted to the description in this proposal. This step
     basically means renaming keywords.

  2. The timing of writing the four *-stats files should be unified, so
     that they are written exactly 24 hours after starting the
     relay. Right now, the measurement intervals for dirreq, entry, and
     exit stats starts with the first observed request, and files are
     written when observing the first request that occurs more than 24
     hours after the beginning of the measurement interval. With this
     proposal, the measurement intervals should all start at the same
     time, and files should be written exactly 24 hours later.

  3. It is advantageous to cache statistics in local files in the data
     directory until they are included in extra-info documents. The
     reason is that the 24-hour measurement interval can be very
     different from the 18-hour publication interval of extra-info
     documents. When a relay crashes after finishing a measurement
     interval, but before publishing the next extra-info document,
     statistics would get lost. Therefore, statistics are written to
     disk when finishing a measurement interval and read from disk when
     generating an extra-info document. Only the statistics that were
     appended to the *-stats files within the past 24 hours are included
     in extra-info documents. Further, the contents of the *-stats files
     need to be checked in the process of generating extra-info documents.

  4. With the statistics patches being tested, the ./configure options
     should be removed and the statistics code be compiled by default.
     It is still required for relay operators to add configuration
     options (DirReqStatistics, ExitPortStatistics, etc.) to enable
     gathering statistics. However, in the near future, statistics shall
     be enabled gathered by all relays by default, where requiring a
     ./configure option would be a barrier for many relay operators.