Filename: 140-consensus-diffs.txt
Title: Provide diffs between consensuses
Author: Peter Palfrader
Created: 13-Jun-2008
Implemented-In: 0.3.1.1-alpha
Status: Closed
Ticket: https://bugs.torproject.org/13339

0. History

  22-May-2009: Restricted the ed format even more strictly for ease of
  implementation. -nickm

  25-May-2014: Adapted to the new dir-spec version 3 and made the diff urls
  backwards-compatible. -mvdan

  1-Mar-2017: Update to new stats, note newer proposals, note flavors,
  diffs, add parameters, restore diff-only URLs, say what "Digest"
  means. -nickm

  3-May-2017: Add a notion of "digest-as-signed" vs "full digest", since
  otherwise the fact that there are multiple encodings of the same valid
  consensus signatures would make clients identify which encodings they
  had been given as they asked for diffs.

  4-May-2017: Remove support for truncated digest prefixes.

1. Overview.

  Tor clients and servers need a list of which relays are on the
  network.  This list, the consensus, is created by authorities
  hourly and clients fetch a copy of it, with some delay, hourly.

  This proposal suggests that clients download diffs of consensuses
  once they have a consensus instead of hourly downloading a full
  consensus.

  This does not only apply to ordinary directory consensuses, but to the
  newer microdescriptor consensuses added in the third version of the
  directory specification.

2. Numbers

  After implementing proposal 138, which removed nodes that are not
  running from the list, a consensus document was about 92 kilobytes
  in size after compression... back in 2008 when this proposal was first
  written.

  But now in March 2017, that figure is more like 625 kilobytes.

  The diff between two consecutive consensuses, in ed format, is on
  average 37 kilobytes compressed.  So by making this change, we could
  save something like 94% of our consensus download bandwidth.

3. Proposal

3.0. Preliminaries.

  Unless otherwise specified, all digests in this document are SHA3-256
  digests, encoded in base64.  This document also uses "hash" as
  synonymous with "digest".

  A "full digest" of a consensus document covers the entire document,
  from the "network-status-version" through the newline after the final
  "-----END SIGNATURE-----".

  A "digest as signed" of a consensus document covers the same part that
  the signatures cover: the "network-status-version" through the space
  immediately after the "directory-signature" keyword on the first
  "directory-signature" line.

3.1 Clients

  If a client has a consensus that is recent enough it SHOULD
  try to download a diff to get the latest consensus rather than
  fetching a full one.

  [XXX: what is recent enough?
	time delta in hours / size of compressed diff

1:	38177
2:      66955
3:	93502
4:	118959
5:	143450
6:	167136
12:	291354
18:	404008
24:	416663
30:	431240
36:	443858
42:	454849
48:	464677
54:	476716
60:	487755
66:	497502
72:	506421

   Data suggests that for the first few hours' diffs are very useful,
   saving at least 50% for the first 12 hours.  After that, returns seem to
   be more marginal.  But note the savings from proposals like 274-276, which
   make diffs smaller over a much longer timeframe. ]


3.2 Servers

  Directory authorities and servers need to keep a number of old consensus
  documents so they can build diffs.  (See section 5 below ).  They should
  offer a diff to the most recent consensus at the following request:

  HTTP/1.0 GET /tor/status-vote/current/consensus{-Flavor}/<FPRLIST>.z
  X-Or-Diff-From-Consensus: HASH1 HASH2...

  where the hashes are the digests-as-signed of the consensuses the client
  currently has, and FPRLIST is a list of (abbreviated) fingerprints of
  authorities the client trusts.

  Servers will only return a consensus if more than half of the requested
  authorities have signed the document. Otherwise, a 404 error will be sent
  back.

  The advantage of using the same URL that is currently used for
  consensuses is that the client doesn't need to know whether a server
  supports consensus diffs.  If it doesn't, it will simply ignore the
  extra header and return the full consensus.

  If a server cannot offer a diff from one of the consensuses identified
  by one of the hashes but has a current consensus it MUST return the
  full consensus.

  [XXX: what should we do when the client already has the latest
  consensus?  I can think of the following options:
    - send back 3xx not modified
    - send back 200 ok and an empty diff
    - send back 404 nothing newer here.

    I currently lean towards the empty diff.]

  Additionally, specific diff for a given consensus digest-as-signed
  should be available a URL of the form:

    /tor/status-vote/current/consensus{-Flavor}/diff/<HASH>/<FPRLIST>.z

  This differs from the previous request type in that it should never
  return a whole consensus: if a diff is not available, it should return
  404.

4. Diff Format

  Diffs start with the token "network-status-diff-version" followed by a
  space and the version number, currently "1".

  If a document does not start with network-status-diff it is assumed
  to be a full consensus download and would therefore currently start
  with "network-status-version 3".

  Following the network-status-diff line is another header line,
  starting with the token "hash" followed by the digest-as-signed of the
  consensus that this diff applies to, and the full digest that the
  resulting consensus should have.

  Following the network-status-diff header lines is a diff, or patch, in
  limited ed format.  We choose this format because it is easy to create
  and process with standard tools (patch, diff -e, ed).  This will help
  us in developing and testing this proposal and it should make future
  debugging easier.

  [ If at one point in the future we decide that the space benefits from
    a custom diff format outweighs these benefits we can always
    introduce a new diff format and offer it at for instance
    ../diff2/... ]

  We support the following ed commands, each on a line by itself:
   - "<n1>d"          Delete line n1
   - "<n1>,<n2>d"     Delete lines n1 through n2, inclusive
   - "<n1>,$d"        Delete line n1 through the end of the file, inclusive.
   - "<n1>c"          Replace line n1 with the following block
   - "<n1>,<n2>c"     Replace lines n1 through n2, inclusive, with the
                      following block.
   - "<n1>a"          Append the following block after line n1.
   - "a"              Append the following block after the current line.

  Note that line numbers always apply to the file after all previous
  commands have already been applied.  Note also that line numbers
  are 1-indexed.

  The commands MUST apply to the file from back to front, such that
  lines are only ever referred to by their position in the original
  file.

  If there are any directory signatures on the original document, the
  first command MUST be a "<n1>,$d" form to remove all of the directory
  signatures.  Using this format ensures that the client will
  successfully apply the diff even if they have an unusual encoding for
  the signatures.

  The "current line" is either the first line of the file, if this is
  the first command, the last line of a block we added in an append or
  change command, or the line immediate following a set of lines we just
  deleted (or the last line of the file if there are no lines after
  that).

  The replace and append command take blocks.  These blocks are simply
  appended to the diff after the line with the command.  A line with
  just a period (".") ends the block (and is not part of the lines
  to add).  Note that it is impossible to insert a line with just
  a single dot.

4.1. Concatenating multiple diffs

  Directory caches may, at their discretion, return the concatenation of
  multiple diffs using the format above.  Such diffs are to be applied from
  first to last.  This allows the caches to cache a smaller number of
  compressed diffs, at the expense of some loss in bandwidth efficiency.


5. Networkstatus parameters

  The following parameters govern how relays and clients use this protocol.

     min-consensuses-age-to-cache-for-diff
       (min 0, max 744, default 6)
     max-consensuses-age-to-cache-for-diff
       (min 0, max 8192, default 72)

       These two parameters determine how much consensus history (in
       hours) relays should try to cache in order to serve diffs.

     try-diff-for-consensus-newer-than
       (min 0, max 8192, default 72)

       This parameter determines how old a consensus can be (in hours)
       before a client should no longer try to find a diff for it.