Filename: 337-simpler-guard-usability.md
Title: A simpler way to decide, "Is this guard usable?"
Author: Nick Mathewson
Created: 2021-10-22
Status: Closed

Introduction

The current guard-spec describes a mechanism for how to behave when our primary guards are unreachable, and we don't know which other guards are reachable. This proposal describes a simpler method, currently implemented in Arti.

(Note that this method might not actually give different results: its only advantage is that it is much simpler to implement.)

The task at hand

For illustration, we'll assume that our primary guards are P1, P2, and P3, and our subsequent guards (in preference order) are G1, G2, G3, and so on. The status of each guard is Reachable (we think we can connect to it), Unreachable (we think it's down), or Unknown (we haven't tried it recently).

The question becomes, "What should we do when P1, P2, and P3 are Unreachable, and G1, G2, ... are all Unknown"?

In this circumstance, we could say that we only build circuits to G1, wait for them to succeed or fail, and only try G2 if we see that the circuits to G1 have failed completely. But that delays in the case that G1 is down.

Instead, the first time we get a circuit request, we try to build one circuit to G1. On the next circuit request, if the circuit to G1 isn't done yet, we launch a circuit to G2 instead. The next request (if the G1 and G2 circuits are still pending) goes to G3, and so on. But (here's the critical part!) we don't actually use the circuit to G2 unless the circuit to G1 fails, and we don't actually use the circuit to G3 unless the circuits to G1 and G2 both fail.

This approach causes Tor clients to check the status of multiple possible guards in parallel, while not actually using any guard until we're sure that all the guards we'd rather use are down.

The current algorithm and its drawbacks

For the current algorithm, see guard-spec section 4.9: circuits are exploratory if they are not using a primary guard. If such an exploratory circuit is waiting_for_better_guard, then we advance it (or not) depending on the status of all other circuits using guards that we'd rather be using.

In other words, the current algorithm is described in terms of actions to take with given circuits.

For Arti (and for other modular Tor implementations), however, this algorithm is a bit of a pain: it introduces dependencies between the guard code and the circuit handling code, requiring each one to mess with the other.

Proposal

I suggest that we describe an alternative algorithm for handing circuits to non-primary guards, to be used in preference to the current algorithm. Unlike the existing approach, it isolates the guard logic a bit better from the circuit logic.

Handling exploratory circuits

When all primary guards are Unreachable, we need to try non-primary guards. We select the first such guard (in preference order) that is neither Unreachable nor Pending. Whenever we give out such a guard, if the guard's status is Unknown, then we call that guard "Pending" until the attempt to use it succeeds or fails. We remember when the guard became Pending.

Aside: None of the above is a change from our existing specification.

After completing a circuit, the implementation must check whether its guard is usable. A guard is usable according to these rules:

Primary guards are always usable.

Non-primary guards are usable for a given circuit if every guard earlier in the preference list is either unsuitable for that circuit (e.g. because of family restrictions), or marked as Unreachable, or has been pending for at least {NONPRIMARY_GUARD_CONNECT_TIMEOUT}.

Non-primary guards are unusable for a given circuit if some guard earlier in the preference list is suitable for the circuit and Reachable.

Non-primary guards are unusable if they have not become usable after {NONPRIMARY_GUARD_IDLE_TIMEOUT} seconds.

If a circuit's guard is neither usable nor unusable immediately, the circuit is not discarded; instead, it is kept (but not used) until it becomes usable or unusable.

I am not 100% sure whether this description produces the same behavior as the current guard-spec, but it is simpler to describe, and has proven to be simpler to implement.

Implications for program design.

(This entire section is implementation detail to explain why this is a simplification from the previous algorithm. It is for explanatory purposes only and is not part of the spec.)

With this algorithm, we cut down the interaction between the guard code and the circuit code considerably, but we do not remove it entirely. Instead, there remains (in Arti terms) a pair of communication channels between the circuit manager and the guard manager:

  • Whenever a guard is given to the circuit manager, the circuit manager receives the write end of a single-use channel to report whether the guard has succeeded or failed.

  • Whenever a non-primary guard is given to the circuit manager, the circuit receives the read end of a single-use channel that will tell it whether the guard is usable or unusable. This channel doesn't report anything until the guard has one status or the other.

With this design, the circuit manager never needs to look at the list of guards, and the guard manager never needs to look at the list of circuits.

Subtleties concerning "guard success"

Note that the above definitions of a Reachable guard depend on reporting when the guard is successful or failed. This is not necessarily the same as reporting whether the circuit is successful or failed. For example, a circuit that fails after the first hop does not necessarily indicate that there's anything wrong with the guard. Similarly, we can reasonably conclude that the guard is working (at least somewhat) as long as we have an open channel to it.