Filename: 194-mnemonic-urls.txt
Title: Mnemonic .onion URLs
Author: Sai, Alex Fink
Created: 29-Feb-2012
Status: Superseded
1. Overview
Currently, canonical Tor .onion URLs consist of a naked 80-bit hash[1]. This
is not something that users can even recognize for validity, let alone produce
directly. It is vulnerable to partial-match fuzzing attacks[2], where a
would-be MITM attacker generates a very similar hash and uses various social
engineering, wiki poisoning, or other methods to trick the user into visiting
the spoof site.
This proposal gives an alternative method for displaying and entering .onion
and other URLs, such that they will be easily remembered and generated by end
users, and easily published by hidden service websites, without any dependency
on a full domain name type system like e.g. namecoin[3]. This makes it easier
to implement (requiring only a change in the proxy).
This proposal could equally be used for IPv4, IPv6, etc, if normal DNS is for
some reason untrusted.
This is not a petname system[4], in that it does not allow service providers
or users[5] to associate a name of their choosing to an address[6]. Rather, it
is a mnemonic system that encodes the 80 bit .onion address into a
meaningful[7] and memorable sentence. A full petname system (based on
registration of some kind, and allowing for shorter, service-chosen URLs) can
be implemented in parallel[8].
This system has the three properties of being secure, distributed, and
human-meaningful — it just doesn't also have choice of name (except of course
by brute force creation of multiple keys to see if one has an encoding the
operator likes).
This is inspired by Jonathan Ackerman's "Four Little Words" proposal[9] for
doing the same thing with IPv4 addresses. We just need to handle 80+ bits, not
just 32 bits.
It is similar to Markus Jakobsson & Ruj Akavipat's FastWord system[10], except
that it does not permit user choice of passphrase, does not know what URL a
user will enter (vs verifying against a single stored password), and again has
to encode significantly more data.
This is also similar to RFC1751[11], RFC2289[12], and multiple other
fingerprint encoding systems[13] (e.g. PGPfone[14] using the PGP
wordlist[15], and Arturo Filatsò's OnionURL[16]), but we aim to make something
that's as easy as possible for users to remember — and significantly easier
than just a list of words or pseudowords, which we consider only useful as an
active confirmation tool, not as something that can be fully memorized and
recalled, like a normal domain name.
2. Requirements
2.1. encodes at least 80 bits of random data (preferably more, eg for a
checksum)
2.2. valid, visualizable English sentence — not just a series of words[17]
2.3. words are common enough that non-native speakers and bad spellers will have
minimum difficulty remembering and producing (perhaps with some spellcheck help)
2.4. not syntactically confusable (e.g. order should not matter)
2.5. short enough to be easily memorized and fully recalled at will, not just
recognized
2.6. no dependency on an external service
2.7. dictionary size small enough to be reasonable for end users to download as
part of the onion package
2.8. consistent across users (so that websites can e.g. reinforce their random
hash's phrase with a clever drawing)
2.9. not create offensive sentences that service providers will reject
2.10. resistant against semantic fuzzing (e.g. by having uniqueness against
WordNet synsets[18])
3. Possible implementations
This section is intentionally left unfinished; full listing of template
sentences and the details of their parser and generating implementation is
co-dependent on the creation of word class dictionaries fulfilling the above
criteria. Since that's fairly labor-intensive, we're pausing at this stage for
input first, to avoid wasting work.
3.1. Have a fixed number of template sentences, such as:
1. Adj subj adv vtrans adj obj
2. Subj and subj vtrans adj obj
3. … etc
For a 6 word sentence, with 8 (3b) templates, we need ~12b (4k word)
dictionaries for each word category.
If multiple words of the same category are used, they must either play
different grammatical roles (eg subj vs obj, or adj on a different item), be
chosen from different dictionaries, or there needs to be an order-agnostic way
to join them at the bit level. Preferably this should be avoided, just to
prevent users forgetting the order.
3.2. As 3.1, but treat sentence generation as decoding a prefix code, and have
a Huffman code for each word class.
We suppose it’s okay if the generated sentence has a few more words than it
might, as long as they’re common lean words. E.g., for adjectives, "good"
might cost only six bits while "unfortunate" costs twelve.
Choice between different sentence syntaxes could be worked into the prefix
code as well, and potentially done separately for each syntactic constituent.
4. Usage
To form mnemonic .onion URL, just join the words with dashes or underscores,
stripping minimal words like 'a', 'the', 'and' etc., and append '.onion'. This
can be readily distinguished from standard hash-style .onion URLs by form.
Translation should take place at the client — though hidden service servers
should also be able to output the mnemonic form of hashes too, to assist
website operators in publishing them (e.g. by posting an amusing drawing of
the described situation on their website to reinforce the mnemonic).
After the translation stage of name resolution, everything proceeds as normal
for an 80-bit hash onion URL.
The user should be notified of the mnemonic form of hash URL in some way, and
have an easy way in the client UI to translate mnemonics to hashes and vice
versa. For the purposes of browser URLs and the like though, the mnemonic
should be treated on par with the hash; if the user enters a mnemonic URL they
should not become redirected to the hash version. (If anything, the opposite
may be true, so that users become used to seeing and verifying the mnemonic
version of hash URLs, and gain the security benefits against partial-match
fuzzing.)
Ideally, inputs that don't validly resolve should have a response page served
by the proxy that uses a simple spell-check system to suggest alternate domain
names that are valid hash encodings. This could hypothetically be done inline
in URL input, but would require changes on the browser (normally domain names
aren't subject so spellcheck), and this avoids that implementation problem.
5. International support
It is not possible for this scheme to support non-English languages without
a) (usually) Unicode in domains (which is not yet well supported by browsers),
and
b) fully customized dictionaries and phrase patterns per language
The scheme must not be used in an attempted 'translation' by simply replacing
English words with glosses in the target language. Several of the necessary
features would be completely mangled by this (e.g. other languages have
different synonym, homonym, etc groupings, not to mention completely different
grammar).
It is unlikely a priori that URLs constructed using a non-English
dictionary/pattern setup would in any sense 'translate' semantically to
English; more likely is that each language would have completely unrelated
encodings for a given hash.
We intend to only make an English version at first, to avoid these issues
during testing.
________________
[1] https://trac.torproject.org/projects/tor/wiki/doc/HiddenServiceNames
https://gitweb.torproject.org/torspec.git/blob/HEAD:/address-spec.txt
[2] http://www.thc.org/papers/ffp.html
[3] http://dot-bit.org/Namecoin
[4] https://en.wikipedia.org/wiki/Zooko's_triangle
[5] https://addons.mozilla.org/en-US/firefox/addon/petname-tool/
[6] However, service operators can generate a large number of hidden service
descriptors and check whether their hashes result in a desirable phrasal
encoding (much like certain hidden services currently use brute force generated
hashes to ensure their name is the prefix of their raw hash). This won't get you
whatever phrase you want, but will at least improve the likelihood that it's
something amusing and acceptable.
[7] "Meaningful" here inasmuch as e.g. "Barnaby thoughtfully mangles simplistic
yellow camels" is an absurdist but meaningful sentence. Absurdness is a feature,
not a bug; it decreases the probability of mistakes if the scenario described is
not one that the user would try to fit into a template of things they have
previously encountered IRL. See research into linguistic schema for further
details.
[8] https://gitweb.torproject.org/torspec.git/blob/HEAD:/proposals/ideas/xxx-oni
on-nyms.txt
[9] http://blog.rabidgremlin.com/2010/11/28/4-little-words/
[10] http://fastword.me/
[11] https://tools.ietf.org/html/rfc1751
[12] http://tools.ietf.org/html/rfc2289
[13] https://github.com/singpolyma/mnemonicode
http://mysteryrobot.com
https://github.com/zacharyvoase/humanhash
[14] http://www.mathcs.duq.edu/~juola/papers.d/icslp96.pdf
[15] http://en.wikipedia.org/wiki/PGP_word_list
[16] https://github.com/hellais/Onion-url
https://github.com/hellais/Onion-url/blob/master/dev/mnemonic.py
[17] http://www.reddit.com/r/technology/comments/ecllk
[18] http://wordnet.princeton.edu/wordnet/man2.1/wnstats.7WN.html