Trie is a tree data structure that stores a set or an associative array (i.e., a map). Since a picture is worth a thousand words, at 24fps for 89 seconds, this short video that explains tries is worth 2.1 million words:
Tries are a type of associative arrays, like hash tables. However, compared to hash tables, tries have several advantages:
- looking up a word takes O(word length) in the worst case for tries, whereas an imperfect hash table can take up to O(number of words);
- tries have no collisions;
- it is possible to walk through all keys in order;
- it is not necessary to design an hash function.
Tries do have disadvantages:
- the naïve implementation of tries uses pointers between nodes, which reduces their cache efficiency;
- if tries are naïvely stored on a storage device they perform much worse than hash tables;
- a trie may require more memory than an hash table.
So, given these pros and cons, here is a simple trie implementation:
This code assumes that the file
wlist.txt is accessible and that it contains a dictionary of words, one per line.
For example, this is how the trie that stores all the English words that start with “archi” looks like (click for a large version):
As a side node, it possible to “collapse” long chains of prefixes. For the same set of words we would get something like this:
Let’s now use a trie to solve a common
interview question problem.
T9 is a predictive text technology for mobile phones having a 3×4 keypad. Its core idea is simple: users only press each key once for each letter of the word they want to type. For example, to write “arching” a user would tap “2724464”. That’s exactly what a trie does! Compared with the previous implementation, we need to change two things:
- nodes are numbers rather than letters;
- a node is associated to a list of values because the same sequence of numbers can generate different words.
Here’s the code:
The most prominent changes are that we are using
str.translate to map a-z characters to phone digits and that each node has a list of associated values (
Here is how this “trie” looks like:
The algorithm to implement T9 is now straightforward: given a sequence of numbers, iterate over them and use each of them to select next trie node. When we run out of numbers it means that we found the longest valid prefix in the trie. We then start a depth first search starting from the last node we explored, and collect words from each node that we explore.
Here is the code that does it:
This code assumes that there is a
wlist.txt file available. Several English wordlists can be downloaded here: http://www.keithv.com/software/wlist/.
A sample output of this simple application is:
Tries are ubiquitous in all problems where prefix matching is useful. A few examples are:
- Google autocomplete: tries are augmented with words popularity;
- spell checkers: each reasonable misspelling of a word (due to insertion, deletion, or substitution of one or more character) is linked to the correct spelling, and each misspelling is linked to the correct spelling (see http://norvig.com/spell-correct.html for an overview of how simple spell correction works);
- firewalls often store IP ranges associated to a policy (e.g., drop packet, forward packet, accept packet, etc.): IP ranges can be efficiently stored and searched using a trie (for example, see this paper);
- tries are also used in bioinformatics for overlap detection in fragment assembly (ref.);
- the fastest algorithm for large data sets sorting, burstsort, is based on tries (ref.)
Whenever you are dealing with a problem where prefixes are important, tries might be the right tool.