Trie is a tree data structure that stores a set or an associative array (i.e., a map). Since a picture is worth a thousand words, at 24fps for 89 seconds, this short video that explains tries is worth 2.1 million words:

Tries are a type of associative arrays, like hash tables. However, compared to hash tables, tries have several advantages:

  • looking up a word takes O(word length) in the worst case for tries, whereas an imperfect hash table can take up to O(number of words);
  • tries have no collisions;
  • it is possible to walk through all keys in order;
  • it is not necessary to design an hash function.

Tries do have disadvantages:

  • the naïve implementation of tries uses pointers between nodes, which reduces their cache efficiency;
  • if tries are naïvely stored on a storage device they perform much worse than hash tables;
  • a trie may require more memory than an hash table.

So, given these pros and cons, here is a simple trie implementation:

class Node:

    def __init__(self, key, value):
        self.children = {}
        self.key = key
        self.value = value


def append_word(node, word, completeword):
    if not word:
        return
    key = word[0]
    try:
        child = node.children[key]
    except KeyError:
        child = Node(key, None)
        node.children[key] = child
    if len(word) == 1:
        child.value = completeword
    else:
        append_word(child, word[1:], completeword)


def main():
    root = Node(None, None)
    with open('wlist.txt') as wlist:
        map(lambda l: append_word(root, l.strip(), l.strip()), wlist)


if __name__ == '__main__':
    main()
trie.py

This code assumes that the file wlist.txt is accessible and that it contains a dictionary of words, one per line.

For example, this is how the trie that stores all the English words that start with “archi” looks like (click for a large version):

Trie of English words starting with "archi".

As a side node, it possible to “collapse” long chains of prefixes. For the same set of words we would get something like this:

Trie of English words starting with "archi". Commn prefixes have been merged.

Let’s now use a trie to solve a common interview question problem.

T9

T9 is a predictive text technology for mobile phones having a 3×4 keypad. Its core idea is simple: users only press each key once for each letter of the word they want to type. For example, to write “arching” a user would tap “2724464”. That’s exactly what a trie does! Compared with the previous implementation, we need to change two things:

  • nodes are numbers rather than letters;
  • a node is associated to a list of values because the same sequence of numbers can generate different words.

Here’s the code:

#!/usr/bin/env python

from string import maketrans

PHONE_LETTERS = 'abcdefghijklmnopqrstuvwxyz'
PHONE_NUMBERS = '22233344455566677778889999'
# mapping to translate a-z characters to phone digits
PHONE_TRANS = maketrans(PHONE_LETTERS, PHONE_NUMBERS)

class Node:

    def __init__(self, key):
        self.children = {}
        self.key = key
        self.values = []


def append_word(node, sequence, completeword):
    if not sequence:
        return
    key = sequence[0]
    try:
        child = node.children[key]
    except KeyError:
        child = Node(key)
        node.children[key] = child
    if len(sequence) == 1:
        child.values.append(completeword)
    else:
        append_word(child, sequence[1:], completeword)


def main():
    root = Node(None)
    with open('wlist.txt') as wlist:
        # magic! str.translate uses PHONE_TRANS to translate mappings
        map(lambda l: append_word(root, l.strip().translate(PHONE_TRANS), l.strip()), wlist)


if __name__ == '__main__':
    main()

The most prominent changes are that we are using str.translate to map a-z characters to phone digits and that each node has a list of associated values (node.values).

Here is how this “trie” looks like:

Trie of words starting with "archi", with phone keypad numbers in place of letters.

The algorithm to implement T9 is now straightforward: given a sequence of numbers, iterate over them and use each of them to select next trie node. When we run out of numbers it means that we found the longest valid prefix in the trie. We then start a depth first search starting from the last node we explored, and collect words from each node that we explore.

Here is the code that does it:

#!/usr/bin/env python

import fileinput
import string

PHONE_LETTERS = 'abcdefghijklmnopqrstuvwxyz'
PHONE_NUMBERS = '22233344455566677778889999'
PHONE_TRANS = string.maketrans(PHONE_LETTERS, PHONE_NUMBERS)

class Node:

    def __init__(self, key):
        self.children = {}
        self.key = key
        self.values = []


def append_word(node, sequence, completeword):
    if not sequence:
        return
    # discard words that do not match [a-z]+
    if not all(map(lambda x: x in string.ascii_lowercase, completeword)):
        print "Discarding %s: only a-z characters allowed" % completeword
        return
    key = sequence[0]
    try:
        child = node.children[key]
    except KeyError:
        child = Node(key)
        node.children[key] = child
    if len(sequence) == 1:
        child.values.append(completeword)
    else:
        append_word(child, sequence[1:], completeword)


def lookup(node, sequence=None):
    if sequence:
        # there are still numbers in the sequence: follow them in the trie
        try:
            child = node.children[sequence[0]]
            return lookup(child, sequence[1:])
        except KeyError:
            return []
    else:
        # the sequence is empty: explore the trie using a DFS
        result = node.values[:]
        for child in node.children.values():
            result.extend(lookup(child))
        return result


def main():
    root = Node(None)
    with open('wlist.txt') as wlist:
        map(lambda l: append_word(root, l.strip().translate(PHONE_TRANS), l.strip()), wlist)
    sequence = ""
    while True:
        print "Phone keys:"
        try:
            sequence = raw_input()
        except EOFError:
            break
        if not sequence:
            break
        words = sorted(lookup(root, sequence))
        print "Words: %s" % words


if __name__ == '__main__':
    main()

This code assumes that there is a wlist.txt file available. Several English wordlists can be downloaded here: http://www.keithv.com/software/wlist/.

A sample output of this simple application is:

Phone keys:
53763
Words: ['jerod', 'jeroen', 'kernel', 'kernels', 'kernen', 'kerner', 'kernersville', 'kernes', 'kerney', 'kesner', 'lerner']

Other applications

Tries are ubiquitous in all problems where prefix matching is useful. A few examples are:

  • Google autocomplete: tries are augmented with words popularity;
  • spell checkers: each reasonable misspelling of a word (due to insertion, deletion, or substitution of one or more character) is linked to the correct spelling, and each misspelling is linked to the correct spelling (see http://norvig.com/spell-correct.html for an overview of how simple spell correction works);
  • firewalls often store IP ranges associated to a policy (e.g., drop packet, forward packet, accept packet, etc.): IP ranges can be efficiently stored and searched using a trie (for example, see this paper);
  • tries are also used in bioinformatics for overlap detection in fragment assembly (ref.);
  • the fastest algorithm for large data sets sorting, burstsort, is based on tries (ref.)

Whenever you are dealing with a problem where prefixes are important, tries might be the right tool.