How to Create a Trie in Python

How to create a trie in Python

Unwind is essentially correct that there are many different ways to implement a trie; and for a large, scalable trie, nested dictionaries might become cumbersome -- or at least space inefficient. But since you're just getting started, I think that's the easiest approach; you could code up a simple trie in just a few lines. First, a function to construct the trie:

>>> _end = '_end_'
>>>
>>> def make_trie(*words):
... root = dict()
... for word in words:
... current_dict = root
... for letter in word:
... current_dict = current_dict.setdefault(letter, {})
... current_dict[_end] = _end
... return root
...
>>> make_trie('foo', 'bar', 'baz', 'barz')
{'b': {'a': {'r': {'_end_': '_end_', 'z': {'_end_': '_end_'}},
'z': {'_end_': '_end_'}}},
'f': {'o': {'o': {'_end_': '_end_'}}}}

If you're not familiar with setdefault, it simply looks up a key in the dictionary (here, letter or _end). If the key is present, it returns the associated value; if not, it assigns a default value to that key and returns the value ({} or _end). (It's like a version of get that also updates the dictionary.)

Next, a function to test whether the word is in the trie:

>>> def in_trie(trie, word):
... current_dict = trie
... for letter in word:
... if letter not in current_dict:
... return False
... current_dict = current_dict[letter]
... return _end in current_dict
...
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'baz')
True
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'barz')
True
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'barzz')
False
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'bart')
False
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'ba')
False

I'll leave insertion and removal to you as an exercise.

Of course, Unwind's suggestion wouldn't be much harder. There might be a slight speed disadvantage in that finding the correct sub-node would require a linear search. But the search would be limited to the number of possible characters -- 27 if we include _end. Also, there's nothing to be gained by creating a massive list of nodes and accessing them by index as he suggests; you might as well just nest the lists.

Finally, I'll add that creating a directed acyclic word graph (DAWG) would be a bit more complex, because you have to detect situations in which your current word shares a suffix with another word in the structure. In fact, this can get rather complex, depending on how you want to structure the DAWG! You may have to learn some stuff about Levenshtein distance to get it right.

Implementation of a Trie in Python

There are several issues you're running into. The first is that if you have several children at the same level, you'll only be prefixing one of them with the initial part of the string, and just showing the suffix of the others. Another issue is that you're only showing leaf nodes, even though you can have terminal values that are not at a leaf (consider what happens when you use both "foo" and "foobar" as keys into a Trie). Finally, you're not outputting the values at all.

To solve the first issue, I suggest using a recursive generator that does the traversal of the Trie. Separating the traversal from __str__ makes things easier since the generator can simply yield each value we come across, rather than needing to build up a string as we go. The __str__ method can assemble the final result easily using str.join.

For the second issue, you should yield the current node's key and value whenever self.val is not None, rather than only at leaf nodes. As long as you don't have any way to remove values, all leaf nodes will have a value, but we don't actually need any special casing to detect that.

And for the final issue, I suggest using string formatting to make a key:value pair. (I suppose you can skip this if you really don't need the values.)

Here's some code:

def traverse(self, prefix=""):
if self.val is not None:
yield "{}:{}".format(prefix, self.val)
for letter, child in self.children.items():
yield from child.traverse(prefix + letter)

def __str__(self):
return " | ".join(self.traverse())

If you're using a version of Python before 3.3, you'll need to replace the yield from statement with an explicit loop to yield the items from the recursive calls:

for item in child.traverse(prefix + letter)
yield item

Example output:

>>> t = Trie()
>>> t.insert("foo", 5)
>>> t.insert("bar", 10)
>>> t.insert("foobar", 100)
>>> str(t)
'bar:10 | foo:5 | foobar:100'

Passing a list of strings to be put into trie

You aren't resetting curr back to the root after each insert, so you're inserting the next word where the last one left off. You'd want something like:

def insert(self, words):
curr = self.root
for word in words:
for letter in word:
node = curr.children.get(letter)
if not node:
node = TrieNode()
curr.children[letter] = node
curr = node
curr.end = True
curr = self.root # Reset back to the root

I'd break this up though. I think your insert function is doing too much, and shouldn't be dealing with multiple strings. I'd change it to something like:

def insert(self, word):
curr = self.root
for letter in word:
node = curr.children.get(letter)
if not node:
node = TrieNode()
curr.children[letter] = node
curr = node
curr.end = True

def insert_many(self, words):
for word in words:
self.insert(word) # Just loop over self.insert

Now that's a non-problem since each insert is an independent call, and you can't forget to reset curr.

How to implement the startsWith function of a trie in python

As you have logic in search and startsWith that is common, I would put that common logic in a separate method (I'll call it locate). Then on the TreeNode class you could define a recursive method that finds (yields) all words rooted at that node.

Code:

class TrieNode:
def __init__(self):
self.children = {}
self.end_of_word = False

def allWords(self):
if self.end_of_word:
yield ""
for char, child in self.children.items():
for word in child.allWords():
yield char + word

class Trie(object):
def __init__(self):
self.root = TrieNode()

def insert(self, word):
current = self.root
for char in word:
if char not in current.children:
current.children[char] = TrieNode()
current = current.children[char]
current.end_of_word = True

def locate(self, word):
current = self.root
for char in word:
if char not in current.children:
return
current = current.children[char]
return current

def search(self, word):
current = self.locate(word)
return current and current.end_of_word

def startsWith(self, prefix):
current = self.locate(prefix)
if not current:
return []
return [prefix + word for word in current.allWords()]

def remove(self, word):
current = self.root
if self.search(word):
for char in word:
current = current.children[char]
current.end_of_word = False

Implementing a Trie to support autocomplete in Python

You could just implement a generator that iterates over the Trie according to prefix the same way as other methods do. Once you've found the node at the end of the prefix you can use recursive generator with yield from to iterate over the sub trie while keeping track of the prefix and yielding it when terminal node is found:

class TrieNode:
def __init__(self):
self.end = False
self.children = {}

def all_words(self, prefix):
if self.end:
yield prefix

for letter, child in self.children.items():
yield from child.all_words(prefix + letter)

class Trie:
# existing methods here
def all_words_beginning_with_prefix(self, prefix):
cur = self.root
for c in prefix:
cur = cur.children.get(c)
if cur is None:
return # No words with given prefix

yield from cur.all_words(prefix)

trie = Trie()
trie.insert('foobar')
trie.insert('foo')
trie.insert('bar')
trie.insert('foob')
trie.insert('foof')

print(list(trie.all_words_beginning_with_prefix('foo')))

Output:

['foo', 'foob', 'foobar', 'foof']


Related Topics



Leave a reply



Submit