How to create a trie in Python
Unwind is essentially correct that there are many different ways to implement a trie; and for a large, scalable trie, nested dictionaries might become cumbersome -- or at least space inefficient. But since you're just getting started, I think that's the easiest approach; you could code up a simple trie
in just a few lines. First, a function to construct the trie:
>>> _end = '_end_'
>>>
>>> def make_trie(*words):
... root = dict()
... for word in words:
... current_dict = root
... for letter in word:
... current_dict = current_dict.setdefault(letter, {})
... current_dict[_end] = _end
... return root
...
>>> make_trie('foo', 'bar', 'baz', 'barz')
{'b': {'a': {'r': {'_end_': '_end_', 'z': {'_end_': '_end_'}},
'z': {'_end_': '_end_'}}},
'f': {'o': {'o': {'_end_': '_end_'}}}}
If you're not familiar with setdefault
, it simply looks up a key in the dictionary (here, letter
or _end
). If the key is present, it returns the associated value; if not, it assigns a default value to that key and returns the value ({}
or _end
). (It's like a version of get
that also updates the dictionary.)
Next, a function to test whether the word is in the trie:
>>> def in_trie(trie, word):
... current_dict = trie
... for letter in word:
... if letter not in current_dict:
... return False
... current_dict = current_dict[letter]
... return _end in current_dict
...
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'baz')
True
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'barz')
True
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'barzz')
False
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'bart')
False
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'ba')
False
I'll leave insertion and removal to you as an exercise.
Of course, Unwind's suggestion wouldn't be much harder. There might be a slight speed disadvantage in that finding the correct sub-node would require a linear search. But the search would be limited to the number of possible characters -- 27 if we include _end
. Also, there's nothing to be gained by creating a massive list of nodes and accessing them by index as he suggests; you might as well just nest the lists.
Finally, I'll add that creating a directed acyclic word graph (DAWG) would be a bit more complex, because you have to detect situations in which your current word shares a suffix with another word in the structure. In fact, this can get rather complex, depending on how you want to structure the DAWG! You may have to learn some stuff about Levenshtein distance to get it right.
Implementation of a Trie in Python
There are several issues you're running into. The first is that if you have several children at the same level, you'll only be prefixing one of them with the initial part of the string, and just showing the suffix of the others. Another issue is that you're only showing leaf nodes, even though you can have terminal values that are not at a leaf (consider what happens when you use both "foo"
and "foobar"
as keys into a Trie). Finally, you're not outputting the values at all.
To solve the first issue, I suggest using a recursive generator that does the traversal of the Trie. Separating the traversal from __str__
makes things easier since the generator can simply yield
each value we come across, rather than needing to build up a string as we go. The __str__
method can assemble the final result easily using str.join
.
For the second issue, you should yield the current node's key and value whenever self.val
is not None
, rather than only at leaf nodes. As long as you don't have any way to remove values, all leaf nodes will have a value, but we don't actually need any special casing to detect that.
And for the final issue, I suggest using string formatting to make a key:value
pair. (I suppose you can skip this if you really don't need the values.)
Here's some code:
def traverse(self, prefix=""):
if self.val is not None:
yield "{}:{}".format(prefix, self.val)
for letter, child in self.children.items():
yield from child.traverse(prefix + letter)
def __str__(self):
return " | ".join(self.traverse())
If you're using a version of Python before 3.3, you'll need to replace the yield from
statement with an explicit loop to yield the items from the recursive calls:
for item in child.traverse(prefix + letter)
yield item
Example output:
>>> t = Trie()
>>> t.insert("foo", 5)
>>> t.insert("bar", 10)
>>> t.insert("foobar", 100)
>>> str(t)
'bar:10 | foo:5 | foobar:100'
Passing a list of strings to be put into trie
You aren't resetting curr
back to the root after each insert, so you're inserting the next word where the last one left off. You'd want something like:
def insert(self, words):
curr = self.root
for word in words:
for letter in word:
node = curr.children.get(letter)
if not node:
node = TrieNode()
curr.children[letter] = node
curr = node
curr.end = True
curr = self.root # Reset back to the root
I'd break this up though. I think your insert
function is doing too much, and shouldn't be dealing with multiple strings. I'd change it to something like:
def insert(self, word):
curr = self.root
for letter in word:
node = curr.children.get(letter)
if not node:
node = TrieNode()
curr.children[letter] = node
curr = node
curr.end = True
def insert_many(self, words):
for word in words:
self.insert(word) # Just loop over self.insert
Now that's a non-problem since each insert
is an independent call, and you can't forget to reset curr
.
How to implement the startsWith function of a trie in python
As you have logic in search
and startsWith
that is common, I would put that common logic in a separate method (I'll call it locate
). Then on the TreeNode
class you could define a recursive method that finds (yields) all words rooted at that node.
Code:
class TrieNode:
def __init__(self):
self.children = {}
self.end_of_word = False
def allWords(self):
if self.end_of_word:
yield ""
for char, child in self.children.items():
for word in child.allWords():
yield char + word
class Trie(object):
def __init__(self):
self.root = TrieNode()
def insert(self, word):
current = self.root
for char in word:
if char not in current.children:
current.children[char] = TrieNode()
current = current.children[char]
current.end_of_word = True
def locate(self, word):
current = self.root
for char in word:
if char not in current.children:
return
current = current.children[char]
return current
def search(self, word):
current = self.locate(word)
return current and current.end_of_word
def startsWith(self, prefix):
current = self.locate(prefix)
if not current:
return []
return [prefix + word for word in current.allWords()]
def remove(self, word):
current = self.root
if self.search(word):
for char in word:
current = current.children[char]
current.end_of_word = False
Implementing a Trie to support autocomplete in Python
You could just implement a generator that iterates over the Trie according to prefix the same way as other methods do. Once you've found the node at the end of the prefix you can use recursive generator with yield from
to iterate over the sub trie while keeping track of the prefix and yielding it when terminal node is found:
class TrieNode:
def __init__(self):
self.end = False
self.children = {}
def all_words(self, prefix):
if self.end:
yield prefix
for letter, child in self.children.items():
yield from child.all_words(prefix + letter)
class Trie:
# existing methods here
def all_words_beginning_with_prefix(self, prefix):
cur = self.root
for c in prefix:
cur = cur.children.get(c)
if cur is None:
return # No words with given prefix
yield from cur.all_words(prefix)
trie = Trie()
trie.insert('foobar')
trie.insert('foo')
trie.insert('bar')
trie.insert('foob')
trie.insert('foof')
print(list(trie.all_words_beginning_with_prefix('foo')))
Output:
['foo', 'foob', 'foobar', 'foof']
Related Topics
How to Emulate a Do-While Loop
Calling Parent Class _Init_ with Multiple Inheritance, What's the Right Way
How to Redirect Output with Subprocess in Python
"Ssl Module in Python Is Not Available" When Installing Package with Pip3
Custom Sorting in Pandas Dataframe
Are Python Variables Pointers? or Else, What Are They
Security of Python's Eval() on Untrusted Strings
Extracting Just Month and Year Separately from Pandas Datetime Column
Using Pandas to Pd.Read_Excel() for Multiple Worksheets of the Same Workbook
How to Keep Python Print from Adding Newlines or Spaces
How to Find the Time Difference Between Two Datetime Objects in Python
Should Wildcard Import Be Avoided
Is _Init_.Py Not Required for Packages in Python 3.3+
How to Specify Multiple Return Types Using Type-Hints
How to Ignore the First Line of Data When Processing CSV Data