> [!META]- Inline Metadata
> [tags:: #career]
> [project group::]
> [difficulty:: medium]
> [status:: done]
> [link::]
> [up:: [[Discord Interview MOC]] [[Leetcode MOC]]]
> [!NOTE]
> This is not an exact Leetcode problem, but an interesting way to implement and use a trie.
## Problem
Implement a swear filter: given a string, replace any words on the banlist with "shazzbot."
## Solution
A trie or prefix tree is a data structure that allows for extremely efficient search, which makes it ideal for string filtering.
![[Pasted image 20220930214439.png]]
### TrieNode
This is the base structure of a trie. The Node doesn't need to have any behaviors, just a few attributes:
- children
- value
- end of word marker
How do we hold child nodes? Because we want fast string lookup, a dictionary works best here, with the key being the character and the value being a TrieNode object representing the next character.
```python
class TrieNode:
def __init__(self, value: str):
self.children = {}
self.value = value
self.end_of_word = False
```
### Trie
A trie is relatively simple to implement in Python. The trie itself is instantiated with an empty root node. The trie has two main behaviors we care about: search and insert.
```python
class Trie:
def __init__(self):
self.root = TrieNode(None)
```
#### Insert
In this implementation, the `insert()` mention accepts a single word. First, set a local variable `current_node` to the Trie's root node. Then, lowercase the word. Next, iterate through the word, lower-casing each character and then using `setdefault` on the `children` dictionary of the current node to add a new node for that character if it doesn't already exist. In the same line, we update the current node, then continue the iteration.
At the end of the iteration, we set the current node's end of word value to True.
>[!NOTE]
>This can cause some nodes to be an end-of-word node (like the k in "fuck" and "fucking") but also have children. As you'll see in the next section, that's okay because we will only check for the end of word marker when we're at the end of the word we're searching.
To use this method, we will assume that we have a list of words to be censored and will add them one by one with `insert()`.
```python
def insert(self, word: str):
current_node = self.root
word = word.lower()
for char in word:
current_node = current_node.children.setdefault(char, TrieNode(value=char))
current_node.end_of_word = True
```
### Search
Next, we need to implement search. This method will also take an individual word, but will return a boolean value indicating whether the word was found. Like in `insert()`, start with setting the current node to the root node and lowercasing the word. Then iterate through each character in the string, and set the current node to the character's node in the children dictionary.
If the next node isn't present in the children dictionary, accessing will raise a `KeyError`, so use a try-except block to catch the `KeyError` and return `False` when it happens.
If you get to the end of the word you're checking, you'll also want to check if the node has its `end_of_word` attribute set to `True`. If so, you found the word in your list and can return `True`, otherwise return `False`.
Doing this check will prevent you from finding a constituent word of what you're trying to filter that may be innocent and filtering it regardless. Without such a check, you'd be censoring the word "mother" because "motherfucker" is on your block list.
```python
def search(self, word: str) -> bool:
current_node = self.root
word = word.lower()
for char in word:
try:
current_node = current_node.children[char]
except KeyError:
return False
if current_node.end_of_word:
return True
return False
```
### Putting it all together
```python
class TrieNode:
def __init__(self, value: str):
self.children = {}
self.value = value
self.end_of_word = False
class Trie:
def __init__(self):
self.root = TrieNode(None)
def insert(self, word: str):
current_node = self.root
word = word.lower()
for char in word:
current_node = current_node.children.setdefault(char, TrieNode(value=char))
current_node.end_of_word = True
def search(self, word: str) -> bool:
current_node = self.root
word = word.lower()
for char in word:
try:
current_node = current_node.children[char]
except KeyError:
return False
# Last, ensure that we've also reached the end of the word
if current_node.end_of_word:
return True
return False
```
### Use
To use this, set up your list of words to be blocked, insert them into an instantiated `Trie` object, then for each message, use `.translate()` to remove punctuation and `search()` on each word in each message to see if it's in the blocklist. This is also written so that you can make use of Python's `filter()` function.
```python
def trie_implementation(message: str) -> str:
trie = Trie()
new_message = []
for swear in PROFANITY_LIST:
trie.insert(swear)
translated_message = message.translate(str.maketrans("", "", string.punctuation))
for word in translated_message.split():
if trie.search(word):
new_message.append("shazzbot")
else:
new_message.append(word)
return " ".join(new_message)
```
This could be improved in several ways:
- censor words self-censored with things like 1337speak
- reconstruct the message with the stripped punctuation
- any others?
### Adding 1337speak
To do this, we will assume a non-cyclical dictionary of potential substitutions like so:
```python
substitution_mappings = {
"a": ["4"],
"e": ["3"],
"i": ["1"], # Will go infinite depth if you set l as a sub to i and vice versa
"l": ["1"],
"o": ["0"],
"s": ["5"],
"t": ["7"],
}
```
All that needs to be updated here is `Trie.insert()` needs to recursively call itself with each substition for that character plus the rest of the word. This should be wrapped in a try-except because it's possible to get an index error when adding the rest of the word (however this can be handled in a number of ways).
```python
def insert(self, word: str, current_node: TrieNode = None):
if not current_node:
current_node = self.root
word = word.lower()
for idx, char in enumerate(word):
if substitution_mappings.get(char):
for sub in substitution_mappings[char]:
try:
self.insert(
word=f"{sub}{word[idx+1:]}", current_node=current_node
)
except IndexError:
self.insert(word=sub, current_node=current_node)
current_node = current_node.children.setdefault(char, TrieNode(value=char))
current_node.end_of_word = True
```
## Things I Learned
Set search in Python averages O(1) because it's a hashmap, but approaches O(n) as the set grows because of increased collision risk necessitating different hash functions.
## Scratchpad