Swear Filter - The Codex

> [!META]- Inline Metadata > [tags:: #career] > [project group::] > [difficulty:: medium] > [status:: done] > [link::] > [up:: [[Discord Interview MOC]] [[Leetcode MOC]]] > [!NOTE] > This is not an exact Leetcode problem, but an interesting way to implement and use a trie. ## Problem Implement a swear filter: given a string, replace any words on the banlist with "shazzbot." ## Solution A trie or prefix tree is a data structure that allows for extremely efficient search, which makes it ideal for string filtering. ![[Pasted image 20220930214439.png]] ### TrieNode This is the base structure of a trie. The Node doesn't need to have any behaviors, just a few attributes: - children - value - end of word marker How do we hold child nodes? Because we want fast string lookup, a dictionary works best here, with the key being the character and the value being a TrieNode object representing the next character. ```python class TrieNode: def __init__(self, value: str): self.children = {} self.value = value self.end_of_word = False ``` ### Trie A trie is relatively simple to implement in Python. The trie itself is instantiated with an empty root node. The trie has two main behaviors we care about: search and insert. ```python class Trie: def __init__(self): self.root = TrieNode(None) ``` #### Insert In this implementation, the `insert()` mention accepts a single word. First, set a local variable `current_node` to the Trie's root node. Then, lowercase the word. Next, iterate through the word, lower-casing each character and then using `setdefault` on the `children` dictionary of the current node to add a new node for that character if it doesn't already exist. In the same line, we update the current node, then continue the iteration. At the end of the iteration, we set the current node's end of word value to True. >[!NOTE] >This can cause some nodes to be an end-of-word node (like the k in "fuck" and "fucking") but also have children. As you'll see in the next section, that's okay because we will only check for the end of word marker when we're at the end of the word we're searching. To use this method, we will assume that we have a list of words to be censored and will add them one by one with `insert()`. ```python def insert(self, word: str): current_node = self.root word = word.lower() for char in word: current_node = current_node.children.setdefault(char, TrieNode(value=char)) current_node.end_of_word = True ``` ### Search Next, we need to implement search. This method will also take an individual word, but will return a boolean value indicating whether the word was found. Like in `insert()`, start with setting the current node to the root node and lowercasing the word. Then iterate through each character in the string, and set the current node to the character's node in the children dictionary. If the next node isn't present in the children dictionary, accessing will raise a `KeyError`, so use a try-except block to catch the `KeyError` and return `False` when it happens. If you get to the end of the word you're checking, you'll also want to check if the node has its `end_of_word` attribute set to `True`. If so, you found the word in your list and can return `True`, otherwise return `False`. Doing this check will prevent you from finding a constituent word of what you're trying to filter that may be innocent and filtering it regardless. Without such a check, you'd be censoring the word "mother" because "motherfucker" is on your block list. ```python def search(self, word: str) -> bool: current_node = self.root word = word.lower() for char in word: try: current_node = current_node.children[char] except KeyError: return False if current_node.end_of_word: return True return False ``` ### Putting it all together ```python class TrieNode: def __init__(self, value: str): self.children = {} self.value = value self.end_of_word = False class Trie: def __init__(self): self.root = TrieNode(None) def insert(self, word: str): current_node = self.root word = word.lower() for char in word: current_node = current_node.children.setdefault(char, TrieNode(value=char)) current_node.end_of_word = True def search(self, word: str) -> bool: current_node = self.root word = word.lower() for char in word: try: current_node = current_node.children[char] except KeyError: return False # Last, ensure that we've also reached the end of the word if current_node.end_of_word: return True return False ``` ### Use To use this, set up your list of words to be blocked, insert them into an instantiated `Trie` object, then for each message, use `.translate()` to remove punctuation and `search()` on each word in each message to see if it's in the blocklist. This is also written so that you can make use of Python's `filter()` function. ```python def trie_implementation(message: str) -> str: trie = Trie() new_message = [] for swear in PROFANITY_LIST: trie.insert(swear) translated_message = message.translate(str.maketrans("", "", string.punctuation)) for word in translated_message.split(): if trie.search(word): new_message.append("shazzbot") else: new_message.append(word) return " ".join(new_message) ``` This could be improved in several ways: - censor words self-censored with things like 1337speak - reconstruct the message with the stripped punctuation - any others? ### Adding 1337speak To do this, we will assume a non-cyclical dictionary of potential substitutions like so: ```python substitution_mappings = { "a": ["4"], "e": ["3"], "i": ["1"], # Will go infinite depth if you set l as a sub to i and vice versa "l": ["1"], "o": ["0"], "s": ["5"], "t": ["7"], } ``` All that needs to be updated here is `Trie.insert()` needs to recursively call itself with each substition for that character plus the rest of the word. This should be wrapped in a try-except because it's possible to get an index error when adding the rest of the word (however this can be handled in a number of ways). ```python def insert(self, word: str, current_node: TrieNode = None): if not current_node: current_node = self.root word = word.lower() for idx, char in enumerate(word): if substitution_mappings.get(char): for sub in substitution_mappings[char]: try: self.insert( word=f"{sub}{word[idx+1:]}", current_node=current_node ) except IndexError: self.insert(word=sub, current_node=current_node) current_node = current_node.children.setdefault(char, TrieNode(value=char)) current_node.end_of_word = True ``` ## Things I Learned Set search in Python averages O(1) because it's a hashmap, but approaches O(n) as the set grows because of increased collision risk necessitating different hash functions. ## Scratchpad