Intuition
Substring search (e.g. is “log” in “catalog”?) is O(t*p):
def is_substr(text: str, pattern: str) -> bool:
return any([text[i:i+len(pattern)] == pattern for i in range(len(text)-len(pattern)+1)])
Can we do it faster?
Algorithm idea
- Start by doing the same, but instead of
text[] == pattern, dohash(text[]) == hash(pattern). Whichhash()? 🤔 - Note that
hash(pattern)never changes, so only compute it once:O(p)so far. - First time, compare with
hash(text[:len(pattern)]), againO(p). - 🤯: instead of a new
O(p)hash as we slide, update the hash inO(1)by removing the evicting character and adding the incoming character to the hash. - Hash collisions exist, so one still has to check equality to confirm (another
O(p)). - 💥 BOOM 💥! Now it’s
O(p + t)on average!
Algorithm
def compute_hash(text, pattern):
hash = 0
for i in range(len(pattern)):
hash = (hash * 256 + ord(text[i])) % 101
return hash
def update_hash(hash, old_char, new_char, pattern):
h = pow(256, len(pattern)-1) # highest power of base
hash -= (ord(old_char) * h) % 101 # remove char
hash = (hash * 256 + ord(new_char)) % 101 # 👀 same as compute_hash
return hash
def rabin_karp_search(text, pattern):
p_hash = compute_hash(pattern, pattern)
t_hash = compute_hash(text, pattern)
for i in range(len(text)-len(pattern)+1):
if p_hash == t_hash and pattern == text[i:i+len(pattern)]:
return i
if i < len(text)-len(pattern): # As in "Every time except last"
t_hash = update_hash(t_hash, text[i], text[i+len(pattern)], pattern)
return -1
🤔 Understanding the hash functions
compute_hash
It’s tricky to see in reverse, but the hash is just a weighted sum (the % is just for convenience: positive & bounded):
hash("HELLO") = ( H *256^4 + E *256^3 + L *256^2 + L *256^1 + O *256^0 ) % 101
update_hash
- Adding the new char works the same as in
compute_hash - To remove
Hin the example above we’d need to-= H*256^4, but how do we figure out the4inupdate_hash? - It will always be
len(pattern-1)! Because it goes from0tonright to left, and the size islen(pattern).
🧠 Pro tips
- 👀 worst case still
O(t*p)if all hashes collide and have to check equality every time. - 256 is the “base” and 101 is the “mod”. Chosen because there are 256 characters and 101 is a large prime, which reduces chance of collision, but other numbers can be used.
%is distributive over sum, so you can% 101after every operation or at the end and it should return the same.
Done 🎉🎉🎉
Appendix: Other tutorials online
DON’T LOOK 🙈. They ALL SUCK 😱.
Appendix: Related Leetcodes
28. Implement strStr(): This problem asks you to return the index of the first occurrence of a given needle in a given haystack, or -1 if the needle is not part of the haystack.
49. Group Anagrams: This problem asks you to group all the strings in a given array that are anagrams of each other, using a hash function to encode each string.
1044. Longest Duplicate Substring: This problem asks you to find the longest substring of a given string that occurs at least twice, using binary search and Rabin Karp to check for duplicates.
187. Repeated DNA Sequences: This problem asks you to find all the 10-letter-long sequences that occur more than once in a given DNA molecule, using Rabin Karp to hash each sequence.
686. Repeated String Match: This problem asks you to find the minimum number of times you need to repeat a given string A such that another string B is a substring of it, using Rabin Karp to check for substring match.