🏠

Rabin-Karp


Intuition

Substring search (e.g. is “log” in “catalog”?) is O(t*p):

def is_substr(text: str, pattern: str) -> bool:
    return any([text[i:i+len(pattern)] == pattern for i in range(len(text)-len(pattern)+1)])

Can we do it faster?

Algorithm idea

  • Start by doing the same, but instead of text[] == pattern, do hash(text[]) == hash(pattern). Which hash()? 🤔
  • Note that hash(pattern) never changes, so only compute it once: O(p) so far.
  • First time, compare with hash(text[:len(pattern)]), again O(p).
  • 🤯: instead of a new O(p) hash as we slide, update the hash in O(1) by removing the evicting character and adding the incoming character to the hash.
  • Hash collisions exist, so one still has to check equality to confirm (another O(p)).
  • 💥 BOOM 💥! Now it’s O(p + t) on average!

Algorithm


def compute_hash(text, pattern):
    hash = 0
    for i in range(len(pattern)):
        hash = (hash * 256 + ord(text[i])) % 101
    return hash

def update_hash(hash, old_char, new_char, pattern):
    h = pow(256, len(pattern)-1) # highest power of base
    hash -= (ord(old_char) * h) % 101 # remove char
    hash = (hash * 256 + ord(new_char)) % 101 # 👀 same as compute_hash
    return hash

def rabin_karp_search(text, pattern):
    p_hash = compute_hash(pattern, pattern)
    t_hash = compute_hash(text, pattern)

    for i in range(len(text)-len(pattern)+1):
        if p_hash == t_hash and pattern == text[i:i+len(pattern)]:
            return i

        if i < len(text)-len(pattern): # As in "Every time except last"
            t_hash = update_hash(t_hash, text[i], text[i+len(pattern)], pattern)

    return -1

🤔 Understanding the hash functions

compute_hash

It’s tricky to see in reverse, but the hash is just a weighted sum (the % is just for convenience: positive & bounded):

hash("HELLO") = ( H *256^4 + E *256^3 + L *256^2 + L *256^1 + O *256^0 ) % 101

update_hash

  • Adding the new char works the same as in compute_hash
  • To remove H in the example above we’d need to -= H*256^4, but how do we figure out the 4 in update_hash?
  • It will always be len(pattern-1)! Because it goes from 0 to n right to left, and the size is len(pattern).

🧠 Pro tips

  • 👀 worst case still O(t*p) if all hashes collide and have to check equality every time.
  • 256 is the “base” and 101 is the “mod”. Chosen because there are 256 characters and 101 is a large prime, which reduces chance of collision, but other numbers can be used.
  • % is distributive over sum, so you can % 101 after every operation or at the end and it should return the same.

Done 🎉🎉🎉

Appendix: Other tutorials online

DON’T LOOK 🙈. They ALL SUCK 😱.

28. Implement strStr(): This problem asks you to return the index of the first occurrence of a given needle in a given haystack, or -1 if the needle is not part of the haystack.

49. Group Anagrams: This problem asks you to group all the strings in a given array that are anagrams of each other, using a hash function to encode each string.

1044. Longest Duplicate Substring: This problem asks you to find the longest substring of a given string that occurs at least twice, using binary search and Rabin Karp to check for duplicates.

187. Repeated DNA Sequences: This problem asks you to find all the 10-letter-long sequences that occur more than once in a given DNA molecule, using Rabin Karp to hash each sequence.

686. Repeated String Match: This problem asks you to find the minimum number of times you need to repeat a given string A such that another string B is a substring of it, using Rabin Karp to check for substring match.

 

Issues & PRs welcome ♥️
Powered by Hugo - Theme beautifulhugo