String Matching Algorithms
problem: given two strings, s and t. does s occurs as a substring of t?
naive approach:
def substringSearch(t, s):
for i in range(0..len(t)):
matched = true
for j in range(0..len(s)):
if t[i+j] != s[j]:
matched = false
break
if matched:
return "String occurs"
return "does not occur"
Runtime: O(m * n)
Rabin-Karp Algorithm:
def subStringSearch(t, s):
len_s = len(s)
pattern_hash = get_hash_val(s, 0, len_s)
substr_hash = -1
for i in range(0..len(t)-len(s)):
substr_hash = get_hash(t, i, i+len_s, substr_hash)
if substr_hash == pattern_hash:
if t[i:i+len_s].matches(s):
return true
return false
def get_hash_val(s, start, end, prev_hash=-1):
size_of_alphabet = SIZE_OF_ASCII_CHARS
lowest_ascii_char = 'a'
hash_val = 0
if prev_hash == -1:
for i in range(start..end):
hash_val *= size_of_alphabet
hash_val += (s[i] - lowest_ascii_char)
else:
start_char = s[start]
hash_val = prev_hash * size_of_alphabet + (s[end] - lowest_ascii_char) - start_char * (size_of_alphabet ^ (end-start+1))
Analysis:
if prev_hash = -1, hashing takes O(s)
inside the for loop, hashing happens in O(1) time
the hash function is designed in such a way that if two hash value equals then they are the same string.
So for good hash function, O(n)
for bad hash function O(m * n)
problem: given two strings, s and t. does s occurs as a substring of t?
naive approach:
def substringSearch(t, s):
for i in range(0..len(t)):
matched = true
for j in range(0..len(s)):
if t[i+j] != s[j]:
matched = false
break
if matched:
return "String occurs"
return "does not occur"
Runtime: O(m * n)
Rabin-Karp Algorithm:
def subStringSearch(t, s):
len_s = len(s)
pattern_hash = get_hash_val(s, 0, len_s)
substr_hash = -1
for i in range(0..len(t)-len(s)):
substr_hash = get_hash(t, i, i+len_s, substr_hash)
if substr_hash == pattern_hash:
if t[i:i+len_s].matches(s):
return true
return false
def get_hash_val(s, start, end, prev_hash=-1):
size_of_alphabet = SIZE_OF_ASCII_CHARS
lowest_ascii_char = 'a'
hash_val = 0
if prev_hash == -1:
for i in range(start..end):
hash_val *= size_of_alphabet
hash_val += (s[i] - lowest_ascii_char)
else:
start_char = s[start]
hash_val = prev_hash * size_of_alphabet + (s[end] - lowest_ascii_char) - start_char * (size_of_alphabet ^ (end-start+1))
Analysis:
if prev_hash = -1, hashing takes O(s)
inside the for loop, hashing happens in O(1) time
the hash function is designed in such a way that if two hash value equals then they are the same string.
So for good hash function, O(n)
for bad hash function O(m * n)
No comments:
Post a Comment