String Processing
Fundamentals
String Hashing
Rabin-Karp for String Matching
Prefix function - Knuth-Morris-Pratt
Definition You are given a string of length . The prefix function for this string is defined as an array of length , where is the length of the longest proper prefix of the substring which is also a suffix of this substring.
For example, prefix function of string "abcabcd" is , and prefix function of string "aabaaab" is By definition
Brute force algo to compute is
# O(n^3)
def kmp(s):
pi = [0]*len(s)
for i in range(len(s)):
for k in range(i):
if s[:k] == s[i-k+1:i+1]:
pi[i] = k
return pi
First optimization
Observe is at most 1 larger than . Thus when moving to the next position, the value of the prefix function can either increase by one, stay the same, or decrease by some amount.
Algo: In total the function can grow at most steps (after all interations), and therefore also only can decrease a total of steps. This means we only have to perform string comparisons, and reach the complexity .
# O(n^2)
def kmp(s):
pi = [0]*len(s)
for i in range(1,len(s)): # on each iteration, pi[i] increases at most with 1
for j in range(pi[i-1]+1,0,-1):
if s[:j] == s[i-j+1:i+1]: # hit this line at most n times during all iterations from the two loops above
pi[i] = j # here pi decreases or increases at most by 1
break
return pi
Second optimization
no string comparisons
Let us compute . Notice that iff
goal is to find largest such that
def KMP(s):
pi = [0]*len(s)
for i in range(1,len(s)):
j = pi[i-1]
while j > 0 and s[i] != s[j]:
j = pi[j-1]
if j == 0:
pi[i] += s[i] == s[0]
else:
pi[i] = j+1
# if s[i] == s[j]:
# j += 1
# pi[i] = j
return pi
complexity as we increase at most times and decrease it at most times. No string comparisons.
is called prefix function = longest preffix suffix (LPS)
Problems
Need creativity to come up with why you need KMP
Search for a substring in a string
Idea
Search substring t in string s:Apply KMP(t+'#'+s) # use the hashtag for cases such as s = 'aaa', t = 'aa'
Counting the number of occurrences of each prefix
Given a string of length . In the first variation of the problem we want to count the number of appearances of each prefix in the same string.
def solve(s):
pi = KMP(s)
res = [0]*(len(s)+1)
for i in range(len(s)):
res[pi[i]] += 1
for i in range(len(s)-1,0,-1):
res[pi[i-1]] += res[i]
for i in range(len(s)+1):
res[i] += 1
return res[1:]
The number of different substring in a string
Idea
We will solve this problem iteratively. Namely we will learn, knowing the current number of different substrings, how to recompute this count by adding a character to the end.We take the string and reverse it. Now the task is transformed into computing how many prefixes there are that don't appear anywhere else.
Therefore the number of new substrings appearing when we add a new character is .
Z-function
Suffix Array
Aho-Corasick algorithm
Suffix Tree
Suffix Automaton
Lyndon factorization
Tasks
Expression parsing
Manacher's Algorithm - Finding all sub-palindromes in O(N)
- Manacher, p1, p2
- maximum product of two palins
CODE
# O(n**2)
class Solution:
def longestPalindrome(self, s: str) -> str:
def palin(i,j):
while i >= 0 and j < len(s) and s[i] == s[j]:
i -= 1
j += 1
return s[i+1:j]
res = ''
for i in range(len(s)):
odd = palin(i,i)
even = palin(i,i+1)
if len(res)<len(odd): res = odd
if len(res)<len(even): res = even
return res
# Manacher Algorithm. Find Longest Palindrome in O(n) time.
class Solution:
def longestPalindrome(self, s: str) -> str:
T = '$#'+'#'.join(s)+'#&'
P = [0]*len(T)
C,R = 0,0
for i in range(len(T)-1):
if R > i: # P[i] = (R>i) and min(R-i,P[2*C-i]) slows down
P[i] = min(R-i,P[2*C-i])
while T[i+P[i]+1] == T[i-P[i]-1]:
P[i] += 1
if R < i+P[i]:
C,R = i,i+P[i]
l = max(P)
i = P.index(l)
# P[2:-2:2] returns the sizes of largest palindrom for each i being the center (only odd length palindromes!)
#P[2:-2] returns sized of largest palindromes (odd and even)
return s[(i-l)//2:(i+l)//2]
- using Manacher to query if substring is palindome in
O(1)
per query
def Manacher(s):
T = '$#'+'#'.join(s)+'#&'
P = [0]*len(T)
C,R = 0,0
for i in range(len(T)-1):
if R > i:
P[i] = min(R-i,P[2*C-i])
while T[i+P[i]+1] == T[i-P[i]-1]:
P[i] += 1
if R < i+P[i]:
C,R = i,i+P[i]
return P[2:-2]
def palin(i,j):
if j > len(s): return False
l = j-i
c = 2*((i+j)//2)-(l%2==0)
return P[c] >= l
P = Manacher(s)
palin(i,j) # returns if s[i:j] is a palindrome