String Processing

Fundamentals

String Hashing

Rabin-Karp for String Matching

Prefix function - Knuth-Morris-Pratt

Definition You are given a string  ss  of length  nn . The prefix function for this string is defined as an array  π\pi  of length  nn , where  π[i]\pi[i]  is the length of the longest proper prefix of the substring  s[0i]s[0 \dots i]  which is also a suffix of this substring.

π[i]=maxk=0i{k:s[0:k]=s[ik+1:i+1]}\pi[i] = \max_ {k = 0 \dots i} \{k : s[0:k] = s[i-k+1:i+1] \}

For example, prefix function of string "abcabcd" is  [0,0,0,1,2,3,0][0, 0, 0, 1, 2, 3, 0] , and prefix function of string "aabaaab" is  [0,1,0,1,2,2,3][0, 1, 0, 1, 2, 2, 3]  By definition π[0]=0\pi[0] = 0

Brute force algo to compute π\pi is O(n3)O(n^3)

# O(n^3)
def kmp(s):
    pi = [0]*len(s)
    for i in range(len(s)):
        for k in range(i):
            if s[:k] == s[i-k+1:i+1]:
                pi[i] = k
    return pi

First optimization

Observe π(i+1)\pi(i+1) is at most 1 larger than π(i)\pi(i). Thus when moving to the next position, the value of the prefix function can either increase by one, stay the same, or decrease by some amount.

Algo: In total the function can grow at most  nn  steps (after all interations), and therefore also only can decrease a total of  nn  steps. This means we only have to perform  O(n)O(n)  string comparisons, and reach the complexity  O(n2)O(n^2) .

# O(n^2)
def kmp(s):
    pi = [0]*len(s)
    for i in range(1,len(s)): # on each iteration, pi[i] increases at most with 1
        for j in range(pi[i-1]+1,0,-1):
            if s[:j] == s[i-j+1:i+1]: # hit this line at most n times during all iterations from the two loops above
                pi[i] = j # here pi decreases or increases at most by 1
                break
    return pi

Second optimization

no string comparisons

Let us compute π[i]\pi[i]. Notice that π[i]=π[i1]+1\pi[i] = \pi[i-1] + 1 iff s[i]==s[π[i1]]s[i] == s[\pi[i-1]]

s0 s1 s2π[i] s3s3=si+1π[i+1]=π[i]+1  si2 si1 siπ[i] si+1s3=si+1π[i+1]=π[i]+1\underbrace{\overbrace{s_0 ~ s_1 ~ s_2}^{\pi[i]} ~ \overbrace{s_3}^{s_3 = s_{i+1}}}_{\pi[i+1] = \pi[i] + 1} ~ \dots ~ \underbrace{\overbrace{s_{i-2} ~ s_{i-1} ~ s_{i}}^{\pi[i]} ~ \overbrace{s_{i+1}}^{s_3 = s_{i + 1}}}_{\pi[i+1] = \pi[i] + 1}

goal is to find largest jπ[i1]j \leq \pi[i-1] such that s[0j1]=s[ij+1i]s[0 \dots j-1] = s[i-j+1 \dots i]

s0 s1j s2 s3π[i]  si3 si2 si1 sijπ[i] si+1\overbrace{\underbrace{s_0 ~ s_1}_j ~ s_2 ~ s_3}^{\pi[i]} ~ \dots ~ \overbrace{s_{i-3} ~ s_{i-2} ~ \underbrace{s_{i-1} ~ s_{i}}_j}^{\pi[i]} ~ s_{i+1}

def KMP(s):
    pi = [0]*len(s)
    for i in range(1,len(s)):
        j = pi[i-1]
        while j > 0 and s[i] != s[j]:
            j = pi[j-1]
        if j == 0: 
            pi[i] += s[i] == s[0]
        else: 
            pi[i] = j+1
        # if s[i] == s[j]:
        #     j += 1
        # pi[i] = j
    return pi

O(n)O(n) complexity as we increase pi[i]pi[i] at most nn times and decrease it at most nn times. No string comparisons.

pipi is called prefix function = longest preffix suffix (LPS)

Problems

Need creativity to come up with why you need KMP

Search for a substring in a string

Idea Search substring t in string s:

Apply KMP(t+'#'+s) # use the hashtag for cases such as s = 'aaa', t = 'aa'

Counting the number of occurrences of each prefix

Given a string  ss  of length  nn . In the first variation of the problem we want to count the number of appearances of each prefix  s[0i]s[0 \dots i]  in the same string.

def solve(s):
    pi = KMP(s)
    res = [0]*(len(s)+1)
    for i in range(len(s)):
        res[pi[i]] += 1
    for i in range(len(s)-1,0,-1):
        res[pi[i-1]] += res[i]
    for i in range(len(s)+1):
        res[i] += 1
    return res[1:]

The number of different substring in a string

Idea We will solve this problem iteratively. Namely we will learn, knowing the current number of different substrings, how to recompute this count by adding a character to the end.

We take the string  t=s+ct = s + c  and reverse it. Now the task is transformed into computing how many prefixes there are that don't appear anywhere else.

Therefore the number of new substrings appearing when we add a new character  cc  is  s+1πmax|s| + 1 - \pi_{\text{max}} .

Z-function

Suffix Array

Aho-Corasick algorithm

Suffix Tree

Suffix Automaton

Lyndon factorization

Tasks

Expression parsing

Manacher's Algorithm - Finding all sub-palindromes in O(N)

# O(n**2)
class Solution:
    def longestPalindrome(self, s: str) -> str:
        def palin(i,j):
            while i >= 0 and j < len(s) and s[i] == s[j]:
                i -= 1
                j += 1
            return s[i+1:j]
        res = ''
        for i in range(len(s)):
            odd = palin(i,i)
            even = palin(i,i+1)
            if len(res)<len(odd): res = odd
            if len(res)<len(even): res = even
        return res

# Manacher Algorithm. Find Longest Palindrome in O(n) time.
class Solution:
    def longestPalindrome(self, s: str) -> str:  
        T = '$#'+'#'.join(s)+'#&'
        P = [0]*len(T)
        C,R = 0,0
        for i in range(len(T)-1):
            if R > i: # P[i] = (R>i) and min(R-i,P[2*C-i]) slows down
                P[i] = min(R-i,P[2*C-i])
            while T[i+P[i]+1] == T[i-P[i]-1]:
                P[i] += 1
            if R < i+P[i]:
                C,R = i,i+P[i]
        l = max(P)
        i = P.index(l)
        # P[2:-2:2] returns the sizes of largest palindrom for each i being the center (only odd length palindromes!)
        #P[2:-2] returns sized of largest palindromes (odd and even)
        return s[(i-l)//2:(i+l)//2]

  • using Manacher to query if substring is palindome in O(1) per query
def Manacher(s):
    T = '$#'+'#'.join(s)+'#&'
    P = [0]*len(T)
    C,R = 0,0
    for i in range(len(T)-1):
        if R > i: 
            P[i] = min(R-i,P[2*C-i])
        while T[i+P[i]+1] == T[i-P[i]-1]:
            P[i] += 1
        if R < i+P[i]:
            C,R = i,i+P[i]
    return P[2:-2]

def palin(i,j):
    if j > len(s): return False
    l = j-i
    c = 2*((i+j)//2)-(l%2==0)
    return P[c] >= l
P = Manacher(s)
palin(i,j) # returns if s[i:j] is a palindrome

Finding repetitions