String Matching, Karp-Rabin

String Matching Problem

给定两个字符串 s 和 t，判断 s 是否是 t 的子串？如果是，出现在哪？出现了多少次？

Naive/Brute Force Algorithm

def match(s, t):
    return any(s == t[i:i+len(s)] for i in range(len(t)-len(s))

子串比较的复杂度： $O(|s|)$ ，因此算法时间复杂度为 $O(|s|*(|t|-|s|))$ ，近似为 $O(|s|*|t|)$

Karp-Rabin Algorithm

Rolling Hash ADT

在 Naive Algorithm 中，我们每次比较子串的复杂度都是 $O(|s|)$ ，但实际上，前后两次比较中，即 $s == t[i:i+len(s)]$ 与 $s == t[i+1:i+len(s)+1]$ ，两个 t 的子串有 $len(s)-1$ 个字母是重复的，如果能够不用重复比较这些比较过的字母，就有可能将比较子串的复杂度将到 $O(1)$ 。

于是 Rolling Hash ADT 诞生了：它维护一个字符串 x，同时提供以下三个方法

方法

功能

r()

r 就是 hash function，r() 返回当前子串的哈希值 h(x)

r.append(c)

将字母 c 放到 x 的末尾

r.skip(c)

将 x 的第一个字母移除

Karp-Rabin

def karp_r(s, t):
    for c in s:
        rs.append(c)
    for c in t[:len(s)]:
        rt.append(c)
    if rs() == rt():
        # compare char by char
        pass
    
    for i in range(len(s), len(t)):
        rt.skip(t[i-len(s)])
        rt.append(t[i])
        if rs() == rt():
            # compare char by char
            pass

分析：

如果 Rolling Hash ADT 能够做到：

两个不同的子串的哈希值相等的概率 < $\frac{1}{|s|}$

Karp-Rabin Algorithm 的算法复杂度就是：

O(|s| + |t|*|s|*\frac{1}{|s|}) = O(|s| + |t|)

An Rolling Hash Data Structure

想象 string x 是一个基数（base）为 a 的多位数（multi-digit number）u ：

方法

基本实现

优化实现

r()

$u \bmod p$ ，其中 p 是接近 |s| 的随机质数，r 中保存着 u 和 x

$u \bmod p$ ，其中 p 是接近 |s| 的随机质数， r 中保存着 $u \bmod p$ 和 $|x|$ ，而并非 u

r.append(c)

$u = u·a + ord(c)$

$(u · a + ord(c))\bmod p$ 即 $[(u \bmod p)·a + ord(c)] \bmod p$

r.skip(c)

$u = u - c·a^{|x|-1}$

$[u - ord(c)·(a^{|x|-1} \bmod p)] \bmod p$ 即

$[(u \bmod p) - ord(c)·(a^{|x|-1} \bmod p)] \bmod p$

在实际问题中 a 就是所有可能的字符数量。

PreviousWeighted Shortest Paths NextPriority Queue Interface & Implementation

Last updated 6 years ago