Introduction
The Rabin-Karp algorithm, developed by Michael Rabin and Richard Karp in 1987, uses hashing to find pattern occurrences in a text. Instead of comparing the pattern with every substring character by character, it computes a hash value for the pattern and for each m-length window of the text. Only when the hash values match does it perform a full character comparison to confirm the match.
The algorithm's efficiency depends on the rolling hash function, which updates the hash of the current window to the hash of the next window in O(1) time by removing the contribution of the outgoing character and adding that of the incoming character. This makes the overall expected running time O(n + m).
Rolling Hash Function
A rolling hash treats each string as a number in some base d (typically the alphabet size) modulo a prime q:
hash(S[0..m-1]) = (S[0] * d^(m-1) + S[1] * d^(m-2) + ... + S[m-1]) mod qTo slide the window one position to the right (removing S[i] and adding S[i+m]):
hash(S[i+1..i+m]) = (d * (hash(S[i..i+m-1]) - S[i] * d^(m-1)) + S[i+m]) mod q| Parameter | Purpose | Typical Value |
|---|---|---|
| d (base) | Number of characters in alphabet | 256 (extended ASCII) |
| q (modulus) | Reduce hash to manageable size | Large prime (e.g., 10^9 + 7) |
| d^(m-1) mod q | Precomputed for removing leading digit | Computed once |
The choice of a large prime q minimizes the probability of hash collisions (spurious hits). With a random prime q, the probability of a false match at any given position is approximately 1/q, making collisions rare for large q.
The Algorithm
procedure RabinKarp(T, P, d, q): n = length(T), m = length(P) h = d^(m-1) mod q // precompute pHash = 0 // pattern hash tHash = 0 // text window hash // Compute initial hashes for i from 0 to m-1: pHash = (d * pHash + P[i]) mod q tHash = (d * tHash + T[i]) mod q // Slide the window for i from 0 to n - m: if pHash == tHash: // Verify character by character if T[i..i+m-1] == P: report match at position i // Compute hash for next window if i < n - m: tHash = (d * (tHash - T[i] * h) + T[i + m]) mod q if tHash < 0: tHash = tHash + qSpurious Hits
A spurious hit occurs when the hash of a text window matches the pattern hash, but the actual strings differ. This happens because the hash function maps many strings to the same value (pigeonhole principle). Spurious hits are handled by a full character-by-character verification when hashes match.
The expected number of spurious hits over the entire search is (n - m + 1) / q. With a prime q of size O(nm), the expected number of false matches is O(1), and the algorithm runs in O(n + m) expected time.
Reducing Collisions
- Use a larger prime q to reduce collision probability
- Use double hashing (two independent hash functions) -- a match requires both hashes to agree
- Choose q to be a prime not close to a power of 2 to avoid patterns in the hash values
Worked Example
Search for P = "31" in T = "2359023141", with d = 10, q = 13.
Pattern hash: (3*10 + 1) mod 13 = 31 mod 13 = 5
h = 10^(2-1) mod 13 = 10
Window "23": hash = (2*10 + 3) mod 13 = 23 mod 13 = 10. No match.Window "35": hash = (10*(10 - 2*10) + 5) mod 13 = ... = 9. No match.Window "59": hash = 7. No match.Window "90": hash = 12. No match.Window "02": hash = 2. No match.Window "23": hash = 10. No match.Window "31": hash = 5. Hash match! Verify: "31" == "31". Match at position 7.Window "14": hash = 1. No match.Window "41": hash = 2. No match.Only one hash comparison led to a verification, and it was a true match (no spurious hits in this example).
Complexity Analysis
| Case | Time | Condition |
|---|---|---|
| Best / Average | O(n + m) | Few or no spurious hits |
| Worst case | O(nm) | Every window produces a spurious hit |
Space: O(1) extra beyond the input (just a few hash variables).
The worst case is extremely unlikely with a good hash function and large prime. Using double hashing reduces the collision probability to approximately 1/q^2 per position.
Implementation
def rabin_karp(text, pattern, d=256, q=101): n, m = len(text), len(pattern) h = pow(d, m - 1, q) p_hash = 0 t_hash = 0 matches = [] for i in range(m): p_hash = (d * p_hash + ord(pattern[i])) % q t_hash = (d * t_hash + ord(text[i])) % q for i in range(n - m + 1): if p_hash == t_hash: if text[i:i + m] == pattern: matches.append(i) if i < n - m: t_hash = (d * (t_hash - ord(text[i]) * h) + ord(text[i + m])) % q if t_hash < 0: t_hash += q return matchesMulti-Pattern Search
Rabin-Karp is particularly well-suited for searching for multiple patterns simultaneously. Instead of running a separate search for each pattern, compute the hash of each pattern and store them in a hash set. Then, for each text window, check whether its hash matches any pattern's hash.
procedure MultiPatternRabinKarp(T, patterns, d, q): patternHashes = set of hash(P) for each P in patterns m = length of each pattern (assuming equal length) tHash = hash(T[0..m-1]) for i from 0 to n - m: if tHash in patternHashes: verify T[i..i+m-1] against matching pattern(s) update tHash using rolling hashFor k patterns of the same length, this runs in O(n + km) expected time, compared to O(knm) for running the naive algorithm k times. For patterns of different lengths, multiple hash computations are needed.
Rabin-Karp's multi-pattern capability makes it the algorithm of choice for plagiarism detection systems that compare a document against a large corpus of known texts using rolling hashes of fixed-length chunks.
Applications
- Plagiarism Detection: Systems like MOSS use Rabin-Karp fingerprinting to find matching text segments across documents.
- Spam Filtering: Matching known spam patterns against incoming messages.
- File Deduplication: Content-defined chunking uses Rabin fingerprints to identify duplicate blocks across files (used in rsync and backup systems).
- 2D Pattern Matching: Rabin-Karp extends naturally to finding a 2D pattern within a 2D text by hashing rows and then hashing columns of row-hashes.
- Bioinformatics: Finding repeated motifs in DNA or protein sequences.
References
- Karp, R. M. and Rabin, M. O. "Efficient Randomized Pattern-Matching Algorithms." IBM Journal of Research and Development, vol. 31, no. 2, 1987, pp. 249-260.
- Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. "Introduction to Algorithms." MIT Press, Chapter 32, 2009.
- Rabin, M. O. "Fingerprinting by Random Polynomials." Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.
- Schleimer, S., Wilkerson, D. S., and Aiken, A. "Winnowing: Local Algorithms for Document Fingerprinting." Proceedings of the ACM SIGMOD Conference, 2003.