Pattern Matching | What's Your IQ

Introduction

Pattern matching -- finding all occurrences of a pattern string P within a longer text string T -- is one of the most fundamental problems in computer science. It underpins text editors, search engines, bioinformatics tools, network security systems, and countless other applications.

Several algorithms have been developed, each with different strengths. The naive algorithm is simple but slow in the worst case. KMP guarantees linear time through preprocessing. Rabin-Karp uses hashing for expected linear time and excels at multi-pattern matching. Boyer-Moore skips large portions of the text and is the fastest in practice for typical inputs.

Problem Statement

Given a text T of length n and a pattern P of length m, find all positions i where P occurs in T. That is, find all i such that T[i..i+m-1] = P. We assume m <= n.

The problem has a trivial information-theoretic lower bound of O(n): every character of the text must be examined at least once in the worst case, since any unexamined character could be the start of a match.

Naive Algorithm

procedure NaiveSearch(T, P): n = length(T), m = length(P) for i from 0 to n - m: match = true for j from 0 to m - 1: if T[i + j] != P[j]: match = false break if match: report match at position i

Time complexity: O((n - m + 1) * m) worst case, O(n) best case (when first character of P rarely matches). The worst case occurs with inputs like T = "AAAAAA" and P = "AAB".

Advantages: No preprocessing, minimal code, works well when the alphabet is large (mismatches tend to occur quickly).

Knuth-Morris-Pratt (KMP)

KMP preprocesses the pattern to build a failure function that records the longest proper prefix of P[0..j] that is also a suffix. On a mismatch, the pattern shifts to align this prefix-suffix, avoiding re-examination of text characters.

procedure KMP(T, P): pi = buildFailureFunction(P) j = 0 for i from 0 to n - 1: while j > 0 and T[i] != P[j]: j = pi[j - 1] if T[i] == P[j]: j = j + 1 if j == m: report match at i - m + 1 j = pi[j - 1]

Time: O(n + m). Space: O(m). The text pointer never backtracks, making KMP ideal for streaming input.

Rabin-Karp

Rabin-Karp computes a rolling hash of each m-length window in the text and compares it to the hash of the pattern. Only when hashes match does it verify character by character (to handle hash collisions).

procedure RabinKarp(T, P): patternHash = hash(P) textHash = hash(T[0..m-1]) for i from 0 to n - m: if textHash == patternHash: if T[i..i+m-1] == P: // verify report match at i if i < n - m: textHash = rollingHash(textHash, T[i], T[i+m])

Time: O(n + m) expected, O(nm) worst case (many collisions). Multi-pattern: Rabin-Karp easily extends to searching for multiple patterns simultaneously by maintaining a set of pattern hashes.

Boyer-Moore

Boyer-Moore compares the pattern from right to left and uses two heuristics to skip alignments:

Bad character rule: On mismatch at text character c, shift the pattern to align with the rightmost occurrence of c in the pattern (or past the pattern if c does not appear).
Good suffix rule: On mismatch, shift the pattern to align with the next occurrence of the matched suffix within the pattern.

procedure BoyerMoore(T, P): preprocess bad character table preprocess good suffix table i = m - 1 // position in text while i < n: j = m - 1 while j >= 0 and P[j] == T[i - m + 1 + j]: j = j - 1 if j < 0: report match at i - m + 1 shift by good suffix rule else: shift by max(bad character shift, good suffix shift) i = i + shift

Time: O(n/m) best case (sublinear!), O(nm) worst case without the good suffix rule. With both heuristics: O(n + m). Boyer-Moore is typically the fastest algorithm in practice for single-pattern search on natural language text.

Algorithm Comparison

Algorithm	Preprocessing	Search Time	Space	Best For
Naive	None	O(nm)	O(1)	Short patterns, large alphabets
KMP	O(m)	O(n)	O(m)	Streaming data, worst-case guarantees
Rabin-Karp	O(m)	O(n) expected	O(1)	Multiple patterns, plagiarism detection
Boyer-Moore	O(m + \|alphabet\|)	O(n/m) to O(n)	O(m + \|alphabet\|)	Long patterns, natural language

Boyer-Moore is the only algorithm among these four that can achieve sublinear search time -- it can skip entire blocks of text without examining them. This makes it especially fast when the pattern is long and the alphabet is large.

Worked Example

Search for P = "ABCABD" in T = "ABCABCABDABCABD":

Naive: Tries positions 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. At position 0: matches "ABCAB" then fails at D vs C. At position 9: full match. Total: many character comparisons.

KMP: Failure function for "ABCABD": [0,0,0,1,2,0]. After the mismatch at position 5 (T[5]='C', P[5]='D'), KMP shifts pattern by j - pi[j-1] = 5 - 2 = 3 positions. The text pointer never goes back.

Boyer-Moore: Comparing right-to-left. At position 0: P[5]='D', T[5]='C'. Bad character 'C' last appears at P[2], shift by 5-2=3. At position 3: P[5]='D', T[8]='D', then compare leftward -- partial match leads to good suffix shift. Eventually finds match at position 9.

Algorithm	Character Comparisons
Naive	~35
KMP	~20
Boyer-Moore	~12

Choosing the Right Algorithm

Use Naive when the pattern is very short (1-3 characters) or when implementation simplicity matters more than performance.
Use KMP when worst-case guarantees are needed, the text is a stream that cannot be re-read, or the pattern has many repeated prefixes (e.g., DNA sequences over a small alphabet).
Use Rabin-Karp when searching for multiple patterns simultaneously, for 2D pattern matching, or when an expected linear-time algorithm with simple implementation is sufficient.
Use Boyer-Moore for single-pattern search in large texts with moderate-to-large alphabets. It is the default choice in most practical string search implementations (e.g., grep).
Use Aho-Corasick (not covered in detail here) when searching for a large dictionary of patterns simultaneously in O(n + m + z) time, where z is the number of matches.

References

Knuth, D. E., Morris, J. H., and Pratt, V. R. "Fast Pattern Matching in Strings." SIAM Journal on Computing, vol. 6, 1977, pp. 323-350.
Karp, R. M. and Rabin, M. O. "Efficient Randomized Pattern-Matching Algorithms." IBM Journal of Research and Development, vol. 31, 1987, pp. 249-260.
Boyer, R. S. and Moore, J. S. "A Fast String Searching Algorithm." Communications of the ACM, vol. 20, 1977, pp. 762-772.
Aho, A. V. and Corasick, M. J. "Efficient String Matching: An Aid to Bibliographic Search." Communications of the ACM, vol. 18, 1975, pp. 333-340.
Charras, C. and Lecroq, T. "Handbook of Exact String-Matching Algorithms." King's College London Publications, 2004.