Introduction

The Knuth-Morris-Pratt (KMP) algorithm, published by Donald Knuth, James Morris, and Vaughan Pratt in 1977, is a string-searching algorithm that finds occurrences of a pattern P of length m within a text T of length n in O(n + m) time. This is a significant improvement over the naive approach, which can take O(n * m) time in the worst case.

The key innovation of KMP is the failure function (also called the prefix function or partial match table), which encodes information about how the pattern matches against shifts of itself. When a mismatch occurs, the failure function tells the algorithm exactly how many characters it can skip, avoiding redundant comparisons. The text pointer never moves backward, making KMP particularly useful for streaming data where characters cannot be re-read.

Why Naive Matching Is Slow

The naive approach tries every starting position in the text:

for i from 0 to n-m: for j from 0 to m-1: if T[i+j] != P[j]: break if j == m: report match at position i

This can perform poorly when the pattern has repeated prefixes. For example, searching for "AAAAAB" in "AAAAAAAAAB" causes many partial matches that fail at the last character, each time starting over from the next position. KMP avoids this by recognizing that after a partial match, some prefix of the pattern has already been matched and need not be re-examined.

AlgorithmWorst CaseText Pointer Backtrack?
NaiveO(n * m)Yes
KMPO(n + m)No

The Failure Function (Prefix Table)

The failure function pi[j] for pattern P is defined as the length of the longest proper prefix of P[0..j] that is also a suffix of P[0..j]. A "proper" prefix excludes the entire string itself.

Construction

procedure ComputeFailureFunction(P): m = length(P) pi = array of size m, all zeros len = 0 // length of previous longest prefix-suffix i = 1 while i < m: if P[i] == P[len]: len = len + 1 pi[i] = len i = i + 1 else: if len != 0: len = pi[len - 1] // fall back else: pi[i] = 0 i = i + 1 return pi

Example: Failure Function for "ABABCABAB"

Index j012345678
P[j]ABABCABAB
pi[j]001201234

The failure function encodes the structure of the pattern's internal repetitions. When a mismatch occurs after matching j characters, the algorithm shifts the pattern so that the first pi[j-1] characters are already aligned, avoiding redundant comparisons.

The KMP Algorithm

procedure KMPSearch(T, P): n = length(T) m = length(P) pi = ComputeFailureFunction(P) j = 0 // index in pattern for i from 0 to n-1: while j > 0 and T[i] != P[j]: j = pi[j - 1] if T[i] == P[j]: j = j + 1 if j == m: report match at position i - m + 1 j = pi[j - 1] // continue searching

The text index i always advances forward. When a mismatch occurs, only the pattern index j retreats (using the failure function), and it never retreats past work that has already been validated. This ensures O(n) time for the search phase.

Worked Example

Text: "ABABDABABCABABCABAB", Pattern: "ABABCABAB"

Failure function: [0, 0, 1, 2, 0, 1, 2, 3, 4]

i=0: T[0]=A, P[0]=A. Match. j=1.i=1: T[1]=B, P[1]=B. Match. j=2.i=2: T[2]=A, P[2]=A. Match. j=3.i=3: T[3]=B, P[3]=B. Match. j=4.i=4: T[4]=D, P[4]=C. Mismatch. j=pi[3]=2. T[4]=D, P[2]=A. Mismatch. j=pi[1]=0. T[4]=D, P[0]=A. Mismatch. j stays 0.i=5: T[5]=A, P[0]=A. Match. j=1.i=6: T[6]=B, P[1]=B. Match. j=2.i=7: T[7]=A, P[2]=A. Match. j=3.i=8: T[8]=B, P[3]=B. Match. j=4.i=9: T[9]=C, P[4]=C. Match. j=5.i=10: T[10]=A, P[5]=A. Match. j=6.i=11: T[11]=B, P[6]=B. Match. j=7.i=12: T[12]=A, P[7]=A. Match. j=8.i=13: T[13]=B, P[8]=B. Match. j=9=m. Match at position 5! j=pi[8]=4. Continue with j=4.i=14: T[14]=C, P[4]=C. Match. j=5....continuing finds match at position 10.

Complexity Analysis

PhaseTimeSpace
Failure function constructionO(m)O(m)
Pattern searchO(n)O(1) extra
TotalO(n + m)O(m)

The O(n) bound for the search phase is proved by an amortized argument: the text pointer i advances n times. The pattern pointer j can decrease (via pi) at most as many times as it has increased, and it increases at most n times. So the total number of operations involving j is at most 2n.

Implementation

def kmp_search(text, pattern): n, m = len(text), len(pattern) # Build failure function pi = [0] * m length = 0 i = 1 while i < m: if pattern[i] == pattern[length]: length += 1 pi[i] = length i += 1 elif length != 0: length = pi[length - 1] else: pi[i] = 0 i += 1 # Search matches = [] j = 0 for i in range(n): while j > 0 and text[i] != pattern[j]: j = pi[j - 1] if text[i] == pattern[j]: j += 1 if j == m: matches.append(i - m + 1) j = pi[j - 1] return matches

Why KMP Works

The correctness of KMP rests on the observation that when a mismatch occurs at position j in the pattern, the characters P[0..j-1] have already been matched against the text. The longest proper prefix of P[0..j-1] that is also a suffix tells us how much of the pattern is already aligned with the text after shifting. We can resume matching from position pi[j-1] in the pattern without missing any potential match.

This works because any shift smaller than the one suggested by the failure function would require the text to contain a prefix-suffix alignment that does not exist (since pi gives the longest such alignment). Therefore, all skipped positions can be safely excluded.

KMP can be viewed as a finite automaton that processes the text one character at a time. The failure function defines the transition for mismatch states, and the automaton accepts when it reaches state m (all pattern characters matched).

Applications

  • Text Editors: Find and replace operations in editors and word processors use pattern matching as a core operation.
  • DNA Sequence Matching: Searching for specific gene sequences within genomes. KMP's linear time is essential for long genomic strings.
  • Network Intrusion Detection: Scanning network packets for known attack signatures in real time.
  • Plagiarism Detection: Finding matching text passages between documents.
  • Data Validation: Checking whether input strings conform to specific patterns or contain forbidden substrings.

References

  • Knuth, D. E., Morris, J. H., and Pratt, V. R. "Fast Pattern Matching in Strings." SIAM Journal on Computing, vol. 6, no. 2, 1977, pp. 323-350.
  • Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. "Introduction to Algorithms." MIT Press, Chapter 32, 2009.
  • Gusfield, D. "Algorithms on Strings, Trees, and Sequences." Cambridge University Press, 1997.
  • Charras, C. and Lecroq, T. "Handbook of Exact String-Matching Algorithms." King's College London Publications, 2004.