Introduction

The Z-algorithm is a linear-time string processing algorithm that computes the Z-array for a given string. The Z-array encodes, for each position in the string, the length of the longest substring starting at that position which matches a prefix of the string. This information is sufficient for solving pattern matching and many other string problems in O(n) time.

The Z-algorithm was introduced by Dan Gusfield in his 1997 textbook on string algorithms. While it solves the same problem as the KMP algorithm (and with the same time complexity), many practitioners find the Z-algorithm conceptually simpler and easier to implement correctly. It also provides a more direct representation of the string's structure.

The Z-Array

For a string S of length n, the Z-array Z[0..n-1] is defined as:

  • Z[0] is defined as 0 (or n, by convention -- the entire string matches its own prefix, but this value is not useful).
  • For i > 0, Z[i] is the length of the longest substring starting at position i that matches a prefix of S. Equivalently, S[0..Z[i]-1] = S[i..i+Z[i]-1].

Example

Index01234567
Saabxaaby
Z--1003100

Z[4] = 3 because S[4..6] = "aab" matches S[0..2] = "aab".

The Z-array captures all the prefix-matching information of a string in a single linear-time pass. Every Z-value corresponds to a "Z-box" -- the interval [i, i+Z[i]-1] where the substring matches the prefix.

Z-Array Construction

The linear-time construction maintains a window [L, R] representing the rightmost Z-box found so far (the interval of the form [i, i+Z[i]-1] with the largest right endpoint R).

procedure BuildZArray(S): n = length(S) Z = array of size n, all zeros L = 0, R = 0 for i from 1 to n-1: if i < R: // i is within the current Z-box Z[i] = min(R - i, Z[i - L]) // Try to extend Z[i] while i + Z[i] < n and S[Z[i]] == S[i + Z[i]]: Z[i] = Z[i] + 1 // Update Z-box if extended past R if i + Z[i] > R: L = i R = i + Z[i] return Z

Why This Is Linear

The key observation is that the right boundary R only moves to the right. Each comparison that extends a Z-box advances R by one. Since R can advance at most n times, the total number of character comparisons across all iterations is O(n). The comparisons inside the Z-box (the min operation) take O(1) each.

Pattern Matching with Z-Algorithm

To find all occurrences of pattern P (length m) in text T (length n), concatenate them as S = P + "$" + T, where "$" is a character not in P or T. Then compute the Z-array of S. Any position i where Z[i] = m indicates that T contains P starting at position i - m - 1.

procedure ZPatternMatch(T, P): S = P + "$" + T Z = BuildZArray(S) m = length(P) for i from m + 1 to length(S) - 1: if Z[i] == m: report match at position i - m - 1 in T

The separator "$" ensures that no Z-value in the text portion can exceed m, preventing the prefix from "wrapping around" into the pattern.

Worked Example

Find P = "aab" in T = "aabxaab".

S = "aab$aabxaab"

Index: 0 1 2 3 4 5 6 7 8 9 10S: a a b $ a a b x a a bZ: - 1 0 0 3 1 0 0 3 1 0

Z[4] = 3 = m: match at position 4 - 3 - 1 = 0 in T.

Z[8] = 3 = m: match at position 8 - 3 - 1 = 4 in T.

Result: Pattern found at positions 0 and 4 in the text.

Complexity Analysis

OperationTimeSpace
Z-array constructionO(n)O(n)
Pattern matching scanO(n + m)O(n + m)
TotalO(n + m)O(n + m)

The space can be reduced to O(m) if we process the text portion on-the-fly and only store the Z-array for the pattern and a sliding window. However, the standard implementation uses O(n + m) space for the concatenated string.

Implementation

def z_function(s): n = len(s) z = [0] * n l, r = 0, 0 for i in range(1, n): if i < r: z[i] = min(r - i, z[i - l]) while i + z[i] < n and s[z[i]] == s[i + z[i]]: z[i] += 1 if i + z[i] > r: l, r = i, i + z[i] return zdef z_search(text, pattern): s = pattern + "$" + text z = z_function(s) m = len(pattern) return [i - m - 1 for i in range(m + 1, len(s)) if z[i] == m]

Comparison with KMP

AspectZ-AlgorithmKMP
Time complexityO(n + m)O(n + m)
SpaceO(n + m)O(m)
PreprocessingZ-array on concatenationFailure function on pattern
Conceptual clarityMore intuitiveMore abstract
Streaming supportRequires concatenationNatural streaming

The Z-array and KMP failure function are mathematically related: the failure function can be derived from the Z-array and vice versa. They encode the same information about the string's prefix structure in different forms.

While KMP is better suited for streaming applications (it processes the text without needing to concatenate it with the pattern), the Z-algorithm is often preferred in competitive programming for its simpler implementation and broader applicability to other string problems.

Applications

  • Pattern matching: The primary application, equivalent to KMP but with a different approach.
  • String period detection: A string S has period p if S[i] = S[i+p] for all valid i. The smallest period can be found using the Z-array.
  • Longest repeated prefix: Find the longest prefix of S that appears elsewhere in S. This is simply the maximum value in the Z-array.
  • String compression: Determine if a string is a concatenation of copies of a shorter string. If Z[p] = n - p for some p dividing n, then S consists of n/p copies of S[0..p-1].
  • Distinct substrings: Combined with suffix arrays, the Z-algorithm helps count the number of distinct substrings of a string.

References

  • Gusfield, D. "Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology." Cambridge University Press, Chapter 1, 1997.
  • Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. "Introduction to Algorithms." MIT Press, Chapter 32, 2009.
  • Crochemore, M. and Rytter, W. "Jewels of Stringology." World Scientific, 2002.
  • Halim, S. and Halim, F. "Competitive Programming 3." Lulu Press, Chapter 6, 2013.