Definition and Fundamentals

Suffix Array Concept

Suffix array: array of integers representing starting indices of all suffixes of a string sorted lexicographically. Purpose: enable efficient substring search and analysis. Introduced: Manber & Myers (1993).

Basic Terms

Suffix: substring starting at position i to end. Lexicographic order: dictionary order comparison of suffixes. Text length: n. Suffix array size: n elements.

Notation

Text T = t0t1...tn-1. Suffix starting at i: T[i..n-1]. Suffix array SA[0..n-1]: SA[i] = index of i-th smallest suffix.

Construction Algorithms

Naive Construction

Method: generate all suffixes, sort lexicographically. Time complexity: O(n² log n) due to string comparisons. Practical only for small strings.

Prefix-Doubling Algorithms

Technique: iteratively sort suffixes by doubling prefix length each step. Time complexity: O(n log n). Example: Manber & Myers algorithm. Uses radix sort for efficiency.

Linear-Time Construction

Algorithms: SA-IS (Nong et al.), Kärkkäinen-Sanders-Burkhardt (2006). Time complexity: O(n). Approach: induced sorting of suffix types, efficient classification and recursion.

Algorithm Summary Table

AlgorithmTime ComplexitySpace ComplexityRemarks
NaiveO(n² log n)O(n²)Simple, inefficient
Prefix DoublingO(n log n)O(n)Widely used, practical
SA-ISO(n)O(n)State-of-the-art, complex

SA-IS Algorithm Outline

1. Classify suffixes as S-type or L-type.2. Identify LMS (leftmost S-type) substrings.3. Sort LMS substrings recursively.4. Induce sorting of L- and S-type suffixes.5. Construct final suffix array.

Key Properties

Sorted Order

SA stores suffix indices in lexicographical order. Property: SA[i] < SA[j] implies T[SA[i]..] < T[SA[j]..]. Enables binary search.

Uniqueness

Each suffix corresponds to unique position in SA. No duplicates. Allows unambiguous substring identification.

Relation to LCP Array

LCP (Longest Common Prefix) array stores lengths of longest common prefixes between adjacent suffixes in SA. Facilitates faster pattern queries.

Space Efficiency

Space: O(n) integers, typically 4 or 8 bytes each. Suitable for large text indexing.

Applications

Pattern Matching

Use SA for fast pattern search: binary search over SA with pattern comparisons. Time: O(m log n), m=pattern length.

Data Compression

Basis for Burrows-Wheeler Transform (BWT). SA enables efficient reversible transformations.

Bioinformatics

Genome indexing and alignment. Handle large DNA sequences. Supports approximate and exact matching.

Text Indexing and Retrieval

Full-text search engines, substring frequency queries. Enables fast retrieval of substring occurrences.

Computational Linguistics

Suffix arrays assist in suffix-based analysis, phrase extraction, and language modeling.

Computational Complexity

Construction Time

Best algorithms achieve O(n) time. Prefix doubling: O(n log n). Naive: O(n² log n).

Query Time

Pattern search: O(m log n) via binary search on SA. With LCP: O(m + log n).

Space Complexity

SA stores n integers; LCP adds another n. Total: O(n) space.

Memory Usage and Optimization

Standard Memory Footprint

SA requires 4n bytes (32-bit) or 8n bytes (64-bit) for integer arrays. LCP doubles usage.

Compressed Suffix Arrays

Techniques: FM-index, CSA. Store SA implicitly, reduce memory to near entropy bounds. Tradeoff: slower queries.

Cache Efficiency

Access patterns: sequential in construction, random in queries. Blocking and cache-aware algorithms improve performance.

Practical Considerations

Large texts require external memory or succinct representations. Parallel construction improves speed.

Pattern Matching with Suffix Arrays

Binary Search Technique

Search pattern P in T by binary searching SA indices. Compare P with suffix at SA[mid]. Time: O(m log n).

Use of LCP Array

LCP array reduces redundant comparisons. Improves search to O(m + log n).

Counting Occurrences

Find lower and upper bounds of pattern in SA. Number of occurrences = upper bound - lower bound + 1.

Locating Occurrences

SA values give starting indices of matches. Enables direct text access or further processing.

Enhancements and Augmentations

Enhanced Suffix Arrays

Combine SA with LCP, child tables, and range minimum queries (RMQ) for suffix tree functionality without large space.

Bidirectional Suffix Arrays

Support searching in both directions. Useful in approximate matching and certain bioinformatics tasks.

Parallel and External Memory Construction

Algorithms designed for multi-core and disk-based data. Handle very large texts efficiently.

Comparison with Other Indexing Structures

Suffix Array vs Suffix Tree

SA: less space, simpler implementation. Suffix tree: faster queries, more complex, higher memory.

Suffix Array vs FM-index

FM-index: compressed, supports approximate matching, slower queries. SA: faster but more memory.

Suffix Array vs Trie

Trie stores prefixes explicitly; SA sorts suffixes. Trie larger, SA more compact for suffix-based queries.

Summary Table

Data StructureSpaceQuery SpeedUse Case
Suffix ArrayO(n)Fast (binary search)General substring queries
Suffix TreeO(n log n)Very fast (direct traversal)Complex pattern queries
FM-indexCompressedModerateCompressed storage, approximate search

Implementation Details

Data Structures Used

Arrays for SA, rank, and temporary storage. Auxiliary arrays for classifying suffixes.

Algorithmic Steps

Initialize ranks from characters. Iteratively sort using radix or counting sort. Update ranks after each step.

Practical Tips

Use integer alphabets for efficiency. Optimize memory allocation. Validate with small test cases.

Sample Pseudocode for Prefix Doubling

1. Assign rank based on characters.2. k = 13. While k < n:4. Sort suffixes by (rank[i], rank[i+k])5. Update ranks for next iteration6. k *= 27. Return SA

Limitations and Challenges

High Memory Usage

Storing SA and auxiliary arrays requires significant memory for large texts.

Construction Complexity

Linear-time algorithms complex to implement correctly. Debugging difficult.

Not Ideal for Dynamic Texts

SA is static index; costly to update after text modifications.

Query Speed vs Compression Tradeoff

Compressed variants reduce memory but slow queries.

References

  • Manber, U., Myers, G. "Suffix arrays: a new method for on-line string searches." SIAM Journal on Computing, Vol. 22, 1993, pp. 935–948.
  • Kärkkäinen, J., Sanders, P., Burkhardt, S. "Linear work suffix array construction." Journal of the ACM, Vol. 53, 2006, pp. 918–936.
  • Nong, G., Zhang, S., Chan, W.H. "Two efficient algorithms for linear suffix array construction." IEEE Transactions on Computers, Vol. 60, 2011, pp. 652–664.