Definition and Fundamentals
Suffix Array Concept
Suffix array: array of integers representing starting indices of all suffixes of a string sorted lexicographically. Purpose: enable efficient substring search and analysis. Introduced: Manber & Myers (1993).
Basic Terms
Suffix: substring starting at position i to end. Lexicographic order: dictionary order comparison of suffixes. Text length: n. Suffix array size: n elements.
Notation
Text T = t0t1...tn-1. Suffix starting at i: T[i..n-1]. Suffix array SA[0..n-1]: SA[i] = index of i-th smallest suffix.
Construction Algorithms
Naive Construction
Method: generate all suffixes, sort lexicographically. Time complexity: O(n² log n) due to string comparisons. Practical only for small strings.
Prefix-Doubling Algorithms
Technique: iteratively sort suffixes by doubling prefix length each step. Time complexity: O(n log n). Example: Manber & Myers algorithm. Uses radix sort for efficiency.
Linear-Time Construction
Algorithms: SA-IS (Nong et al.), Kärkkäinen-Sanders-Burkhardt (2006). Time complexity: O(n). Approach: induced sorting of suffix types, efficient classification and recursion.
Algorithm Summary Table
| Algorithm | Time Complexity | Space Complexity | Remarks |
|---|---|---|---|
| Naive | O(n² log n) | O(n²) | Simple, inefficient |
| Prefix Doubling | O(n log n) | O(n) | Widely used, practical |
| SA-IS | O(n) | O(n) | State-of-the-art, complex |
SA-IS Algorithm Outline
1. Classify suffixes as S-type or L-type.2. Identify LMS (leftmost S-type) substrings.3. Sort LMS substrings recursively.4. Induce sorting of L- and S-type suffixes.5. Construct final suffix array.Key Properties
Sorted Order
SA stores suffix indices in lexicographical order. Property: SA[i] < SA[j] implies T[SA[i]..] < T[SA[j]..]. Enables binary search.
Uniqueness
Each suffix corresponds to unique position in SA. No duplicates. Allows unambiguous substring identification.
Relation to LCP Array
LCP (Longest Common Prefix) array stores lengths of longest common prefixes between adjacent suffixes in SA. Facilitates faster pattern queries.
Space Efficiency
Space: O(n) integers, typically 4 or 8 bytes each. Suitable for large text indexing.
Applications
Pattern Matching
Use SA for fast pattern search: binary search over SA with pattern comparisons. Time: O(m log n), m=pattern length.
Data Compression
Basis for Burrows-Wheeler Transform (BWT). SA enables efficient reversible transformations.
Bioinformatics
Genome indexing and alignment. Handle large DNA sequences. Supports approximate and exact matching.
Text Indexing and Retrieval
Full-text search engines, substring frequency queries. Enables fast retrieval of substring occurrences.
Computational Linguistics
Suffix arrays assist in suffix-based analysis, phrase extraction, and language modeling.
Computational Complexity
Construction Time
Best algorithms achieve O(n) time. Prefix doubling: O(n log n). Naive: O(n² log n).
Query Time
Pattern search: O(m log n) via binary search on SA. With LCP: O(m + log n).
Space Complexity
SA stores n integers; LCP adds another n. Total: O(n) space.
Memory Usage and Optimization
Standard Memory Footprint
SA requires 4n bytes (32-bit) or 8n bytes (64-bit) for integer arrays. LCP doubles usage.
Compressed Suffix Arrays
Techniques: FM-index, CSA. Store SA implicitly, reduce memory to near entropy bounds. Tradeoff: slower queries.
Cache Efficiency
Access patterns: sequential in construction, random in queries. Blocking and cache-aware algorithms improve performance.
Practical Considerations
Large texts require external memory or succinct representations. Parallel construction improves speed.
Pattern Matching with Suffix Arrays
Binary Search Technique
Search pattern P in T by binary searching SA indices. Compare P with suffix at SA[mid]. Time: O(m log n).
Use of LCP Array
LCP array reduces redundant comparisons. Improves search to O(m + log n).
Counting Occurrences
Find lower and upper bounds of pattern in SA. Number of occurrences = upper bound - lower bound + 1.
Locating Occurrences
SA values give starting indices of matches. Enables direct text access or further processing.
Enhancements and Augmentations
Enhanced Suffix Arrays
Combine SA with LCP, child tables, and range minimum queries (RMQ) for suffix tree functionality without large space.
Bidirectional Suffix Arrays
Support searching in both directions. Useful in approximate matching and certain bioinformatics tasks.
Parallel and External Memory Construction
Algorithms designed for multi-core and disk-based data. Handle very large texts efficiently.
Comparison with Other Indexing Structures
Suffix Array vs Suffix Tree
SA: less space, simpler implementation. Suffix tree: faster queries, more complex, higher memory.
Suffix Array vs FM-index
FM-index: compressed, supports approximate matching, slower queries. SA: faster but more memory.
Suffix Array vs Trie
Trie stores prefixes explicitly; SA sorts suffixes. Trie larger, SA more compact for suffix-based queries.
Summary Table
| Data Structure | Space | Query Speed | Use Case |
|---|---|---|---|
| Suffix Array | O(n) | Fast (binary search) | General substring queries |
| Suffix Tree | O(n log n) | Very fast (direct traversal) | Complex pattern queries |
| FM-index | Compressed | Moderate | Compressed storage, approximate search |
Implementation Details
Data Structures Used
Arrays for SA, rank, and temporary storage. Auxiliary arrays for classifying suffixes.
Algorithmic Steps
Initialize ranks from characters. Iteratively sort using radix or counting sort. Update ranks after each step.
Practical Tips
Use integer alphabets for efficiency. Optimize memory allocation. Validate with small test cases.
Sample Pseudocode for Prefix Doubling
1. Assign rank based on characters.2. k = 13. While k < n:4. Sort suffixes by (rank[i], rank[i+k])5. Update ranks for next iteration6. k *= 27. Return SALimitations and Challenges
High Memory Usage
Storing SA and auxiliary arrays requires significant memory for large texts.
Construction Complexity
Linear-time algorithms complex to implement correctly. Debugging difficult.
Not Ideal for Dynamic Texts
SA is static index; costly to update after text modifications.
Query Speed vs Compression Tradeoff
Compressed variants reduce memory but slow queries.
References
- Manber, U., Myers, G. "Suffix arrays: a new method for on-line string searches." SIAM Journal on Computing, Vol. 22, 1993, pp. 935–948.
- Kärkkäinen, J., Sanders, P., Burkhardt, S. "Linear work suffix array construction." Journal of the ACM, Vol. 53, 2006, pp. 918–936.
- Nong, G., Zhang, S., Chan, W.H. "Two efficient algorithms for linear suffix array construction." IEEE Transactions on Computers, Vol. 60, 2011, pp. 652–664.