Definition and Concept
Overview
Suffix tree: compressed trie representing all suffixes of a given string. Enables fast substring queries, pattern matching, and repetitive substring detection. Fundamental in string algorithms, bioinformatics, and text indexing.
Structure
Nodes: represent substrings. Edges: labeled by substrings. Leaves: correspond to suffixes, indexed by start position. Path from root to leaf spells suffix. Compression: edges combine chains of single-child nodes for space efficiency.
Terminology
Root: unique start node. Internal nodes: branching points. Edge-label: substring on edge. Suffix link: pointer from one internal node to another related node, accelerates construction.
Construction Algorithms
Naive Approach
Inserts all suffixes one by one into a trie. Time complexity: O(n²) for string length n. Inefficient for large data sets.
Weiner's Algorithm
First linear-time method (1973). Constructs suffix tree from left to right. Complex suffix links and bookkeeping.
Ukkonen's Algorithm
Online, linear-time construction. Processes suffixes incrementally. Uses suffix links to avoid recomputation. Widely used due to simplicity and efficiency.
Properties and Characteristics
Size
At most 2n-1 nodes for string length n. Edges labeled with substrings, stored as indices to original string, not copies.
Uniqueness
Unique suffix tree per string assuming unique terminal symbol appended. Guarantees deterministic structure.
Suffix Links
Connect internal nodes representing substrings differing by first character. Enables efficient tree traversal and construction.
Applications
Pattern Matching
Queries: check if a pattern appears in text. Time complexity: O(m) for pattern length m after tree built.
Longest Repeated Substring
Internal nodes with multiple leaf descendants represent repeated substrings. Max-depth node yields longest repeated substring.
Substring Statistics
Count distinct substrings, frequency of substrings, and find lexicographic order of suffixes.
Comparison with Other Structures
Suffix Array
Suffix array: array of suffix start indexes in lex order. More space-efficient but slower for some queries. Often paired with LCP array.
Trie
Trie stores all suffixes explicitly. Larger space. Suffix tree compresses trie for linear space.
Enhanced Suffix Arrays
Combine suffix arrays with extra data to simulate suffix tree queries efficiently.
| Data Structure | Space | Query Time | Construction Time |
|---|---|---|---|
| Suffix Tree | O(n) | O(m) | O(n) |
| Suffix Array | O(n) | O(m log n) | O(n log n) |
Space Complexity
Linear Space
Suffix tree requires O(n) space for string length n. Nodes and edges stored efficiently using pointers and indices.
Edge Label Storage
Edges store substring positions as pairs (start, end) referencing original string. Avoids copying substrings.
Memory Overhead
Auxiliary data (suffix links, parent pointers) increase constant factors but remain linear overall.
Time Complexity
Construction
Ukkonen's algorithm: O(n) time. Naive method: O(n²). Weiner's method: O(n).
Querying
Substring search: O(m), m = pattern length. Longest repeated substring: O(n).
Updates
Suffix trees generally static. Dynamic versions exist but complex and less practical.
Ukkonen's Algorithm
Key Concepts
Incremental suffix addition. Implicit suffix trees during construction. Active point: current position in tree for extension.
Suffix Links Usage
Accelerate traversal from one suffix to next. Avoid recomputing common prefixes repeatedly.
Phases and Extensions
Algorithm proceeds in phases, each adding one character. Extensions handle suffix insertions and tree update.
for i in 1 to n: extend suffix tree with character s[i] update active point and suffix links Generalized Suffix Trees
Definition
Extension of suffix tree to multiple strings. Represents suffixes of all input strings simultaneously.
Applications
Common substring detection, multiple pattern matching, comparative genomics.
Construction
Concatenate strings with unique delimiters. Build suffix tree on combined input. Identify string membership in leaves.
Implementation Details
Edge Representation
Edges stored as (start index, end index) pairs into input string array. Saves memory and lookup time.
Node Structure
Contains child edge map, suffix link pointer, leaf index (for leaves), and optional parent pointer.
Data Structures Used
Hash maps, balanced trees, or arrays for children. Trade-off between speed and memory.
| Component | Purpose | Details |
|---|---|---|
| Edges | Connect nodes | Store substring indices |
| Nodes | Represent substrings | Child pointers, suffix links |
| Suffix Links | Fast traversal | Connect related nodes |
Optimizations and Variants
Edge Compression
Store edge labels as indices, not strings, reducing memory. Combine unary paths.
Lazy Evaluation
Delay some computations until necessary. E.g., postpone leaf edge updates.
Sparse Suffix Trees
Store suffixes only at sampled positions. Trade-off between space and query speed.
Limitations and Challenges
Memory Usage
High constant factor in space compared to suffix arrays. May be impractical for very large texts.
Dynamic Updates
Real-time insertions/deletions challenging. Mostly static structures.
Implementation Complexity
Algorithms (especially Ukkonen's) intricate. Correctness and debugging nontrivial.
References
- E. Ukkonen, "On-line construction of suffix trees," Algorithmica, vol. 14, no. 3, pp. 249-260, 1995.
- D. Gusfield, "Algorithms on Strings, Trees and Sequences," Cambridge University Press, 1997.
- J. Kärkkäinen and P. Sanders, "Simple linear work suffix array construction," Proceedings of the 30th International Colloquium on Automata, Languages and Programming, pp. 943-955, 2003.
- M. Crochemore and W. Rytter, "Jewels of Stringology," World Scientific, 2002.