1. Introduction
The Longest Common Subsequence (LCS) problem asks us to find the longest sequence of characters that appears in both of two given strings, where the characters must appear in the same relative order but need not be contiguous. Unlike the longest common substring problem, the characters in the subsequence do not need to be adjacent.
"The LCS problem is one of the classic problems in computer science, and its solution by dynamic programming is one of the most elegant demonstrations of the technique." -- Thomas H. Cormen, Introduction to Algorithms
For example, the LCS of "ABCBDAB" and "BDCAB" is "BCAB" (length 4). Note that "BDAB" is also a valid LCS of the same length -- the LCS is not necessarily unique.
2. Problem Definition
Input: Two strings X of length m and Y of length n.
Output: The length of the longest common subsequence of X and Y, and optionally the subsequence itself.
Subsequence vs. Substring: A subsequence maintains relative order but allows gaps. "ACE" is a subsequence of "ABCDE" but not a substring. "BCD" is both a subsequence and a substring of "ABCDE".
| String X | String Y | LCS | Length |
|---|---|---|---|
| ABCBDAB | BDCAB | BCAB | 4 |
| AGGTAB | GXTXAYB | GTAB | 4 |
| ABC | DEF | (empty) | 0 |
3. Brute Force Approach
The brute force approach generates all 2^m subsequences of X, checks each against Y, and returns the longest match. This is prohibitively expensive.
functionlcsBrute(X, Y, m, n): if m == 0or n == 0: return0if X[m-1] == Y[n-1]: return1 + lcsBrute(X, Y, m-1, n-1) else: returnmax( lcsBrute(X, Y, m-1, n), lcsBrute(X, Y, m, n-1) )The recursive solution has time complexity O(2^(m+n)) in the worst case, as each call may branch into two subcalls.
4. Optimal Substructure
The LCS problem has optimal substructure. Consider the last characters of X and Y:
- If X[m] == Y[n]: This character must be part of the LCS. The LCS length is 1 plus the LCS of X[1..m-1] and Y[1..n-1].
- If X[m] != Y[n]: At least one of the last characters is not in the LCS. We take the maximum of LCS(X[1..m-1], Y[1..n]) and LCS(X[1..m], Y[1..n-1]).
The overlapping subproblems arise because both branches may eventually need the LCS of the same pair of prefixes.
5. Recurrence Relation
Let dp[i][j] represent the length of the LCS of X[1..i] and Y[1..j].
Recurrence:
If X[i] == Y[j]: dp[i][j] = dp[i-1][j-1] + 1
If X[i] != Y[j]: dp[i][j] = max(dp[i-1][j], dp[i][j-1])
Base case:dp[i][0] = 0 and dp[0][j] = 0 for all i, j
6. Top-Down (Memoization)
functionlcsMemo(X, Y, m, n, memo): if m == 0or n == 0: return0if memo[m][n] != -1: return memo[m][n] if X[m-1] == Y[n-1]: memo[m][n] = 1 + lcsMemo(X, Y, m-1, n-1, memo) else: memo[m][n] = max( lcsMemo(X, Y, m-1, n, memo), lcsMemo(X, Y, m, n-1, memo) ) return memo[m][n]7. Bottom-Up (Tabulation)
functionlcsDP(X, Y): m = length(X) n = length(Y) dp = array[m+1][n+1], initialized to 0for i = 1to m: for j = 1to n: if X[i-1] == Y[j-1]: dp[i][j] = dp[i-1][j-1] + 1else: dp[i][j] = max(dp[i-1][j], dp[i][j-1]) return dp[m][n]8. Worked Example
Let us compute the LCS of X = "ABCB" and Y = "BDCAB".
DP Table
| "" | B | D | C | A | B | |
|---|---|---|---|---|---|---|
| "" | 0 | 0 | 0 | 0 | 0 | 0 |
| A | 0 | 0 | 0 | 0 | 1 | 1 |
| B | 0 | 1 | 1 | 1 | 1 | 2 |
| C | 0 | 1 | 1 | 2 | 2 | 2 |
| B | 0 | 1 | 1 | 2 | 2 | 3 |
The LCS length is dp[4][5] = 3.
Key Cell Explanations
dp[1][4]: X[1]='A' matches Y[4]='A', sodp[0][3] + 1 = 1.dp[2][1]: X[2]='B' matches Y[1]='B', sodp[1][0] + 1 = 1.dp[3][3]: X[3]='C' matches Y[3]='C', sodp[2][2] + 1 = 2.dp[4][5]: X[4]='B' matches Y[5]='B', sodp[3][4] + 1 = 3.
9. Backtracking the LCS
To reconstruct the actual LCS string, we trace back through the DP table starting from dp[m][n].
functionfindLCS(dp, X, Y, m, n): lcs = "" i = m, j = n while i > 0and j > 0: if X[i-1] == Y[j-1]: lcs = X[i-1] + lcs // Prepend character i = i - 1 j = j - 1else if dp[i-1][j] > dp[i][j-1]: i = i - 1else: j = j - 1return lcsBacktracking Trace for X="ABCB", Y="BDCAB"
| Step | i | j | X[i-1] | Y[j-1] | Action | LCS so far |
|---|---|---|---|---|---|---|
| 1 | 4 | 5 | B | B | Match, move diagonal | B |
| 2 | 3 | 4 | C | A | dp[2][4]=1 < dp[3][3]=2, move left | B |
| 3 | 3 | 3 | C | C | Match, move diagonal | CB |
| 4 | 2 | 2 | B | D | dp[1][2]=0 < dp[2][1]=1, move up | CB |
| 5 | 1 | 2 | A | D | dp[0][2]=0 = dp[1][1]=0, move left | CB |
| 6 | 1 | 1 | A | B | dp[0][1]=0 = dp[1][0]=0, move up | CB |
This particular backtracking path yields "CB" with length 2, but we know the LCS has length 3. This happens because the tie-breaking rule matters. With different tie-breaking at steps 4-6, we could find "BCB" as a valid LCS. In practice, both paths through the table are valid; the LCS is not unique when there are ties in the DP table. A different backtracking strategy (always preferring up over left, or vice versa) yields different valid subsequences of the same maximum length.
10. Complexity Analysis
| Approach | Time Complexity | Space Complexity |
|---|---|---|
| Brute Force | O(2^(m+n)) | O(m + n) |
| Top-Down (Memoization) | O(m * n) | O(m * n) |
| Bottom-Up (Tabulation) | O(m * n) | O(m * n) |
| Space-Optimized | O(m * n) | O(min(m, n)) |
Space can be reduced to O(min(m, n)) by keeping only two rows of the table at a time. However, backtracking the LCS requires the full table. Hirschberg's algorithm achieves O(min(m, n)) space while still recovering the LCS by using a divide-and-conquer technique on top of the DP solution.
11. Variations
Longest Common Substring
Unlike LCS, the longest common substring requires contiguous characters. The recurrence changes: dp[i][j] = dp[i-1][j-1] + 1 if characters match, and dp[i][j] = 0 otherwise. The answer is the maximum value in the table.
Shortest Common Supersequence
The shortest string that has both X and Y as subsequences has length m + n - LCS(X, Y). The LCS tells us how much the two strings can "share."
LCS of Multiple Strings
The LCS can be extended to k strings. The DP table becomes k-dimensional with complexity O(n1 * n2 * ... * nk), which is practical only for small k.
12. Applications
- Diff tools: The Unix
diffcommand and Git diff use LCS to identify unchanged lines between file versions, displaying only the insertions and deletions. - Bioinformatics: DNA and protein sequence alignment uses LCS variants to find conserved regions across species.
- Plagiarism detection: Comparing documents by finding long common subsequences between their contents.
- Version control: Merge algorithms in systems like Git use LCS to resolve conflicts between concurrent edits.
- Data compression: Delta encoding uses LCS-like algorithms to store only differences between successive versions.
13. References
- Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. Introduction to Algorithms (4th ed.). MIT Press, 2022. Chapter 14: Dynamic Programming.
- Hirschberg, D. S. "A linear space algorithm for computing maximal common subsequences." Communications of the ACM, 18(6):341-343, 1975.
- Hunt, J. W. and McIlroy, M. D. "An Algorithm for Differential File Comparison." Bell Laboratories, 1976.
- Bergroth, L., Hakonen, H., and Raita, T. "A survey of longest common subsequence algorithms." SPIRE 2000.