Longest Common Subsequence

1. Introduction

The Longest Common Subsequence (LCS) problem asks us to find the longest sequence of characters that appears in both of two given strings, where the characters must appear in the same relative order but need not be contiguous. Unlike the longest common substring problem, the characters in the subsequence do not need to be adjacent.

"The LCS problem is one of the classic problems in computer science, and its solution by dynamic programming is one of the most elegant demonstrations of the technique." -- Thomas H. Cormen, Introduction to Algorithms

For example, the LCS of "ABCBDAB" and "BDCAB" is "BCAB" (length 4). Note that "BDAB" is also a valid LCS of the same length -- the LCS is not necessarily unique.

2. Problem Definition

Input: Two strings X of length m and Y of length n.

Output: The length of the longest common subsequence of X and Y, and optionally the subsequence itself.

Subsequence vs. Substring: A subsequence maintains relative order but allows gaps. "ACE" is a subsequence of "ABCDE" but not a substring. "BCD" is both a subsequence and a substring of "ABCDE".

String X	String Y	LCS	Length
ABCBDAB	BDCAB	BCAB	4
AGGTAB	GXTXAYB	GTAB	4
ABC	DEF	(empty)	0

3. Brute Force Approach

The brute force approach generates all 2^m subsequences of X, checks each against Y, and returns the longest match. This is prohibitively expensive.

functionlcsBrute(X, Y, m, n): if m == 0or n == 0: return0if X[m-1] == Y[n-1]: return1 + lcsBrute(X, Y, m-1, n-1) else: returnmax( lcsBrute(X, Y, m-1, n), lcsBrute(X, Y, m, n-1) )

The recursive solution has time complexity O(2^(m+n)) in the worst case, as each call may branch into two subcalls.

4. Optimal Substructure

The LCS problem has optimal substructure. Consider the last characters of X and Y:

If X[m] == Y[n]: This character must be part of the LCS. The LCS length is 1 plus the LCS of X[1..m-1] and Y[1..n-1].
If X[m] != Y[n]: At least one of the last characters is not in the LCS. We take the maximum of LCS(X[1..m-1], Y[1..n]) and LCS(X[1..m], Y[1..n-1]).

The overlapping subproblems arise because both branches may eventually need the LCS of the same pair of prefixes.

5. Recurrence Relation

Let dp[i][j] represent the length of the LCS of X[1..i] and Y[1..j].

Recurrence:

If X[i] == Y[j]: dp[i][j] = dp[i-1][j-1] + 1

If X[i] != Y[j]: dp[i][j] = max(dp[i-1][j], dp[i][j-1])

Base case:dp[i][0] = 0 and dp[0][j] = 0 for all i, j

6. Top-Down (Memoization)

functionlcsMemo(X, Y, m, n, memo): if m == 0or n == 0: return0if memo[m][n] != -1: return memo[m][n] if X[m-1] == Y[n-1]: memo[m][n] = 1 + lcsMemo(X, Y, m-1, n-1, memo) else: memo[m][n] = max( lcsMemo(X, Y, m-1, n, memo), lcsMemo(X, Y, m, n-1, memo) ) return memo[m][n]

7. Bottom-Up (Tabulation)

functionlcsDP(X, Y): m = length(X) n = length(Y) dp = array[m+1][n+1], initialized to 0for i = 1to m: for j = 1to n: if X[i-1] == Y[j-1]: dp[i][j] = dp[i-1][j-1] + 1else: dp[i][j] = max(dp[i-1][j], dp[i][j-1]) return dp[m][n]

8. Worked Example

Let us compute the LCS of X = "ABCB" and Y = "BDCAB".

DP Table

	B	D	C	A	B
""	0	0	0	0	0
A	0	0	0	1	1
B	1	1	1	1	2
C	1	1	2	2	2
B	1	1	2	2	3

The LCS length is dp[4][5] = 3.

Key Cell Explanations

dp[1][4]: X[1]='A' matches Y[4]='A', so dp[0][3] + 1 = 1.
dp[2][1]: X[2]='B' matches Y[1]='B', so dp[1][0] + 1 = 1.
dp[3][3]: X[3]='C' matches Y[3]='C', so dp[2][2] + 1 = 2.
dp[4][5]: X[4]='B' matches Y[5]='B', so dp[3][4] + 1 = 3.

9. Backtracking the LCS

To reconstruct the actual LCS string, we trace back through the DP table starting from dp[m][n].

functionfindLCS(dp, X, Y, m, n): lcs = "" i = m, j = n while i > 0and j > 0: if X[i-1] == Y[j-1]: lcs = X[i-1] + lcs // Prepend character i = i - 1 j = j - 1else if dp[i-1][j] > dp[i][j-1]: i = i - 1else: j = j - 1return lcs

Backtracking Trace for X="ABCB", Y="BDCAB"

Step	i	j	X[i-1]	Y[j-1]	Action	LCS so far
1	4	5	B	B	Match, move diagonal	B
2	3	4	C	A	dp[2][4]=1 < dp[3][3]=2, move left	B
3	3	3	C	C	Match, move diagonal	CB
4	2	2	B	D	dp[1][2]=0 < dp[2][1]=1, move up	CB
5	1	2	A	D	dp[0][2]=0 = dp[1][1]=0, move left	CB
6	1	1	A	B	dp[0][1]=0 = dp[1][0]=0, move up	CB

This particular backtracking path yields "CB" with length 2, but we know the LCS has length 3. This happens because the tie-breaking rule matters. With different tie-breaking at steps 4-6, we could find "BCB" as a valid LCS. In practice, both paths through the table are valid; the LCS is not unique when there are ties in the DP table. A different backtracking strategy (always preferring up over left, or vice versa) yields different valid subsequences of the same maximum length.

10. Complexity Analysis

Approach	Time Complexity	Space Complexity
Brute Force	O(2^(m+n))	O(m + n)
Top-Down (Memoization)	O(m * n)	O(m * n)
Bottom-Up (Tabulation)	O(m * n)	O(m * n)
Space-Optimized	O(m * n)	O(min(m, n))

Space can be reduced to O(min(m, n)) by keeping only two rows of the table at a time. However, backtracking the LCS requires the full table. Hirschberg's algorithm achieves O(min(m, n)) space while still recovering the LCS by using a divide-and-conquer technique on top of the DP solution.

11. Variations

Longest Common Substring

Unlike LCS, the longest common substring requires contiguous characters. The recurrence changes: dp[i][j] = dp[i-1][j-1] + 1 if characters match, and dp[i][j] = 0 otherwise. The answer is the maximum value in the table.

Shortest Common Supersequence

The shortest string that has both X and Y as subsequences has length m + n - LCS(X, Y). The LCS tells us how much the two strings can "share."

LCS of Multiple Strings

The LCS can be extended to k strings. The DP table becomes k-dimensional with complexity O(n1 * n2 * ... * nk), which is practical only for small k.

12. Applications

Diff tools: The Unix diff command and Git diff use LCS to identify unchanged lines between file versions, displaying only the insertions and deletions.
Bioinformatics: DNA and protein sequence alignment uses LCS variants to find conserved regions across species.
Plagiarism detection: Comparing documents by finding long common subsequences between their contents.
Version control: Merge algorithms in systems like Git use LCS to resolve conflicts between concurrent edits.
Data compression: Delta encoding uses LCS-like algorithms to store only differences between successive versions.

13. References

Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. Introduction to Algorithms (4th ed.). MIT Press, 2022. Chapter 14: Dynamic Programming.
Hirschberg, D. S. "A linear space algorithm for computing maximal common subsequences." Communications of the ACM, 18(6):341-343, 1975.
Hunt, J. W. and McIlroy, M. D. "An Algorithm for Differential File Comparison." Bell Laboratories, 1976.
Bergroth, L., Hakonen, H., and Raita, T. "A survey of longest common subsequence algorithms." SPIRE 2000.