1. Introduction

The Edit Distance problem, also known as Levenshtein distance (named after Soviet mathematician Vladimir Levenshtein who defined it in 1965), measures the minimum number of single-character operations required to transform one string into another. It is a fundamental algorithm in computer science with far-reaching applications in spell checking, DNA sequence alignment, natural language processing, and version control systems.

"The notion of distance between strings provides a mathematical foundation for approximate string matching, one of the most practical problems in computer science." -- Dan Gusfield, Algorithms on Strings, Trees, and Sequences

Edit distance captures an intuitive idea: how "similar" are two strings? The fewer operations needed to transform one into the other, the more similar they are. This concept extends naturally to many domains beyond text processing.

2. Problem Definition

Input: Two strings s1 of length m and s2 of length n.

Output: The minimum number of edit operations to convert s1 into s2.

ParameterDescriptionExample
s1Source string"kitten"
s2Target string"sitting"
Return valueMinimum edit distance3

3. Edit Operations

The three permitted operations, each with a cost of 1, are:

  • Insert: Add a character at any position. Example: "abc" to "abdc" (insert 'd').
  • Delete: Remove a character from any position. Example: "abc" to "ac" (delete 'b').
  • Replace: Change one character to another. Example: "abc" to "adc" (replace 'b' with 'd').

Note: If only insertions and deletions are allowed (no replacements), the resulting metric is called the Longest Common Subsequence (LCS) distance. If transpositions are also allowed, the metric is called the Damerau-Levenshtein distance.

4. Brute Force Approach

The naive recursive approach compares characters from the end of both strings. If the last characters match, no operation is needed for that position. Otherwise, we try all three operations and take the minimum.

functioneditDistance(s1, s2, m, n): // Base casesif m == 0: return n // Insert all remaining characters of s2if n == 0: return m // Delete all remaining characters of s1// If last characters match, no operation neededif s1[m-1] == s2[n-1]: returneditDistance(s1, s2, m-1, n-1) // Try all three operations and take minimumreturn1 + min( editDistance(s1, s2, m, n-1), // InserteditDistance(s1, s2, m-1, n), // DeleteeditDistance(s1, s2, m-1, n-1) // Replace )

This recursive solution has exponential time complexity O(3^(m+n)) because each call branches into up to three subcalls, and many subproblems are recomputed.

5. Optimal Substructure

The problem has optimal substructure because the edit distance between two strings can be computed from the edit distances of their prefixes. Specifically, the optimal alignment of s1[1..m] and s2[1..n] must end with one of:

  • Aligning s1[m] with s2[n] (match or replace), plus the optimal alignment of s1[1..m-1] and s2[1..n-1].
  • Deleting s1[m], plus the optimal alignment of s1[1..m-1] and s2[1..n].
  • Inserting s2[n], plus the optimal alignment of s1[1..m] and s2[1..n-1].

The overlapping subproblems arise because different sequences of operations can lead to the same pair of prefixes being compared, making dynamic programming the right technique.

6. Recurrence Relation

Let dp[i][j] represent the edit distance between s1[1..i] and s2[1..j].

Recurrence:

If s1[i] == s2[j]: dp[i][j] = dp[i-1][j-1]

If s1[i] != s2[j]: dp[i][j] = 1 + min(dp[i][j-1], dp[i-1][j], dp[i-1][j-1])

Base cases:dp[i][0] = i and dp[0][j] = j

The three terms in the minimum correspond to insert (extend s2 by one), delete (extend s1 by one), and replace (change one character), respectively.

7. Top-Down (Memoization)

functioneditDistMemo(s1, s2, m, n, memo): if m == 0: return n if n == 0: return m if memo[m][n] != -1: return memo[m][n] if s1[m-1] == s2[n-1]: memo[m][n] = editDistMemo(s1, s2, m-1, n-1, memo) else: memo[m][n] = 1 + min( editDistMemo(s1, s2, m, n-1, memo), editDistMemo(s1, s2, m-1, n, memo), editDistMemo(s1, s2, m-1, n-1, memo) ) return memo[m][n]

8. Bottom-Up (Tabulation)

functioneditDistDP(s1, s2): m = length(s1) n = length(s2) dp = array[m+1][n+1] // Base casesfor i = 0to m: dp[i][0] = i for j = 0to n: dp[0][j] = j // Fill tablefor i = 1to m: for j = 1to n: if s1[i-1] == s2[j-1]: dp[i][j] = dp[i-1][j-1] else: dp[i][j] = 1 + min(dp[i][j-1], dp[i-1][j], dp[i-1][j-1]) return dp[m][n]

9. Worked Example

Let us compute the edit distance between s1 = "kitten" and s2 = "sitting".

DP Table

""sitting
""01234567
k11234567
i22123456
t33212345
t44321234
e55432234
n66543323

The edit distance is dp[6][7] = 3. The three operations are:

  1. Replace 'k' with 's': "kitten" becomes "sitten"
  2. Replace 'e' with 'i': "sitten" becomes "sittin"
  3. Insert 'g' at the end: "sittin" becomes "sitting"

Reading the Table

Each cell dp[i][j] represents the minimum edits to convert the first i characters of "kitten" into the first j characters of "sitting". The value comes from comparing the diagonal (match/replace), left (insert), and above (delete) neighbors.

10. Complexity Analysis

ApproachTime ComplexitySpace Complexity
Brute ForceO(3^(m+n))O(m + n) stack
Top-Down (Memoization)O(m * n)O(m * n)
Bottom-Up (Tabulation)O(m * n)O(m * n)
Space-OptimizedO(m * n)O(min(m, n))

Space can be optimized to O(min(m, n)) since each row of the DP table depends only on the previous row. However, this optimization prevents backtracking to find the actual edit operations.

11. Variations

Weighted Edit Distance

Different operations can have different costs. For example, on a keyboard, replacing 'a' with 's' (adjacent keys) might cost less than replacing 'a' with 'p' (distant keys). The recurrence remains the same but uses variable costs instead of 1.

Damerau-Levenshtein Distance

Adds a fourth operation: transposition (swapping two adjacent characters). This is particularly useful for spell checking since transposition errors ("teh" for "the") are common typing mistakes.

Longest Common Subsequence Distance

If only insertions and deletions are allowed (no replacements), the edit distance equals m + n - 2 * LCS(s1, s2), connecting edit distance to the LCS problem.

12. Applications

  • Spell checking: Suggesting corrections by finding dictionary words with the smallest edit distance to a misspelled word.
  • DNA sequence alignment: Measuring similarity between genetic sequences where insertions, deletions, and mutations correspond to the three edit operations.
  • Plagiarism detection: Comparing documents to find passages with small edit distances.
  • Diff tools: Version control systems like Git use edit distance variants to compute the difference between file versions.
  • Optical character recognition (OCR): Matching recognized text against dictionary entries to correct recognition errors.
  • Natural language processing: Fuzzy string matching in search engines and autocomplete systems.

13. References

  1. Levenshtein, V. I. "Binary codes capable of correcting deletions, insertions, and reversals." Soviet Physics Doklady, 10(8):707-710, 1966.
  2. Wagner, R. A. and Fischer, M. J. "The string-to-string correction problem." Journal of the ACM, 21(1):168-173, 1974.
  3. Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. Introduction to Algorithms (4th ed.). MIT Press, 2022.
  4. Gusfield, D. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997.