Introduction
Interpolation search is an improved variant of binary search designed for sorted arrays where the values are approximately uniformly distributed. While binary search always examines the middle element of the current search interval, interpolation search estimates the position of the target value based on the values at the boundaries of the interval. This is analogous to how a person might search a telephone book: rather than opening to the middle, one would open near the beginning for names starting with "A" and near the end for names starting with "Z."
The algorithm was first described by W. W. Peterson in 1957 and later analyzed in detail by Yehoshua Perl, Alon Itai, and Haim Avni in 1978. When the data is uniformly distributed, interpolation search achieves an average time complexity of O(log log n), which is significantly faster than binary search's O(log n). For one billion elements, binary search requires about 30 comparisons while interpolation search requires only about 5.
"Interpolation search improves upon binary search by using the distribution of keys to estimate the position of the target, much as we would look up a word in a dictionary." -- Thomas H. Cormen et al., Introduction to Algorithms
However, this improved performance comes with a caveat: if the data is not uniformly distributed, interpolation search can degrade to O(n) in the worst case, making it slower than binary search. Understanding when to apply interpolation search versus binary search is crucial for algorithm selection.
How It Works
The Key Insight
Binary search always probes the midpoint of the current interval, regardless of the target value or the values at the interval boundaries. Interpolation search improves on this by estimating where the target is likely to be located based on linear interpolation between the boundary values.
Step-by-Step Procedure
- Set low = 0 and high = n - 1.
- While low <= high and target is within the range [A[low], A[high]]:
- Calculate the probe position using the interpolation formula:
pos = low + ((target - A[low]) * (high - low)) / (A[high] - A[low]) - If A[pos] equals the target, return pos.
- If A[pos] is less than the target, set low = pos + 1.
- If A[pos] is greater than the target, set high = pos - 1.
- Calculate the probe position using the interpolation formula:
- If the loop ends, return -1 (not found).
Intuition Behind the Formula
The interpolation formula assumes that values are evenly spaced between A[low] and A[high]. Under this assumption, the position of the target value should be proportional to its numerical position within the range. If the target is 75% of the way from A[low] to A[high] in value, the formula estimates it to be 75% of the way from low to high in position.
Algorithm and Pseudocode
Iterative Interpolation Search
procedure InterpolationSearch(A, n, target) Input: A = sorted array of n elements, target = value to find Output: index of target in A, or -1 if not found low := 0 high := n - 1 while low <= high and target >= A[low] and target <= A[high] do if low = high then if A[low] = target then return low else return -1 end if end if // Estimate probe position using linear interpolation pos := low + ((target - A[low]) * (high - low)) / (A[high] - A[low]) if A[pos] = target then return pos else if A[pos] < target then low := pos + 1 else high := pos - 1 end if end while return -1end procedureRecursive Interpolation Search
procedure InterpolationSearchRecursive(A, low, high, target) if low > high or target < A[low] or target > A[high] then return -1 end if if low = high then if A[low] = target then return low else return -1 end if end if pos := low + ((target - A[low]) * (high - low)) / (A[high] - A[low]) if A[pos] = target then return pos else if A[pos] < target then return InterpolationSearchRecursive(A, pos + 1, high, target) else return InterpolationSearchRecursive(A, low, pos - 1, target) end ifend procedureComplexity Analysis
Time Complexity
| Case | Complexity | Condition |
|---|---|---|
| Best Case | O(1) | Target is found on the first probe. |
| Average Case (uniform data) | O(log log n) | Data is uniformly distributed. |
| Worst Case | O(n) | Data is highly skewed (e.g., exponentially distributed). |
Why O(log log n)?
When data is uniformly distributed, each interpolation step reduces the search space not by half (as in binary search) but by a factor proportional to the square root of the current size. After one probe, the expected subproblem size drops from n to approximately sqrt(n). After two probes, it drops to approximately sqrt(sqrt(n)), and so on. Solving the recurrence T(n) = T(sqrt(n)) + O(1) yields T(n) = O(log log n).
Space Complexity
Iterative: O(1). Only a constant number of variables (low, high, pos) are used.
Recursive: O(log log n) on average for uniform data, but O(n) in the worst case due to recursive call depth.
Practical Comparison
| Array Size (n) | Binary Search (log2 n) | Interpolation Search (log2 log2 n) |
|---|---|---|
| 1,000 | ~10 comparisons | ~3-4 comparisons |
| 1,000,000 | ~20 comparisons | ~4-5 comparisons |
| 1,000,000,000 | ~30 comparisons | ~5 comparisons |
Worked Example
Search for target = 67 in the sorted, uniformly distributed array:
A = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100] (n = 10)
Step-by-Step Trace
| Step | low | high | A[low] | A[high] | pos (formula) | A[pos] | Action |
|---|---|---|---|---|---|---|---|
| 1 | 0 | 9 | 10 | 100 | 0 + ((67-10)*(9-0))/(100-10) = 0 + (57*9)/90 = 5.7 -> 5 | 60 | 60 < 67, set low = 6 |
| 2 | 6 | 9 | 70 | 100 | 6 + ((67-70)*(9-6))/(100-70) = 6 + (-3*3)/30 = 5.7 -> 5 | -- | 67 < A[low]=70, exit loop |
Result: The condition target < A[low] (67 < 70) causes the loop to exit. Return -1: the value 67 is not in the array. The search performed only 1 comparison before determining the target is absent.
Successful Search Example
Search for target = 70 in the same array A = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100].
- Step 1: low=0, high=9. pos = 0 + ((70-10)*9)/90 = 6. A[6] = 70 = target. Return 6.
The target is found in a single probe, demonstrating interpolation search's ability to jump directly to the correct position when data is uniformly distributed.
The Probe Position Formula
Mathematical Derivation
The interpolation formula is derived from linear interpolation. Given two points (low, A[low]) and (high, A[high]), the formula estimates the position x where a value y (the target) would fall on a straight line connecting these two points:
pos = low + ((target - A[low]) / (A[high] - A[low])) * (high - low)The term (target - A[low]) / (A[high] - A[low]) computes the fractional position of the target within the value range. Multiplying by (high - low) maps this fraction to the index range.
Edge Cases
- Division by zero: If A[low] = A[high], the denominator is zero. This case must be handled separately: if A[low] equals the target, return low; otherwise, the target is not present.
- Integer overflow: The product (target - A[low]) * (high - low) can overflow for large values. Using 64-bit integers or floating-point arithmetic for the calculation avoids this issue.
- Negative probe position: If the target is outside the range [A[low], A[high]], the formula can produce positions outside the valid index range. The loop condition target >= A[low] and target <= A[high] prevents this.
Variants
Quadratic interpolation search uses a quadratic (rather than linear) fit through three points to estimate the probe position. This can provide better estimates for data with known quadratic distribution patterns, but the additional computation per step often outweighs the benefit of fewer probes.
Advantages and Disadvantages
Advantages
- Extremely Fast on Uniform Data: O(log log n) average case is dramatically faster than binary search for large datasets with uniform distribution.
- Adaptive: The probe position adapts to the actual data values, not just the indices, making better use of available information.
- Simple Modification of Binary Search: The algorithm is structurally identical to binary search with only the probe position calculation changed.
- Efficient for Large Sorted Files: When searching large sorted files on disk, each probe requires a disk access. Minimizing the number of probes (as interpolation search does) directly reduces I/O operations.
Disadvantages
- Worst Case O(n): On highly non-uniform data (e.g., exponential distribution), the probe position estimates can be poor, causing the algorithm to degenerate to linear scan.
- Requires Numeric Keys: The interpolation formula requires arithmetic operations on the keys, so it cannot be applied to non-numeric data (e.g., strings) without numeric encoding.
- More Complex Than Binary Search: The additional computation per step (multiplication, division) adds overhead that may not be worthwhile for small arrays.
- Requires Sorted Data: Like binary search, the array must be sorted.
- Susceptible to Integer Overflow: Care must be taken with the arithmetic to avoid overflow errors.
Comparison with Other Search Algorithms
| Algorithm | Average Case | Worst Case | Best For |
|---|---|---|---|
| Linear Search | O(n) | O(n) | Unsorted or very small data |
| Binary Search | O(log n) | O(log n) | Any sorted data |
| Interpolation Search | O(log log n) | O(n) | Uniformly distributed sorted data |
| Jump Search | O(sqrt(n)) | O(sqrt(n)) | Sorted data with costly backward jumps |
Interpolation search is the fastest of these algorithms when the data is uniformly distributed but the least reliable when the distribution is unknown. Binary search remains the safest general-purpose choice for sorted arrays because its O(log n) worst case is guaranteed regardless of data distribution. A hybrid approach -- starting with interpolation and falling back to binary search after a few poor probes -- can combine the best properties of both.
Applications
Telephone Directory and Dictionary Lookup
The original motivation for interpolation search comes from how humans search through alphabetically sorted books. We estimate where to open based on the letter we are looking for, which is exactly what the algorithm formalizes.
Database Indexing
Database systems can use interpolation search for lookups in dense, uniformly distributed integer indexes (such as auto-incrementing primary keys). The reduced number of disk accesses compared to binary search can significantly improve query performance.
Numerical Computation
In scientific computing, interpolation search techniques are used to locate values in lookup tables with regularly spaced entries, such as precomputed mathematical function values.
Log File Analysis
When searching through sorted log files by timestamp (which is approximately uniformly distributed over time), interpolation search can quickly locate entries near a target timestamp.
Data Compression
Compression algorithms that use sorted symbol frequency tables can employ interpolation search to quickly look up encoding information for a given symbol.
References
- Peterson, W. W. "Addressing for Random-Access Storage." IBM Journal of Research and Development, 1(2):130-146, 1957.
- Perl, Y., Itai, A., and Avni, H. "Interpolation Search -- A log log n Search." Communications of the ACM, 21(7):550-553, 1978.
- Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. Introduction to Algorithms (3rd ed.). MIT Press, 2009.
- Sedgewick, R. and Wayne, K. Algorithms (4th ed.). Addison-Wesley, 2011.
- Knuth, D. E. The Art of Computer Programming, Volume 3: Sorting and Searching (2nd ed.). Addison-Wesley, 1998. Section 6.2.1.