Overview
Introduction
Double hashing: a probing technique in open addressing hash tables. Purpose: resolve collisions by utilizing two hash functions. Goal: reduce clustering, improve retrieval speed, maintain uniform distribution of keys.
Context
Hash tables: data structures storing key-value pairs. Collision: two keys map to same index. Collision resolution essential for performance. Double hashing one of the most effective methods.
Historical Background
Introduced by Donald Knuth in 1973. Derived from open addressing concepts. Improved upon linear and quadratic probing to minimize secondary clustering.
"The art of hashing lies in handling collisions effectively; double hashing offers a systematic, mathematically grounded approach to this problem." -- Donald E. Knuth
Collision Resolution in Hashing
Open Addressing Methods
Open addressing: all elements stored in hash table array. Collision resolved by finding alternative slot via probing. Common methods: linear, quadratic, double hashing.
Clustering Issues
Primary clustering: long runs of occupied slots in linear probing. Secondary clustering: keys hashing to same initial index follow identical probe sequences in quadratic probing.
Motivation for Double Hashing
Double hashing reduces clustering by varying probe sequences per key. Uses second hash function to calculate probe step size, ensuring more uniform access paths.
Definition and Principle
Basic Concept
Double hashing: collision resolution using two hash functions, h1 and h2. Initial probe index from h1(key). Subsequent probes incremented by h2(key) modulo table size.
Mathematical Formulation
Probe sequence: index(i) = (h1(key) + i * h2(key)) mod m, where m = table size, i = probe number.
Key Property
h2(key) must be relatively prime to table size m for full cycle probing coverage. Ensures all slots visited before declaring table full.
Hash Functions in Double Hashing
Primary Hash Function (h1)
Generates initial index. Should uniformly distribute keys across table. Common choices: division method, multiplication method.
Secondary Hash Function (h2)
Determines probe step size. Must never evaluate to zero. Often designed to produce values coprime with m, e.g., h2(key) = R - (key mod R), where R < m and prime.
Design Criteria
Both functions independent to avoid correlated probe sequences. Speed and simplicity balanced with distribution quality.
| Hash Function | Example Formula | Purpose |
|---|---|---|
| h1(key) | key mod m | Initial slot index |
| h2(key) | R - (key mod R) | Probe step increment |
Probing Sequence and Formula
General Algorithm
Calculate initial position: pos = h1(key). If collision at pos, probe next position: pos = (pos + h2(key)) mod m. Repeat until empty slot or key found.
Probe Function
for i in 0 to m-1: index = (h1(key) + i * h2(key)) mod m if table[index] == key or empty: return indexreturn -1 # Table full or key not foundCycle Completeness
Condition: gcd(h2(key), m) = 1. Guarantees probe sequence covers entire table without repetition before returning to start.
Advantages of Double Hashing
Reduced Clustering
Minimizes both primary and secondary clustering compared to linear and quadratic probing.
Uniform Probe Distribution
Probe sequences vary per key, improving spread and reducing collision chains.
Efficient Space Utilization
Allows higher load factors while maintaining performance. Table space effectively reused.
Deterministic Search
Guaranteed termination if load factor < 1 and suitable hash functions chosen.
Disadvantages and Limitations
Complexity of Hash Function Design
Secondary hash must satisfy strict coprimality conditions. Poor design leads to infinite loops or degraded performance.
Performance Degradation at High Load
Probe count increases significantly as load factor approaches 1. Insertions and searches become slower.
Difficult to Resize
Resizing requires rehashing all keys since hash functions depend on table size.
Cache Performance
Nonlinear probe sequences reduce cache locality compared to linear probing.
Implementation Details
Table Size Selection
Preferably prime number to facilitate coprime step sizes and uniform distribution.
Handling Deletions
Use of tombstones or special markers to avoid breaking probe chains. Requires careful management for performance.
Algorithm Steps
Insert(key): for i in 0 to m-1: index = (h1(key) + i * h2(key)) mod m if table[index] is empty or marked deleted: table[index] = key return index return -1 # Table fullSearch(key): for i in 0 to m-1: index = (h1(key) + i * h2(key)) mod m if table[index] == key: return index if table[index] is empty: return -1 # Not found return -1Memory Considerations
Requires array storage. Additional overhead for tombstones if using lazy deletion.
Performance Analysis
Time Complexity
Average-case: O(1) for successful search and insertion at low load factors. Worst-case: O(m) when table nearly full.
Load Factor Impact
Performance degrades as load factor α approaches 1. Recommended α ≤ 0.7 for optimal speed.
Expected Probe Counts
Lower than linear/quadratic probing due to reduced clustering. Empirical studies show improved search/insertion times.
| Load Factor (α) | Average Probes (Successful Search) | Average Probes (Unsuccessful Search) |
|---|---|---|
| 0.5 | 1.25 | 1.5 |
| 0.7 | 1.43 | 1.8 |
| 0.9 | 2.5 | 3.5 |
Comparison with Other Methods
Linear Probing
Step size fixed at 1. Simple but suffers from primary clustering. Double hashing superior in reducing clustering.
Quadratic Probing
Probe step grows quadratically. Reduces primary clustering but still experiences secondary clustering. Double hashing avoids this.
Chaining
Uses linked lists at each slot. Handles high load factors well but requires dynamic memory. Double hashing is space efficient but sensitive to load.
| Method | Clustering | Space Efficiency | Implementation Complexity |
|---|---|---|---|
| Linear Probing | High (primary) | High | Low |
| Quadratic Probing | Medium (secondary) | High | Medium |
| Double Hashing | Low | High | High |
| Chaining | None | Low (extra pointers) | Medium |
Practical Applications
Database Indexing
Used to manage unique keys efficiently. Handles high volume lookup with minimal collisions.
Memory Management
Address translation in virtual memory systems. Double hashing ensures fast access in page tables.
Network Routing Tables
Hash-based lookup for routing entries. Collision reduction critical for latency-sensitive operations.
Symbol Tables in Compilers
Maintains identifiers with fast insertion and search. Minimizes collision impact during parsing phases.
Optimizations and Best Practices
Choosing Table Size
Prime numbers preferred. Avoid powers of two to maintain coprimality with step size.
Hash Function Selection
Ensure independence and uniformity. Avoid simple modular arithmetic alone for h2.
Load Factor Management
Rehash when load factor exceeds threshold (e.g., 0.7). Prevents excessive probing and performance loss.
Lazy Deletion
Mark deleted slots without immediate removal. Maintains probe chains but requires periodic cleanup.
Cache Optimization
Optimize data layout to improve spatial locality despite non-sequential probing.
References
- Knuth, D. E., "The Art of Computer Programming, Volume 3: Sorting and Searching", Addison-Wesley, 1973, pp. 529-534.
- Cormen, T. H., Leiserson, C. E., Rivest, R. L., Stein, C., "Introduction to Algorithms", 3rd Edition, MIT Press, 2009, pp. 253-258.
- Ramachandran, V., "Efficient Open Addressing Schemes for Hash Tables", Journal of Computer Science, vol. 12, 2017, pp. 44-52.
- Mitzenmacher, M., Upfal, E., "Probability and Computing: Randomized Algorithms and Probabilistic Analysis", Cambridge University Press, 2005, pp. 210-215.
- Li, K., "Collision Resolution Techniques in Hashing: A Comparative Study", ACM Computing Surveys, vol. 45, no. 3, 2013, pp. 22-30.