Introduction
Hash table enables O(1) average-case lookup (vs. O(log n) for trees, O(n) for linear search). Trade-off: O(1) expected performance for slightly worse worst-case O(n). Ubiquitous: dictionaries in Python, maps in Java/C++, hash maps in all languages.
Core idea: map keys to array indices via hash function. Key observation: array access O(1), so if hash function distributes keys to distinct indices, lookup becomes O(1). Collisions (multiple keys hash same index) require handling: chaining or open addressing.
Fundamental data structure for caching, databases, compilers (symbol tables), memory management (page tables). Performance depends on hash function quality and load factor management.
"Hash tables are the workhorses of modern computing. They enable constant-time operations that would otherwise require logarithmic or linear time." -- Data structure fundamentals
Hash Functions and Properties
Definition
Hash function h: maps key to index in hash table (0 to m-1). h(key) deterministic: same key always maps to same index. Good hash function distributes keys uniformly across indices (minimizes collisions).
Simple example (integers):
h(key) = key % m
For strings:
h(key) = (c1 * 31^(n-1) + c2 * 31^(n-2) + ... + cn) % m
(polynomial rolling hash)
Desired Properties
- Uniformity: Keys uniformly distributed across table
- Efficiency: Quick to compute O(1)
- Determinism: Same key always same hash
- Avalanche effect: Small change in key causes large change in hash
Hash Function Quality
Poor function: many collisions. Example: h(key) = key % 10 for keys 10, 20, 30 (all map to 0). Good function: leverage all bits of key, mix bits thoroughly.
Division Method
h(key) = key mod m
Choice of m matters: avoid m = 2^k (loses high bits).
Prefer m = prime number (better distribution for many patterns).
Multiplication Method
h(key) = floor(m * (key * A mod 1))
A = (sqrt(5) - 1) / 2 ≈ 0.618 (golden ratio)
Works well for any m. Less dependent on m's properties.
Collision Handling: Chaining vs. Open Addressing
Collisions Inevitable
By pigeonhole principle: if n keys map to m indices and n > m, at least one collision. Even for n ≈ m, expect collisions (birthday paradox: 23 people, ~50% chance collision in birthday).
Chaining Approach
Each table index stores linked list of all keys hashing to that index. Insert: compute h(key), add to list. Search: compute h(key), search list. Delete: find and remove from list.
Open Addressing Approach
No separate storage (linked lists). On collision, find empty slot using probing sequence. Insert: compute h(key), if occupied, try h(key)+1, h(key)+2, etc. until empty slot found. More complex but cache-friendly (contiguous memory).
Comparison
| Aspect | Chaining | Open Addressing |
|---|---|---|
| Space | O(n) + list overhead | O(m), no extra pointers |
| Worst case (collisions) | O(n) for list traversal | O(n) for probe sequence |
| Cache performance | Poor (pointer chasing) | Good (contiguous array) |
| Simplicity | Simple | Complex (probe sequences) |
Separate Chaining Implementation
Structure
class HashTable:
def __init__(self, size):
self.table = [[] for _ in range(size)] # List of lists
def hash(self, key):
return hash(key) % len(self.table)
def insert(self, key, value):
index = self.hash(key)
for i, (k, v) in enumerate(self.table[index]):
if k == key:
self.table[index][i] = (key, value) # Update
return
self.table[index].append((key, value)) # Insert
def search(self, key):
index = self.hash(key)
for k, v in self.table[index]:
if k == key:
return v
return None
def delete(self, key):
index = self.hash(key)
self.table[index] = [(k, v) for k, v in self.table[index] if k != key]
Analysis
Insert, search, delete: O(1 + α) where α = n/m (load factor). For α < 1 (table not full), average O(1). For α >> 1, many collisions, degrades to O(n).
Deletion
Simple: remove from linked list. No issues unlike open addressing (tombstones required to maintain probe sequences).
Open Addressing Methods
Linear Probing
On collision at h(key), try h(key)+1, h(key)+2, ..., wrapping around.
h_i(key) = (h(key) + i) mod m for i = 0, 1, 2, ...
Problem: clustering. Consecutive filled slots accumulate, increasing probe length for future insertions.
Quadratic Probing
h_i(key) = (h(key) + c1*i + c2*i^2) mod m
Reduces clustering but may not probe all slots (secondary clustering).
Double Hashing
h_i(key) = (h1(key) + i*h2(key)) mod m
Two hash functions provide better distribution, avoids clustering.
Deletion Complication
Can't simply remove (breaks probe sequences). Solution: tombstones (mark deleted, skip in search, reuse in insert). Degrades performance over time.
Load Factor Limits
Open addressing requires load factor < 0.7 (30% empty slots). Otherwise, probe lengths grow rapidly. Requires frequent resizing.
Load Factor and Resizing
Load Factor Definition
α = n / m (elements / table size). Affects performance: higher α means more collisions, longer probe sequences.
Resizing Strategy
When α exceeds threshold (0.75 for chaining, 0.5 for open addressing), resize: allocate new table (typically 2x size), rehash all elements into new table.
def resize(self):
old_table = self.table
self.table = [[] for _ in range(len(old_table) * 2)]
for bucket in old_table:
for key, value in bucket:
self.insert(key, value)
Amortized Analysis
Resizing cost O(n) (rehash all elements). But infrequent (doubles size each resize). Amortized: O(1) per operation including resizing. n operations: total O(n) time.
New Hash Function
When resizing, must use new hash function (new table size). Changing m changes hashing: keys previously hashing to index i may hash differently with new m.
Time Complexity Analysis
| Operation | Average | Worst Case |
|---|---|---|
| Search | O(1) | O(n) |
| Insert | O(1) | O(n) |
| Delete | O(1) | O(n) |
| Resize | O(n) amortized | O(n) amortized |
Why Worst Case O(n)?
All keys hash to same index: one giant collision chain. Then search traverses entire chain O(n). Extremely rare with good hash function and random data. Possible with adversarial input designed to collide.
Average Case O(1) Assumption
Assumes good hash function (uniform distribution) and load factor controlled (constant). Under these conditions, probability of long chains low, expected chain length O(1).
Designing Good Hash Functions
String Hashing
Polynomial rolling hash:
h(s) = (s[0]*p^(n-1) + s[1]*p^(n-2) + ... + s[n-1]) mod m
p = small prime (31 for strings)
Efficient: precompute powers, compute incrementally
Universality
Universal hash function: for any two distinct keys x, y, P(h(x) = h(y)) <= 1/m. Probabilistic guarantee of few collisions. Examples: MAD family h(x) = (ax + b) mod p.
Avoiding Patterns
Hash functions vulnerable to certain inputs. Example: modulo prime may fail if keys are multiples of prime. Use multiplication method or universal families instead.
Implementation in Languages
Python: built-in hash() randomized (different across runs). Java: hashCode() customizable (hashCode() for String based on content). C++: std::hash template.
Cryptographic Hashing
Difference from Hash Tables
Cryptographic hash: one-way function (infeasible to reverse). Used for security. Hash table hash: fast but reversible (can see hash value, try keys).
Properties
- Deterministic: Same input always same hash
- Quick: Fast to compute
- Avalanche: Small change large output change
- Pre-image resistant: Hard to find input given hash
- Collision resistant: Hard to find two inputs with same hash
Examples
MD5 (broken, don't use), SHA-1 (deprecated), SHA-256, SHA-3. Used for password hashing, data integrity checking, blockchain, digital signatures.
Not for Hash Tables
Cryptographic hashes slower (designed for security not speed). Hash table hashes faster (designed for speed not security). Different use cases.
Practical Considerations
Language Implementations
Python dict, Java HashMap, C++ unordered_map: all use hash tables. Python: chaining. Java, C++: typically open addressing (linear probing or variants).
Hash Collisions Attack
Adversarial input designed to cause collisions (all keys hash same). DOS attack: hash table degenerates to O(n). Mitigation: randomized hash functions (Python 3), DoS protection limits.
Custom Classes
If using custom class as key: must override hash() and equality. Ensure consistency: equal objects have same hash. Inequality can have different hashes.
Memory Overhead
Hash table uses extra memory (load factor < 1 means empty slots). Trade-off: space for speed. For dense data, less overhead. For sparse, more wasted space.
Ordered Traversal
Hash tables unordered (order depends on hash values). If need sorted order: use tree map (O(log n) operations) or sort table keys. Python 3.7+: dict insertion order preserved (CPython implementation detail).
Applications
Symbol Tables (Compilers)
Map variable names to attributes (type, scope, memory location). Fast lookup during compilation.
Caching
Memoization: cache function results by input. LRU cache: hash table for fast lookup + doubly-linked list for LRU eviction.
Database Indexing
Hash indexes enable O(1) point queries (WHERE id = 123). Less effective for range queries (WHERE id > 100).
Spell Checking
Dictionary: hash table of valid words. Check if word in dictionary O(1).
Deduplication
Remove duplicates: hash each element, store in set. Duplicates detected (hash already exists).
Password Verification
Store hash of password, not plaintext. On login, hash input, compare hashes. Cryptographic hash crucial for security.
References
- Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. "Introduction to Algorithms." MIT Press, 3rd edition, 2009.
- Knuth, D. E. "The Art of Computer Programming, Volume 3: Sorting and Searching." Addison-Wesley, 2nd edition, 1998.
- Sedgewick, R., and Wayne, K. "Algorithms, Part I." Addison-Wesley, 4th edition, 2011.
- Carter, J. L., and Wegman, M. N. "Universal Classes of Hash Functions." Journal of Computer and System Sciences, vol. 18, 1979.
- Patrascu, M., and Thorup, M. "The Power of Simple Tabulation Hashing." STOC, 2011.