Introduction

Hash table enables O(1) average-case lookup (vs. O(log n) for trees, O(n) for linear search). Trade-off: O(1) expected performance for slightly worse worst-case O(n). Ubiquitous: dictionaries in Python, maps in Java/C++, hash maps in all languages.

Core idea: map keys to array indices via hash function. Key observation: array access O(1), so if hash function distributes keys to distinct indices, lookup becomes O(1). Collisions (multiple keys hash same index) require handling: chaining or open addressing.

Fundamental data structure for caching, databases, compilers (symbol tables), memory management (page tables). Performance depends on hash function quality and load factor management.

"Hash tables are the workhorses of modern computing. They enable constant-time operations that would otherwise require logarithmic or linear time." -- Data structure fundamentals

Hash Functions and Properties

Definition

Hash function h: maps key to index in hash table (0 to m-1). h(key) deterministic: same key always maps to same index. Good hash function distributes keys uniformly across indices (minimizes collisions).

Simple example (integers):
h(key) = key % m

For strings:
h(key) = (c1 * 31^(n-1) + c2 * 31^(n-2) + ... + cn) % m
 (polynomial rolling hash)

Desired Properties

  • Uniformity: Keys uniformly distributed across table
  • Efficiency: Quick to compute O(1)
  • Determinism: Same key always same hash
  • Avalanche effect: Small change in key causes large change in hash

Hash Function Quality

Poor function: many collisions. Example: h(key) = key % 10 for keys 10, 20, 30 (all map to 0). Good function: leverage all bits of key, mix bits thoroughly.

Division Method

h(key) = key mod m

Choice of m matters: avoid m = 2^k (loses high bits).
Prefer m = prime number (better distribution for many patterns).

Multiplication Method

h(key) = floor(m * (key * A mod 1))

A = (sqrt(5) - 1) / 2 ≈ 0.618 (golden ratio)
Works well for any m. Less dependent on m's properties.

Collision Handling: Chaining vs. Open Addressing

Collisions Inevitable

By pigeonhole principle: if n keys map to m indices and n > m, at least one collision. Even for n ≈ m, expect collisions (birthday paradox: 23 people, ~50% chance collision in birthday).

Chaining Approach

Each table index stores linked list of all keys hashing to that index. Insert: compute h(key), add to list. Search: compute h(key), search list. Delete: find and remove from list.

Open Addressing Approach

No separate storage (linked lists). On collision, find empty slot using probing sequence. Insert: compute h(key), if occupied, try h(key)+1, h(key)+2, etc. until empty slot found. More complex but cache-friendly (contiguous memory).

Comparison

Aspect Chaining Open Addressing
Space O(n) + list overhead O(m), no extra pointers
Worst case (collisions) O(n) for list traversal O(n) for probe sequence
Cache performance Poor (pointer chasing) Good (contiguous array)
Simplicity Simple Complex (probe sequences)

Separate Chaining Implementation

Structure

class HashTable:
 def __init__(self, size):
 self.table = [[] for _ in range(size)] # List of lists

 def hash(self, key):
 return hash(key) % len(self.table)

 def insert(self, key, value):
 index = self.hash(key)
 for i, (k, v) in enumerate(self.table[index]):
 if k == key:
 self.table[index][i] = (key, value) # Update
 return
 self.table[index].append((key, value)) # Insert

 def search(self, key):
 index = self.hash(key)
 for k, v in self.table[index]:
 if k == key:
 return v
 return None

 def delete(self, key):
 index = self.hash(key)
 self.table[index] = [(k, v) for k, v in self.table[index] if k != key]

Analysis

Insert, search, delete: O(1 + α) where α = n/m (load factor). For α < 1 (table not full), average O(1). For α >> 1, many collisions, degrades to O(n).

Deletion

Simple: remove from linked list. No issues unlike open addressing (tombstones required to maintain probe sequences).

Open Addressing Methods

Linear Probing

On collision at h(key), try h(key)+1, h(key)+2, ..., wrapping around.

h_i(key) = (h(key) + i) mod m for i = 0, 1, 2, ...

Problem: clustering. Consecutive filled slots accumulate, increasing probe length for future insertions.

Quadratic Probing

h_i(key) = (h(key) + c1*i + c2*i^2) mod m

Reduces clustering but may not probe all slots (secondary clustering).

Double Hashing

h_i(key) = (h1(key) + i*h2(key)) mod m

Two hash functions provide better distribution, avoids clustering.

Deletion Complication

Can't simply remove (breaks probe sequences). Solution: tombstones (mark deleted, skip in search, reuse in insert). Degrades performance over time.

Load Factor Limits

Open addressing requires load factor < 0.7 (30% empty slots). Otherwise, probe lengths grow rapidly. Requires frequent resizing.

Load Factor and Resizing

Load Factor Definition

α = n / m (elements / table size). Affects performance: higher α means more collisions, longer probe sequences.

Resizing Strategy

When α exceeds threshold (0.75 for chaining, 0.5 for open addressing), resize: allocate new table (typically 2x size), rehash all elements into new table.

def resize(self):
 old_table = self.table
 self.table = [[] for _ in range(len(old_table) * 2)]
 for bucket in old_table:
 for key, value in bucket:
 self.insert(key, value)

Amortized Analysis

Resizing cost O(n) (rehash all elements). But infrequent (doubles size each resize). Amortized: O(1) per operation including resizing. n operations: total O(n) time.

New Hash Function

When resizing, must use new hash function (new table size). Changing m changes hashing: keys previously hashing to index i may hash differently with new m.

Time Complexity Analysis

Operation Average Worst Case
Search O(1) O(n)
Insert O(1) O(n)
Delete O(1) O(n)
Resize O(n) amortized O(n) amortized

Why Worst Case O(n)?

All keys hash to same index: one giant collision chain. Then search traverses entire chain O(n). Extremely rare with good hash function and random data. Possible with adversarial input designed to collide.

Average Case O(1) Assumption

Assumes good hash function (uniform distribution) and load factor controlled (constant). Under these conditions, probability of long chains low, expected chain length O(1).

Designing Good Hash Functions

String Hashing

Polynomial rolling hash:
h(s) = (s[0]*p^(n-1) + s[1]*p^(n-2) + ... + s[n-1]) mod m

p = small prime (31 for strings)
Efficient: precompute powers, compute incrementally

Universality

Universal hash function: for any two distinct keys x, y, P(h(x) = h(y)) <= 1/m. Probabilistic guarantee of few collisions. Examples: MAD family h(x) = (ax + b) mod p.

Avoiding Patterns

Hash functions vulnerable to certain inputs. Example: modulo prime may fail if keys are multiples of prime. Use multiplication method or universal families instead.

Implementation in Languages

Python: built-in hash() randomized (different across runs). Java: hashCode() customizable (hashCode() for String based on content). C++: std::hash template.

Cryptographic Hashing

Difference from Hash Tables

Cryptographic hash: one-way function (infeasible to reverse). Used for security. Hash table hash: fast but reversible (can see hash value, try keys).

Properties

  • Deterministic: Same input always same hash
  • Quick: Fast to compute
  • Avalanche: Small change large output change
  • Pre-image resistant: Hard to find input given hash
  • Collision resistant: Hard to find two inputs with same hash

Examples

MD5 (broken, don't use), SHA-1 (deprecated), SHA-256, SHA-3. Used for password hashing, data integrity checking, blockchain, digital signatures.

Not for Hash Tables

Cryptographic hashes slower (designed for security not speed). Hash table hashes faster (designed for speed not security). Different use cases.

Practical Considerations

Language Implementations

Python dict, Java HashMap, C++ unordered_map: all use hash tables. Python: chaining. Java, C++: typically open addressing (linear probing or variants).

Hash Collisions Attack

Adversarial input designed to cause collisions (all keys hash same). DOS attack: hash table degenerates to O(n). Mitigation: randomized hash functions (Python 3), DoS protection limits.

Custom Classes

If using custom class as key: must override hash() and equality. Ensure consistency: equal objects have same hash. Inequality can have different hashes.

Memory Overhead

Hash table uses extra memory (load factor < 1 means empty slots). Trade-off: space for speed. For dense data, less overhead. For sparse, more wasted space.

Ordered Traversal

Hash tables unordered (order depends on hash values). If need sorted order: use tree map (O(log n) operations) or sort table keys. Python 3.7+: dict insertion order preserved (CPython implementation detail).

Applications

Symbol Tables (Compilers)

Map variable names to attributes (type, scope, memory location). Fast lookup during compilation.

Caching

Memoization: cache function results by input. LRU cache: hash table for fast lookup + doubly-linked list for LRU eviction.

Database Indexing

Hash indexes enable O(1) point queries (WHERE id = 123). Less effective for range queries (WHERE id > 100).

Spell Checking

Dictionary: hash table of valid words. Check if word in dictionary O(1).

Deduplication

Remove duplicates: hash each element, store in set. Duplicates detected (hash already exists).

Password Verification

Store hash of password, not plaintext. On login, hash input, compare hashes. Cryptographic hash crucial for security.

References

  • Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. "Introduction to Algorithms." MIT Press, 3rd edition, 2009.
  • Knuth, D. E. "The Art of Computer Programming, Volume 3: Sorting and Searching." Addison-Wesley, 2nd edition, 1998.
  • Sedgewick, R., and Wayne, K. "Algorithms, Part I." Addison-Wesley, 4th edition, 2011.
  • Carter, J. L., and Wegman, M. N. "Universal Classes of Hash Functions." Journal of Computer and System Sciences, vol. 18, 1979.
  • Patrascu, M., and Thorup, M. "The Power of Simple Tabulation Hashing." STOC, 2011.