Introduction

Hash index: indexes data using hash functions for ultra-fast exact-match lookups. Core principle: compute hash(key) -> bucket location, retrieve value directly. O(1) average-case performance vs. O(log n) for B-Trees. Trade-off: cannot support range queries or sorting.

Foundation: hash tables extended for persistent storage. Single-node: straightforward hash table. Disk-based: organize buckets on disk blocks. Distributed: hash partitioning across servers.

Usage: primary indexes in databases (PostgreSQL HASH index), cache implementations, in-memory stores. Preferred when exact-match queries dominant, range queries rare. Fastest access method when applicable.

"Hash indexes achieve constant-time lookup through direct addressing. Ultimate speed for exact matches, but sacrifices ordering and range query capability. Perfect for equality-focused workloads." -- Index structures

Hash Function Fundamentals

Hash Function Properties

Good hash function: (1) deterministic (same key always hashes to same value), (2) uniform distribution (all buckets equally likely), (3) fast computation (minimal overhead), (4) minimal collisions (different keys rarely hash to same bucket).

Common Hash Functions

Division method: h(key) = key mod m (m = bucket count). Fast, simple. Weakness: poor distribution if m not chosen carefully. Multiplication method: h(key) = floor(m(Ak mod 1)) where A golden ratio. Better distribution, slightly slower. Cryptographic: SHA-1, MD5 (overkill for indexing, secure but slow).

Universal Hashing

Family of hash functions: randomly choose function from family. Worst-case collision probability bounded. Protection against adversarial key sequences. Cost: need multiple functions or randomization.

Quality Metrics

Measure distribution: chi-square test (statistical uniformity). Actual usage: collision rate, load factor, clustering (adjacent collisions). Monitor: detect poor distribution early.

Key Representation

Integer keys: direct hashing. String keys: convert to integer (sum bytes, polynomial rolling hash). Composite keys: combine component hashes. Goal: consistent, uniform numeric representation.

Bucket Organization and Storage

Static Hash Organization

Fixed bucket count at creation. Hash table: array of m buckets. Each bucket: storage for entries (chaining) or attempt to store in-place (open addressing). Allocate all buckets upfront.

Chaining Buckets

Each bucket: linked list of entries. Collision: append to list. Lookup: compute hash, search list. Simple, supports deletion easily. Weakness: extra memory for links, cache misses traversing chains.

Open Addressing Buckets

All entries in array. Collision: find empty bucket using probe sequence. Linear probing: check next slot, then next+1, etc. Quadratic probing: probe offsets quadratic. Double hashing: use second hash. No extra memory, but clustering possible.

Extendible Hashing

Dynamic directory: array of bucket pointers. Bucket: fixed size. Directory grows: double size when bucket overflows (some buckets point to same). Overflow bucket: separate space for excess entries. Balanced: all buckets roughly same depth (log m levels).

Bucket Size and Load Factor

Bucket size B: entries per bucket. Load factor alpha: entries / buckets. Keep alpha low (less than 0.75): minimize collisions. High load: many collisions, slower access. Balance: memory vs. speed.

Collision Handling Mechanisms

Separate Chaining

Collision: append to bucket's linked list. Multiple entries per bucket. Lookup: search chain until key found. Insertion: check uniqueness, add if absent. Deletion: remove from chain. Simple, general-purpose.

Complexity with Chaining

Average case: O(1 + alpha) where alpha load factor. If alpha small (well-sized table), near O(1). Worst case: O(n) if all keys hash to same bucket (poor function or adversarial). Practice: usually O(1) with good function.

Linear Probing

Collision at bucket i: try i+1, then i+2, etc. (mod m). Simple, cache-friendly (contiguous memory). Problem: primary clustering (adjacent occupied slots slow searches). Deletion: mark as deleted (keep probes working), later rebuild.

Quadratic Probing

Try i+1 squared, i+2 squared, i+3 squared, ... (mod m). Reduces clustering vs. linear. May not probe all buckets (cycle). Implementation: requires careful parameter selection (m prime, specific relationships).

Double Hashing

Primary hash: h1(key) = bucket. Collision: use h2(key) as probe offset. Try h1 + h2, h1 + 2*h2, ... (mod m). Best distribution: h2 must be coprime to m. Requires two hash functions.

Deletion Challenges

Chaining: straightforward (remove from list). Open addressing: mark as deleted (not empty), maintain probe chains. Rebuild hash table periodically: compact, remove deleted entries. Complexity: deletion not simple in open addressing.

In-Memory Hash Indexes

Speed Advantage

RAM-based: microsecond latencies. Direct memory access: no disk I/O. Cache-friendly: entry pointers to records. Ultra-responsive. Ideal: working set fits in RAM.

Exact-Match Queries

Perfect use case: single key lookup. Compute hash, access bucket, retrieve row ID. O(1) constant time. Beats B-Trees (O(log n)) significantly. Preferred for equality filters.

Memory Constraints

Limited by RAM size. Large datasets: cannot index entire table. Solution: cache frequently accessed keys, use disk-based index for cold data. Tiered approach: hot/cold data split.

Persistence Issues

Pure in-memory: volatile. Crash loses index. Options: write-ahead logging (log changes), periodic snapshots, dual index (in-memory + disk backup). Cost: reduced performance vs. pure memory.

Scalability Limits

Single server: limited by RAM (256GB-1TB typical). Distributed hashing: partition across servers (consistent hashing). Each server holds subset. Scalable but complex (network latency dominates).

Disk-Based Hash Indexes

Disk Organization

Hash table on disk: buckets stored in blocks (4KB typical). Primary bucket array: small, fits in memory. Bucket blocks: on disk, accessed on collision. Two-level organization: in-memory array + disk blocks.

Bucket Blocks

Disk block: multiple entries (key-value or row IDs). Hash points to block. Read block: get all entries. Search block: linear scan (small, acceptable). Collision: overflow block (linked blocks).

I/O Cost

Best case: one disk read (primary bucket fits). Collision: additional reads (overflow blocks). Worst case: many collisions -> many reads. Average: 1-2 reads typical (well-designed hash).

Overflow Handling

Bucket full: allocate overflow block. Chain: linked blocks. Problem: many overflows -> many reads. Solution: dynamic hashing (resize), increase bucket size.

Dynamic Resizing

When load factor high: double bucket count. Rehash all entries: distribute to new buckets. Expensive operation (O(n)). Amortized: O(1) per insertion. Happens rarely, impact minimal.

Dynamic Hashing

Extendible Hashing

Directory: array of bucket pointers (size 2^d, d depth). Hash: take d bits of hash value as index. Bucket: store entries. Overflow: split bucket, double directory if needed. Adaptive: directory grows as needed.

Advantages

No rehashing required (selective bucket splitting). Directory small (even if millions entries). Performance: 1-2 disk accesses. Locality: related keys may share bucket.

Directory Growth

Bucket split: create second bucket, redistribute entries by (d+1)th bit. Directory double: add level. Gradual: only affected buckets change. Efficient: most buckets stable.

Linear Hashing

Alternative: grow table incrementally (one bucket at time). No directory: use hash family. Next pointer: identifies split boundary. Splits triggered by load factor. More complex but also dynamic.

Practical Considerations

PostgreSQL hash index: static (requires manual REINDEX to resize). Other systems: dynamic (extendible, linear). Choice: trade complexity for flexibility. Static simpler, dynamic better scalability.

Performance Characteristics

Equality Lookup

Average: O(1) with good hash function. Disk accesses: 1-2 (primary bucket + possibly one overflow). Microseconds (in-memory) or milliseconds (disk). Beats B-Trees significantly.

Range Queries

Not supported efficiently. Hash doesn't preserve order. Range [a, b]: no guidance from hash. Solution: full table scan (no index benefit) or use B-Tree. Hash unsuitable for range queries.

Sorting and Ordering

Hash: no natural order. Retrieve rows, sort in memory. Expensive for large results. B-Trees support: leaves linked (sequential scan). Hash: sequential access inefficient.

Insertion/Deletion

Find location: O(1) hash lookup. Insert/delete: O(1) operation. Resize (dynamic): O(n) amortized to O(1). Overall: insertion/deletion very fast.

Benchmark Comparison

Operation Hash Index B-Tree Index Notes
Equality Lookup O(1) O(log n) Hash much faster
Range Query O(n) O(log n + k) B-Tree efficient
Insert/Delete O(1) avg O(log n) Hash faster
Sorted Output O(n log n) O(n) B-Tree ordered

Hash vs. B-Tree Indexes

Equality Performance

Hash wins decisively: O(1) vs. O(log n). Actual latency: hash typically faster (fewer CPU instructions, cache-friendly). When equality primary query: hash preferred.

Range Query Support

B-Trees support naturally: ordered, contiguous scans. Hash: no guidance, full scan required. Application: if range queries needed, B-Tree mandatory. Mixed workload: B-Tree safer choice.

Ordering and Sorting

B-Trees: leaf nodes linked, sequential access efficient. Hash: random scattered entries, sorting required. Application: ORDER BY queries favor B-Trees. Hash inefficient for sorted results.

Scalability

Hash: grows gracefully (dynamic hashing handles growth). B-Tree: also scalable (auto-balancing). Both suitable for large datasets. Choice: query patterns.

Space Overhead

Hash: exact allocation (no over-provisioning needed). B-Trees: node overhead, some wasted space. Hash typically smaller. Compression: both benefit.

Concurrent Access

Hash: bucket-level locking possible. B-Trees: node-level locking. Both support concurrency. Implementation-dependent, no clear winner.

Implementation Considerations

Hash Function Selection

Speed: minimize computation (division method fast). Distribution: test with workload keys. Robustness: avoid adversarial degradation. Test: benchmark actual performance.

Bucket Size Configuration

Disk-based: match block size (4KB). In-memory: balance memory/speed. Load factor: keep less than 0.75 (good collision distribution). Monitor: adjust if performance degrades.

Collision Strategy Choice

Chaining: simpler, cleaner deletion. Open addressing: memory-efficient, cache-friendly. Hybrid: combine (chaining buckets with open addressing within). Profile: choose based on workload.

Dynamic Sizing

Static: simpler, predictable. Dynamic (extendible): better for unpredictable growth. Trade-off: complexity vs. flexibility. Most modern systems: dynamic (transparent to users).

Persistence Strategy

Pure in-memory: volatile, rebuild on restart. Write-ahead logging: durable but slower. Snapshots: periodic save. Choose: recovery requirement vs. performance.

Concurrency Control

Lock-based: simple, potential contention. Optimistic: detect conflicts, retry. Lock-free: complex but contention-free. Workload-dependent: choose accordingly.

Applications and Use Cases

Cache Implementations

Perfect use case: hash ideal for fast lookups. Memcached, Redis: hash tables. User sessions: hash(session_id) -> data. Reduces database load, speeds response.

Primary Key Indexes

Database primary keys: hash index excellent. Equality queries common. CREATE INDEX idx ON table USING HASH (id). Fastest lookups for direct row access.

Join Operations

Hash join: build hash table from inner table, probe with outer. Efficient for equality joins. Both tables small: both in-memory hashes. One large: disk-based hash.

Aggregation

GROUP BY: hash table tracks aggregates. Key: group column, value: aggregate (sum, count, avg). Hash lookup: O(1) per row. Final: single scan through buckets.

Deduplication

Remove duplicates: hash table tracks seen keys. O(1) check per row. Efficient: single pass through data. Applications: data cleaning, streaming aggregation.

In-Memory Databases

H2, SQLite in-memory mode: hash indexes for in-RAM data. Ultra-fast: microsecond latencies. Suitable: high-frequency trading, real-time analytics, session stores.

Limitations and Alternatives

No Range Query Support

Fundamental limitation: hash doesn't preserve order. Range [a, b]: no index guidance. Workaround: B-Tree for range-heavy queries. Hybrid: multiple indexes (hash + B-Tree).

Worst-Case Degradation

Poor hash function: all keys same bucket, O(n) lookup. Adversarial keys: trigger collisions. Mitigation: universal hashing, randomization. Risk: denial-of-service attacks.

Resizing Cost

Static hash: manual REINDEX required. Dynamic: automatic but not free (background work). Resizing: expensive O(n) operation. Plan: avoid frequent resizing.

Memory Overhead

Hash table: extra space for pointers, deleted entries. Not as compact as B-Trees. Concern: memory-constrained environments (embedded systems). Solution: careful sizing.

Ordering Inability

Cannot provide sorted output. Applications: ORDER BY queries need separate sorting. Expensive: memory sort for large results. B-Trees naturally sorted (more efficient).

When to Avoid

Range queries important: use B-Tree. Sorted access common: B-Tree better. Unknown query patterns: B-Tree safer. Only equality queries: hash preferred. Hybrid indexes: multiple index types.

References

  • Ramakrishnan, R., and Gehrke, J. "Database Management Systems." McGraw-Hill, 3rd edition, 2003.
  • Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. "Introduction to Algorithms." MIT Press, 3rd edition, 2009.
  • Garcia-Molina, H., Ullman, J. D., and Widom, J. "Database Systems: The Complete Book." Pearson, 2nd edition, 2008.
  • Silberschatz, A., Korth, H. F., and Sudarshan, S. "Database System Concepts." McGraw-Hill, 6th edition, 2010.
  • Knuth, D. E. "The Art of Computer Programming, Volume 3: Sorting and Searching." Addison-Wesley, 2nd edition, 1998.