Hash Functions (Hash Tables)

Introduction

Hash function: maps arbitrary data to fixed-size index. Input: any size. Output: array index (0 to m-1). Goal: uniform distribution (minimize collisions). Challenge: deterministic (same input always same output), unpredictable (no pattern).

"Hash functions are the foundation of hash tables: convert keys into array indices with remarkable efficiency. Good functions balance simplicity with distribution quality, enabling O(1) average-case performance." -- Data structures

Hash Function Definition

Formal Definition

Hash function h: U -> {0, 1, ..., m-1}. Input domain U: keys (strings, numbers). Output range: array size m. Property: deterministic (h(x) always same). Desirable: distribute uniformly across range.

Key Properties

Deterministic: h(k) returns same value always. Efficient: O(1) computation. Distribution: uniform (minimize clumping). Fast computation: outweighs perfect distribution.

Collision Inevitable

Pigeonhole principle: keys > indices, collisions must occur. Goal: minimize collisions probabilistically. Perfect hash: no collisions (requires knowing all keys upfront).

Collision Example

h("Alice") = 5 (slot 5)h("Bob") = 5 (slot 5)COLLISION: both hash to same slotSolution: collision resolution (chaining, open addressing)

Design Criteria

Uniformity

Keys distributed evenly across array. Prob(h(k) = i) = 1/m for all i. Mathematical: minimize variance of distribution. Practical: avoid clustering.

Determinism

Same key always hashes to same slot. Essential: lookup must find inserted values. Consistency: critical for correctness.

Efficiency

O(1) computation time. Complex functions: negate hash table advantage. Trade-off: simplicity vs distribution quality (usually prefer simple).

Avalanche Effect

Tiny input change: completely different output. Bit flip in key: ideally changes hash completely. Prevents clustering (similar keys near each other).

Table Size Independence

Function: independent of m (array size). Scaling: same hash function works for different sizes. Grow table: rehash, not recalculate.

Division Method

Formula

h(k) = k mod mSimple: remainder of key divided by table sizeExample: k=17, m=10: h(17) = 17 mod 10 = 7Range: 0 to m-1 (exact fit)

Advantages

Simplicity: trivial implementation. Speed: single operation (modulo). Uniform: works well with good m.

Disadvantages

Clustering: certain keys cluster (if m has factors). Bad m values: m = 2^k (ignores high bits), m divisible by small primes.

Optimal m

Prime number: best practice. Example: m = 97, 101, 107. Avoids factor-related clustering. Slightly larger: reduces collision rate.

Example Good vs Bad

Good: m = 7 (prime)Bad: m = 8 (power of 2, ignores low bits)Bad: m = 6 (2*3, misses factors)

Multiplication Method

Formula

h(k) = floor(m * (k*A - floor(k*A)))Where A is constant: 0 < A < 1, typically (sqrt(5)-1)/2 ≈ 0.618Fractional part: k*A - floor(k*A)

How It Works

Example: k=16, m=10, A=0.618k*A = 16 * 0.618 = 9.888Fractional: 9.888 - 9 = 0.888h(k) = floor(10 * 0.888) = floor(8.88) = 8

Advantages

Distribution: good regardless of m. Works: with any m (not limited to primes). Avalanche: bit changes propagate well.

Disadvantages

Complexity: more operations than division. Floating-point: precision issues. Implementation: trickier than modulo.

Why Golden Ratio

A = (sqrt(5)-1)/2 ≈ 0.618 (golden ratio). Mathematical: provably good distribution. Historical: empirically best-known constant.

Mid-Square Method

Algorithm

1. Square the key: k^22. Extract middle digits3. Modulo to rangeExample: k=3121, m=100k^2 = 9,740,641Middle digits: 740h(3121) = 740 mod 100 = 40

Advantages

Avalanche: squaring mixes bits well. Simple: easy to understand. Good distribution: empirically works well.

Disadvantages

Limited range: overflow possible (k^2 very large). Implementation: requires multiplication. Not standard: less portable.

When to Use

Small keys: avoids overflow. Integer keys: natural fit. Simple implementations: embedded systems.

Folding Method

Algorithm

1. Divide key into parts2. Add parts (shift and add)3. Modulo to rangeExample: k=123456789, m=100Divide: 123, 456, 789Add: 123 + 456 + 789 = 1368h(k) = 1368 mod 100 = 68

Variants

Shift and add: shift alternate parts, then add. Folding at boundaries: reverse alternate parts (different distribution).

Advantages

Simple: easy to code. Handles long keys: process chunks. Good: reasonable distribution for strings.

Disadvantages

Weaker: less avalanche than multiplication. Dependency: digit order matters less. Clustering: some patterns cluster.

Use Case

String hashing: break string into chunks. Legacy systems: historical method (simpler than modular).

Universal Hashing

Concept

Family of hash functions: multiple functions, random selection. Guarantee: for any two keys, small probability same hash (collision). Proof: mathematical bound on collision rate.

Formula

h(k) = ((a*k + b) mod p) mod mWhere:p = prime > |U| (universe size)a, b = random, 1 <= a < p, 0 <= b < pm = hash table sizeDifferent (a,b) pairs = different hash functions

Collision Probability

Theorem: Prob(h(k1) = h(k2)) <= 1/m for random (a,b). Guarantee: any two keys collide with probability 1/m (optimal).

Advantages

Provable: mathematical guarantees. Adversary-resistant: no bad input distribution. Flexible: choose function for worst case.

Disadvantages

Overhead: modular arithmetic expensive. Randomization: must store (a,b). Complexity: not simple implementation.

When to Use

Security-critical: adversarial input. Theoretical: need guarantees. Performance-critical: collision handling expensive.

Cryptographic Hash Functions

Properties

Deterministic: same input always same output. Quick: fast computation. Avalanche: tiny change, completely different output. One-way: cannot reverse (find input from output). Collision-resistant: hard to find two inputs with same hash.

Examples

MD5: 128-bit (broken, don't use). SHA-1: 160-bit (deprecated). SHA-2: 256/512-bit (standard). SHA-3: latest standard (Keccak).

Use Cases

Passwords: hash for storage (one-way). Integrity: checksums. Digital signatures: message authentication. Blockchain: proof-of-work (collision-hard).

Not For Hash Tables

Overkill: expensive for hash tables (need speed). Unnecessary: collision-resistance not needed (tables handle collisions). Time: SHA-256 >> division method (by orders of magnitude).

Trade-off

Hash tables: need speed (simple functions). Cryptography: need security (complex functions). Different goals, different methods.

Collision Probability

Birthday Paradox

Probability of collision: surprisingly high. For m = 10^6: need ~1000 keys for 50% collision chance. Formula: 1 - (m!/(m^n * (m-n)!)) approximated: 1 - e^(-n^2/(2m)).

Load Factor

Load factor α = n/m (keys / table size). α < 0.5: low collisions. α = 1: moderate collisions. α > 2: very high collisions. Resize table: keep α < 1.

Expected Collisions

Load Factor	Collision Prob.	Avg. Probe Length
0.5	~1.3%	~1.25
0.75	~4%	~1.5
1.0	~7.5%	~2
2.0	~30%	~3

Rehashing

When α exceeds threshold (usually 0.75): create larger table. Rehash all keys: recompute hash with new m. Cost: O(n). Infrequent: amortized to O(1) per insertion.

Bloom Filters

Concept

Probabilistic: uses k hash functions. Bit array: initially all 0. Insert: set k bits to 1. Query: check if all k bits are 1.

Operation

Insert x: For i = 1 to k: set bit[h_i(x)] = 1Query x: For i = 1 to k: if bit[h_i(x)] == 0: x definitely not present else: x probably present (may have false positive)

False Positives

Probability: e^(-k*n/m)^k. Trade-off: more bits/functions reduce false positives. No false negatives: if absent, definitely detected.

Use Case

Cache lookups: avoid expensive operations. Network: spam detection (multiple hash functions). Space-efficient: single bit per element vs full hash.

Advantage over Hash Table

Space: bits vs full entries (100x smaller). Speed: O(k) vs O(1) (fixed k, predictable). Trade-off: false positives acceptable.

Practical Examples

Java's Integer Hash

public static int hashCode(int value) { return value;}

Java's String Hash

h = 0for (int i = 0; i < s.length(); i++) h = 31*h + s.charAt(i);return h;

FNV Hash (32-bit)

h = 2166136261for each byte b: h = (h * 16777619) XOR breturn h;

MurmurHash (Fast, Good Distribution)

Complex: multiple bit operations (shifts, XOR, multiplication). Distribution: excellent. Speed: 3-4x faster than universal hashing.

Practical Selection

Integer keys: identity or simple multiply. String keys: length-weighted (like Java). General: FNV or MurmurHash. Security: SHA-2.

References

Knuth, D. E. "The Art of Computer Programming: Sorting and Searching." Vol. 3, Addison-Wesley, 2nd edition, 1998.
Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. "Introduction to Algorithms." MIT Press, 3rd edition, 2009.
Carter, J. L., and Wegman, M. N. "Universal Classes of Hash Functions." Journal of Computer and System Sciences, vol. 18, no. 2, 1979, pp. 143-154.
Bloom, B. H. "Space/Time Trade-offs in Hash Coding with Allowable Errors." Communications of the ACM, vol. 13, no. 7, 1970, pp. 422-426.
Sedgewick, R., and Wayne, K. "Algorithms." Addison-Wesley, 4th edition, 2011.