Introduction

Hash function: maps arbitrary data to fixed-size index. Input: any size. Output: array index (0 to m-1). Goal: uniform distribution (minimize collisions). Challenge: deterministic (same input always same output), unpredictable (no pattern).

"Hash functions are the foundation of hash tables: convert keys into array indices with remarkable efficiency. Good functions balance simplicity with distribution quality, enabling O(1) average-case performance." -- Data structures

Hash Function Definition

Formal Definition

Hash function h: U -> {0, 1, ..., m-1}. Input domain U: keys (strings, numbers). Output range: array size m. Property: deterministic (h(x) always same). Desirable: distribute uniformly across range.

Key Properties

Deterministic: h(k) returns same value always. Efficient: O(1) computation. Distribution: uniform (minimize clumping). Fast computation: outweighs perfect distribution.

Collision Inevitable

Pigeonhole principle: keys > indices, collisions must occur. Goal: minimize collisions probabilistically. Perfect hash: no collisions (requires knowing all keys upfront).

Collision Example

h("Alice") = 5 (slot 5)h("Bob") = 5 (slot 5)COLLISION: both hash to same slotSolution: collision resolution (chaining, open addressing)

Design Criteria

Uniformity

Keys distributed evenly across array. Prob(h(k) = i) = 1/m for all i. Mathematical: minimize variance of distribution. Practical: avoid clustering.

Determinism

Same key always hashes to same slot. Essential: lookup must find inserted values. Consistency: critical for correctness.

Efficiency

O(1) computation time. Complex functions: negate hash table advantage. Trade-off: simplicity vs distribution quality (usually prefer simple).

Avalanche Effect

Tiny input change: completely different output. Bit flip in key: ideally changes hash completely. Prevents clustering (similar keys near each other).

Table Size Independence

Function: independent of m (array size). Scaling: same hash function works for different sizes. Grow table: rehash, not recalculate.

Division Method

Formula

h(k) = k mod mSimple: remainder of key divided by table sizeExample: k=17, m=10: h(17) = 17 mod 10 = 7Range: 0 to m-1 (exact fit)

Advantages

Simplicity: trivial implementation. Speed: single operation (modulo). Uniform: works well with good m.

Disadvantages

Clustering: certain keys cluster (if m has factors). Bad m values: m = 2^k (ignores high bits), m divisible by small primes.

Optimal m

Prime number: best practice. Example: m = 97, 101, 107. Avoids factor-related clustering. Slightly larger: reduces collision rate.

Example Good vs Bad

Good: m = 7 (prime)Bad: m = 8 (power of 2, ignores low bits)Bad: m = 6 (2*3, misses factors)

Multiplication Method

Formula

h(k) = floor(m * (k*A - floor(k*A)))Where A is constant: 0 < A < 1, typically (sqrt(5)-1)/2 ≈ 0.618Fractional part: k*A - floor(k*A)

How It Works

Example: k=16, m=10, A=0.618k*A = 16 * 0.618 = 9.888Fractional: 9.888 - 9 = 0.888h(k) = floor(10 * 0.888) = floor(8.88) = 8

Advantages

Distribution: good regardless of m. Works: with any m (not limited to primes). Avalanche: bit changes propagate well.

Disadvantages

Complexity: more operations than division. Floating-point: precision issues. Implementation: trickier than modulo.

Why Golden Ratio

A = (sqrt(5)-1)/2 ≈ 0.618 (golden ratio). Mathematical: provably good distribution. Historical: empirically best-known constant.

Mid-Square Method

Algorithm

1. Square the key: k^22. Extract middle digits3. Modulo to rangeExample: k=3121, m=100k^2 = 9,740,641Middle digits: 740h(3121) = 740 mod 100 = 40

Advantages

Avalanche: squaring mixes bits well. Simple: easy to understand. Good distribution: empirically works well.

Disadvantages

Limited range: overflow possible (k^2 very large). Implementation: requires multiplication. Not standard: less portable.

When to Use

Small keys: avoids overflow. Integer keys: natural fit. Simple implementations: embedded systems.

Folding Method

Algorithm

1. Divide key into parts2. Add parts (shift and add)3. Modulo to rangeExample: k=123456789, m=100Divide: 123, 456, 789Add: 123 + 456 + 789 = 1368h(k) = 1368 mod 100 = 68

Variants

Shift and add: shift alternate parts, then add. Folding at boundaries: reverse alternate parts (different distribution).

Advantages

Simple: easy to code. Handles long keys: process chunks. Good: reasonable distribution for strings.

Disadvantages

Weaker: less avalanche than multiplication. Dependency: digit order matters less. Clustering: some patterns cluster.

Use Case

String hashing: break string into chunks. Legacy systems: historical method (simpler than modular).

Universal Hashing

Concept

Family of hash functions: multiple functions, random selection. Guarantee: for any two keys, small probability same hash (collision). Proof: mathematical bound on collision rate.

Formula

h(k) = ((a*k + b) mod p) mod mWhere:p = prime > |U| (universe size)a, b = random, 1 <= a < p, 0 <= b < pm = hash table sizeDifferent (a,b) pairs = different hash functions

Collision Probability

Theorem: Prob(h(k1) = h(k2)) <= 1/m for random (a,b). Guarantee: any two keys collide with probability 1/m (optimal).

Advantages

Provable: mathematical guarantees. Adversary-resistant: no bad input distribution. Flexible: choose function for worst case.

Disadvantages

Overhead: modular arithmetic expensive. Randomization: must store (a,b). Complexity: not simple implementation.

When to Use

Security-critical: adversarial input. Theoretical: need guarantees. Performance-critical: collision handling expensive.

Cryptographic Hash Functions

Properties

Deterministic: same input always same output. Quick: fast computation. Avalanche: tiny change, completely different output. One-way: cannot reverse (find input from output). Collision-resistant: hard to find two inputs with same hash.

Examples

MD5: 128-bit (broken, don't use). SHA-1: 160-bit (deprecated). SHA-2: 256/512-bit (standard). SHA-3: latest standard (Keccak).

Use Cases

Passwords: hash for storage (one-way). Integrity: checksums. Digital signatures: message authentication. Blockchain: proof-of-work (collision-hard).

Not For Hash Tables

Overkill: expensive for hash tables (need speed). Unnecessary: collision-resistance not needed (tables handle collisions). Time: SHA-256 >> division method (by orders of magnitude).

Trade-off

Hash tables: need speed (simple functions). Cryptography: need security (complex functions). Different goals, different methods.

Collision Probability

Birthday Paradox

Probability of collision: surprisingly high. For m = 10^6: need ~1000 keys for 50% collision chance. Formula: 1 - (m!/(m^n * (m-n)!)) approximated: 1 - e^(-n^2/(2m)).

Load Factor

Load factor α = n/m (keys / table size). α < 0.5: low collisions. α = 1: moderate collisions. α > 2: very high collisions. Resize table: keep α < 1.

Expected Collisions

Load FactorCollision Prob.Avg. Probe Length
0.5~1.3%~1.25
0.75~4%~1.5
1.0~7.5%~2
2.0~30%~3

Rehashing

When α exceeds threshold (usually 0.75): create larger table. Rehash all keys: recompute hash with new m. Cost: O(n). Infrequent: amortized to O(1) per insertion.

Bloom Filters

Concept

Probabilistic: uses k hash functions. Bit array: initially all 0. Insert: set k bits to 1. Query: check if all k bits are 1.

Operation

Insert x: For i = 1 to k: set bit[h_i(x)] = 1Query x: For i = 1 to k: if bit[h_i(x)] == 0: x definitely not present else: x probably present (may have false positive)

False Positives

Probability: e^(-k*n/m)^k. Trade-off: more bits/functions reduce false positives. No false negatives: if absent, definitely detected.

Use Case

Cache lookups: avoid expensive operations. Network: spam detection (multiple hash functions). Space-efficient: single bit per element vs full hash.

Advantage over Hash Table

Space: bits vs full entries (100x smaller). Speed: O(k) vs O(1) (fixed k, predictable). Trade-off: false positives acceptable.

Practical Examples

Java's Integer Hash

public static int hashCode(int value) { return value;}

Java's String Hash

h = 0for (int i = 0; i < s.length(); i++) h = 31*h + s.charAt(i);return h;

FNV Hash (32-bit)

h = 2166136261for each byte b: h = (h * 16777619) XOR breturn h;

MurmurHash (Fast, Good Distribution)

Complex: multiple bit operations (shifts, XOR, multiplication). Distribution: excellent. Speed: 3-4x faster than universal hashing.

Practical Selection

Integer keys: identity or simple multiply. String keys: length-weighted (like Java). General: FNV or MurmurHash. Security: SHA-2.

References

  • Knuth, D. E. "The Art of Computer Programming: Sorting and Searching." Vol. 3, Addison-Wesley, 2nd edition, 1998.
  • Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. "Introduction to Algorithms." MIT Press, 3rd edition, 2009.
  • Carter, J. L., and Wegman, M. N. "Universal Classes of Hash Functions." Journal of Computer and System Sciences, vol. 18, no. 2, 1979, pp. 143-154.
  • Bloom, B. H. "Space/Time Trade-offs in Hash Coding with Allowable Errors." Communications of the ACM, vol. 13, no. 7, 1970, pp. 422-426.
  • Sedgewick, R., and Wayne, K. "Algorithms." Addison-Wesley, 4th edition, 2011.