Hash Functions (Cryptography)

Introduction

A cryptographic hash function is a mathematical algorithm that takes an input of arbitrary size and produces a fixed-size output (the hash, digest, or fingerprint). This transformation is deterministic (the same input always yields the same output) and one-way (it is computationally infeasible to reverse the process and recover the input from the output).

Hash functions are among the most fundamental building blocks in cryptography. They are used in digital signatures, message authentication codes (HMACs), password storage, data integrity verification, blockchain proof-of-work, certificate fingerprints, and countless other security mechanisms.

A good cryptographic hash function should behave like a random oracle -- given any input, the output should appear completely random and unpredictable, with no discernible relationship between inputs and their hashes. In practice, no function can truly be a random oracle, but well-designed hash functions come remarkably close.

"Hash functions are the workhorse of modern cryptography. Almost every cryptographic protocol uses a hash function somewhere." -- Phillip Rogaway, University of California, Davis

Essential Properties

Preimage Resistance

A hash function must have three levels of resistance:

Preimage resistance (one-wayness): Given a hash value h, it should be computationally infeasible to find any message m such that H(m) = h. This prevents an attacker from recovering the original data from its hash.
Second preimage resistance: Given a message m1, it should be infeasible to find a different message m2 such that H(m1) = H(m2). This prevents an attacker from substituting a different document that has the same hash as the original.

Collision Resistance

Collision resistance means it should be infeasible to find any two distinct messages m1 and m2 such that H(m1) = H(m2). Note the distinction from second preimage resistance: here the attacker has freedom to choose both messages, which makes collisions easier to find (see the Birthday Attack section).

Collision resistance is critical for digital signatures. If an attacker can create two documents with the same hash, they could have one document signed (say, a benign contract) and then substitute the other (a malicious contract) -- the signature remains valid for both.

Avalanche Effect

The avalanche effect means that a small change in the input (even flipping a single bit) produces a drastically different hash output. Approximately 50% of the output bits should change. This property ensures that similar inputs produce completely dissimilar hashes, preventing any useful information from leaking through hash comparison.

// Demonstrating the avalanche effect with SHA-256SHA256("hello") = 2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824SHA256("hellp") = 739d1e53e45e1b7eb042aee40e3dbbbb1e6a1eab43e18e5c8b2706ae0bda10f5SHA256("Hello") = 185f8db32271fe25f561a6fc938b2e264306ec304eda518007d1764826381969// Changing a single character completely changes the hash// No pattern can be discerned between inputs and outputs

Major Hash Algorithms

MD5 (Message Digest 5)

MD5 was designed by Ronald Rivest in 1991 as a successor to MD4. It produces a 128-bit (16-byte) hash value, typically rendered as a 32-character hexadecimal string. MD5 was once the standard hash function for file integrity checks, digital signatures, and password storage.

However, MD5 has been cryptographically broken since the mid-2000s:

In 2004, Xiaoyun Wang demonstrated practical collision attacks
In 2008, researchers used MD5 collisions to create a rogue CA certificate that browsers trusted
In 2012, the Flame malware exploited MD5 collisions in Windows Update's code signing

MD5 should never be used for security purposes. It remains acceptable only for non-cryptographic checksums (e.g., verifying file downloads when the hash itself is transmitted securely).

SHA-1 (Secure Hash Algorithm 1)

SHA-1, designed by the NSA and published by NIST in 1995, produces a 160-bit hash. It was the dominant hash function for over a decade, used in SSL/TLS certificates, Git commits, and digital signatures.

SHA-1 was theoretically broken in 2005 when Wang et al. demonstrated attacks faster than brute force. In February 2017, Google and CWI Amsterdam published SHAttered -- the first practical SHA-1 collision, producing two different PDF files with the same SHA-1 hash. The attack required approximately 2^63.1 SHA-1 computations (about 6,500 years of single-CPU time, completed in days using Google's infrastructure).

Major browsers stopped accepting SHA-1 certificates in 2017, and NIST officially deprecated SHA-1 for digital signatures. Git has migrated to SHA-256 as its hash algorithm. SHA-1 should not be used in any new security application.

SHA-2 Family

The SHA-2 family, designed by the NSA and published by NIST in 2001, includes six hash functions with digest sizes of 224, 256, 384, and 512 bits. SHA-256 and SHA-512 are the most widely used variants.

SHA-256 is the standard hash function of modern cryptography. It secures TLS certificates, Bitcoin's proof-of-work, HMAC constructions, and virtually every digital signature scheme in current use. No practical or theoretical attacks against SHA-2 have been published -- the best attacks reduce the collision resistance of SHA-256 from 2^128 to approximately 2^125, which remains far beyond feasibility.

Variant	Digest Size	Block Size	Rounds	Word Size
SHA-224	224 bits	512 bits	64	32 bits
SHA-256	256 bits	512 bits	64	32 bits
SHA-384	384 bits	1024 bits	80	64 bits
SHA-512	512 bits	1024 bits	80	64 bits
SHA-512/224	224 bits	1024 bits	80	64 bits
SHA-512/256	256 bits	1024 bits	80	64 bits

SHA-512 is actually faster than SHA-256 on 64-bit processors because it uses 64-bit word operations natively. SHA-512/256 provides a 256-bit digest using the SHA-512 core, combining the speed advantage of SHA-512 on 64-bit platforms with the 256-bit output size.

SHA-3 (Keccak)

SHA-3 was standardized by NIST in 2015 (FIPS 202) after a public competition that ran from 2007 to 2012. The winner was Keccak, designed by Guido Bertoni, Joan Daemen, Michael Peeters, and Gilles Van Assche. SHA-3 is not a replacement for SHA-2 (which remains secure) but a complementary standard based on entirely different mathematics.

Unlike SHA-2, which uses the Merkle-Damgard construction, SHA-3 uses a sponge construction. The sponge absorbs input data in blocks, interleaving with a permutation function, then squeezes out the hash. This design eliminates the length extension vulnerability that affects all Merkle-Damgard hash functions (including SHA-256) -- with SHA-256, knowing H(m) and the length of m allows computing H(m || padding || m') without knowing m.

SHA-3 also provides SHAKE128 and SHAKE256, extendable-output functions (XOFs) that can produce hash outputs of any desired length, making them versatile for key derivation and other applications.

"SHA-3 is an insurance policy. If SHA-2 ever falls, SHA-3 is ready -- and because it is built on completely different mathematical foundations, whatever breaks SHA-2 is unlikely to break SHA-3." -- NIST, SHA-3 standardization announcement

The Birthday Attack

The birthday attack exploits the mathematics of the birthday problem in probability theory. In a room of just 23 people, there is a greater than 50% probability that two people share the same birthday. Analogously, for a hash function with an n-bit output, a collision can be expected after approximately 2^(n/2) random hashing operations -- not 2^n.

This means a 128-bit hash (like MD5) provides only 64-bit collision resistance, and a 160-bit hash (SHA-1) provides only 80-bit collision resistance. For SHA-256, the birthday bound is 2^128, which remains well beyond practical attack capability.

Hash Function	Output Size	Preimage Resistance	Collision Resistance (Birthday Bound)	Status
MD5	128 bits	2^128 (theoretical)	2^64 (practical: broken)	Broken
SHA-1	160 bits	2^160	2^80 (practical: broken at 2^63)	Deprecated
SHA-256	256 bits	2^256	2^128	Secure
SHA-512	512 bits	2^512	2^256	Secure
SHA-3-256	256 bits	2^256	2^128	Secure

Password Hashing

Cryptographic hash functions like SHA-256 are designed to be fast -- a modern GPU can compute billions of SHA-256 hashes per second. This is exactly what you want for file integrity or digital signatures, but it is catastrophic for password storage. If passwords are stored as plain SHA-256 hashes, an attacker with a stolen database can try billions of password guesses per second.

Purpose-built password hashing functions solve this problem by being deliberately slow and resource-intensive. They use configurable parameters to control computation time, memory usage, and parallelism, making brute-force attacks impractical.

bcrypt

bcrypt, designed by Niels Provos and David Mazieres in 1999, is based on the Blowfish cipher. Its key innovation is the cost factor (work factor), an integer that determines the number of key expansion iterations. Each increment of the cost factor doubles the computation time, allowing bcrypt to keep pace with Moore's Law.

// bcrypt hash example// Cost factor 12 = 2^12 = 4096 iterations of key expansionpassword = "correct horse battery staple"salt = generate_random_salt(16 bytes)hash = bcrypt(password, salt, cost=12)// Result: $2b$12$LJ3m4ys3Lk0TSwMvnEyBxe8GCNZ5YoKPjQkOVLrYMPFhO9JHBa// Verificationis_correct = bcrypt_verify("correct horse battery staple", stored_hash)// Returns: true

bcrypt limits passwords to 72 bytes and always produces a 184-bit hash. It remains widely used and is the default password hashing function in many frameworks. However, it uses only CPU time, not memory, making it vulnerable to GPU and ASIC attacks.

Argon2

Argon2 won the Password Hashing Competition (PHC) in 2015 and is the current recommended standard. Unlike bcrypt, Argon2 is memory-hard -- it requires a configurable amount of RAM, making it resistant to GPU and ASIC attacks (which have fast computation but limited memory bandwidth).

Argon2 has three variants:

Argon2d: Data-dependent memory access. Maximum resistance to GPU cracking but vulnerable to side-channel attacks. Use for cryptocurrency mining and non-interactive contexts.
Argon2i: Data-independent memory access. Resistant to side-channel attacks. Use for password hashing in server environments.
Argon2id: Hybrid of d and i modes. Recommended for most password hashing use cases. Uses data-independent access in the first pass and data-dependent access in subsequent passes.

OWASP recommends Argon2id with a minimum configuration of: memory 19 MiB, iterations 2, parallelism 1. These parameters should be tuned to take approximately 0.5-1 second per hash on the production hardware.

Applications of Hash Functions

Cryptographic hash functions serve as building blocks across virtually all areas of information security:

Digital Signatures: The message is hashed before signing, so the signature covers a fixed-size digest rather than the entire (potentially very large) message. See Digital Signatures.

HMAC (Hash-based Message Authentication Code): HMAC combines a hash function with a secret key to provide both integrity and authentication. HMAC-SHA256(key, message) is widely used in API authentication, JWT tokens, and session management.

Data Integrity: File hashes (checksums) verify that downloaded files have not been corrupted or tampered with. Software repositories publish SHA-256 hashes alongside their packages.

Blockchain and Proof-of-Work: Bitcoin's mining process requires finding a nonce such that SHA256(SHA256(block_header)) produces a hash below a target value. The one-wayness of SHA-256 ensures that the only way to find such a nonce is exhaustive trial-and-error.

Key Derivation: Key derivation functions like HKDF (HMAC-based Key Derivation Function) use hash functions to derive multiple cryptographic keys from a single shared secret, as in the TLS 1.3 key schedule.

Content-Addressable Storage: Git uses SHA-1 (migrating to SHA-256) to identify every object (commit, tree, blob) by its content hash. IPFS uses multihash for content addressing in distributed file systems.

Algorithm Comparison

Algorithm	Output Size	Speed (cycles/byte)	Year	Security Status	Use Case
MD5	128 bits	~5	1991	Broken	Non-security checksums only
SHA-1	160 bits	~7	1995	Deprecated	Legacy systems only
SHA-256	256 bits	~12	2001	Secure	General purpose (recommended)
SHA-512	512 bits	~8 (64-bit)	2001	Secure	64-bit platforms, higher security margin
SHA-3-256	256 bits	~10	2015	Secure	Diversity, length extension immunity
BLAKE2b	Up to 512 bits	~3	2012	Secure	High-performance hashing
BLAKE3	256 bits	~1	2020	Secure	Extreme performance, parallelism

For new applications requiring a general-purpose hash, SHA-256 is the standard choice. For performance-critical applications, BLAKE3 offers dramatically higher throughput while maintaining security. For password hashing specifically, always use Argon2id or bcrypt -- never a general-purpose hash function.

For related topics, see Symmetric Encryption (which uses hash functions internally) and Public Key Infrastructure (which uses hash functions for certificate fingerprints).

References

Rivest, R. (1992). RFC 1321: The MD5 Message-Digest Algorithm.
NIST (2015). FIPS 180-4: Secure Hash Standard (SHS).
NIST (2015). FIPS 202: SHA-3 Standard: Permutation-Based Hash and Extendable-Output Functions.
Wang, X., & Yu, H. (2005). "How to Break MD5 and Other Hash Functions." EUROCRYPT 2005, Springer.
Stevens, M., et al. (2017). "The First Collision for Full SHA-1." CRYPTO 2017, Springer.
Provos, N., & Mazieres, D. (1999). "A Future-Adaptable Password Scheme." USENIX Annual Technical Conference.
Biryukov, A., Dinu, D., & Khovratovich, D. (2016). "Argon2: New Generation of Memory-Hard Functions for Password Hashing and Other Applications." IEEE European Symposium on Security and Privacy.
Krawczyk, H. (2010). RFC 5869: HMAC-based Extract-and-Expand Key Derivation Function (HKDF).
Aumasson, J.-P., et al. (2013). "BLAKE2: Simpler, Smaller, Fast as MD5." ACNS 2013, Springer.
O'Connor, S., et al. (2020). "BLAKE3: One function, fast everywhere." Specification document.