Definition and Purpose

What is Cache?

Cache: small, fast memory located close to CPU cores. Purpose: reduce average memory access time. Function: store frequently accessed instructions/data. Benefit: improves CPU throughput and reduces latency.

Role in Memory Hierarchy

Position: between CPU registers and main memory. Level: intermediary buffer. Effect: bridges speed gap between fast CPU and slower DRAM. Reduces costly main memory accesses.

Historical Context

Origin: introduced in 1960s mainframes. Evolution: from single-level caches to multi-level hierarchies. Importance: fundamental for modern processors’ performance scaling.

"A cache is a small, fast memory that stores copies of data from frequently used main memory locations." -- Andrew S. Tanenbaum

Cache Hierarchy

Levels of Cache

L1 Cache: smallest, fastest, split into instruction and data caches, size 16-64KB, cycle latency ~1-3. L2 Cache: larger, unified, size 256KB-1MB, latency ~5-12 cycles. L3 Cache: shared, multi-MB size, latency ~20-40 cycles.

Trade-offs

Size vs Speed: larger cache slower but holds more data. Levels: increase hit rate but add latency. Inclusion: inclusion property ensures L1 data in L2. Exclusivity: some architectures use exclusive caches to increase capacity.

Example Architectures

Intel Core i7: 3-level cache hierarchy. AMD Ryzen: inclusive L3 shared cache. ARM Cortex: typically 2-level cache with L2 shared. High-performance servers: multiple cache levels including L4.

Cache LevelSizeLatency (cycles)Typical Use
L116-64 KB1-3Per-core, split I/D
L2256 KB - 1 MB5-12Per-core, unified
L32-32 MB20-40Shared, multi-core

Cache Mapping Techniques

Direct-Mapped Cache

Mapping: each block maps to exactly one cache line. Indexing: uses lower bits of address. Advantage: simple, fast lookup. Disadvantage: conflict misses high due to fixed mapping.

Fully Associative Cache

Mapping: any block can be placed anywhere in cache. Lookup: parallel search of all tags. Advantage: minimizes conflict misses. Disadvantage: complex hardware, slower access time, higher power.

Set-Associative Cache

Mapping: cache divided into sets, each with multiple lines (ways). Block maps to one set but any line within. Trade-off: balances speed and flexibility. Common: 2-way, 4-way, 8-way associativity.

Cache TypeMappingAdvantagesDisadvantages
Direct-MappedOne-to-oneSimple, fastConflict misses
Fully AssociativeAny lineLow conflict missesComplex, slow
Set-AssociativeSet-basedBalance of speed &missesMore complex than direct

Address Breakdown

Address components: Tag, Index, Block Offset. Index selects set/line. Tag verifies block identity. Offset selects byte within block.

Address = [Tag | Index | Block Offset]Block Size = 2^b bytesNumber of Sets = 2^sTag Size = Address bits - (s + b)

Cache Replacement Policies

Least Recently Used (LRU)

Policy: evict block unused longest. Implementation: counters, stack. Advantage: exploits temporal locality. Disadvantage: complexity increases with associativity.

First In First Out (FIFO)

Policy: evict oldest block. Simple queue structure. Advantage: low overhead. Disadvantage: ignores usage patterns, less efficient than LRU.

Random Replacement

Policy: randomly select block to evict. Advantage: simplest, avoids pathological cases. Disadvantage: unpredictable performance.

Other Policies

Least Frequently Used (LFU), Pseudo-LRU approximations, Adaptive policies combining heuristics.

Example: For 4-way associative cache,On miss: Identify victim block by LRU counters Replace victim with new blockUpdate LRU state on every access

Cache Coherence and Consistency

Coherence Problem

Issue: multiple caches may hold copies of same memory block. Problem: update in one cache must reflect in others. Without coherence: stale data, inconsistency.

Coherence Protocols

MESI: states Modified, Exclusive, Shared, Invalid. Protocol: manages block states and transitions on read/write. Maintains coherence via bus snooping or directory-based protocols.

Memory Consistency Models

Defines order of memory operations visible to processors. Examples: Sequential Consistency, Relaxed Consistency. Affects cache behavior and synchronization.

Implementation Approaches

Bus Snooping: monitors bus transactions. Directory-Based: centralized tracking of block states. Hybrid methods optimize scalability and performance.

Cache Performance Metrics

Hit Rate and Miss Rate

Hit Rate: fraction of accesses found in cache. Miss Rate = 1 - Hit Rate. Aim: maximize hit rate to reduce latency.

Types of Misses

Compulsory Misses: first-time access misses. Capacity Misses: cache too small. Conflict Misses: mapping collisions in cache.

Average Memory Access Time (AMAT)

Formula: AMAT = Hit Time + Miss Rate × Miss Penalty. Lower AMAT improves overall processor speed.

Bandwidth and Power

Cache reduces memory bandwidth demand. Power: smaller, faster caches consume less power. Trade-offs between size, speed, and power consumption.

AMAT = Hit Time + (Miss Rate × Miss Penalty)Example:Hit Time = 1 cycleMiss Rate = 5%Miss Penalty = 100 cyclesAMAT = 1 + 0.05 × 100 = 6 cycles

Write Policies

Write-Through

Writes update cache and main memory simultaneously. Advantage: simple coherence. Disadvantage: higher memory traffic.

Write-Back

Writes update cache only; memory updated on block eviction. Advantage: reduces memory writes. Disadvantage: complexity in maintaining coherence.

Write Allocate vs No-Write Allocate

Write Allocate: loads block on write miss before update. No-Write Allocate: writes directly to memory on miss without loading. Used respectively with write-back and write-through.

Cache Design Considerations

Block Size

Typical sizes: 16-128 bytes. Larger blocks exploit spatial locality but increase miss penalty and memory traffic.

Associativity

Higher associativity reduces conflict misses but increases hit time and hardware complexity.

Latency vs Bandwidth

Trade-off: smaller caches faster but lower bandwidth; larger caches slower but higher capacity.

Tag Storage

Tags store metadata for validation. Tag size affects cache overhead and access time.

Inclusion Property

Inclusive caches guarantee lower-level data also present in higher levels. Simplifies coherence but reduces effective capacity.

Virtual vs Physical Caches

Virtual Cache

Indexed/tagged using virtual addresses. Advantage: early access before address translation. Challenge: synonym problem (multiple virtual addresses mapping to same physical address).

Physical Cache

Uses physical addresses for indexing/tagging. Avoids synonym issues. Slightly higher latency due to address translation first.

Virtual-Physical Hybrid

Virtually indexed, physically tagged caches combine benefits. Reduce latency with fewer synonym conflicts.

Multi-core and Shared Cache Architectures

Private vs Shared Caches

Private: each core has own L1/L2 caches. Shared: usually L3 cache shared among cores. Shared caches improve data sharing but increase contention.

Cache Coherence in Multi-core

Protocols ensure consistency across cores. Scalability challenges with increasing cores.

Non-Uniform Cache Access (NUCA)

Caches divided into banks with variable latency. Optimizes access time based on physical location.

Cache and Security Implications

Side-Channel Attacks

Cache timing leaks sensitive information. Examples: Spectre, Meltdown. Attackers infer data by measuring cache hits/misses.

Mitigation Techniques

Cache partitioning, randomization, flushing on context switch. Hardware and software defenses to reduce attack surface.

Trusted Execution Environments

Isolate sensitive computations from cache interference. Use secure cache designs to protect confidentiality.

References

  • Hennessy, J. L., & Patterson, D. A. Computer Architecture: A Quantitative Approach. 6th ed., Morgan Kaufmann, 2017, pp. 123-178.
  • Smith, A. J. Cache Memories. ACM Computing Surveys, vol. 14, no. 3, 1982, pp. 473-530.
  • Qureshi, M. K., & Patt, Y. N. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, 2006, pp. 423-432.
  • Dubois, M., Scheurich, C., & Briggs, F. A. Synchronization and Coherence in Shared-Memory Multiprocessors. IEEE Computer, vol. 21, no. 6, 1988, pp. 9-21.
  • Kocher, P. Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems. Advances in Cryptology, CRYPTO ’96, 1996, pp. 104-113.