CPU Cache | Jihoon Kim

CPU cache stands out as a critical component that significantly impacts performance. This post will explore the role of CPU cache in modern processors, its levels, associativity, coherence, replacement policies, and advanced features.

What is CPU Cache?

CPU cache is a high-speed memory subsystem integrated directly into the processor. Its primary function is to bridge the speed gap between the fast CPU and the relatively slower main memory (RAM). By storing frequently accessed data and instructions, cache reduces the average time it takes for the CPU to access memory, a concept known as memory latency.

The Memory Hierarchy

To understand cache better, we need to look at the entire memory hierarchy:

CPU Registers (fastest, smallest)
L1 Cache
L2 Cache
L3 Cache
Main Memory (RAM)
Storage (HDD/SSD) (slowest, largest)

As we move down this list, capacity increases but speed decreases. CPU cache occupies the crucial middle ground in this hierarchy.

The Three Levels of Cache: In-Depth

L1 Cache

Size: Typically 32 to 64 KB per core
Speed: Can be accessed in as little as 2-4 CPU cycles
Structure: Often split into separate instruction (L1i) and data (L1d) caches
Associativity: Usually 4 or 8-way set associative
Latency: Around 0.9 nanoseconds

L1 cache is further divided into L1i (instruction) and L1d (data) caches. This separation allows the CPU to fetch instructions and data simultaneously, improving efficiency.

L2 Cache

Size: Usually 256 KB to 1 MB per core
Speed: Access time of about 7-20 CPU cycles
Structure: Unified (contains both instructions and data)
Associativity: Often 8 or 16-way set associative
Latency: Around 2.5 nanoseconds

L2 cache is larger but slower than L1. It acts as a backup, holding data that doesn’t fit in L1 but is still likely to be needed soon.

L3 Cache

Size: Often 4 MB to 50 MB (or even more in high-end processors)
Speed: Access time of about 20-60 CPU cycles
Structure: Unified and shared among all cores
Associativity: Can be 16-way or higher set associative
Latency: Around 12 nanoseconds

L3 cache is the last line of defense before the CPU has to access the much slower main memory. It’s shared among all cores, facilitating data sharing between cores.

Cache Associativity Explained

Cache associativity refers to how memory blocks are mapped to cache lines. There are three main types:

Direct-mapped: Each memory block maps to only one possible cache line.
Fully associative: A memory block can map to any cache line.
Set associative: A compromise between the above two, where each memory block can map to a subset of cache lines.

Most modern caches are set associative, balancing performance and complexity.

Cache Coherence

In multi-core processors, maintaining cache coherence is crucial. This ensures that all cores have a consistent view of memory. Protocols like MESI (Modified, Exclusive, Shared, Invalid) are used to maintain coherence across different cache levels and cores.

Cache Replacement Policies

When a cache is full and a new block needs to be loaded, the processor must decide which existing block to evict. Common replacement policies include:

Least Recently Used (LRU)
Random Replacement
First-In, First-Out (FIFO)
Pseudo-LRU

The choice of policy can significantly impact cache performance.

Cache Performance Metrics

Several metrics are used to evaluate cache performance:

Hit Rate: The percentage of memory accesses found in the cache.
Miss Rate: The percentage of memory accesses not found in the cache.
Hit Time: The time it takes to access data in the cache.
Miss Penalty: The additional time required to fetch data from a lower level of memory.

Advanced Cache Features

Modern processors implement various advanced caching techniques:

Prefetching: Proactively fetching data into the cache before it’s requested.
Inclusive vs. Exclusive Caching: Determining whether lower-level caches contain copies of data in higher-level caches.
Write-back vs. Write-through: Strategies for updating main memory when cache data is modified.
Cache Partitioning: Reserving portions of cache for specific applications or cores.

Prefetching

Prefetching is a technique used to reduce cache misses by proactively loading data into the cache before it’s explicitly requested by the CPU.

How Prefetching Works

The processor predicts which data will be needed in the near future.
It fetches this data from main memory into the cache ahead of time.
When the CPU actually needs the data, it’s already available in the cache.

There are two main types of prefetching:

Hardware Prefetching: Implemented in the CPU’s circuitry.
- Analyzes memory access patterns in real-time.
- Automatically initiates prefetch requests.
- Common in modern processors.
Software Prefetching: Implemented through code.
- Programmers or compilers insert prefetch instructions.
- Gives fine-grained control over what data to prefetch.

Pros of Prefetching:

Reduces cache misses and memory latency.
Can significantly improve performance for predictable access patterns.

Cons of Prefetching:

If predictions are incorrect, it can waste bandwidth and potentially evict useful data from the cache.
Overly aggressive prefetching can increase power consumption.

Prefetch Algorithms

Common prefetching algorithms include:

Sequential Prefetching: Fetches the next N cache lines after a cache miss.
Stride Prefetching: Detects and prefetches based on regular access patterns.
Correlation-based Prefetching: Uses historical data to predict future accesses.

Prefetching is particularly effective for applications with predictable memory access patterns, such as iterating over arrays or traversing linked data structures.

Inclusive vs. Exclusive Caching

Inclusive and exclusive caching policies determine how data is stored across different cache levels in the memory hierarchy.

Inclusive Caching

In an inclusive caching system:

Lower-level caches (e.g., L3) contain all the data that is in the higher-level caches (e.g., L2 and L1).
When data is brought into a higher-level cache, it’s also stored in the lower-level caches.

Pros:

Simplifies cache coherence in multi-core systems.
Easier to determine if data is in the cache without checking all levels.

Cons:

Reduces the effective cache size, as data is duplicated across levels.
Can lead to more frequent evictions in lower-level caches.

Exclusive Caching

In an exclusive caching system:

Each piece of data is stored in only one cache level at a time.
When data is moved to a higher-level cache, it’s removed from the lower-level cache.

Pros:

Maximizes the effective cache size by avoiding data duplication.
Can potentially store more unique data items across all cache levels.

Cons:

More complex to implement, especially for maintaining cache coherence.
May require additional data transfers between cache levels.

Hybrid Approaches

Some systems use a hybrid approach:

Mostly-Inclusive: Similar to inclusive, but allows some exclusivity for certain data.
Non-Inclusive Non-Exclusive (NINE): Doesn’t guarantee inclusion or exclusion, offering flexibility based on specific situations.

The choice between inclusive and exclusive caching depends on various factors, including the processor architecture, the expected workload, and the importance of cache coherence vs. maximizing cache capacity.

Write-back vs. Write-through Caching

When it comes to updating data in the cache and main memory, there are two primary strategies: write-back and write-through.

Write-back Caching

In write-back caching:

When data is modified, it’s only written to the cache, not to main memory.
The modified data is marked as “dirty.”
The data is written to main memory only when it’s evicted from the cache.

Pros:

Reduces write traffic to main memory
Can improve performance for write-heavy workloads

Cons:

Risk of data loss if system crashes before dirty data is written to main memory
More complex to implement

Write-through Caching

In write-through caching:

When data is modified, it’s written to both the cache and main memory immediately.
Ensures that cache and main memory are always in sync.

Pros:

Simpler to implement
Lower risk of data loss in case of system failure

Cons:

Increased write traffic to main memory
Can be slower for write-intensive operations

Most modern processors use write-back caching for L1 and L2 caches due to its performance benefits, while implementing safeguards against data loss. Some systems use a hybrid approach, employing write-through for certain critical data and write-back for the rest.

Cache Partitioning

Cache partitioning is a technique used to divide the cache into separate regions, each dedicated to specific tasks, cores, or applications. This approach aims to improve performance and provide more predictable behavior in multi-core and multi-tasking environments.

How Cache Partitioning Works

The cache (typically L3 or Last Level Cache) is divided into multiple partitions.
Each partition is assigned to a specific core, application, or set of tasks.
The processor ensures that each entity only uses its assigned partition.

Types of Cache Partitioning

Way Partitioning: Divides the cache by ways in a set-associative cache.
Set Partitioning: Allocates different sets of the cache to different partitions.
Dynamic Partitioning: Adjusts partition sizes on-the-fly based on workload demands.

Benefits of Cache Partitioning

Performance Isolation: Prevents one application from flooding the cache and evicting another application’s data.
Quality of Service (QoS): Ensures critical tasks always have a minimum amount of cache available.
Security: Can help mitigate certain types of cache-based side-channel attacks.
Predictability: Improves the predictability of application performance, which is crucial in real-time systems.

Challenges and Considerations

Reduced Flexibility: Strict partitioning may lead to underutilization if a partition’s owner doesn’t need all its allocated space.
Complexity: Implementing and managing cache partitioning adds complexity to the system.
Overhead: There may be a small performance overhead in enforcing partitioning.

Applications

Cache partitioning is particularly useful in:

Multi-tenant Cloud Environments: Ensuring fair resource allocation among different virtual machines or containers.
Real-time Systems: Providing predictable performance for time-critical tasks.
Mixed-criticality Systems: Isolating high-priority tasks from less critical ones.

Impact on Software Development

Understanding CPU cache can influence software optimization:

Loop Tiling: Restructuring loops to improve cache utilization.
Data Alignment: Aligning data structures to cache line boundaries.
Cache-Oblivious Algorithms: Designing algorithms that perform well regardless of cache size.

What is CPU Cache?

The Memory Hierarchy

The Three Levels of Cache: In-Depth

L1 Cache

L2 Cache

L3 Cache

Cache Associativity Explained

Cache Coherence

Cache Replacement Policies

Cache Performance Metrics

Advanced Cache Features

Prefetching

How Prefetching Works

Prefetch Algorithms

Inclusive vs. Exclusive Caching

Inclusive Caching

Exclusive Caching

Hybrid Approaches

Write-back vs. Write-through Caching

Write-back Caching

Write-through Caching

Cache Partitioning

How Cache Partitioning Works

Types of Cache Partitioning

Benefits of Cache Partitioning

Challenges and Considerations

Applications

Impact on Software Development

Enjoy Reading This Article?