Hardware mechanism transferring cache lines directly between processor caches without accessing main memory.
Cache-to-cache (C2C) transfer is a hardware capability in coherent shared-memory systems that allows one processor's cache controller to supply a cache line directly to another processor's cache in response to a read or write request, bypassing main memory entirely. Rather than routing the request down to DRAM and back, the owning cache intercepts the miss and forwards the data laterally across the interconnect fabric. This behavior is governed by cache-coherence protocols such as MESI, MOESI, and their derivatives, and can be implemented in both snoop-based and directory-based coherence architectures.
The mechanism works by detecting, during a cache miss, that another cache already holds the requested line in a modified or shared state. The coherence protocol coordinates the transfer: the supplying cache sends the line directly to the requesting cache, updates its own state, and optionally writes back to memory depending on the protocol variant. This reduces round-trip latency compared to a full DRAM fetch and conserves memory bandwidth, though it shifts traffic onto the on-chip or off-chip interconnect and can introduce contention on the coherence fabric under high-concurrency workloads.
For AI and machine learning workloads, C2C efficiency has become increasingly important as models grow larger and parallel training across many cores or accelerator tiles becomes standard. Frequent parameter reads, gradient accumulations, and shared activation buffers generate dense patterns of remote cache accesses. Effective C2C transfers reduce stalls during these operations, improve effective bandwidth for shared data structures, and help mitigate the performance cost of false sharing. Conversely, poorly managed data placement or coherence policy mismatches can saturate interconnects and negate the benefits.
As many-core accelerators, chiplet-based designs, and large-scale distributed training systems have proliferated through the 2010s and 2020s, hardware architects and ML system designers have paid growing attention to C2C behavior. Optimizations such as careful tensor partitioning, NUMA-aware memory allocation, and coherence domain tuning are now standard considerations when deploying large models on modern multi-socket or multi-tile hardware.