A decoder-only Transformer with addressable auxiliary memory enabling reasoning far beyond its attention window.
A Large Memory Model (L2M) is a decoder-only Transformer architecture augmented with an explicit auxiliary memory module that provides persistent, content-addressable storage for intermediate representations, retrieved states, and relational facts. Unlike standard Transformers, which are constrained by a fixed context window, L2M systems offload long-lived information to an external memory bank, allowing the model to maintain world state, track entities across vast spans of text, and synthesize evidence distributed far beyond what local attention can reach. The memory module is typically differentiable and trained end-to-end alongside the base model.
In practice, L2M architectures interface with memory through learned read and write primitives — content-based or sparse addressing, gating mechanisms, and learned key-value representations. Training may incorporate auxiliary objectives such as retrieval supervision, contrastive losses, or reconstruction targets to encourage the model to store and retrieve information faithfully. Some designs integrate retrieval-augmented generation pipelines, while others use hierarchical memory tiers with learned compression or eviction strategies to manage capacity and latency. These engineering choices allow multi-step reasoning by storing intermediate latent states and chaining inference across discrete retrieval operations.
L2M draws conceptually from a lineage of memory-augmented neural networks — including Neural Turing Machines, Memory Networks, and Compressive Transformers — while addressing the practical demands of modern large-scale language modeling. The key advance over earlier memory architectures is scale: L2M designs are built to operate with the parameter counts, training data volumes, and inference throughput requirements of contemporary foundation models, making memory augmentation viable for real deployment rather than toy tasks.
The approach shows particular value in long-document question answering, multi-turn dialogue requiring persistent state, complex program synthesis over large codebases, and multi-hop scientific or legal reasoning where evidence is scattered across lengthy inputs. As context-window scaling alone proves costly and sometimes insufficient for deep compositional reasoning, L2M represents a structurally distinct path toward extending the effective reasoning horizon of large language models without proportionally increasing attention complexity.