Compressing or summarizing context to fit within a model's limited context window.
Context compaction is a technique used in large language model (LLM) systems to reduce the size of an accumulated conversation or document context so that it fits within the model's fixed context window while preserving as much relevant information as possible. Because transformer-based models can only attend to a finite number of tokens at once — ranging from a few thousand to hundreds of thousands depending on the architecture — long-running interactions, agentic workflows, or large document processing tasks inevitably risk exceeding this limit. Context compaction addresses this constraint by intelligently condensing prior content before it is truncated or lost.
The most common approach to context compaction involves using the model itself to generate a summary of earlier portions of the conversation or task history, then replacing those raw tokens with the compressed summary. This recursive summarization can be triggered automatically when the context approaches a threshold, or it can be applied in structured layers — for example, summarizing individual tool call results before aggregating them into a higher-level task summary. Other strategies include selective retention, where only the most semantically relevant or recently accessed information is kept verbatim, and retrieval-augmented approaches that offload older context to an external memory store and retrieve it on demand.
Context compaction is particularly critical in agentic and multi-step reasoning systems, where an AI agent may execute dozens or hundreds of tool calls, accumulate observations, and maintain a running plan over an extended session. Without compaction, these systems either fail when the context limit is hit or must discard valuable history arbitrarily. Effective compaction strategies allow agents to operate over much longer horizons, maintain coherent task state, and avoid repeating work already completed. The quality of compaction directly affects downstream performance — overly aggressive compression can cause the model to lose critical details, while insufficient compression fails to solve the underlying problem.
As context windows have grown larger, the need for compaction has shifted rather than disappeared. Longer windows increase computational cost quadratically due to the attention mechanism, making efficient context management an economic and latency concern even when hard limits are not reached. Context compaction thus remains an active area of research and engineering, intersecting with topics like memory-augmented neural networks, retrieval-augmented generation, and efficient attention mechanisms.