Each token has a fixed capacity to influence others, which shrinks as context grows.
An attention budget is the finite amount of influence each token in a context window can distribute across all other tokens. When a model processes a token, it allocates some portion of its representational capacity to attend to related tokens — prior context, instructions, referenced files, or conversation history. That allocation is fixed per token, independent of how long the overall context grows.
The mechanism stems from how transformer attention works: every token attends to every other token in the context window, but the information any single token can transmit is bounded. Heavy attention drawn by one relationship — say, a long thread of tool results — leaves proportionally less capacity for other relationships, such as a schema reference pasted at the top of the prompt.
This creates a practical tradeoff as sessions grow. Early in a conversation, the model's attention is sharply focused on the most recent instructions. Later, older tokens compete for that same fixed budget, and the effective signal-to-noise ratio drops. This is why long agent sessions often exhibit degraded performance on tasks that require recalling early context.
Open questions include whether sparse attention mechanisms can selectively budget attention toward higher-value relationships, and whether session-level memory systems can offload low-value tokens to preserve per-token budget efficiency. The field has no established standard for measuring or communicating attention budget constraints to users.