Streaming Sessions

Streaming sessions are an inference optimization technique that maintains a persistent GPU sequence across a series of sequential micro-turn requests, avoiding the overhead of repeated memory allocation and key-value cache initialization that would otherwise occur with every new 200ms chunk. In a naive implementation, processing each micro-turn as an independent request would require allocating GPU memory, initializing the transformer key-value cache, and computing position embeddings for every chunk — operations whose overhead can exceed the actual computation of the chunk itself at 200ms granularity.

The streaming sessions approach takes advantage of the fact that micro-turn chunks in an interaction model arrive in a predictable, sequential order with temporal continuity. Rather than treating each chunk as a fresh sequence, the server appends each chunk to a persistent sequence stored in GPU memory, maintaining the full context of the session across all chunks. Each new chunk only needs to compute its own forward pass; the attention mechanism attends over the entire accumulated sequence without requiring that the prior chunks be reprocessed. This dramatically reduces the per-chunk compute overhead and is essential for meeting the strict latency constraints of real-time interaction at 200ms granularity.

The technique was developed to address the mismatch between how existing LLM inference systems are optimized and what continuous interaction requires. Most LLM serving systems are optimized for long continuous sequences (processing a large document) or short independent requests (single-turn inference). Neither pattern matches the streaming micro-turn scenario: many very short requests with strict latency budgets arriving in rapid succession, where the total session length can grow very long even as individual request sizes remain tiny. Streaming sessions address this by separating the logical request boundary (each 200ms chunk) from the physical GPU memory management (a single growing sequence).

The concept has been upstreamed to SGLang, a popular open-source LLM serving framework, indicating broad applicability beyond the specific interaction model implementation. The key engineering challenge is managing the growth of the persistent sequence over very long sessions without running out of GPU memory — an active area of work involving context management, eviction strategies, and architectural changes to the attention mechanism.