Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Observatory
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Streaming Sessions

Streaming Sessions

GPU memory technique that maintains a persistent sequence across sequential micro-turn requests.

Year: 2025Generality: 500Added: May 12, 2026
Back to Vocab

Streaming sessions are an inference optimization technique that maintains a persistent GPU sequence across a series of sequential micro-turn requests, avoiding the overhead of repeated memory allocation and key-value cache initialization that would otherwise occur with every new 200ms chunk. In a naive implementation, processing each micro-turn as an independent request would require allocating GPU memory, initializing the transformer key-value cache, and computing position embeddings for every chunk — operations whose overhead can exceed the actual computation of the chunk itself at 200ms granularity.

The streaming sessions approach takes advantage of the fact that micro-turn chunks in an interaction model arrive in a predictable, sequential order with temporal continuity. Rather than treating each chunk as a fresh sequence, the server appends each chunk to a persistent sequence stored in GPU memory, maintaining the full context of the session across all chunks. Each new chunk only needs to compute its own forward pass; the attention mechanism attends over the entire accumulated sequence without requiring that the prior chunks be reprocessed. This dramatically reduces the per-chunk compute overhead and is essential for meeting the strict latency constraints of real-time interaction at 200ms granularity.

The technique was developed to address the mismatch between how existing LLM inference systems are optimized and what continuous interaction requires. Most LLM serving systems are optimized for long continuous sequences (processing a large document) or short independent requests (single-turn inference). Neither pattern matches the streaming micro-turn scenario: many very short requests with strict latency budgets arriving in rapid succession, where the total session length can grow very long even as individual request sizes remain tiny. Streaming sessions address this by separating the logical request boundary (each 200ms chunk) from the physical GPU memory management (a single growing sequence).

The concept has been upstreamed to SGLang, a popular open-source LLM serving framework, indicating broad applicability beyond the specific interaction model implementation. The key engineering challenge is managing the growth of the persistent sequence over very long sessions without running out of GPU memory — an active area of work involving context management, eviction strategies, and architectural changes to the attention mechanism.