Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Streaming

Streaming

Real-time, token-by-token delivery of model outputs as they are generated.

Year: 2020Generality: 450
Back to Vocab

Streaming in the context of large language models refers to the practice of transmitting generated tokens to the end user incrementally as they are produced, rather than waiting for the model to complete its full response before sending anything. This stands in contrast to batch-style inference, where the entire output is computed and buffered server-side before delivery. By surfacing partial results immediately, streaming dramatically reduces the perceived latency of a response — a user sees the first words within milliseconds of the model beginning generation, even if the complete answer takes several seconds to finish.

Under the hood, streaming is implemented through persistent HTTP connections or server-sent events (SSE), which allow a server to push data chunks to a client continuously over a single open connection. Each chunk typically contains one or a few tokens along with metadata. The client application renders these tokens progressively, producing the familiar typewriter effect seen in chat interfaces. On the model side, nothing about the underlying autoregressive generation changes — the model still predicts one token at a time conditioned on all prior context — but the infrastructure is designed to forward each token downstream without waiting for an end-of-sequence signal.

Streaming matters enormously for user experience in interactive applications. Cognitive research consistently shows that users tolerate longer total response times when they receive immediate feedback; a response that begins appearing in under a second feels far more responsive than one that arrives all at once after three seconds, even if the total content is identical. This makes streaming nearly mandatory for conversational assistants, live coding tools, and document drafting applications where engagement depends on a sense of fluid dialogue.

Beyond UX, streaming also enables early stopping and interruption — users can read the beginning of a response and cancel generation if the model has clearly gone off track, saving compute. It also opens the door to streaming-aware pipelines where downstream components (such as text-to-speech engines or UI renderers) can begin processing output before generation is complete, further compressing end-to-end latency in complex AI-powered systems.

Related

Related

Continuous Batching
Continuous Batching

A technique that dynamically groups incoming requests into batches for efficient ML inference.

Generality: 339
Token Speculation Techniques
Token Speculation Techniques

Methods that predict multiple candidate tokens in parallel to accelerate text generation.

Generality: 450
Speculative Decoding
Speculative Decoding

A technique that accelerates LLM inference by drafting and verifying token sequences in parallel.

Generality: 520
Sequential Models
Sequential Models

AI models that process ordered data by capturing dependencies across time or position.

Generality: 795
Inference Scaling
Inference Scaling

Improving model outputs by allocating more compute during inference rather than during training

Generality: 812
Disaggregated Inference
Disaggregated Inference

A model serving technique that separates LLM inference into specialized hardware for prompt processing (prefill) and token generation (decode), enabling higher throughput and lower power at the cost of deployment flexibility.

Generality: 710