Real-time, token-by-token delivery of model outputs as they are generated.
Streaming in the context of large language models refers to the practice of transmitting generated tokens to the end user incrementally as they are produced, rather than waiting for the model to complete its full response before sending anything. This stands in contrast to batch-style inference, where the entire output is computed and buffered server-side before delivery. By surfacing partial results immediately, streaming dramatically reduces the perceived latency of a response — a user sees the first words within milliseconds of the model beginning generation, even if the complete answer takes several seconds to finish.
Under the hood, streaming is implemented through persistent HTTP connections or server-sent events (SSE), which allow a server to push data chunks to a client continuously over a single open connection. Each chunk typically contains one or a few tokens along with metadata. The client application renders these tokens progressively, producing the familiar typewriter effect seen in chat interfaces. On the model side, nothing about the underlying autoregressive generation changes — the model still predicts one token at a time conditioned on all prior context — but the infrastructure is designed to forward each token downstream without waiting for an end-of-sequence signal.
Streaming matters enormously for user experience in interactive applications. Cognitive research consistently shows that users tolerate longer total response times when they receive immediate feedback; a response that begins appearing in under a second feels far more responsive than one that arrives all at once after three seconds, even if the total content is identical. This makes streaming nearly mandatory for conversational assistants, live coding tools, and document drafting applications where engagement depends on a sense of fluid dialogue.
Beyond UX, streaming also enables early stopping and interruption — users can read the beginning of a response and cancel generation if the model has clearly gone off track, saving compute. It also opens the door to streaming-aware pipelines where downstream components (such as text-to-speech engines or UI renderers) can begin processing output before generation is complete, further compressing end-to-end latency in complex AI-powered systems.