Micro-Turn

A micro-turn is the fundamental unit of processing in an interaction model: a small, time-bounded chunk of input or output (typically around 200 milliseconds of audio or video) treated as a discrete processing unit within a continuous stream rather than as a complete user turn or model response. In conventional turn-based systems, the model waits for the user to finish composing a complete message before beginning to process it. In a micro-turn architecture, both input and output tokens are processed as ongoing streams, with the model interleaving perception and generation continuously rather than waiting for complete turns on either side.

The 200-millisecond granularity matters because it approximates the timescale of natural conversational overlap and responsiveness. Human conversational turn-taking occurs on the order of hundreds of milliseconds; any longer delay feels unnatural and disruptive to the sense of presence. By working with 200ms chunks of input and generation, the model can achieve near real-time concurrency across multiple input and output modalities simultaneously — audio in, audio out, video in — without artificial boundaries dictating when input ends and output begins.

Micro-turns eliminate the concept of a "turn boundary" as a meaningful architectural unit. In most existing real-time systems, a separate harness component — typically a voice-activity detection (VAD) module — predicts when a turn has ended so that a turn-based model can be invoked. This VAD is typically meaningfully less intelligent than the model itself, which precludes interaction modes requiring proactive interjections, reactions to visual cues, or simultaneous speech. With micro-turns, these capabilities emerge naturally from continuous processing: the model can choose to interject whenever the accumulated context warrants it, without waiting for a turn-boundary signal from a separate component.

The tradeoff is inference efficiency. Processing very small chunks with strict latency constraints requires frequent prefills and decodes of small sequences, each requiring GPU memory allocation and metadata computation. Existing LLM inference libraries are often not optimized for this pattern. Solutions include streaming sessions (maintaining a persistent GPU sequence across chunks) and kernel-level optimizations for the specific shapes encountered in bidirectional serving. The computational overhead of micro-turn processing is a real cost that must be weighed against the interaction quality gains.