Simultaneous Speech

Simultaneous speech is the capability for an interaction model and a human user to speak at the same time — not in the sense of overlapping chaotic conversation, but as a controlled, productive mode of interaction. The paradigmatic use case is live translation: a user speaks in language A while the model generates the translation in language B concurrently, rather than waiting for the user to finish a complete sentence before beginning translation. This requires the model to process input while simultaneously generating output, maintain linguistic coherence across both streams, and manage the temporal alignment between what is being said and what is being generated.

Simultaneous speech is qualitatively different from the overlapping interruptions that occur in natural human conversation (which are handled under verbal interjections). In simultaneous speech, both streams are intentional and productive — the model is not interrupting the user but generating useful output in parallel. This is a capability that requires the full-duplex, time-aligned micro-turn architecture: the model must be generating output tokens continuously while receiving input tokens, rather than completing one before beginning the other.

The technical challenge is maintaining quality under the cognitive load of concurrent processing. A model that simultaneously translates while listening must allocate attention across both tasks without degradation on either. Early results from interaction model research suggest that simultaneous speech quality remains competitive with turn-based translation at the cost of higher latency variance — the translation may trail the input by a variable amount depending on sentence structure. For languages with significantly different word orders (English and Japanese, for example), this is a harder problem than for closely related languages.

The broader implication of simultaneous speech capability is that it makes AI assistance viable in contexts where turn-based interaction is too slow or disruptive. Live events, lectures, negotiations, and clinical settings all involve continuous information flow that cannot be paused for turn-based exchanges. If AI is to function as a real-time collaborator in these settings, simultaneous speech is a prerequisite capability.