Verbal Interjections

Verbal interjections are model-initiated spoken contributions that occur during user speech — not in response to a completed turn, but triggered by real-time assessment of the ongoing context. In a conventional half-duplex system, the model waits for the user to finish speaking before generating any response. A verbal interjection system allows the model to jump in when the accumulated context warrants it — for example, if the user says something factually incorrect, the model might interject with a correction; if the user asks a question and then immediately answers it themselves, the model might stay quiet; if the user seems confused, the model might offer clarification unprompted.

The capability depends on the same architectural foundation as full-duplex interaction and micro-turn processing: the model must be able to perceive and reason about the ongoing input stream while simultaneously generating output, without being blocked by either. This is not possible in turn-based architectures, where the model is architecturally prohibited from output generation until the entire input turn has been received and processed. Verbal interjections therefore emerge naturally from the time-aligned micro-turn design rather than requiring a separate interjection-detection module.

The practical applications are significant. A model that can interject on incorrect mispronunciations in real time (the CueSpeak benchmark) differs qualitatively from one that can only correct errors after the fact. A model that can say "hold on, I see what you mean" in response to a visual cue (visual interjections) while the user is still speaking enables a collaborative style that mirrors how humans naturally work together. Live translation — generating speech in language B while the user is still speaking in language A — is another core use case that requires simultaneous speech and thus the ability to interject proactively rather than reactively.

The risks of verbal interjection are also significant: a model that interjects incorrectly or excessively would be experienced as rude, distracting, or unhelpful. Calibrating interjection behavior — deciding when the model's confidence that an interjection is warranted is high enough to override the user's current turn — is an open problem. The model's decisions here have real social consequences and represent a safety-critical dimension of interaction model behavior that does not exist in turn-based systems.