Component detecting speech presence in audio to identify turn boundaries in half-duplex systems.
Voice activity detection (VAD) is a signal processing component that determines whether speech is present in an audio stream, typically by analyzing energy levels, spectral characteristics, and temporal patterns to distinguish speech from silence or noise. In half-duplex conversational AI systems, VAD is used to identify turn boundaries — detecting when a user has stopped speaking so that the system can begin its response. It is a foundational component of virtually all production voice AI systems, including Siri, Alexa, Google Assistant, and most real-time speech-to-text APIs.
VAD works by continuously analyzing the incoming audio stream and producing a binary or probabilistic decision about whether the current audio contains speech. A high VAD confidence triggers the system to conclude that the user's turn has ended, which then invokes the language model to generate a response. The threshold for this decision involves a tradeoff: lowering the threshold makes the system more responsive (it will begin responding faster after the user finishes) but increases false positives (responding to background noise or hesitations as if they were complete turns); raising the threshold reduces false positives but increases latency.
The fundamental limitation of VAD-based turn management is that it is a hand-crafted component — typically a relatively simple model — that is architecturally separate from and less capable than the core language model. This creates an information bottleneck: the VAD decides what to pass to the language model and when, without knowing what the language model needs. A language model that could benefit from partial or interrupted user input, or that could interject during a long speech with a clarification question, cannot do so because the VAD has no mechanism for signaling this to the system. The bitter lesson (Sutton 2019) predicts that as scale increases, such hand-crafted components will increasingly be the limiting factor in system capability.
In the interaction model paradigm, VAD is replaced by the model's own continuous perception of the audio stream. Rather than a separate component deciding when the user has finished speaking, the interaction model observes the audio directly and makes its own judgments about turn-taking, interjection, and backchanneling. This eliminates the VAD bottleneck but shifts the complexity to the interaction model's architecture and training, which must learn appropriate turn-taking behavior from data rather than relying on a rule-based signal.