Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Observatory
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Voice Activity Detection

Voice Activity Detection

Component detecting speech presence in audio to identify turn boundaries in half-duplex systems.

Year: 2020Generality: 580Added: May 12, 2026
Back to Vocab

Voice activity detection (VAD) is a signal processing component that determines whether speech is present in an audio stream, typically by analyzing energy levels, spectral characteristics, and temporal patterns to distinguish speech from silence or noise. In half-duplex conversational AI systems, VAD is used to identify turn boundaries — detecting when a user has stopped speaking so that the system can begin its response. It is a foundational component of virtually all production voice AI systems, including Siri, Alexa, Google Assistant, and most real-time speech-to-text APIs.

VAD works by continuously analyzing the incoming audio stream and producing a binary or probabilistic decision about whether the current audio contains speech. A high VAD confidence triggers the system to conclude that the user's turn has ended, which then invokes the language model to generate a response. The threshold for this decision involves a tradeoff: lowering the threshold makes the system more responsive (it will begin responding faster after the user finishes) but increases false positives (responding to background noise or hesitations as if they were complete turns); raising the threshold reduces false positives but increases latency.

The fundamental limitation of VAD-based turn management is that it is a hand-crafted component — typically a relatively simple model — that is architecturally separate from and less capable than the core language model. This creates an information bottleneck: the VAD decides what to pass to the language model and when, without knowing what the language model needs. A language model that could benefit from partial or interrupted user input, or that could interject during a long speech with a clarification question, cannot do so because the VAD has no mechanism for signaling this to the system. The bitter lesson (Sutton 2019) predicts that as scale increases, such hand-crafted components will increasingly be the limiting factor in system capability.

In the interaction model paradigm, VAD is replaced by the model's own continuous perception of the audio stream. Rather than a separate component deciding when the user has finished speaking, the interaction model observes the audio directly and makes its own judgments about turn-taking, interjection, and backchanneling. This eliminates the VAD bottleneck but shifts the complexity to the interaction model's architecture and training, which must learn appropriate turn-taking behavior from data rather than relying on a rule-based signal.