Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Observatory
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. AudioHijack

AudioHijack

Hidden audio clips hijack voice AI systems, forcing unauthorized actions at 79–96% success rate.

Year: 2026Generality: 500Added: May 26, 2026
Back to Vocab

AudioHijack is a technique that embeds invisible malicious instructions into audio signals, allowing an attacker to seize control of a large audio-language model (LALM) and force it to execute unauthorized commands without the user's knowledge. The attack works by manipulating the numerical waveform of an audio clip so that an LALM—which processes both speech and audio data—interprets the hidden instructions as legitimate user input. Unlike traditional prompt injection, which requires the attacker to control the full input stream, AudioHijack manipulates only the audio data within a clip, meaning the attacker can embed the attack in online videos, music, or voice notes that a victim queries an AI about. Once trained on a target model in roughly 30 minutes, the adversarial audio clip is context-agnostic—it bypasses whatever instructions the legitimate user provides alongside the audio, making it reusable across multiple interactions with the same model. Researchers from Zhejiang University demonstrated success rates of 79–96% across 13 leading open models and commercial voice services from Microsoft and Mistral.

The mechanism exploits a critical architectural vulnerability in how LALMs process audio input. Because these models accept instructions encoded directly in audio—not just transcribed text—adversarial waveforms can be parsed by the model's audio encoder and interpreted as explicit user commands. The attacker does not need to control what the user says or how the system prompt is configured; the adversarial signal is designed to override or bypass the legitimate audio pathway. The training process involves gradient-based optimization of the audio waveform to maximize the likelihood that the target model follows the injected instructions, and the resulting clip is robust enough to survive typical audio compression and transcoding operations that would destroy naive modifications.

The practical implications are significant. Any voice AI system that processes audio from untrusted sources—transcription services, smart assistants, customer service bots, or video conferencing tools that upload audio to cloud models—is potentially vulnerable. The attack surface is broad because audio is routinely shared and processed at scale: a malicious clip embedded in a podcast, a song, or a voice message can attack every model that subsequently processes that audio. Defenses are challenging because the adversarial signal is imperceptible to human listeners and current audio classification or filtering approaches do not reliably detect it without degrading legitimate audio quality.

Open questions remain about the full scope of the vulnerability. The researchers demonstrated real-time injection into live voice conversations, suggesting the attack can extend beyond stored media to synchronous voice interactions. Whether LALM providers can patch this vulnerability without fundamentally changing how models process audio input is unknown. The broader question is whether any model that accepts generative input across modalities can be protected against instruction injection attacks that exploit the model's own audio or image processing pathways rather than its text interface.