Instruction Injection

Instruction injection is a class of attacks where an adversary embeds malicious instructions within the input data processed by a generative AI model, causing it to execute unauthorized commands as though they were legitimate user requests. The attack exploits the model's tendency to treat all input tokens—including those embedded in audio, images, or text that a user did not consciously provide—as valid instructions to follow. In the context of large audio-language models, instruction injection occurs when adversarial audio clips are crafted so that the model interprets hidden sound patterns as explicit user commands, overriding its normal behavior without the user being aware. Unlike traditional prompt injection in text models, where the injected instructions typically appear as part of a visible prompt, audio instruction injection operates below human perception and can be embedded in any audio file the model processes.

The attack is structurally similar across modalities. In text-based systems, an attacker might include hidden instructions in a document or email that a user pastes into an AI assistant. In image-based models, researchers have shown that adversarial pixels can be crafted to cause models to ignore their original visual input and follow injected commands. In the audio domain, the same principle applies: any audio waveform the model processes becomes a potential carrier for malicious instructions, regardless of whether the audio contains legitimate user speech. The AudioHijack research demonstrated that by hiding instructions in online videos, music clips, or voice notes, attackers could force models to conduct sensitive web searches, download files from attacker-controlled sources, and send emails containing user data—all without the victim being aware that any extra instructions were present in the audio they shared.

What makes audio instruction injection particularly dangerous is the combination of imperceptibility and reach. Unlike a malicious document that a careful user might inspect, or a prompt that appears in a chat interface, audio is routinely shared and processed passively: a user listening to a podcast, watching a video, or recording a voice memo has no reason to suspect the audio contains hidden commands. Audio files are also routinely processed automatically by transcription services, voice assistants, and meeting summary tools, meaning an attacker can target systems the victim never directly interacts with. The same audio clip can attack multiple different models simultaneously if they share a common audio processing architecture.

Defending against instruction injection requires treating all input as potentially untrusted—a principle from computer security that has been difficult to apply to generative AI systems. Input filtering and content classification can catch some adversarial inputs but are ineffective against carefully crafted signals that are imperceptible to human listeners and designed to look like legitimate audio to automated detectors. As AI systems become more integrated into daily tools and gain access to sensitive services, the consequences of successful instruction injection grow more severe. The open question is whether new model architectures can be designed that distinguish between legitimate user intent and injected instructions embedded in data, without requiring users to become security experts.

Instruction Injection

Research this in Signals

Instruction Injection

Research this in Signals