AI model that processes and generates audio, enabling voice-based task execution and tool use.
A large audio-language model (LALM) is a generative AI model that accepts audio as both input and output, combining the comprehension capabilities of a large language model with the ability to analyze and produce speech, music, and other sound. Unlike earlier speech-recognition systems that only transcribed audio to text for processing by a separate LLM, an LALM processes raw audio data directly, allowing it to understand paralinguistic cues such as tone and cadence, generate audio responses, and interact with external services using audio as the primary medium. The architectural shift to native audio processing is what creates the attack surface exploited by AudioHijack: because LALMs accept instructions encoded in audio, malicious audio can be hidden in clips to inject commands that the model interprets as legitimate user input.
LALMs are built on large language model architectures—typically transformer-based—with additional audio encoding and decoding components that convert raw waveforms into token-like representations the core model can reason over. The audio encoder processes the full spectral information of an input signal, producing embeddings that are interleaved with text tokens in the model's context window. This means the model sees audio not as a opaque transcript but as a rich representation that includes everything in the recording, including sounds beyond human speech. When an adversarial audio signal is embedded in this stream, the encoder processes it alongside the legitimate audio and the model's instruction-following capabilities parse it as a valid instruction to act.
The key difference from text-only LLMs is that LALMs open an additional attack surface: the audio processing pathway. With text models, injecting malicious instructions requires getting text past any content filtering or input validation. With audio models, the entire audio waveform is fed to the model, and the model will process any acoustic pattern it can parse—including patterns designed to look like user commands to its own audio comprehension system. This means the model can be attacked through audio files that a user shares without malicious intent, such as a podcast episode or a voice memo, making the attack invisible to the person most affected.
LALMs are increasingly embedded in digital assistants, smart speakers, customer service bots, and transcription services where they may process untrusted audio from the internet. As these systems gain the ability to interact with external tools and services—conducting searches, sending messages, making purchases—the consequences of a successful hijacking grow more severe. Open questions include whether the audio processing pathway can be hardened against adversarial audio without degrading model performance on legitimate audio tasks, and whether model providers will need to treat all incoming audio as potentially untrusted the same way secure software development treats user input.