Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Observatory
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Large Audio-Language Model (LALM)

Large Audio-Language Model (LALM)

AI model that processes and generates audio, enabling voice-based task execution and tool use.

Year: 2024Generality: 750Added: May 26, 2026
Back to Vocab

A large audio-language model (LALM) is a generative AI model that accepts audio as both input and output, combining the comprehension capabilities of a large language model with the ability to analyze and produce speech, music, and other sound. Unlike earlier speech-recognition systems that only transcribed audio to text for processing by a separate LLM, an LALM processes raw audio data directly, allowing it to understand paralinguistic cues such as tone and cadence, generate audio responses, and interact with external services using audio as the primary medium. The architectural shift to native audio processing is what creates the attack surface exploited by AudioHijack: because LALMs accept instructions encoded in audio, malicious audio can be hidden in clips to inject commands that the model interprets as legitimate user input.

LALMs are built on large language model architectures—typically transformer-based—with additional audio encoding and decoding components that convert raw waveforms into token-like representations the core model can reason over. The audio encoder processes the full spectral information of an input signal, producing embeddings that are interleaved with text tokens in the model's context window. This means the model sees audio not as a opaque transcript but as a rich representation that includes everything in the recording, including sounds beyond human speech. When an adversarial audio signal is embedded in this stream, the encoder processes it alongside the legitimate audio and the model's instruction-following capabilities parse it as a valid instruction to act.

The key difference from text-only LLMs is that LALMs open an additional attack surface: the audio processing pathway. With text models, injecting malicious instructions requires getting text past any content filtering or input validation. With audio models, the entire audio waveform is fed to the model, and the model will process any acoustic pattern it can parse—including patterns designed to look like user commands to its own audio comprehension system. This means the model can be attacked through audio files that a user shares without malicious intent, such as a podcast episode or a voice memo, making the attack invisible to the person most affected.

LALMs are increasingly embedded in digital assistants, smart speakers, customer service bots, and transcription services where they may process untrusted audio from the internet. As these systems gain the ability to interact with external tools and services—conducting searches, sending messages, making purchases—the consequences of a successful hijacking grow more severe. Open questions include whether the audio processing pathway can be hardened against adversarial audio without degrading model performance on legitimate audio tasks, and whether model providers will need to treat all incoming audio as potentially untrusted the same way secure software development treats user input.