Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. ASR (Automatic Speech Recognition)

ASR (Automatic Speech Recognition)

Technology that converts spoken human language into written text automatically.

Year: 1988Generality: 771
Back to Vocab

Automatic Speech Recognition (ASR) is the technology that enables computers to identify and transcribe spoken language into text. Modern ASR systems are built on two core components: acoustic models, which map raw audio signals to phonetic units, and language models, which use statistical or neural methods to predict likely word sequences given context. Together, these components allow a system to move from waveform to words, handling the enormous variability in human speech caused by accents, speaking rates, background noise, and individual vocal characteristics.

Historically, ASR relied on Hidden Markov Models (HMMs) to represent the sequential, probabilistic nature of speech, often combined with Gaussian Mixture Models for acoustic scoring. The deep learning era transformed the field dramatically: deep neural networks replaced hand-crafted acoustic features, and end-to-end architectures such as Connectionist Temporal Classification (CTC) and sequence-to-sequence models with attention mechanisms allowed systems to learn directly from audio-transcript pairs without explicit phoneme alignment. Transformer-based models like OpenAI's Whisper have further pushed accuracy across diverse languages and acoustic conditions.

ASR underpins a vast range of real-world applications, including voice assistants, real-time captioning, medical dictation, call center automation, and accessibility tools for people with disabilities. Its integration with natural language processing pipelines enables downstream tasks such as intent detection, sentiment analysis, and machine translation, making ASR a foundational layer in conversational AI systems.

The practical challenges in ASR remain significant: handling overlapping speakers, domain-specific vocabulary, low-resource languages, and noisy environments continues to drive active research. Evaluation is typically measured using Word Error Rate (WER), which quantifies the edit distance between a system's output and a reference transcript. As models grow larger and training datasets more diverse, ASR performance has approached or matched human-level accuracy on benchmark tasks, though robustness in real-world, unconstrained conditions remains an ongoing area of improvement.

Related

Related

Speech-to-Text Model
Speech-to-Text Model

An AI model that converts spoken audio into written text automatically.

Generality: 550
Speech Processing
Speech Processing

AI techniques enabling computers to recognize, interpret, and synthesize human speech.

Generality: 720
S2R (Speech-to-Retrieval)
S2R (Speech-to-Retrieval)

Maps spoken audio directly to retrieval-ready representations, bypassing error-prone transcription pipelines.

Generality: 174
Speech-to-Speech Model
Speech-to-Speech Model

AI systems that directly translate spoken language into another spoken language.

Generality: 520
TTS (Text-to-Speech)
TTS (Text-to-Speech)

AI system that converts written text into natural-sounding spoken audio.

Generality: 550
AI Assistant
AI Assistant

An AI system that understands natural language and autonomously completes tasks for users.

Generality: 792