Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Speech Processing

Speech Processing

AI techniques enabling computers to recognize, interpret, and synthesize human speech.

Year: 1990Generality: 720
Back to Vocab

Speech processing is a broad discipline within artificial intelligence concerned with enabling machines to handle human speech in all its forms — recognizing spoken words, understanding their meaning, verifying speaker identity, and generating natural-sounding audio from text. It sits at the intersection of linguistics, signal processing, and machine learning, drawing on each to convert the continuous, noisy, highly variable acoustic signal of human speech into structured representations that computers can work with. Core tasks include automatic speech recognition (ASR), text-to-speech synthesis (TTS), speaker diarization, and spoken language understanding.

Modern speech processing systems are built almost entirely on deep learning. ASR pipelines typically use neural architectures — convolutional networks for feature extraction from raw audio or spectrograms, recurrent networks or transformers for sequence modeling — trained end-to-end on thousands of hours of labeled speech. The introduction of the connectionist temporal classification (CTC) loss function and, later, attention-based encoder-decoder models dramatically simplified training by removing the need for explicit phoneme-level alignment. Large pretrained models such as wav2vec 2.0 and Whisper have further pushed accuracy by learning powerful speech representations from massive unlabeled or weakly labeled corpora before fine-tuning on specific tasks.

Speech synthesis has undergone a parallel revolution. Early concatenative and parametric systems produced recognizably robotic output; neural vocoders like WaveNet and subsequent models such as Tacotron and VITS now generate speech that is often indistinguishable from a human recording. These systems learn to map text or phoneme sequences directly to waveforms, capturing prosody, rhythm, and speaker timbre with high fidelity. Controllable synthesis — adjusting emotion, accent, or speaking style — has become an active research frontier.

The practical impact of speech processing is enormous. Voice assistants, real-time transcription services, call-center automation, accessibility tools for the deaf and hard of hearing, and multilingual translation all depend on it. At the same time, the technology raises significant concerns around privacy, consent, and the potential for voice deepfakes, making robust speaker verification and audio forensics increasingly important counterparts to generative advances.

Related

Related

Speech-to-Text Model
Speech-to-Text Model

An AI model that converts spoken audio into written text automatically.

Generality: 550
TTS (Text-to-Speech)
TTS (Text-to-Speech)

AI system that converts written text into natural-sounding spoken audio.

Generality: 550
Speech-to-Speech Model
Speech-to-Speech Model

AI systems that directly translate spoken language into another spoken language.

Generality: 520
ASR (Automatic Speech Recognition)
ASR (Automatic Speech Recognition)

Technology that converts spoken human language into written text automatically.

Generality: 771
NLP (Natural Language Processing)
NLP (Natural Language Processing)

The subfield of AI enabling computers to understand, process, and generate human language.

Generality: 928
Speech-to-Image Model
Speech-to-Image Model

An AI system that generates visual images directly from spoken language input.

Generality: 420