Speech recognition—automatic speech recognition (ASR) or speech-to-text—converts spoken audio into written text using acoustic models, language models, and deep learning. The technology has reached widespread commercialization: voice assistants (Siri, Alexa, Google Assistant), transcription services, call center automation, accessibility tools, and real-time captioning. State-of-the-art systems use end-to-end neural networks trained on large datasets and achieve near-human accuracy for many languages and accents. Research continues into multilingual and code-switching support, robustness to noise and accents, and low-resource language coverage. Speech recognition increasingly couples with natural language understanding for voice interfaces.
The demand for hands-free and accessible interfaces drives adoption of speech recognition. Commercial deployment spans consumer, enterprise, and medical applications. Challenges include accuracy for accents and noisy environments, privacy concerns for always-on devices, and latency for real-time applications. The field continues to advance with larger models, self-supervised pretraining, and efficient on-device inference. Speech recognition is now foundational to voice-first computing.