An AI process that segments audio recordings by speaker identity, answering 'who spoke when.'
Speaker diarization is the task of partitioning an audio stream into homogeneous temporal segments according to speaker identity — essentially answering the question "who spoke when?" without necessarily knowing who the speakers are in advance. The process typically unfolds in several stages: voice activity detection filters out non-speech regions, speaker change detection identifies boundaries between turns, and clustering algorithms group acoustically similar segments together under a common speaker label. Modern systems replace many of these hand-engineered stages with end-to-end neural architectures, using speaker embeddings such as x-vectors or d-vectors to represent short audio segments in a compact latent space where distance correlates with speaker dissimilarity.
The core challenge in diarization is that it must operate without prior enrollment of the speakers being identified, distinguishing it from speaker verification or identification tasks. Overlapping speech — where two or more people talk simultaneously — is a particularly difficult problem, since most classical clustering-based pipelines assume a single active speaker at any moment. Recent approaches using end-to-end neural diarization (EEND) models the problem as a multi-label classification task over time, allowing the system to assign multiple speaker labels to overlapping frames and significantly improving performance in naturalistic conversational settings.
Diarization is foundational to a wide range of downstream applications. In automated meeting transcription, it enables attribution of speech turns to individual participants, making transcripts far more readable and actionable. In broadcast media, legal proceedings, and clinical documentation, accurate speaker segmentation is a prerequisite for meaningful search and analysis. As conversational AI systems grow more sophisticated, diarization increasingly serves as a front-end component that feeds structured, speaker-attributed input into downstream speech recognition and natural language understanding pipelines, making it a critical building block for human-computer interaction at scale.