A statistical measure of consistency among multiple human annotators labeling the same data.
Inter-rater agreement quantifies how consistently multiple human evaluators assign the same labels or judgments to the same data points. In machine learning, this metric is foundational to supervised learning pipelines, where the quality of training data depends heavily on the reliability of human annotations. When annotators disagree frequently, the resulting labels introduce noise that can degrade model performance, inflate error rates, and obscure the true signal in the data. Measuring agreement before finalizing a dataset helps teams identify ambiguous labeling guidelines, poorly defined categories, or tasks that require domain expertise.
The most widely used tools for quantifying inter-rater agreement include Cohen's Kappa for two annotators and Fleiss' Kappa for three or more. Unlike raw percent agreement, these statistics correct for the level of agreement that would be expected by chance alone, providing a more honest picture of annotator consistency. Intraclass correlation coefficients (ICC) are used when ratings are continuous rather than categorical. Scores are typically interpreted on a scale where values above 0.8 indicate strong agreement, values between 0.6 and 0.8 indicate moderate agreement, and lower values signal that the labeling task or guidelines need revision.
In practice, inter-rater agreement is measured during the annotation design phase, often on a pilot batch of examples, before full-scale labeling begins. Low agreement scores prompt teams to refine label definitions, provide annotator training, or restructure ambiguous categories. Many benchmark datasets in NLP, computer vision, and medical imaging report inter-rater agreement scores as a transparency measure, signaling to researchers how much inherent subjectivity exists in the task itself. Tasks like sentiment analysis, medical image diagnosis, and content moderation are known to produce lower agreement scores due to genuine interpretive ambiguity.
Beyond dataset construction, inter-rater agreement has become increasingly relevant to AI evaluation, where human judges assess the quality of model outputs — such as generated text or image captions. As large language models are evaluated through human preference studies and red-teaming exercises, ensuring that evaluators themselves are consistent is essential for drawing valid conclusions about model behavior. High inter-rater agreement in evaluation protocols strengthens the credibility of benchmarks and safety assessments alike.