When evaluation metrics for a ML model shift over time, degrading measured performance.
Criteria drift refers to the phenomenon in machine learning where the metrics, standards, or benchmarks used to evaluate a model's performance change over time, causing a model that once appeared effective to seem degraded — or vice versa — without any change to the model itself. This is distinct from concept drift, which involves changes in the underlying data distribution, or model decay, which reflects genuine performance degradation. Criteria drift is specifically about the goalposts moving: what counts as "good" performance is redefined, often in response to evolving business requirements, regulatory changes, or a deeper understanding of what the model should actually optimize for.
In practice, criteria drift can manifest in several ways. A fraud detection model might initially be evaluated on raw accuracy, but as stakeholders grow more sophisticated, they shift to precision-recall tradeoffs or F1 scores — suddenly the model looks worse on paper despite unchanged behavior. Similarly, a recommendation system once judged by click-through rate might later be assessed on user retention or satisfaction scores, reflecting a broader organizational shift in values. These changes in evaluation criteria can create confusion about whether a model needs retraining, replacement, or simply re-contextualization.
Criteria drift matters because it complicates model governance and MLOps pipelines. Automated monitoring systems that trigger retraining alerts based on metric thresholds can fire spuriously when the metrics themselves are redefined. It also creates challenges in longitudinal model comparison: benchmarking a new model against a predecessor becomes unreliable if the evaluation framework has shifted between deployments. Organizations practicing responsible AI development must maintain versioned records of evaluation criteria alongside model versions to ensure apples-to-apples comparisons.
Addressing criteria drift requires deliberate documentation practices, stakeholder alignment on evaluation frameworks before deployment, and periodic audits that distinguish between genuine model degradation and shifting standards. As machine learning systems are increasingly embedded in high-stakes domains like healthcare, finance, and criminal justice, the ability to detect and account for criteria drift becomes an important component of model accountability and long-term reliability.