A probabilistic ranking metric that accounts for varying document relevance levels across positions.
Expected Reciprocal Rank (ERR) is an evaluation metric used in information retrieval and search system assessment that measures the quality of a ranked list of documents by computing the expected reciprocal rank of the first relevant result a user encounters. Unlike simpler metrics, ERR models user behavior probabilistically: it assumes a user scans results from top to bottom and may stop at any point upon finding a sufficiently relevant document. The probability of stopping at a given rank depends on the relevance grades of all documents ranked above it, making ERR sensitive to both the position and the degree of relevance of each result.
The metric was introduced by Chapelle, Metzler, Zhang, and Grinspan in 2009 and quickly became influential in the information retrieval community. Its key innovation over predecessors like Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) is its explicit cascade model of user behavior. In this cascade model, a user's probability of examining a document at rank r is the product of the probabilities of not being satisfied by any of the documents ranked above it. This makes ERR particularly well-suited for graded relevance judgments, where documents are not simply relevant or irrelevant but exist on a spectrum of usefulness.
ERR matters in machine learning contexts primarily because modern search engines and recommendation systems are trained and evaluated using offline metrics before deployment. Choosing the right metric directly shapes what a learned ranking model optimizes for. ERR's cascade assumption aligns more closely with observed user behavior in web search than position-blind metrics, making it a more faithful proxy for real-world user satisfaction. It is commonly used in learning-to-rank research and competitions such as those hosted by major search companies.
Despite its strengths, ERR has limitations: it is less interpretable than simpler metrics and can be sensitive to the specific relevance scale used. Nonetheless, it remains a standard tool in the evaluation toolkit for ranking systems, particularly when fine-grained relevance distinctions and realistic user models are priorities.