The degree to which humans can understand why an AI system made a decision.
Interpretability refers to the extent to which a human can comprehend the internal mechanisms and reasoning behind an AI system's outputs or decisions. Unlike a black-box model that simply produces predictions, an interpretable system allows users to trace how input features influenced a particular outcome. This property exists on a spectrum: some models, like linear regression or decision trees, are inherently interpretable by design, while others, like deep neural networks with billions of parameters, require additional techniques to make their behavior legible to humans.
Achieving interpretability typically involves one of two broad strategies. The first is to use intrinsically transparent models whose structure directly encodes human-readable logic. The second is to apply post-hoc explanation methods to complex, high-performing models after training. Techniques such as LIME (Local Interpretable Model-agnostic Explanations), SHAP (SHapley Additive exPlanations), saliency maps, and attention visualization fall into this second category. These methods approximate or highlight the factors most responsible for a given prediction, offering a window into otherwise opaque decision processes without sacrificing model performance.
Interpretability matters enormously in high-stakes domains where decisions carry significant consequences. In healthcare, clinicians need to understand why a model flags a patient as high-risk before acting on that recommendation. In finance, regulators may require that loan denials be explainable to applicants. In criminal justice, algorithmic decisions affecting sentencing or parole must withstand ethical and legal scrutiny. Without interpretability, even highly accurate models can erode trust, introduce undetected bias, or fail to meet regulatory requirements such as the EU's General Data Protection Regulation (GDPR), which includes provisions for the right to explanation.
Interpretability is closely related to, but distinct from, explainability and transparency. Interpretability typically refers to the inherent comprehensibility of a model's structure, while explainability often refers to the ability to construct a post-hoc narrative about a decision. As AI systems are deployed in increasingly consequential settings, interpretability has become a core research priority, driving entire subfields dedicated to understanding, auditing, and communicating how machine learning models behave.