
CoT
Chain-of-Thought Faithfulness
Chain-of-Thought
Assessing whether a model's generated chain-of-thought reflects its true internal decision process (faithful) rather than being a post‑hoc, plausibly sounding rationale that does not causally drive the model's output.
CoT Faithfulness is the criterion that a Chain-of-Thought (CoT) explanation produced by an AI model is causally and representationally aligned with the computations the model actually performed to arrive at its final answer, not merely a fluent or persuasive narrative that could have been produced independently of the internal decision path. For experts, this concept separates two classes of generated rationales: faithful traces that correspond to internal activation patterns, intermediate latent variables, or computations that materially influenced the chosen output, and plausible but unfaithful rationales that are post‑hoc reconstructions (or hallucinated narratives) which may correlate with the answer but do not reflect the model’s causal reasoning. Measuring and improving CoT Faithfulness is central to interpretability, verification, and alignment work in AI: it informs whether humans can reliably inspect, debug, or rely on model reasoning, and whether supervision or constraints on reasoning traces (e.g., supervised CoT datasets, causal mediation analysis, intervention-based tests, token‑level attribution, sufficiency/comprehensiveness metrics, and probing of attention/gradient signals) actually change the underlying computation versus only influencing surface outputs. Practical approaches include designing interventions that flip internal activations or outputs while observing corresponding changes in generated chains, training on ground‑truth reasoning traces when available, using modular or latent‑variable architectures that expose internal steps, and developing evaluation suites that test whether intervening on a claimed step alters the final prediction; limitations include ambiguous definitions of faithfulness, difficulty in accessing or interpreting latent computations in very large models, and potential trade-offs between faithfulness, performance, and fluency.
First used: around 2022–2023 with the rise of Chain‑of‑Thought prompting; Popularized: 2023–2024 as empirical papers and critique analyses highlighted discrepancies between CoT outputs and models' internal computations.
