An AI system's capacity to identify and fix its own errors autonomously.
Self-correction in AI refers to the ability of a model or system to detect errors in its own outputs and revise them without requiring external human intervention. This capability has become especially prominent with large language models (LLMs), where the model is prompted—or prompts itself—to review a generated response, identify flaws in reasoning or factual accuracy, and produce an improved version. Unlike classical training-time error correction through gradient descent, self-correction in modern AI typically operates at inference time, making it a distinct and practically significant behavior.
The mechanisms underlying self-correction vary by context. In reinforcement learning from human feedback (RLHF), models learn to associate certain output patterns with negative reward signals, effectively internalizing a preference for more accurate or coherent responses. In chain-of-thought and self-refinement frameworks, a model is explicitly instructed to critique its own answer and iterate toward a better one, sometimes using a separate "critic" model or a second pass of the same model. Techniques like Constitutional AI leverage self-critique loops where the model evaluates its outputs against a set of principles before finalizing a response.
The practical importance of self-correction lies in its potential to improve reliability without expensive retraining or constant human oversight. If a model can catch its own logical errors, hallucinations, or unsafe outputs, it becomes more trustworthy in high-stakes deployments such as medical question answering, legal reasoning, or code generation. However, research has shown that self-correction is far from guaranteed—models often fail to identify their own errors, or introduce new ones during revision, particularly when no external ground-truth signal is available. This has led to active debate about whether LLMs can genuinely self-correct or merely appear to do so under favorable prompting conditions.
The concept gained significant traction in the ML community around 2022–2023 with the proliferation of instruction-tuned LLMs and the publication of frameworks like Self-Refine, Reflexion, and Constitutional AI. It sits at the intersection of reasoning, alignment, and reliability research, making it one of the more actively studied capabilities in contemporary AI development.