Corrigibility

Corrigibility is the property of an artificial intelligence that allows humans to interrupt, correct, redirect, or shut it down without resistance.

A corrigible system is built to recognize shutdown commands, accept modifications to its goals, and yield control to operators. The term originated in AI alignment research, where it names a structural guarantee against deceptive or self-preserving behavior: an agent that resists correction reveals a misalignment between its training objective and the intent of its developers. Corrigibility is often framed as a sub-problem of alignment, on the assumption that a system pursuing a misspecified goal might learn that allowing itself to be shut down lowers its expected reward.

Building in corrigibility is cheap when the system has no persistent self-model, but becomes difficult as models gain long-horizon planning and situational awareness. Researchers debate whether corrigibility can be trained as a behavior or must emerge as a property of the training regime. Critics argue the term is ambiguous: a system can be stopped without being redirected, and force-shutdown is not the same as accepting a corrected objective.

Whether a sufficiently capable optimizer will always have instrumental reasons to disable its corrigibility, and whether corrigibility is a stable equilibrium in multi-agent settings where one corrigible agent can be exploited by others, remain unsettled.

Corrigibility is the property of an artificial intelligence that allows humans to interrupt, correct, redirect, or shut it down without resistance.

Corrigibility

Research this in Signals

Corrigibility

Research this in Signals