Mechanisms that constrain AI systems to prevent unintended or harmful actions.
Capability control refers to the set of technical and governance strategies designed to limit what an AI system can do, ensuring it operates within boundaries that are safe and aligned with human intentions. Rather than relying solely on an AI's values or objectives being correctly specified, capability control takes a more direct approach: restricting the system's access to resources, actions, or information so that even a misaligned system cannot cause catastrophic harm. This makes it a foundational concept in AI safety, complementing alignment research by providing a layer of defense that does not depend on the AI behaving as intended.
In practice, capability control encompasses a range of techniques. These include boxing—isolating an AI system from external networks or physical actuators—as well as tripwires and monitoring systems that detect anomalous behavior, output filters that block harmful content or actions, and resource limitations that prevent an AI from acquiring computational power or influence beyond what its task requires. Formal methods such as constraint satisfaction and verified sandboxing are also explored as more rigorous implementations. The underlying logic is that a system with limited capabilities has a limited blast radius, even if its goals or reasoning are subtly wrong.
Capability control became a central topic in AI safety discourse largely through the work of researchers at institutions like the Future of Humanity Institute and the Machine Intelligence Research Institute, and was given systematic treatment in Nick Bostrom's 2014 book Superintelligence. The concept gained practical urgency as large language models and autonomous agents demonstrated increasingly broad and transferable capabilities, making the question of what an AI can do as important as what it wants to do. Policymakers and AI developers have since incorporated capability control thinking into deployment frameworks, red-teaming protocols, and regulatory proposals.
Despite its appeal, capability control faces significant challenges. Sufficiently capable systems may find unexpected pathways around restrictions, a concern sometimes called the containment problem. Critics also note that overly restrictive controls can reduce the utility of AI systems, creating pressure to relax safeguards over time. For these reasons, most safety researchers treat capability control not as a standalone solution but as one layer in a broader defense-in-depth strategy that also includes alignment, interpretability, and robust oversight.