Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Observatory
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Alignment Tax

Alignment Tax

Performance cost of making AI models safer and aligned with human values

Year: 2022Generality: 693
Back to Vocab

Alignment Tax is the performance or capability reduction that results from applying safety-focused training techniques to AI models. When developers implement measures to make models more aligned with human values—such as Constitutional AI, reinforcement learning from human feedback (RLHF), refusal training, or adversarial robustness—these measures can measurably decrease the model's raw performance on benchmarks or reduce its general capability. A model optimized purely for capability without safety constraints often outperforms the same model after alignment training. This tradeoff is the alignment tax: the price paid for making models safer and more trustworthy.

How Alignment Tax emerges relates to optimization and objective functions. During pretraining, a language model optimizes for predictive accuracy—given a sequence, predict the next token. This objective rewards breadth of knowledge and capability. Safety training introduces additional objectives: avoid harmful outputs, refuse certain requests, follow specific behavioral guidelines. These can conflict with pure capability. For example, a model that refuses to help with certain technical topics might be more aligned but less useful for legitimate queries about those topics. RLHF attempts to balance this through human preferences, but no training process perfectly preserves all capabilities while adding constraints. The result is measurable degradation on some benchmarks, reduced versatility, or slower inference. Even well-designed safety training creates friction between "doing what the user asks" and "doing what's safe and appropriate." This friction is the tax.

Why Alignment Tax matters is profound for AI deployment and societal impact. It forces a genuine tradeoff: how much capability should we sacrifice for safety? Some argue that accepting the tax is necessary—that a slightly less capable but safer model is better than a more capable but risky one. Others push back, noting that capability loss can also harm utility and trust. The existence of this tax also incentivizes research into making safety measures more efficient, reducing the tax's size without compromising protection. Understanding the alignment tax is essential for responsible AI development: it acknowledges that safety is not free, that values have costs, and that wise deployment requires explicit choices about where to land on the capability-safety frontier.

Related

Related

Alignment
Alignment

Ensuring an AI system's goals and behaviors reliably match human values and intentions.

Generality: 865
Alignment Platform
Alignment Platform

An integrated framework ensuring AI systems behave consistently with human values and goals.

Generality: 680
Super Alignment
Super Alignment

Ensuring superintelligent AI systems reliably align with human values at scale.

Generality: 550
Distillation Tax
Distillation Tax

Performance ceiling when training smaller models from larger model outputs

Generality: 519
AI Safety
AI Safety

Research field ensuring AI systems remain beneficial, aligned, and free from catastrophic risk.

Generality: 871
Group-Based Alignment
Group-Based Alignment

Coordinating multiple AI agents to share goals, values, and behaviors without conflict.

Generality: 395