Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Concept Erasure

Concept Erasure

Removing specific concepts from a model's internal representations to reduce bias or improve interpretability.

Year: 2020Generality: 339
Back to Vocab

Concept erasure is a technique in machine learning that selectively removes or suppresses specific concepts, features, or attributes from a model's internal representations—typically the learned embeddings or activations of a neural network. The goal is to prevent a model from encoding or using particular information, whether to eliminate demographic biases, protect sensitive attributes like race or gender, or make the model's decision-making more transparent and auditable. Unlike simply removing input features, concept erasure operates on the model's latent space, targeting the directions or subspaces in which a concept is encoded after the model has already processed the data.

The core mechanism typically involves identifying a linear or nonlinear subspace within the model's representation space that corresponds to the target concept, then projecting representations away from that subspace. Methods such as Iterative Nullspace Projection (INLP) and more recent approaches like LEACE (Least-squares Concept Erasure) formalize this as a geometric operation: finding the directions most predictive of a concept and orthogonalizing the representations against them. The challenge is performing this erasure precisely enough to suppress the unwanted concept while preserving the model's utility on unrelated tasks—a balance that is difficult to achieve when concepts are entangled in high-dimensional spaces.

Concept erasure matters for several practical and ethical reasons. In fairness-aware machine learning, it provides a post-hoc or in-training mechanism to prevent protected attributes from influencing downstream predictions, even when those attributes are not explicitly provided as inputs but are recoverable from other features. In interpretability research, it serves as a probing tool: erasing a concept and measuring the resulting performance degradation helps researchers understand how much a model relies on that concept. It is also relevant to privacy, where the goal is to ensure that sensitive information cannot be reconstructed from a model's internal states.

As language models and other large neural networks have grown more capable of implicitly encoding sensitive information, concept erasure has become an increasingly active research area. Its limitations—including incomplete erasure, unintended side effects on model performance, and the difficulty of defining concepts precisely—remain open problems that connect to broader questions about representation learning, fairness, and the geometry of neural network embeddings.

Related

Related

Mechanistic Unlearning
Mechanistic Unlearning

Selectively removing specific learned knowledge from trained models without full retraining.

Generality: 293
Machine Unlearning
Machine Unlearning

Removing specific data's influence from a trained model without full retraining.

Generality: 463
Abliteration
Abliteration

Removes alignment restrictions from language models by targeting refusal directions in activations.

Generality: 79
Capability Elucidation
Capability Elucidation

Systematic methods to reveal what tasks and latent abilities an AI system possesses.

Generality: 493
Representation Engineering
Representation Engineering

Designing and optimizing internal data representations to improve AI model performance.

Generality: 654
Ablation
Ablation

Systematically removing model components to measure their individual contribution to performance.

Generality: 700