Removing specific concepts from a model's internal representations to reduce bias or improve interpretability.
Concept erasure is a technique in machine learning that selectively removes or suppresses specific concepts, features, or attributes from a model's internal representations—typically the learned embeddings or activations of a neural network. The goal is to prevent a model from encoding or using particular information, whether to eliminate demographic biases, protect sensitive attributes like race or gender, or make the model's decision-making more transparent and auditable. Unlike simply removing input features, concept erasure operates on the model's latent space, targeting the directions or subspaces in which a concept is encoded after the model has already processed the data.
The core mechanism typically involves identifying a linear or nonlinear subspace within the model's representation space that corresponds to the target concept, then projecting representations away from that subspace. Methods such as Iterative Nullspace Projection (INLP) and more recent approaches like LEACE (Least-squares Concept Erasure) formalize this as a geometric operation: finding the directions most predictive of a concept and orthogonalizing the representations against them. The challenge is performing this erasure precisely enough to suppress the unwanted concept while preserving the model's utility on unrelated tasks—a balance that is difficult to achieve when concepts are entangled in high-dimensional spaces.
Concept erasure matters for several practical and ethical reasons. In fairness-aware machine learning, it provides a post-hoc or in-training mechanism to prevent protected attributes from influencing downstream predictions, even when those attributes are not explicitly provided as inputs but are recoverable from other features. In interpretability research, it serves as a probing tool: erasing a concept and measuring the resulting performance degradation helps researchers understand how much a model relies on that concept. It is also relevant to privacy, where the goal is to ensure that sensitive information cannot be reconstructed from a model's internal states.
As language models and other large neural networks have grown more capable of implicitly encoding sensitive information, concept erasure has become an increasingly active research area. Its limitations—including incomplete erasure, unintended side effects on model performance, and the difficulty of defining concepts precisely—remain open problems that connect to broader questions about representation learning, fairness, and the geometry of neural network embeddings.