A steering mechanism that shapes language model outputs by modifying internal activations.
A control vector is a learned directional vector applied to the internal activations of a neural network—typically a large language model—to systematically shift its behavior along a specific axis, such as tone, formality, sentiment, or topic focus. Unlike prompting, which operates at the input level, control vectors intervene directly within the model's hidden states during inference, offering a more precise and consistent form of behavioral steering without altering the model's weights.
The technique works by identifying a direction in the model's activation space that corresponds to a target concept or behavioral trait. This direction is typically derived by contrasting activations produced by pairs of contrastive inputs—for example, prompts that are formal versus informal, or honest versus deceptive. The resulting vector captures the latent dimension along which the model internally represents that concept. At inference time, this vector is added to (or subtracted from) the residual stream at one or more layers, nudging the model's representations toward or away from the target behavior.
Control vectors occupy a practical middle ground between prompt engineering and full fine-tuning. Prompt engineering is flexible but can be inconsistent and is limited by what the model's context window can express. Fine-tuning offers deep behavioral change but requires significant compute and risks degrading general capabilities. Control vectors are lightweight, reusable, and composable—multiple vectors can be applied simultaneously with different scaling factors, enabling nuanced, multi-dimensional control over model outputs in real time.
The approach gained significant attention in the ML community around 2023, particularly following work on representation engineering and activation steering in transformer-based models. It has practical applications in AI safety (suppressing harmful outputs), personalization (adapting tone or style), and interpretability research (probing what concepts models internally represent). As language models grow larger and more capable, control vectors offer an efficient and interpretable mechanism for aligning model behavior with specific human preferences or deployment constraints.