Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Targeted Adversarial Examples

Targeted Adversarial Examples

Crafted inputs that fool a model into predicting one specific wrong class.

Year: 2014Generality: 550
Back to Vocab

Targeted adversarial examples are carefully constructed inputs designed to cause a machine learning model to produce a specific, attacker-chosen incorrect output. Unlike untargeted adversarial attacks, which simply aim to induce any misclassification, targeted attacks have a precise goal — for example, manipulating an image of a stop sign so that a vision model confidently classifies it as a speed limit sign. The perturbations introduced are typically imperceptible to human observers, making these attacks particularly insidious in real-world deployments.

Generating targeted adversarial examples usually involves an optimization process guided by the model's gradients. The attacker minimizes a loss function that rewards the model assigning high confidence to the target class while keeping the perturbation small, often measured by an Lp norm. Common methods include the Carlini & Wagner (C&W) attack, the Basic Iterative Method (BIM), and Projected Gradient Descent (PGD), all of which iteratively adjust the input toward the adversarial target. These techniques can operate in white-box settings, where the attacker has full access to model weights, or in black-box settings, where only output probabilities or hard labels are available.

The significance of targeted adversarial examples extends well beyond academic curiosity. In safety-critical applications — autonomous vehicles, medical imaging, facial recognition, and malware detection — a targeted attack could redirect a model's decision toward a specific harmful outcome rather than merely causing random errors. This makes them a more dangerous threat model than untargeted attacks and a more demanding benchmark for evaluating model robustness. Research into targeted attacks has directly motivated defenses such as adversarial training, certified robustness, and input preprocessing techniques.

Targeted adversarial examples also serve as diagnostic tools, revealing how models represent and distinguish between classes. The ease with which high-confidence targeted misclassifications can be achieved exposes the degree to which deep neural networks rely on non-human-aligned features. Understanding and defending against these attacks remains an active and open research challenge in the broader field of trustworthy machine learning.

Related

Related

Adversarial Attacks
Adversarial Attacks

Carefully crafted input perturbations designed to fool machine learning models into errors.

Generality: 773
Adversarial Examples
Adversarial Examples

Carefully crafted inputs that fool machine learning models into making wrong predictions.

Generality: 781
Adversarial Evaluation
Adversarial Evaluation

Testing AI systems by deliberately crafting inputs designed to expose failures.

Generality: 694
Target
Target

The correct output a model is trained to predict, serving as the learning signal.

Generality: 720
Adversarial Debiasing
Adversarial Debiasing

A technique that uses adversarial training to reduce bias toward sensitive attributes.

Generality: 340
Prompt Injection
Prompt Injection

Manipulating AI language models by embedding malicious instructions within input prompts.

Generality: 499