Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Adversarial Attacks

Adversarial Attacks

Carefully crafted input perturbations designed to fool machine learning models into errors.

Year: 2014Generality: 773
Back to Vocab

Adversarial attacks are deliberate manipulations of input data designed to cause machine learning models to produce incorrect outputs. By introducing carefully computed perturbations — often imperceptible to human observers — an attacker can cause a model to misclassify an image, mistranscribe speech, or make dangerous decisions. These perturbations exploit the high-dimensional geometry of learned decision boundaries, where small but strategically directed changes in input space can cross classification boundaries that are far from where a human would expect them to be. The phenomenon reveals a fundamental gap between how neural networks represent the world and how humans do.

Attacks are typically categorized along several axes. White-box attacks assume full knowledge of the model's architecture and weights, enabling gradient-based methods like the Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) to compute maximally damaging perturbations directly. Black-box attacks operate with limited or no access to model internals, relying on transferability — the empirical finding that adversarial examples crafted for one model often fool others — or on repeated queries to estimate gradients. Attacks can also be targeted, steering the model toward a specific wrong class, or untargeted, simply maximizing prediction error. Beyond images, adversarial attacks have been demonstrated against text classifiers, speech recognition systems, reinforcement learning agents, and malware detectors.

The practical stakes are high. In autonomous vehicles, adversarial patches on stop signs can cause misclassification at dangerous rates. In medical imaging, subtle pixel-level changes can alter diagnostic predictions. In NLP, imperceptible character substitutions can bypass content moderation systems. These vulnerabilities have driven an entire subfield of adversarial robustness, which seeks to train models that resist such attacks through techniques like adversarial training, certified defenses, and input preprocessing.

Understanding adversarial attacks is essential not just for security but for interpretability and generalization research. The existence of adversarial examples suggests that models may be relying on brittle, non-semantic features rather than the robust concepts humans use, making the study of these attacks a window into the deeper question of what neural networks actually learn.

Related

Related

Adversarial Examples
Adversarial Examples

Carefully crafted inputs that fool machine learning models into making wrong predictions.

Generality: 781
Targeted Adversarial Examples
Targeted Adversarial Examples

Crafted inputs that fool a model into predicting one specific wrong class.

Generality: 550
Adversarial Evaluation
Adversarial Evaluation

Testing AI systems by deliberately crafting inputs designed to expose failures.

Generality: 694
Robustness
Robustness

A model's ability to maintain reliable performance under varied or adversarial conditions.

Generality: 838
Prompt Injection
Prompt Injection

Manipulating AI language models by embedding malicious instructions within input prompts.

Generality: 499
Adversarial Debiasing
Adversarial Debiasing

A technique that uses adversarial training to reduce bias toward sensitive attributes.

Generality: 340