Carefully crafted input perturbations designed to fool machine learning models into errors.
Adversarial attacks are deliberate manipulations of input data designed to cause machine learning models to produce incorrect outputs. By introducing carefully computed perturbations — often imperceptible to human observers — an attacker can cause a model to misclassify an image, mistranscribe speech, or make dangerous decisions. These perturbations exploit the high-dimensional geometry of learned decision boundaries, where small but strategically directed changes in input space can cross classification boundaries that are far from where a human would expect them to be. The phenomenon reveals a fundamental gap between how neural networks represent the world and how humans do.
Attacks are typically categorized along several axes. White-box attacks assume full knowledge of the model's architecture and weights, enabling gradient-based methods like the Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) to compute maximally damaging perturbations directly. Black-box attacks operate with limited or no access to model internals, relying on transferability — the empirical finding that adversarial examples crafted for one model often fool others — or on repeated queries to estimate gradients. Attacks can also be targeted, steering the model toward a specific wrong class, or untargeted, simply maximizing prediction error. Beyond images, adversarial attacks have been demonstrated against text classifiers, speech recognition systems, reinforcement learning agents, and malware detectors.
The practical stakes are high. In autonomous vehicles, adversarial patches on stop signs can cause misclassification at dangerous rates. In medical imaging, subtle pixel-level changes can alter diagnostic predictions. In NLP, imperceptible character substitutions can bypass content moderation systems. These vulnerabilities have driven an entire subfield of adversarial robustness, which seeks to train models that resist such attacks through techniques like adversarial training, certified defenses, and input preprocessing.
Understanding adversarial attacks is essential not just for security but for interpretability and generalization research. The existence of adversarial examples suggests that models may be relying on brittle, non-semantic features rather than the robust concepts humans use, making the study of these attacks a window into the deeper question of what neural networks actually learn.