Identifying data points that deviate significantly from expected or normal behavior.
Anomaly detection is the task of identifying observations, events, or data points that differ substantially from the majority of a dataset or from an established model of normal behavior. These outliers — variously called anomalies, outliers, or novelties — may represent errors, rare events, or genuinely interesting phenomena depending on the application domain. The challenge lies in defining what "normal" means in a given context, since real-world data distributions are often complex, high-dimensional, and subject to drift over time.
The core approaches to anomaly detection fall into three broad categories: statistical methods, which model the data distribution and flag low-probability observations; proximity-based methods, which identify points that are far from their neighbors in feature space; and model-based methods, which train a representation of normal behavior and measure reconstruction or prediction error. Machine learning has dramatically expanded the toolkit available, with techniques ranging from one-class SVMs and isolation forests to autoencoders and variational generative models. Unsupervised approaches are especially common because labeled anomaly data is typically scarce — by definition, anomalies are rare.
Deep learning has pushed anomaly detection into high-dimensional domains that were previously intractable. Autoencoders learn compact representations of normal data and flag inputs with high reconstruction error; generative adversarial networks can model complex data manifolds; and transformer-based architectures have shown strong performance on sequential and time-series anomaly detection. Self-supervised contrastive methods have also emerged as a powerful paradigm, learning representations where anomalies cluster away from normal examples without requiring explicit labels.
Anomaly detection is foundational across a wide range of high-stakes applications: fraud detection in financial transactions, intrusion detection in cybersecurity, predictive maintenance in industrial systems, quality control in manufacturing, and medical diagnosis. Its importance has grown alongside the explosion of sensor data, network telemetry, and real-time monitoring systems. The field remains active because no single method generalizes well across all domains — the right approach depends heavily on data type, anomaly prevalence, latency requirements, and the cost of false positives versus false negatives.