A measure of how two random variables vary together in direction and magnitude.
Covariance is a statistical measure that quantifies the degree to which two random variables change together. When two variables tend to increase and decrease simultaneously, their covariance is positive; when one tends to increase as the other decreases, the covariance is negative; and when the variables are statistically independent, their covariance is zero. Mathematically, the covariance of variables X and Y is defined as the expected value of the product of their deviations from their respective means: Cov(X, Y) = E[(X − μₓ)(Y − μᵧ)]. This signed quantity captures both the direction and a sense of the strength of the linear relationship between two variables.
In machine learning, covariance plays a central role in understanding data structure and feature relationships. The covariance matrix — a square matrix containing the pairwise covariances of all features in a dataset — is foundational to techniques like Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Gaussian mixture models. PCA, for instance, diagonalizes the covariance matrix to find orthogonal directions of maximum variance, enabling dimensionality reduction while preserving as much information as possible. Covariance also underpins Gaussian processes, where the covariance (or kernel) function encodes prior assumptions about the smoothness and structure of functions being modeled.
One important limitation of covariance is that its magnitude depends on the scale of the variables, making direct comparisons across different datasets or feature pairs difficult. Normalizing covariance by the product of the standard deviations of the two variables yields the Pearson correlation coefficient, which is bounded between −1 and 1 and is scale-invariant. Despite this limitation, raw covariance remains essential in many algorithms where the absolute scale of variation matters, such as in Kalman filters and multivariate normal distributions.
Covariance estimation from finite samples is itself a significant challenge in high-dimensional machine learning settings. When the number of features exceeds the number of observations, the sample covariance matrix becomes singular and unreliable. Techniques such as shrinkage estimation (e.g., the Ledoit-Wolf estimator) and sparse covariance estimation have been developed to produce well-conditioned covariance matrices in these regimes, making robust covariance estimation an active area of research in modern machine learning.