An efficient algorithm for approximating log-likelihood gradients when training energy-based models.
Contrastive Divergence is a training algorithm designed to make learning in energy-based probabilistic models computationally tractable. Models like Restricted Boltzmann Machines (RBMs) are trained by maximizing the likelihood of observed data, but this requires computing a partition function — a sum over all possible configurations of the model — that is exponentially expensive to evaluate exactly. CD sidesteps this problem by approximating the required gradient using a short chain of Gibbs sampling steps rather than running the Markov chain to full convergence.
The mechanics of CD are straightforward in the context of RBMs. Training begins with a real data sample, which is used to compute the hidden unit activations in a forward pass. Those hidden activations are then used to reconstruct the visible layer, and the hidden units are sampled again from that reconstruction. The weight update is proportional to the difference between the outer products of the original data with its hidden activations and the reconstructed data with its hidden activations — a quantity that approximates the true log-likelihood gradient. Running only k steps of this process (commonly just one, yielding CD-1) makes training fast enough to be practical, even if it introduces some bias relative to exact maximum likelihood.
Introduced by Geoffrey Hinton in 2002, CD became a cornerstone technique during the mid-2000s deep learning renaissance. It enabled layer-wise pretraining of deep belief networks, allowing researchers to initialize deep architectures in a meaningful way before fine-tuning with backpropagation — a strategy that temporarily unlocked deeper networks before techniques like dropout, ReLU activations, and large datasets made end-to-end supervised training the dominant paradigm.
Although CD has been largely supplanted in modern practice by fully supervised deep learning pipelines and newer generative frameworks like VAEs and GANs, it remains historically significant and conceptually important. It demonstrated that approximate inference could be both principled and practical, and it helped revive interest in unsupervised representation learning — themes that continue to resonate in contemporary self-supervised and generative modeling research.