Converts a vector of real numbers into a normalized probability distribution over classes.
The softmax function is a mathematical operation that transforms a vector of arbitrary real-valued scores — often called logits — into a probability distribution. For each element in the input vector, softmax computes the exponential of that value and divides it by the sum of exponentials across all elements. The result is a vector of the same length where every value lies strictly between 0 and 1 and all values sum to exactly 1. This normalization makes the output directly interpretable as class probabilities, which is why softmax is the standard choice for the final layer of multi-class classification networks.
The mechanics of softmax have an important practical consequence: because it uses exponentials, larger input values are amplified disproportionately relative to smaller ones. This means the function tends to push the highest-scoring class toward a probability near 1 while suppressing others toward 0 — a behavior sometimes described as "winner-takes-most." The sharpness of this effect can be controlled by a temperature parameter that scales the logits before applying softmax. High temperatures produce softer, more uniform distributions, while low temperatures produce sharper, more confident ones. Temperature scaling is widely used in knowledge distillation, language model sampling, and reinforcement learning.
In neural network training, softmax is almost always paired with the cross-entropy loss function. Together they measure how far the predicted probability distribution is from the true one-hot label distribution, and their combination yields clean, well-behaved gradients that make optimization efficient. During backpropagation, the gradient of cross-entropy loss through softmax simplifies to the difference between the predicted probabilities and the true labels, which is computationally convenient and numerically stable when implemented as a fused operation.
Beyond classification output layers, softmax appears throughout modern deep learning architectures. The attention mechanism in Transformer models applies softmax to a matrix of query-key dot products to produce attention weights, effectively letting the model decide how much focus to place on each position in a sequence. This usage has made softmax a foundational operation in natural language processing, computer vision, and multimodal learning, cementing its status as one of the most widely deployed nonlinearities in the field.