Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Softmax Function

Softmax Function

Converts a vector of real numbers into a normalized probability distribution over classes.

Year: 1989Generality: 796
Back to Vocab

The softmax function is a mathematical operation that transforms a vector of arbitrary real-valued scores — often called logits — into a probability distribution. For each element in the input vector, softmax computes the exponential of that value and divides it by the sum of exponentials across all elements. The result is a vector of the same length where every value lies strictly between 0 and 1 and all values sum to exactly 1. This normalization makes the output directly interpretable as class probabilities, which is why softmax is the standard choice for the final layer of multi-class classification networks.

The mechanics of softmax have an important practical consequence: because it uses exponentials, larger input values are amplified disproportionately relative to smaller ones. This means the function tends to push the highest-scoring class toward a probability near 1 while suppressing others toward 0 — a behavior sometimes described as "winner-takes-most." The sharpness of this effect can be controlled by a temperature parameter that scales the logits before applying softmax. High temperatures produce softer, more uniform distributions, while low temperatures produce sharper, more confident ones. Temperature scaling is widely used in knowledge distillation, language model sampling, and reinforcement learning.

In neural network training, softmax is almost always paired with the cross-entropy loss function. Together they measure how far the predicted probability distribution is from the true one-hot label distribution, and their combination yields clean, well-behaved gradients that make optimization efficient. During backpropagation, the gradient of cross-entropy loss through softmax simplifies to the difference between the predicted probabilities and the true labels, which is computationally convenient and numerically stable when implemented as a fused operation.

Beyond classification output layers, softmax appears throughout modern deep learning architectures. The attention mechanism in Transformer models applies softmax to a matrix of query-key dot products to produce attention weights, effectively letting the model decide how much focus to place on each position in a sequence. This usage has made softmax a foundational operation in natural language processing, computer vision, and multimodal learning, cementing its status as one of the most widely deployed nonlinearities in the field.

Related

Related

Multi-Class Activation
Multi-Class Activation

An output activation strategy enabling neural networks to classify inputs into three or more categories.

Generality: 694
Logits
Logits

Raw, unnormalized scores output by a neural network before probability conversion.

Generality: 700
Temperature
Temperature

A scaling parameter that controls the randomness of a model's output distribution.

Generality: 720
Cross-Entropy Loss
Cross-Entropy Loss

A loss function measuring divergence between predicted probability distributions and true labels.

Generality: 838
Logistic Regression
Logistic Regression

A classification algorithm that models the probability of a binary outcome.

Generality: 838
Saturating Non-Linearities
Saturating Non-Linearities

Activation functions whose outputs plateau and stop responding to large input values.

Generality: 581