Raw, unnormalized scores output by a neural network before probability conversion.
Logits are the raw, unnormalized numerical outputs produced by the final linear layer of a neural network before any activation function is applied. In a classification setting, the network outputs one logit per class, and these values represent the model's unconstrained "scores" for each possible label. Because they are unbounded — potentially ranging from large negative to large positive values — logits are not directly interpretable as probabilities, but they encode the relative confidence the model assigns to each class.
To convert logits into probabilities, a softmax function is typically applied, which exponentiates each logit and normalizes the results so they sum to one. This transformation preserves the relative ordering of scores while producing a valid probability distribution. In binary classification, a sigmoid function serves the same role for a single logit. Crucially, many modern training frameworks compute loss functions — such as cross-entropy — directly from logits rather than from softmax outputs, because working in log-space improves numerical stability and avoids floating-point underflow that can occur when probabilities become very small.
The term itself originates from logistic regression, where "logit" referred to the log-odds of a probability: log(p / (1 − p)). This connection is more than etymological — logistic regression can be viewed as a single-layer neural network, and its output before the sigmoid is precisely a logit in both the classical and modern senses. As deep learning scaled up through the 2010s, the term migrated naturally into the neural network vocabulary to describe the pre-activation outputs of any classification head.
Logits matter beyond training mechanics. In knowledge distillation, a student model is trained to match the logit distribution of a larger teacher model, capturing richer information than hard class labels alone. In temperature scaling and calibration, logits are divided by a temperature parameter to sharpen or soften the resulting probability distribution. In language models, the logit vector over the entire vocabulary is the direct output at each token position, making logits central to decoding strategies like beam search, top-k sampling, and nucleus sampling.