A hyperparameter controlling how large each parameter update is during optimization.
Step size is a fundamental hyperparameter in iterative optimization algorithms that determines the magnitude of each update applied to a model's parameters during training. In gradient descent and its variants, the step size—commonly called the learning rate—scales the gradient before it is subtracted from the current parameter values. At each iteration, the algorithm computes the gradient of the loss function with respect to the parameters and moves in the direction that reduces the loss, with the step size governing how far to move. This single scalar (or per-parameter vector in adaptive methods) has an outsized influence on whether training converges at all.
Choosing an appropriate step size involves a fundamental trade-off. A value that is too large causes the optimizer to overshoot minima, producing oscillations or outright divergence; a value that is too small leads to painfully slow convergence and increases the risk of becoming trapped in suboptimal local minima or saddle points. In practice, step size is rarely held constant: techniques such as learning rate schedules (step decay, cosine annealing) and adaptive optimizers like AdaGrad, RMSProp, and Adam automatically adjust effective step sizes per parameter based on historical gradient information, substantially reducing the burden of manual tuning.
Step size became a central concern in machine learning as deep neural networks grew deeper and datasets larger through the 2000s and 2010s, because the sensitivity of training dynamics to this parameter scales with model complexity. Modern best practices include warm-up phases that start with a small step size before increasing it, cyclical learning rate schedules that periodically vary the rate to escape local minima, and learning rate finders that sweep values to identify a good operating range. Understanding step size is essential for anyone training neural networks, as it sits at the intersection of optimization theory and practical model performance.