A kernel capturing the linearized training dynamics of infinitely wide neural networks.
The Neural Tangent Kernel (NTK) is a mathematical object that characterizes how a neural network evolves during gradient-based training by linearizing the network's parameter-to-function map around its initialization. It is constructed by computing the Jacobian of the network's outputs with respect to its parameters and forming the inner product of these Jacobians across data points. In the infinite-width limit, this quantity converges to a deterministic, fixed kernel — meaning the network's training dynamics become equivalent to kernel regression with that kernel. This connection transforms the otherwise intractable nonlinear optimization of neural networks into a tractable linear system, enabling closed-form predictions about convergence rates, loss trajectories, and generalization behavior.
The NTK framework illuminates what researchers call the "lazy training" or linearized regime, where network weights move only slightly from their initialization and the learned features remain essentially fixed throughout training. In this regime, the network behaves like a kernel machine rather than a feature-learning system, and its generalization properties can be analyzed using classical tools from kernel theory and Gaussian processes. This connects directly to the Neural Network Gaussian Process (NNGP) perspective, which describes the distribution over functions induced by random initialization of wide networks. Together, these frameworks provide a coherent theoretical lens for understanding overparameterized networks, including why they can interpolate training data while still generalizing well.
Beyond the basic formulation, the NTK has been extended to convolutional architectures (CNTK), recurrent networks, and attention-based models, and has been used to study scaling laws, optimization stability, and the effects of architectural choices on trainability. Critically, the NTK also defines the boundary of its own applicability: when networks operate outside the infinite-width or lazy-training regime — as many practical deep networks do — feature learning becomes significant and NTK predictions break down. This contrast has sharpened the field's understanding of when depth and nonlinearity provide genuine representational advantages over fixed kernel methods, making the NTK a foundational reference point for modern deep learning theory.