Differential Transformer

Differential Transformer

A transformer-based architecture that incorporates differential structure (derivatives, continuous-time dynamics, or differential operators) into attention and latent representations to better model continuous processes, sensitivities, and physics-informed tasks.

A Differential Transformer is a class of transformer variants that explicitly encodes differential structure into the model — for example by parameterizing continuous-depth dynamics (Neural ODE-style layers) inside attention blocks, by learning or applying differential operators in latent space, or by predicting and using derivatives (Jacobian/temporal derivatives) as primary modeling targets. The design intent is to provide inductive biases that respect continuity, conservation laws, and sensitivity information common in physical systems, control, and high-fidelity time-series, improving sample efficiency and stability when the underlying data-generating process is governed by differential equations or smooth dynamics. Architectures in this family range from transformers with continuous-time attention updates and adjoint-based backpropagation to transformers used as learnable neural operators for partial differential equations (PDEs), and to hybrids that augment token representations with derivative features or Jacobian-aware components to capture first- and higher-order behavior.

Technically, Differential Transformers tie together ideas from the original transformer (self-attention and tokenization), continuous-depth models (Neural ODEs and adjoint methods), and neural operator frameworks (e.g., Fourier/DeepONet approaches) to yield models that are discretization-agnostic, can propagate gradients through continuous dynamics, and can enforce physics-informed constraints during training. In practice this requires careful numerical design: stable integrators or continuous formulations for residual/attention updates, efficient Jacobian-vector product implementations to avoid prohibitive cost when training with derivative losses, and regularization that preserves physical invariants. Applications where this pays off include scientific ML for PDE/PDE-constrained inverse problems, system identification and forecasting for irregular-sampled time-series, model-based control and planning where sensitivities matter, and explainable modeling where explicit derivative estimates improve interpretability. Key engineering trade-offs are computational overhead (especially for high-order derivatives), sensitivity to discretization and stiffness, and the need for appropriate datasets or physics priors to realize the theoretical advantages.

First uses of the idea emerged after the transformer revolution (Transformer, 2017) and the introduction of continuous-depth models (Neural ODEs, 2018); early experimental hybrids and workshop papers appeared circa 2019–2021. The term and broader interest gained clearer traction in the research community from about 2020–2024 as work on neural operators, physics-informed ML, and continuous-time sequence models converged and as demand for physics-consistent architectures in ML (Machine Learning) use cases increased.

Key contributors and influences include Vaswani et al. (the Transformer, 2017) for the attention backbone; Chen et al. (Neural ODEs, 2018) for continuous-depth modeling and adjoint training; Li, Kovachki and collaborators on neural operators (e.g., Fourier Neural Operator) that motivated operator-learning perspectives; Raissi, Perdikaris, Karniadakis and other proponents of physics-informed neural networks (PINNs) who established derivative-based losses and constraints; and research groups across academia and industry (notably teams at DeepMind, major university ML labs, and communities working on scientific ML) that have developed continuous-time attention variants, Jacobian-aware mechanisms, and transformer-based PDE solvers.

Related