State Space Model

A state space model (SSM) is a mathematical framework for representing dynamic systems in which an underlying sequence of hidden (latent) states evolves over time according to a transition function, and observations are generated from those states through a separate emission function. The core structure consists of two coupled equations: a state equation that defines how the latent state at one time step produces the next, typically incorporating stochastic noise, and an observation equation that maps each latent state to a measurable output, also with noise. This separation between the hidden dynamics and the observed signal makes SSMs especially powerful for modeling systems where the true underlying process cannot be directly measured.

Inference in state space models involves estimating the hidden states given a sequence of observations — a problem solved exactly for linear Gaussian systems by the Kalman filter, which provides optimal recursive state estimates in closed form. For nonlinear or non-Gaussian systems, approximate methods such as the Extended Kalman Filter, Unscented Kalman Filter, or particle filters are used. Learning the model parameters themselves is typically handled via the Expectation-Maximization (EM) algorithm or gradient-based optimization, depending on whether the model is classical or learned end-to-end.

In machine learning, state space models have gained renewed prominence as sequence modeling architectures. Structured SSMs such as S4 (Structured State Space for Sequence Modeling) and its successors treat the SSM as a learnable layer that can efficiently capture long-range dependencies in sequences, offering a compelling alternative to Transformers for tasks involving long time series, audio, and genomics. These models leverage the convolutional form of linear SSMs for parallel training while retaining the recurrent form for efficient autoregressive inference.

SSMs matter because they provide a principled, interpretable structure for modeling temporal data, with strong theoretical foundations in control theory and signal processing. Their resurgence in deep learning reflects a broader interest in architectures that combine computational efficiency with the ability to model extended temporal context — properties that are difficult to achieve simultaneously with attention-based models.