A method for injecting token order information into sequence models lacking recurrence.
Positional encoding is a technique used in transformer-based neural networks to supply the model with information about where each token appears within an input sequence. Unlike recurrent architectures such as LSTMs, transformers process all tokens in parallel, meaning the model has no built-in sense of order. Without some form of positional signal, a transformer would treat "the cat sat" identically to "sat cat the" — the same tokens, just rearranged. Positional encodings solve this by augmenting each token's embedding with a vector that encodes its position, allowing the model to distinguish tokens based on where they appear.
The most common approach, introduced alongside the original transformer architecture in 2017, uses sinusoidal functions of varying frequencies to generate fixed positional vectors. Each dimension of the positional encoding corresponds to a sine or cosine wave at a different frequency, producing a unique fingerprint for each position that the model can learn to interpret. An alternative approach treats positional encodings as learnable parameters, optimized during training just like weights in any other layer. Both strategies are added directly to the input embeddings before they enter the attention mechanism, preserving dimensional compatibility while enriching the representation with sequential context.
More recent work has explored relative positional encodings, which encode the distance between tokens rather than their absolute positions, and rotary positional embeddings (RoPE), which integrate positional information directly into the attention computation. These advances have proven especially valuable for handling long sequences and for improving generalization to sequence lengths not seen during training. Models like GPT, BERT, and their successors each make distinct choices about positional encoding strategy, and these choices meaningfully affect downstream performance.
Positional encoding matters because sequence order is semantically critical in language, music, time-series data, and many other domains. Getting this signal right is foundational to a transformer's ability to model syntax, causality, and temporal structure. As transformer architectures continue to expand into vision, biology, and multimodal tasks, positional encoding strategies are evolving alongside them, remaining an active area of research.