Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Sequence Masking

Sequence Masking

Technique that selectively hides input tokens to control what a model attends to.

Year: 2017Generality: 628
Back to Vocab

Sequence masking is a technique used in machine learning to selectively prevent certain positions in an input sequence from influencing a model's computations. It works by constructing a binary or boolean mask — a tensor of the same shape as the input — where designated positions are marked to be ignored or zeroed out during forward passes. This allows models to distinguish between meaningful tokens and irrelevant ones, such as padding added to make variable-length sequences uniform in batch processing.

The most common application is padding masking, where shorter sequences in a batch are padded to match the longest sequence, and a mask is applied so the model does not treat padding tokens as real input. Without this, attention mechanisms and loss functions would incorrectly incorporate padding into their calculations, degrading model performance. A second critical form is causal masking (also called look-ahead masking), used in autoregressive models like GPT. Here, the mask is a lower-triangular matrix that prevents each position from attending to future tokens, enforcing the constraint that predictions at step t depend only on tokens at positions 1 through t.

Sequence masking became especially prominent with the rise of the Transformer architecture, introduced in the 2017 paper Attention Is All You Need by Vaswani et al. In self-attention, the mask is added to the raw attention logits before the softmax operation — masked positions receive a very large negative value (e.g., −∞), driving their post-softmax weight to zero. This elegant integration made masking a first-class citizen in modern NLP pipelines, enabling both efficient batched training and correct autoregressive decoding.

Beyond padding and causal masking, the concept extends to span masking used in pretraining objectives like those in BERT and T5, where contiguous token spans are hidden to force the model to learn contextual representations. Sequence masking is therefore not a single technique but a family of related strategies that shape what information flows through a model at training and inference time, making it foundational to virtually all modern sequence modeling architectures.

Related

Related

Masking
Masking

Blocking certain input positions from attention to enforce valid information flow.

Generality: 694
Attention Masking
Attention Masking

A technique that controls which positions a transformer's attention mechanism can access.

Generality: 694
Similarity Masking
Similarity Masking

Suppressing redundant or overly similar features to sharpen model focus on distinct information.

Generality: 293
MLM (Masked Language Modeling)
MLM (Masked Language Modeling)

A pre-training objective where models learn to predict randomly hidden tokens using bidirectional context.

Generality: 694
Sequence Model
Sequence Model

A model that learns patterns and dependencies within ordered data sequences.

Generality: 840
Self-Attention
Self-Attention

A mechanism that lets neural networks weigh relationships between all parts of an input simultaneously.

Generality: 794