Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Masking

Masking

Blocking certain input positions from attention to enforce valid information flow.

Year: 2017Generality: 694
Back to Vocab

Masking is a technique used in neural network models—particularly transformers—to control which positions in a sequence are allowed to influence the computation at any given step. By setting certain attention weights to negative infinity (or zero after softmax), masking effectively prevents a model from "seeing" specific tokens during training or inference. This is essential for preserving the causal or structural constraints that govern how information should flow through a model.

The most common form is causal masking (also called autoregressive masking), used in decoder-style models like GPT. Here, a triangular mask ensures that when predicting the token at position t, the model can only attend to positions 1 through t—never to future tokens. This mirrors the real-world constraint that a language model generating text cannot know what comes next before it has generated it. Without this mask, a model trained on next-token prediction could trivially cheat by attending to the answer directly.

A second major form is padding masking, which prevents the model from attending to padding tokens added to make variable-length sequences uniform within a batch. A third, more distinctive form is masked language modeling (MLM), introduced with BERT in 2018. In MLM, random tokens in the input are replaced with a special [MASK] token, and the model is trained to predict the original values. Unlike causal masking, MLM allows bidirectional attention over the unmasked context, enabling richer representations suited for understanding tasks rather than generation.

Masking matters because it is the mechanism that enforces the learning objective itself. Without causal masks, autoregressive models cannot learn proper sequential dependencies; without MLM masks, BERT-style pretraining loses its self-supervised signal. As transformers have become the dominant architecture across NLP, vision, and multimodal AI, masking strategies have grown more sophisticated—spanning sparse attention masks, structured dropout masks, and task-specific attention patterns—making masking one of the most practically consequential design choices in modern deep learning.

Related

Related

Attention Masking
Attention Masking

A technique that controls which positions a transformer's attention mechanism can access.

Generality: 694
Sequence Masking
Sequence Masking

Technique that selectively hides input tokens to control what a model attends to.

Generality: 628
Similarity Masking
Similarity Masking

Suppressing redundant or overly similar features to sharpen model focus on distinct information.

Generality: 293
MLM (Masked Language Modeling)
MLM (Masked Language Modeling)

A pre-training objective where models learn to predict randomly hidden tokens using bidirectional context.

Generality: 694
Causal Transformer
Causal Transformer

A transformer variant that enforces temporal causality by masking future information during training.

Generality: 453
Attention Block
Attention Block

A neural network module that selectively weighs input elements by their contextual relevance.

Generality: 752