Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. MLM (Masked Language Modeling)

MLM (Masked Language Modeling)

A pre-training objective where models learn to predict randomly hidden tokens using bidirectional context.

Year: 2018Generality: 694
Back to Vocab

Masked Language Modeling is a self-supervised pre-training technique in which a portion of tokens in an input sequence — typically around 15% — are randomly replaced with a special [MASK] token, and the model is trained to reconstruct the original tokens from the surrounding context. Unlike traditional autoregressive language modeling, which predicts tokens strictly left-to-right, MLM allows the model to attend to context on both sides of each masked position simultaneously. This bidirectional attention mechanism enables the model to build richer, more nuanced representations of meaning, capturing how words relate to their full linguistic environment rather than just their preceding history.

The mechanics of MLM are straightforward but powerful. During training, the model receives a partially obscured sequence and must predict the identity of each masked token using the unmasked tokens as evidence. The loss is computed only over the masked positions, which forces the model to develop deep contextual understanding rather than simply memorizing surface-level patterns. In practice, the masking strategy is often slightly randomized — some selected tokens are replaced with [MASK], some with a random token, and some left unchanged — to prevent the model from learning to rely too heavily on the mask signal itself.

MLM became a cornerstone of modern NLP with the introduction of BERT (Bidirectional Encoder Representations from Transformers) by Google researchers in 2018. BERT demonstrated that a model pre-trained with MLM on large text corpora could be fine-tuned with minimal task-specific data to achieve state-of-the-art results across a wide range of benchmarks, including question answering, sentiment analysis, and named entity recognition. This transfer learning paradigm — pre-train on raw text, fine-tune on labeled data — reshaped how practitioners approach NLP problems.

Since BERT, MLM has inspired numerous variants and extensions. RoBERTa refined the training procedure with dynamic masking and larger batches. SpanBERT extended masking to contiguous spans of tokens to better capture phrase-level semantics. Other models adapted MLM for multilingual, domain-specific, and multimodal settings. Despite the rise of decoder-only architectures trained with causal language modeling, MLM remains a widely used and well-understood pre-training strategy, particularly for tasks requiring deep bidirectional text understanding.

Related

Related

Masking
Masking

Blocking certain input positions from attention to enforce valid information flow.

Generality: 694
Attention Masking
Attention Masking

A technique that controls which positions a transformer's attention mechanism can access.

Generality: 694
Sequence Masking
Sequence Masking

Technique that selectively hides input tokens to control what a model attends to.

Generality: 628
MLLMs (Multimodal Large Language Models)
MLLMs (Multimodal Large Language Models)

AI systems that understand and generate content across text, images, audio, and more.

Generality: 794
Self-Supervised Pretraining
Self-Supervised Pretraining

A technique where models learn rich representations from unlabeled data before fine-tuning on specific tasks.

Generality: 794
Replaced Token Detection
Replaced Token Detection

A self-supervised task where models learn to identify intentionally substituted tokens in sequences.

Generality: 339