MLM (Masked Language Modeling)

Masked Language Modeling is a self-supervised pre-training technique in which a portion of tokens in an input sequence — typically around 15% — are randomly replaced with a special [MASK] token, and the model is trained to reconstruct the original tokens from the surrounding context. Unlike traditional autoregressive language modeling, which predicts tokens strictly left-to-right, MLM allows the model to attend to context on both sides of each masked position simultaneously. This bidirectional attention mechanism enables the model to build richer, more nuanced representations of meaning, capturing how words relate to their full linguistic environment rather than just their preceding history.

The mechanics of MLM are straightforward but powerful. During training, the model receives a partially obscured sequence and must predict the identity of each masked token using the unmasked tokens as evidence. The loss is computed only over the masked positions, which forces the model to develop deep contextual understanding rather than simply memorizing surface-level patterns. In practice, the masking strategy is often slightly randomized — some selected tokens are replaced with [MASK], some with a random token, and some left unchanged — to prevent the model from learning to rely too heavily on the mask signal itself.

MLM became a cornerstone of modern NLP with the introduction of BERT (Bidirectional Encoder Representations from Transformers) by Google researchers in 2018. BERT demonstrated that a model pre-trained with MLM on large text corpora could be fine-tuned with minimal task-specific data to achieve state-of-the-art results across a wide range of benchmarks, including question answering, sentiment analysis, and named entity recognition. This transfer learning paradigm — pre-train on raw text, fine-tune on labeled data — reshaped how practitioners approach NLP problems.

Since BERT, MLM has inspired numerous variants and extensions. RoBERTa refined the training procedure with dynamic masking and larger batches. SpanBERT extended masking to contiguous spans of tokens to better capture phrase-level semantics. Other models adapted MLM for multilingual, domain-specific, and multimodal settings. Despite the rise of decoder-only architectures trained with causal language modeling, MLM remains a widely used and well-understood pre-training strategy, particularly for tasks requiring deep bidirectional text understanding.

MLM (Masked Language Modeling)

Related

MLM (Masked Language Modeling)

Related

Related

Related