Replaced Token Detection

Replaced token detection is a self-supervised pre-training objective in which a model is trained to identify which tokens in an input sequence have been swapped out for plausible but incorrect alternatives. Unlike masked language modeling, where tokens are hidden and the model must predict what belongs in each gap, replaced token detection presents a complete sequence — some tokens genuine, others substituted — and asks the model to classify each token as real or fake. This binary discrimination task is computationally efficient because the model receives a learning signal from every token in the sequence, not just the small fraction that would typically be masked.

The mechanism relies on a two-component architecture. A smaller generator network, often trained with a masked language modeling objective, proposes replacement tokens that are contextually plausible but incorrect. A larger discriminator network then processes the resulting sequence and learns to detect which tokens were swapped. Because the generator produces believable substitutions rather than random noise, the discriminator must develop a nuanced understanding of syntax, semantics, and contextual coherence to succeed. This setup was central to the ELECTRA model introduced by Google Research in 2020, which demonstrated that the approach could match or outperform BERT-scale models using significantly less compute.

The appeal of replaced token detection lies in its sample efficiency. Standard masked language modeling only computes loss on the 15% of tokens that are masked, leaving the majority of the sequence unused for learning. Replaced token detection turns every token into a training signal, making each forward pass substantially more informative. This translates into faster convergence and strong performance on downstream NLP benchmarks even when pre-training budgets are constrained.

Beyond efficiency, the task encourages models to build richer contextual representations because distinguishing a plausible replacement from the original token demands fine-grained language understanding. The approach has influenced subsequent work in self-supervised learning across both text and other modalities, establishing replaced token detection as a meaningful alternative to masking-based objectives in the design of foundation models.

Replaced Token Detection

Related

Replaced Token Detection

Related

Related

Related