Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. BERT (Bidirectional Encoder Representations from Transformers)

BERT (Bidirectional Encoder Representations from Transformers)

A transformer-based model that understands language by reading text in both directions simultaneously.

Year: 2018Generality: 834
Back to Vocab

BERT is a large-scale language representation model developed by Google AI in 2018 that fundamentally changed how neural networks process and understand natural language. Unlike earlier sequential models such as LSTMs or unidirectional transformers, BERT reads entire sequences of text simultaneously, attending to both left and right context for every token at once. This bidirectional approach allows the model to build richer, context-sensitive representations — the word "bank" in "river bank" and "bank account" will produce meaningfully different embeddings depending on surrounding words, something unidirectional models struggled to achieve.

BERT is built on the Transformer encoder architecture and trained using two self-supervised objectives: Masked Language Modeling (MLM), where random tokens are hidden and the model must predict them from context, and Next Sentence Prediction (NSP), where the model learns to determine whether two sentences naturally follow each other. These pretraining tasks require no labeled data and allow BERT to absorb broad linguistic knowledge from massive text corpora. The resulting pretrained model can then be fine-tuned on specific downstream tasks — question answering, named entity recognition, sentiment analysis, textual entailment — with relatively small labeled datasets and minimal architectural changes.

BERT's impact on NLP benchmarks was immediate and dramatic. Upon release, it achieved state-of-the-art results on eleven NLP tasks, including the GLUE and SQuAD benchmarks, often by significant margins. This demonstrated that deep bidirectional pretraining was far more powerful than task-specific architectures trained from scratch, validating the "pretrain then fine-tune" paradigm that now dominates the field. Google also integrated BERT into its search engine, marking one of the most visible real-world deployments of a language model at scale.

BERT catalyzed an explosion of follow-on research. Models like RoBERTa, ALBERT, DistilBERT, and domain-specific variants such as BioBERT and SciBERT refined its training procedures, efficiency, and applicability. More broadly, BERT established the blueprint that later large language models — including GPT-3 and beyond — built upon, cementing the transformer-based pretrained model as the dominant paradigm in modern NLP.

Related

Related

Transformer
Transformer

A neural network architecture using self-attention to process sequential data in parallel.

Generality: 900
GPT (Generative Pre-Trained Transformer)
GPT (Generative Pre-Trained Transformer)

A transformer-based language model pre-trained to generate coherent, human-like text.

Generality: 865
Contextual Embedding
Contextual Embedding

Word representations that dynamically shift meaning based on surrounding context.

Generality: 752
bGPT (Byte-Level Transformer)
bGPT (Byte-Level Transformer)

A GPT variant that processes raw bytes instead of tokenized text or subwords.

Generality: 101
Encoder-Decoder Transformer
Encoder-Decoder Transformer

A transformer architecture that encodes input sequences and decodes them into outputs.

Generality: 722
MLM (Masked Language Modeling)
MLM (Masked Language Modeling)

A pre-training objective where models learn to predict randomly hidden tokens using bidirectional context.

Generality: 694