Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Observatory
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. PML (Protein Language Model)

PML (Protein Language Model)

Transformer-based models that learn biological meaning from protein sequence data.

Year: 2019Generality: 339
Back to Vocab

Protein language models (PMLs) are self-supervised deep learning models—most commonly transformer architectures—trained on hundreds of millions of protein sequences drawn from databases like UniProt. Rather than learning from labeled examples, they are trained using objectives borrowed from natural language processing, such as masked token prediction, where the model learns to reconstruct randomly hidden amino acids from their surrounding context. Through this process, the model implicitly learns the statistical grammar of protein sequences, capturing evolutionary constraints, structural tendencies, and functional signatures encoded in the patterns of amino acid co-occurrence.

The representations produced by PMLs are dense, contextual embeddings that encode far more than simple sequence composition. Because proteins that share a common ancestor tend to preserve functionally important residues across billions of years of evolution, a model trained on enough sequences learns which positions are conserved, which are variable, and how changes at one site relate to changes elsewhere. These embeddings transfer remarkably well to downstream tasks: fine-tuned or zero-shot PMLs achieve strong performance on remote homology detection, secondary and tertiary structure prediction, functional annotation, and variant effect scoring—predicting, for instance, whether a mutation is likely to be tolerated or deleterious.

PMLs also support generative protein design. By treating sequence generation as conditional language modeling, researchers can sample novel sequences likely to fold into desired structures or exhibit target functions, bypassing the need for exhaustive experimental screening. Landmark models in this space include ESM-1b, ESM-2, ProtBERT, ProtTrans, and ESM-3, each demonstrating that scaling model size and training data yields consistent improvements analogous to scaling laws observed in large language models for text.

Despite their power, PMLs have important limitations. They operate purely on sequence and do not explicitly model three-dimensional physics, thermodynamic stability, or the geometric constraints of protein folding. They can also reflect biases in training databases, which oversample certain organism types and protein families. For this reason, PMLs are often combined with multiple sequence alignments, structural encoders, or physics-based scoring to form hybrid systems that leverage the complementary strengths of each approach.

Related

Related

LLM (Large Language Model)
LLM (Large Language Model)

Massive neural networks trained on text to understand and generate human language.

Generality: 905
DLMs (Deep Language Models)
DLMs (Deep Language Models)

Deep neural networks trained to understand, generate, and translate human language.

Generality: 796
MLLMs (Multimodal Large Language Models)
MLLMs (Multimodal Large Language Models)

AI systems that understand and generate content across text, images, audio, and more.

Generality: 794
MLM (Masked Language Modeling)
MLM (Masked Language Modeling)

A pre-training objective where models learn to predict randomly hidden tokens using bidirectional context.

Generality: 694
Sequence Model
Sequence Model

A model that learns patterns and dependencies within ordered data sequences.

Generality: 840
L2M (Large Memory Model)
L2M (Large Memory Model)

A decoder-only Transformer with addressable auxiliary memory enabling reasoning far beyond its attention window.

Generality: 189