Transformer-based models that learn biological meaning from protein sequence data.
Protein language models (PMLs) are self-supervised deep learning models—most commonly transformer architectures—trained on hundreds of millions of protein sequences drawn from databases like UniProt. Rather than learning from labeled examples, they are trained using objectives borrowed from natural language processing, such as masked token prediction, where the model learns to reconstruct randomly hidden amino acids from their surrounding context. Through this process, the model implicitly learns the statistical grammar of protein sequences, capturing evolutionary constraints, structural tendencies, and functional signatures encoded in the patterns of amino acid co-occurrence.
The representations produced by PMLs are dense, contextual embeddings that encode far more than simple sequence composition. Because proteins that share a common ancestor tend to preserve functionally important residues across billions of years of evolution, a model trained on enough sequences learns which positions are conserved, which are variable, and how changes at one site relate to changes elsewhere. These embeddings transfer remarkably well to downstream tasks: fine-tuned or zero-shot PMLs achieve strong performance on remote homology detection, secondary and tertiary structure prediction, functional annotation, and variant effect scoring—predicting, for instance, whether a mutation is likely to be tolerated or deleterious.
PMLs also support generative protein design. By treating sequence generation as conditional language modeling, researchers can sample novel sequences likely to fold into desired structures or exhibit target functions, bypassing the need for exhaustive experimental screening. Landmark models in this space include ESM-1b, ESM-2, ProtBERT, ProtTrans, and ESM-3, each demonstrating that scaling model size and training data yields consistent improvements analogous to scaling laws observed in large language models for text.
Despite their power, PMLs have important limitations. They operate purely on sequence and do not explicitly model three-dimensional physics, thermodynamic stability, or the geometric constraints of protein folding. They can also reflect biases in training databases, which oversample certain organism types and protein families. For this reason, PMLs are often combined with multiple sequence alignments, structural encoders, or physics-based scoring to form hybrid systems that leverage the complementary strengths of each approach.