Word representations that dynamically shift meaning based on surrounding context.
Contextual embeddings are vector representations of words or tokens whose values change depending on the surrounding text, rather than remaining fixed regardless of usage. Traditional static embeddings like Word2Vec or GloVe assign each word a single vector learned from aggregate co-occurrence statistics, which means the word "bank" carries the same representation whether it appears in a sentence about rivers or finance. Contextual embeddings solve this by passing the entire input sequence through a deep neural network — typically a transformer or LSTM-based architecture — and producing token-level representations that encode both the word's identity and its role within that specific context.
The mechanics rely on attention mechanisms or recurrent processing that allow each token's representation to be influenced by every other token in the sequence. In transformer-based models like BERT, bidirectional self-attention means that each word simultaneously attends to all words before and after it, producing a rich, context-saturated vector at every layer. The deeper the layer, the more abstract and semantically nuanced the representation tends to be. These vectors can then be extracted and used as features for downstream tasks — a technique called feature extraction — or the entire model can be fine-tuned end-to-end on a specific task.
Contextual embeddings became a cornerstone of modern NLP after ELMo (2018) demonstrated that representations drawn from language model hidden states dramatically improved performance across diverse benchmarks, followed shortly by BERT's even more impactful bidirectional approach. Their importance lies in enabling a single pretrained model to generalize across tasks like named entity recognition, question answering, sentiment analysis, and machine translation with minimal task-specific architecture changes. Virtually every state-of-the-art language model today — GPT, T5, LLaMA — produces contextual embeddings as a byproduct of its forward pass, making this concept foundational to the transformer era of NLP.