Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Causal Transformer

Causal Transformer

A transformer variant that enforces temporal causality by masking future information during training.

Year: 2018Generality: 453
Back to Vocab

A Causal Transformer is a variant of the standard Transformer architecture designed to respect the directional flow of time in sequential data. Unlike bidirectional transformers that can attend to both past and future tokens simultaneously, a causal transformer restricts each position in a sequence to only attend to previous positions and itself. This is typically achieved through a technique called causal masking — applying a triangular attention mask that sets future attention weights to negative infinity before the softmax operation, effectively zeroing them out. The result is a model that processes sequences in a strictly left-to-right (or autoregressive) fashion, making it well-suited for generative tasks where future context is unavailable at inference time.

The mechanism works by modifying the self-attention layers within the transformer block. During training, even though the full sequence is available, the causal mask prevents information leakage from future tokens, forcing the model to learn representations that depend only on prior context. This property is essential for language modeling objectives where the goal is to predict the next token given all preceding tokens. GPT-style models are the most prominent examples of causal transformers, trained on large corpora using this autoregressive objective to develop powerful generative capabilities.

Causal transformers have become foundational to modern large language models (LLMs) and generative AI systems. Their autoregressive nature enables coherent, sequential text generation — each predicted token is fed back as input to generate the next — which underlies applications like conversational AI, code generation, and creative writing assistance. Beyond NLP, causal transformers are applied in time series forecasting, music generation, and reinforcement learning, wherever predicting future states from past observations is the core challenge.

The distinction between causal and non-causal (bidirectional) transformers represents a fundamental design choice with significant downstream implications. Bidirectional models like BERT excel at understanding and classification tasks where full context is available, while causal models excel at generation. Hybrid architectures have also emerged, combining causal decoding with bidirectional encoding, reflecting the ongoing effort to balance comprehension and generation within a single framework.

Related

Related

Attention Masking
Attention Masking

A technique that controls which positions a transformer's attention mechanism can access.

Generality: 694
Masking
Masking

Blocking certain input positions from attention to enforce valid information flow.

Generality: 694
Transformer
Transformer

A neural network architecture using self-attention to process sequential data in parallel.

Generality: 900
Causal AI
Causal AI

AI that reasons about cause-and-effect relationships rather than mere statistical correlations.

Generality: 694
TCN (Temporal Convolutional Networks)
TCN (Temporal Convolutional Networks)

Convolutional neural networks that model sequential data using dilated, causal convolutions.

Generality: 550
Encoder-Decoder Transformer
Encoder-Decoder Transformer

A transformer architecture that encodes input sequences and decodes them into outputs.

Generality: 722