Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Vision Transformers (ViTs)

Vision Transformers (ViTs)

Transformer-based models that process images as sequences of patches for visual tasks.

Year: 2020Generality: 720
Back to Vocab

Vision Transformers (ViTs) are a class of deep learning models that apply the transformer architecture — originally developed for natural language processing — directly to image data. Rather than using convolutional layers to extract spatial features, ViTs divide an input image into a grid of fixed-size patches, flatten each patch into a vector, and feed the resulting sequence into a standard transformer encoder. Positional embeddings are added to retain spatial information, and a special classification token aggregates global context for downstream prediction tasks such as image classification, object detection, and segmentation.

The core mechanism driving ViTs is multi-head self-attention, which allows every patch to attend to every other patch simultaneously. This global receptive field is a fundamental departure from convolutional neural networks, which build up spatial context incrementally through local filters. The trade-off is that self-attention scales quadratically with sequence length, making high-resolution images computationally expensive without modifications like hierarchical patch merging or windowed attention — strategies employed by later variants such as Swin Transformer and DeiT.

ViTs demonstrated that, given sufficient training data, pure transformer models can match or surpass convolutional architectures on standard vision benchmarks. Early experiments showed that ViTs underperformed CNNs on smaller datasets due to their lack of built-in inductive biases like translation equivariance, but this gap closed substantially when training on large-scale datasets such as JFT-300M or with data-efficient distillation techniques. This finding reinforced a broader trend in deep learning: scale and data can compensate for architectural priors.

The impact of ViTs extends well beyond image classification. They have become foundational components in multimodal models — such as CLIP and Flamingo — that jointly process vision and language, and in generative systems like diffusion models that use transformer-based backbones for image synthesis. By unifying the architectural language of vision and NLP under a single framework, ViTs have accelerated cross-domain transfer learning and reshaped how practitioners design and scale visual systems.

Related

Related

Image-to-Text Model
Image-to-Text Model

An AI system that generates natural language descriptions from visual image content.

Generality: 694
LVLMs (Large Vision Language Models)
LVLMs (Large Vision Language Models)

Large AI models that jointly understand and reason over images and text.

Generality: 694
Transformer
Transformer

A neural network architecture using self-attention to process sequential data in parallel.

Generality: 900
VLM (Visual Language Model)
VLM (Visual Language Model)

AI models that jointly understand and generate both visual and textual information.

Generality: 720
VQA (Visual Question Answering)
VQA (Visual Question Answering)

AI systems that answer natural language questions about images or videos.

Generality: 620
ITM (Image-Text Matching)
ITM (Image-Text Matching)

A technique for learning correspondences between images and natural language descriptions.

Generality: 547