Transformer-based models that process images as sequences of patches for visual tasks.
Vision Transformers (ViTs) are a class of deep learning models that apply the transformer architecture — originally developed for natural language processing — directly to image data. Rather than using convolutional layers to extract spatial features, ViTs divide an input image into a grid of fixed-size patches, flatten each patch into a vector, and feed the resulting sequence into a standard transformer encoder. Positional embeddings are added to retain spatial information, and a special classification token aggregates global context for downstream prediction tasks such as image classification, object detection, and segmentation.
The core mechanism driving ViTs is multi-head self-attention, which allows every patch to attend to every other patch simultaneously. This global receptive field is a fundamental departure from convolutional neural networks, which build up spatial context incrementally through local filters. The trade-off is that self-attention scales quadratically with sequence length, making high-resolution images computationally expensive without modifications like hierarchical patch merging or windowed attention — strategies employed by later variants such as Swin Transformer and DeiT.
ViTs demonstrated that, given sufficient training data, pure transformer models can match or surpass convolutional architectures on standard vision benchmarks. Early experiments showed that ViTs underperformed CNNs on smaller datasets due to their lack of built-in inductive biases like translation equivariance, but this gap closed substantially when training on large-scale datasets such as JFT-300M or with data-efficient distillation techniques. This finding reinforced a broader trend in deep learning: scale and data can compensate for architectural priors.
The impact of ViTs extends well beyond image classification. They have become foundational components in multimodal models — such as CLIP and Flamingo — that jointly process vision and language, and in generative systems like diffusion models that use transformer-based backbones for image synthesis. By unifying the architectural language of vision and NLP under a single framework, ViTs have accelerated cross-domain transfer learning and reshaped how practitioners design and scale visual systems.