Vision Foundation Models · Transformers
Vision Transformer: Treating Image Patches Like Tokens
ViT showed that a standard Transformer can compete in image recognition when images are split into patches and trained at sufficient scale.
ViT showed that a standard Transformer can compete in image recognition when images are split into patches and trained at sufficient scale.
What problem it solves
Computer vision was dominated by convolutional neural networks. ViT asks whether the Transformer architecture, already central in language, can work for images without building in convolution as the main inductive bias.
The core method
The model splits an image into fixed-size patches, linearly embeds each patch, adds position information, and feeds the sequence into a standard Transformer encoder. Classification is performed from the resulting representation, much like sequence classification in NLP.
Key results
ViT performs very well when pretrained on large image datasets and transferred to downstream recognition benchmarks. The paper also shows a tradeoff: without enough data, the weaker image-specific bias can hurt; with enough scale, the architecture becomes highly competitive.
Why it matters
ViT opened the path for foundation-model thinking in computer vision. Once images can be represented as token sequences, many ideas from language modeling become easier to transfer: scaling, pretraining, masked prediction, multimodal alignment, and unified architectures.
Limits and open questions
The original ViT depends heavily on large-scale pretraining and is less data-efficient than CNNs in small-data settings. Patch tokenization can also miss fine local structure unless later designs add hierarchy, better augmentation, or hybrid components.
One line: ViT made images look like sequences to the Transformer ecosystem.