Multimodal Models · Vision Foundation Models
CLIP: Computer Vision Learns to Read Natural Language
CLIP trains image and text encoders on 400 million internet image-text pairs, making natural language a flexible interface for zero-shot visual recognition.
CLIP trains image and text encoders on 400 million internet image-text pairs, making natural language a flexible interface for zero-shot visual recognition.
What problem it solves
Traditional computer vision classifiers learn a fixed label set. If the task changes, the system usually needs new labeled examples and new training. CLIP attacks that brittleness by using natural language as supervision: instead of predicting only predefined categories, the model learns which text goes with which image.
The core method
CLIP trains two encoders, one for images and one for text, with a contrastive objective. Given a batch of image-caption pairs, it pulls matching pairs together and pushes mismatched pairs apart. The training data is web-scale: 400 million image-text pairs. At inference time, a user can describe candidate labels in natural language, and the model scores which text best matches the image.
Key results
CLIP transfers to more than 30 vision datasets without task-specific training. It matches the accuracy of the original ResNet-50 on ImageNet zero-shot, without using ImageNet’s 1.28 million labeled training examples. It also shows nontrivial transfer to OCR, video action recognition, geolocation, and fine-grained classification.
Why it matters
CLIP turned language into a control surface for vision. That idea became central to text-to-image generation, image retrieval, safety filtering, multimodal agents, and promptable perception. It also changed what a vision dataset could mean: supervision did not have to be a clean label taxonomy if messy internet text could be scaled.
Limits and open questions
CLIP inherits biases and noise from web data, and its zero-shot labels depend heavily on prompt wording. It recognizes associations better than it understands grounded physical causality. The model is powerful because it is broad, but broad supervision also makes failure modes less predictable than a narrow supervised classifier.
One line: CLIP made vision queryable by language.