Multimodal Models · Vision Foundation Models

Flamingo: Few-Shot Learning for Images, Video, and Text

Flamingo connects pretrained vision encoders with large language models so multimodal tasks can be handled with a few interleaved examples.

TL;DR

Flamingo connects pretrained vision encoders with large language models so multimodal tasks can be handled with a few interleaved examples.

What problem it solves

Vision-language systems often required task-specific fine-tuning. Flamingo asks whether a model can instead learn new multimodal tasks from a few examples in the prompt, closer to how large language models do in-context learning.

The core method

The model combines frozen pretrained vision and language components with trainable cross-attention layers that let language tokens attend to visual inputs. It supports interleaved sequences of images, video frames, and text, so examples and queries can be placed in one prompt.

Key results

Flamingo performs strongly across many image and video understanding benchmarks in few-shot settings. The key result is not just one score, but the demonstration that a single model can adapt to captioning, question answering, and other multimodal tasks through examples rather than dedicated retraining.

Why it matters

Flamingo helped define the modern visual language model pattern: preserve strong pretrained backbones, add a bridge between modalities, and use prompts as the task interface. That idea shaped later assistants that can reason over screenshots, photos, documents, and video-like context.

Limits and open questions

Few-shot multimodal prompting is powerful but still sensitive to example choice and data coverage. The architecture also inherits biases and blind spots from both visual and language pretraining. For real products, grounding, hallucination control, privacy, and latency remain hard problems.

One line: Flamingo made multimodal prompting feel like language-model prompting.