Latest
Latest paper explainers
The latest AI research paper explainers across language models, embodied intelligence, and frontier science.
Code Generation · Google DeepMind
AlphaCode attacked programming contests by generating many candidate programs, filtering them, and selecting diverse solutions likely to pass hidden tests.
Language Models · Google Research
BERT made deep bidirectional Transformer pretraining practical, letting one pretrained encoder be fine-tuned into strong task-specific NLP systems with minimal architecture changes.
Language Models · Google DeepMind
Chinchilla showed that many large language models were undertrained, and that better compute allocation can beat simply making parameters larger.
Code Generation · Meta AI
Code Llama adapts Llama-family models for code generation, infilling, and instruction-following, giving the open ecosystem stronger coding baselines.
Self-Supervised Learning · Meta AI
DINOv2 trains self-supervised vision models on curated large-scale data to produce robust features usable across many downstream tasks.
Multimodal Models · Google DeepMind
Flamingo connects pretrained vision encoders with large language models so multimodal tasks can be handled with a few interleaved examples.
Language Models · OpenAI
GPT-3 showed that a 175B autoregressive language model could perform many tasks from examples in the prompt, without gradient updates or task-specific fine-tuning.
Alignment · OpenAI
InstructGPT showed that human preference data and RLHF could make smaller models more helpful and aligned than much larger raw language models.
Text-to-Image · Google Research
Imagen showed that stronger language encoders can materially improve text-to-image diffusion models, especially for prompt alignment and photorealism.
Language Models · Google Research
PaLM used the Pathways system to train a 540B dense Transformer and showed how scale improves few-shot language, reasoning, and code performance.
Segmentation · Meta AI
SAM reframed image segmentation as a promptable foundation-model task, backed by a large model and the SA-1B mask dataset.
Language Models · Google Research
T5 unified NLP transfer learning by casting every task as text input to text output, then systematically studying objectives, data, scale, and fine-tuning choices.
Vision Foundation Models · Google Research
ViT showed that a standard Transformer can compete in image recognition when images are split into patches and trained at sufficient scale.
Speech Recognition · OpenAI
Whisper showed that large, diverse, weakly supervised audio data can produce robust multilingual speech recognition and translation models.
Biomolecular Modeling · Google DeepMind
AlphaFold 3 moves beyond protein structure prediction by using a diffusion-based architecture to model complexes involving proteins, nucleic acids, ligands, ions, and modified residues.
Transformers · Google Research
The Transformer removed recurrence and convolution from sequence transduction, replacing them with attention and parallel training; almost every modern LLM stands on that move.
Theorem Proving · Google DeepMind
AlphaGeometry combines a neural language model with symbolic deduction, using synthetic theorems and proofs to reach near gold-medal performance on olympiad geometry.
Text-to-Image · OpenAI
DALL·E 2 splits text-to-image generation into a prior that predicts a CLIP image embedding and a decoder that turns that embedding into an image.
Multimodal Models · OpenAI
CLIP trains image and text encoders on 400 million internet image-text pairs, making natural language a flexible interface for zero-shot visual recognition.
Alignment · Stanford University
Direct Preference Optimization turns preference tuning into a simple classification-style objective, avoiding an explicit reward model and reinforcement learning loop.
Long Context · Google DeepMind
Gemini 1.5 made million-token multimodal context feel less like a demo trick and more like a practical interface for long documents, video, audio, and code.
Efficient AI · Stanford University
FlashAttention keeps attention exact but makes it IO-aware, using tiling to reduce slow GPU memory traffic and make long-sequence Transformers faster and cheaper.
Multimodal Models · OpenAI
GPT-4 was less a full recipe than a measurement document: a multimodal Transformer whose benchmark performance, scaling predictability, and post-training alignment reset expectations for frontier AI.
Diffusion Models · CompVis
Latent diffusion moves denoising from pixel space into a compressed autoencoder latent space, making high-resolution image generation far cheaper while preserving flexibility.
Open Models · Meta AI
Llama 3 is not just a bigger open-weight model; it is Meta's attempt to package multilingual, coding, reasoning, tool use, and safety into a coherent public model family.
Sequence Modeling · Carnegie Mellon University
Mamba makes state space models selective, letting them decide what to remember or forget from the input while scaling linearly with sequence length.
Vision-Language-Action · Physical Intelligence
A single vision-language-action model, trained on data from seven robot platforms, performs dexterous everyday tasks like folding laundry from plain language prompts.
Vision-Language-Action · Google DeepMind
RT-2 shows that robot actions can be represented as language-like tokens, letting web-trained vision-language models transfer semantic knowledge into physical control.
Segmentation · Meta AI
SAM 2 extends promptable segmentation from still images to real-time video by adding streaming memory and a data engine built around user interaction.
LLM Reasoning · DeepSeek
Reinforcement learning alone, with no supervised reasoning traces, can make a base language model develop strong step-by-step reasoning, rivaling top closed models.