Latest

Latest paper explainers

The latest AI research paper explainers across language models, embodied intelligence, and frontier science.

Text-to-Image · Huazhong University of Science and Technology

Moebius: a 0.22B Inpainting Model That Matches FLUX.1-Fill

Moebius is a 0.22B diffusion inpainter that rivals the 11.9B FLUX.1-Fill-Dev. On CelebA-HQ it scores FID 5.39 vs FLUX 10.13, using under 2% of the parameters and roughly 15x less total inference time.

Robotics · UC Berkeley

Playful Agentic Robot Learning: RATs Learn Skills Before the Task

RATs lets a robot play before it is told what to do: agent teams propose their own goals, write code-as-policy, and bank wins as reusable skills. The library lifts LIBERO-PRO success from 23.2% to 43.8%.

World Models · SenseTime

Kairos: A 4B Native World Model Stack for Physical AI

Kairos packs a world model into 4B params with linear-time Hybrid Linear Temporal Attention, leads PAI-Bench at 80.84, and runs a 480P rollout in 11.4s on one RTX5090.

Vision-Language-Action · ACE Robotics

ACE-Ego-0: Unifying Egocentric Human and Robot Data for VLA Pretraining

ACE-Ego-0 pretrains a VLA on 6,000+ hours mixing robot trajectories with human egocentric video turned into pseudo-actions. It averages 78.3% on six real bimanual tasks vs 71.7% for pi-0.5 and 35.6% for GR00T-N1.7.

Robotics · Peking University

DragMesh-2: Dexterous Articulated Manipulation Through Contact

DragMesh-2 opens doors and drawers with a 51-DoF hand and no actuator on the object joint, so motion comes only from contact. PICA training hits 0.89 success at nominal damping and 0.56 at 4x, with no tactile sensing.

AI Agents · Tencent

From Chatbot to Digital Colleague: A Survey of Persistent Autonomous AI

A Tencent YouTu Lab survey maps the chatbot-to-agent shift on two axes: cognitive core (Chatbot then Thinking LLM) and task execution (Agent then Workspace plus Skill), arguing persistent state is the real leap.

AI Agents · University of Science and Technology of China

Role-Agent: One LLM Plays Both Agent and Its Own Environment

Role-Agent makes a single LLM act as agent and environment at once, generating its own process reward and curriculum. It beats GiGPO by 4.2% on ALFWorld and 6.9% on WebShop with Qwen2.5-1.5B.

Retrieval-Augmented Generation · KAIST

Rethinking RAG in Long Videos: V-RAGBench and CARVE Explained

CARVE picks a modality-granularity config per video chunk instead of per query, lifting Recall@5 to 0.603 from a best baseline of 0.510 on V-RAGBench, a leak-filtered egocentric VideoRAG benchmark.

Video Generation · Tsinghua University

OmniDirector: Multi-Shot Camera Cloning without Cross-Paired Data

OmniDirector copies camera motion from a reference video into new generations and cuts rotation error to 2.64 degrees, beating CamCloneMaster (4.11), with no cross-paired training data.

AI Agents · Tsinghua University

EurekAgent: Environment Engineering for AI Science

EurekAgent argues the bottleneck is agent environment design, reporting 2.635999 on 26-circle packing, 2005.03 us TriMul, and 85.71% any-medal on a seven-task MLE-Bench subset.

Multimodal Models · Nanjing University

HYDRA-X: One Visual Tokenizer for Images and Video

HYDRA-X unifies image and video tokenization in one ViT; tubelet attention and hierarchical temporal patchify improve DAVIS rFVD to 11.19 and editing overall to 4.34.

Agent Memory · National University of Singapore

EvoArena: Why Agent Memory Must Track Environment Changes

EvoArena turns static agent tasks into evolving chains and finds current agents average only 39.6% accuracy; EvoMem adds patch memory and improves chain-level accuracy by 3.7 points.

Text-to-Image · The Chinese University of Hong Kong

InterleaveThinker: Planner-Critic Agents for Interleaved Image Generation

InterleaveThinker adds planner and critic agents around frozen image generators, reaching 66.3 to 67.2 average on UEval and lifting FLUX.2-klein WISE from 0.47 to 0.73.

Vision-Language-Action · Zhejiang University

LabVLA: A VLA Model for Scientific Lab Robots

LabVLA trains a Qwen3-VL-4B backbone plus DiT action expert on laboratory workflows and reports 71.1% ID and 70.0% OOD success on LabUtopia.

Theorem Proving · MiniMax AI

MaxProof: How MiniMax M3 Reaches Gold-Level Proof Scores

MaxProof turns MiniMax-M3 into a generator, verifier, fixer, and ranker; with population-level test-time scaling it reports 35/42 on IMO 2025 and 36/42 on USAMO 2026.

Long Context · MiniMax AI

MiniMax Sparse Attention: 1M Context Without Dense Attention

MiniMax Sparse Attention keeps only 2,048 selected KV tokens per query group and reports 28.4x lower attention FLOPs plus 14.2x prefill speedup at 1M context.

AI Agents · NVIDIA

SpatialClaw: Why VLM Spatial Agents Need a Python Workspace

SpatialClaw replaces rigid tool calls with a persistent Python kernel and reaches 59.9% average accuracy across 20 spatial reasoning benchmarks, +11.2 points over the recent spatial-agent baseline.

Multimodal Models · Shanghai AI Laboratory

InternVideo3: An 8B Video Agent with Contextual Reasoning

InternVideo3 is an 8B video model that scores 73.8 on Video-MME single-pass, then adds +2.7 from an agentic reasoning loop on top. M2LA attention holds 768K tokens on one H200 where the base OOMs at 512K.

Reinforcement Learning · Alibaba Qwen Team

Breaking Entropy Bounds: MTP Rejection Sampling for Faster RL Training

Bebop proves multi-token prediction acceptance is bounded by rising RL entropy. A total-variation loss plus rejection sampling holds acceptance near 95%, flattens the entropy slope to -0.06, for up to 1.8x faster RL.

LLM Reasoning · Shanghai Jiao Tong University

ReRe: Cross-view Revisiting Lifts MLLM Spatial Reasoning

ReRe lets an MLLM answer a spatial question, then re-watch a synthesized novel-view video and revise. Training-free, it pushes Qwen3-VL-2B from 22.5 to 31.0 on VSI-Bench (+8.5).

Vision-Language-Action · CASIA

World Pilot: Steering a VLA Policy with World-Action Priors

World Pilot adds two world-model priors to a VLA policy and hits 84.7% total success on LIBERO-Plus zero-shot OOD, up 4.2 over the ABot-M0 base, and the scene prior works even from a world model never action-trained.

AI Agents · The Chinese University of Hong Kong

Orchestra-o1: Omnimodal Agent Orchestration

Orchestra-o1 orchestrates text, image, audio, and video sub-agents and hits 72.8% on OmniGAIA with a GPT-5 brain (+10.3 over Gemini-3-Pro). Its trained 8B orchestrator reaches 30.0%, best among open omnimodal agents.

AI Agents · CASIA

Agentic Environment Engineering for LLMs: A Survey of the Field

A CASIA survey maps agentic environments for LLM agents along eight attribute axes and eight domains, unifying synthesis, evaluation, and co-evolution. Sharpest finding: environments barely fit multi-agent settings.

Reinforcement Learning · Alibaba Qwen Team

APPO: Agentic Procedural Policy Optimization for RL Agents

APPO branches RL rollouts at high-uncertainty, high-influence tokens instead of tool-call boundaries, lifting Qwen2.5-7B by 3.9 points over ARPO across 13 math, multi-hop, and deep-search benchmarks.

Sequence Modeling · JKU Linz

On Subquadratic Architectures: xLSTM vs Mamba-2 vs Gated DeltaNet

A JKU Linz study runs xLSTM, Mamba-2, and Gated DeltaNet through code, time-series, and synthetic tasks, then traces xLSTM's lead to two primitives: counting-style accumulation and finite-state tracking.

AI Agents · Renmin University of China

FORT-Searcher: Training Search Agents Without Shortcuts

FORT-Searcher trains a 3B-active search agent on shortcut-resistant tasks, scoring 66.2 overall among comparable open agents and delaying answer hit time from 18.7 to 46.9 versus REDSearcher data.

AI Agents · Google DeepMind

From AGI to ASI: DeepMind's Map of Superintelligence Pathways

Google DeepMind's report lays out four non-exclusive paths from AGI to ASI and treats each bottleneck, from data walls to regulation, as an open research question.

AI Agents · Renmin University of China

Arbor: Autonomous Research With Hypothesis Trees

Arbor stores research attempts in a persistent hypothesis tree, then admits changes only through held-out evaluation. It reports best held-out results on six AO tasks and 86.36% Any Medal on MLE-Bench Lite.

AI Agents · TokenRhythm Technologies

Claw-SWE-Bench: Why Coding Agent Harnesses Matter

Claw-SWE-Bench evaluates OpenClaw-style coding-agent harnesses on 350 GitHub issue tasks. OpenClaw jumps from 19.1% to 73.4% Pass@1 with a full adapter.

Mixture of Experts · Renmin University of China

Manifold Power Iteration: A Better Router for MoE Models

MPI redesigns MoE routers by aligning router rows with expert weight directions. On 11B MoE, average benchmark accuracy rises from 40.92 to 42.76 with only 0.2% training slowdown.

Reinforcement Learning · Tencent

CPPO: Beyond Uniform Token-Level Trust Regions in LLM RL

CPPO replaces PPO's one-size threshold with stricter early-token clipping and a running prefix-divergence budget, lifting Qwen3-30B-A3B-Base AIME from 49.23 (DPPO) to 54.79.

Code Generation · Renmin University of China

DeNovoSWE: 4,818 Auto-Built Repos to Train Whole-Repo Code Agents

DeNovoSWE auto-constructs 4,818 verifiable whole-repository generation tasks. Fine-tuning Qwen3-30B-A3B on them lifts BeyondSWE-Doc2Repo pass rate from 0.058 to 0.472.

Reinforcement Learning · Xi'an Jiaotong University

Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching

Flow-DPPO swaps PPO ratio clipping for an exact per-step Gaussian KL term, lifting GenEval2 to 48.1 on SD3.5 (vs 39.9 for Flow-GRPO) while cutting policy drift roughly 4x.

Video Generation · KAIST

Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Sync

First autoregressive-diffusion lip-sync method: distills a 14B bidirectional teacher into causal 1.3B/14B students that generate each chunk in 2 steps, hitting 31.58 FPS with sub-millisecond time-to-first-frame.

Multimodal Models · Fudan University

ARM: An AutoRegressive Multimodal Model with Unified Discrete Tokens

ARM is a 7B autoregressive model that does image understanding, text-to-image, and editing in one next-token framework, on a shared discrete tokenizer; RL lifts GenEval 0.79 to 0.86 and GEdit-EN overall 5.75 to 6.68.

Video Generation · Zhipu AI

SCAIL-2: End-to-end In-Context Conditioning for Character Animation

SCAIL-2 feeds the raw driving video into the generation sequence instead of a pose skeleton, cutting FVD to 287 vs 305 for Wan-Animate on Studio-Bench, with one model covering animation and replacement.

Reinforcement Learning · Zhejiang University

N-GRPO: Semantic Neighbor Mixing for RL Rollouts

N-GRPO perturbs rollout embeddings with semantic neighbors, lifting DeepSeek-R1-Distill-Qwen-1.5B average Pass@32 to 79.17 and AIME25 Pass@32 to 50.28.

Multimodal Models · Kuaishou Technology

Kwai Keye-VL-2.0: Open Long-Video Multimodal Agent Model

Kwai Keye-VL-2.0 is a 30B-A3B open MoE multimodal model with 256K context, strong long-video scores, and 62.0 on SWE-bench Verified.

Reinforcement Learning · Tencent

DRPO: Rethinking Divergence Regularization in LLM RL

DRPO swaps DPPO's hard divergence mask for a smooth advantage-weighted quadratic regularizer, keeping the Binary-TV trust region but with bounded gradient weights, and trains Qwen3 LLMs more stably under FP8.

AI Agents · Ant Group

SearchSwarm: Delegation Intelligence for Deep Research

SearchSwarm fine-tunes Tongyi DeepResearch-30B-A3B on harness-generated delegation trajectories, lifting BrowseComp from 43.4 to 68.1 and topping every 30B-A3B model on four deep-research benchmarks.

Reinforcement Learning · Alibaba Qwen Team

Z-Reward: Internalizing Reasoning into Score Distributions for T2I

Z-Reward predicts a distribution over rubric scores instead of one scalar. A 9B student hits 88.6% human-preference accuracy with a single output token, and downstream T2I tuning gains 41.3% net GSB over SFT.

Long Context · University of Maryland

End-to-End Context Compression at Scale with LCLMs

LCLMs train a 0.6B encoder and 4B decoder jointly to compress long context into soft tokens at 1:4, 1:8 and 1:16, cutting prefill memory and time-to-first-token while staying close to the uncompressed baseline.

AI Agents · Zhejiang University

WeaveBench: Hybrid Computer-Use Agents Still Fail

WeaveBench tests 114 real hybrid GUI plus CLI tasks; the best frontier pairing reaches only 41.2% PassRate, and final-only grading overstates GPT-5.5 by 20.2 points.

World Models · Alibaba Qwen Team

ABot-Earth 0.5: Generating 3D Cities From Satellite Images

ABot-Earth 0.5 uses satellite imagery to generate 3D Gaussian Splatting city scenes, reporting under 10 minutes per square kilometer and FID 16.1.

World Models · JD.com (Joy Future Academy)

Echo-Memory: Which Memory Lets a World Model Remember a Room?

When a camera revisits an old spot, block-wise state-space recurrence scored 69.0 open-domain VLM consistency vs 12.25 for the no-memory baseline; aggressive compression and spatial summaries mostly collapsed.

AI Agents · Independent Researcher

SpatialWorld: Interactive Spatial Reasoning for Agents

SpatialWorld: Interactive Spatial Reasoning for Agents turns interactive spatial reasoning into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Long Context · Tencent

FlashMemory-DeepSeek-V4: Cutting KV Cache to 13.5% for 500K Context

FlashMemory-DeepSeek-V4 keeps only the KV chunks a neural indexer predicts you will need, shrinking physical KV cache to 13.5% of full-context decoding while accuracy stays flat or edges up ~0.6%.

World Models · Microsoft Research

Mirage: Latent Spatial Memory Makes Video World Models 10x Faster

Mirage stores a video world model's 3D memory inside diffusion latent space instead of an RGB point cloud, hitting state-of-the-art WorldScore (70.36) while running 10.57x faster and using 55x less GPU memory.

Video Generation · Nanjing University

CoVEBench: Can Video Editors Follow Complex Instructions?

CoVEBench: Can Video Editors Follow Complex Instructions? turns complex instruction following for video editing into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Multimodal Models · HKUST

Robust-U1: MLLMs Recover Corrupted Images First

Robust-U1 trains an MLLM to reconstruct corrupted visual content, reaching 0.7398 overall on R-Bench versus 0.5770 for BAGEL and 0.5017 for Robust-R1.

Multimodal Models · Ant Group

MemDreamer: Hierarchical Graph Memory for Long Video Understanding

MemDreamer turns hours-long video QA into agentic retrieval over a 3-tier graph memory, lifting LVBench from 78.2 to 90.7 (+12.5) while the reasoning model reads ~6K tokens instead of 240K-784K.

World Models · Independent Researcher

AnchorWorld: Egocentric World Simulation for Embodied AI

AnchorWorld: Egocentric World Simulation for Embodied AI turns egocentric world simulation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Multimodal Models · Peking University

Watch, Remember, Reason: A Human-View Map of Video MLLMs

A survey that reframes long-video MLLMs as three abilities (watch, remember, reason), comparing against 11 prior surveys and organizing 100+ methods plus 5 application domains.

Speech Synthesis · Independent Researcher

MMAE: A Massive Benchmark for Audio Editing Models

MMAE: A Massive Benchmark for Audio Editing Models turns audio editing evaluation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Shanghai Jiao Tong University

SWE-Explore: Can Coding Agents Find the Right Code?

SWE-Explore isolates the repo-exploration stage of coding agents over 848 issues. Agentic explorers crush BM25 (HitFile 0.65 vs 0.08), but line-level recall stalls at 0.15-0.20, and that gap is what limits repairs.

Fine-Tuning & Adaptation · HKUST

On the Geometry of On-Policy Distillation: A Distinct Update Regime

On-policy distillation does not sit between SFT and RLVR — it carves its own geometry. Its updates touch fewer weights, avoid principal directions, and lock into a narrow low-dimensional subspace early in training.

Text Embeddings · Renmin University of China

EmbFilter: Turning an LLM's UnEmbedding Matrix Into a Feature Lens

EmbFilter reads the LLM unembedding matrix as a lens, strips the subspace that ties text embeddings to high-frequency junk tokens, and lifts zero-shot retrieval while shrinking dimensions.

Reinforcement Learning · University of Zurich

RL Elicits Contextual Learning of Unseen Language Translation

RL with a chrF reward teaches an LLM to translate from in-context linguistic packets rather than memorize languages. On five unseen languages it averages 0.3335 chrF vs 0.2300 for SFT, which drops below the base model.

Video Generation · Peking University

LoomVideo: A 5B Unified Video Generator That Edits Without Concatenation

LoomVideo runs text-to-video, editing, and multi-image-to-video in one 5B model, matching 13B baselines on VBench (63.15 vs 63.01) and editing 5.41x faster by adding the source latent instead of concatenating it.

AI Agents · City University of Hong Kong

RHO: Retrospective Harness Optimization via Self-Preference

RHO tunes an LLM agent harness from past unlabeled trajectories using self-consistency and pairwise self-preference, lifting SWE-Bench Pro from 59% to 78% in one round with no external grading.

Agent Memory · National University of Singapore

MRAgent: Graph Memory That Reconstructs Instead of Retrieves

MRAgent gives LLM agents a Cue-Tag-Content memory graph and lets the model reason while it traverses it, lifting LoCoMo LLM-Judge from 68.3 to 84.2 while cutting tokens to 118k per sample.

AI Agents · Independent Researcher

AdaPlanBench: Testing Adaptive Planning in LLM Agents

AdaPlanBench: Testing Adaptive Planning in LLM Agents turns adaptive planning under constraints into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Independent Researcher

ArcANE: Measuring When Role-Playing Agents Break Character

ArcANE: Measuring When Role-Playing Agents Break Character turns role-playing language agent reliability into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Text-to-Image · Independent Researcher

DIRECT: 3D-Aware Object Insertion with Visual Proxies

DIRECT: 3D-Aware Object Insertion with Visual Proxies turns 3D-aware object insertion into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Shanghai Jiao Tong University

LatentSkill: Bake Agent Skills Into LoRA Weights, Not the Prompt

A hypernetwork compiles a textual skill into a LoRA adapter in one forward pass. On ALFWorld, LatentSkill lifts success by 21.4 points (seen) with 64.1% fewer prefill tokens.

AI Agents · Lehigh University

OpenSkill: Self-Evolving LLM Agents With No Task Supervision

OpenSkill lets agents build skills and their own verifiers from the open web, hitting 43.6% on SkillsBench (+8.9 over the best baseline) with zero target-task answers.

AI Agents · Independent Researcher

SoCRATES: Evaluating Proactive LLM Mediation

SoCRATES: Evaluating Proactive LLM Mediation turns proactive mediation agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Independent Researcher

ToolMaze: When LLM Agents Must Replan After Tool Failures

ToolMaze: When LLM Agents Must Replan After Tool Failures turns dynamic replanning after tool failures into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Code Generation · University of Waterloo

Code2LoRA: Hypernetworks That Generate Repo-Specific LoRA Adapters

Code2LoRA trains a hypernetwork to emit a repo-specific LoRA adapter for a code model with no inference-time token cost: 66.2% in-repo and 63.8% cross-repo exact match, plus an Evo variant that tracks diffs with a GRU.

Vision-Language-Action · ETH Zurich

Robots Need More than VLA and World Models: Four Missing Interfaces

A position paper from ETH Zurich, Stanford and TU Darmstadt argues scaling VLA and world models is not enough — robots need four interfaces to turn unstructured human and video behaviour into grounded supervision.

Vision Foundation Models · ETH Zurich

ZipSplat: Fewer Gaussians, Better Splats

ZipSplat predicts 3D Gaussians from k-means scene tokens instead of one per pixel, hitting 24.14 PSNR on DL3DV with 249K Gaussians vs YoNoSplat's 22.01 at 1.2M, all pose-free.

AI Agents · UC Berkeley

Agents' Last Exam: Why AI Agents Still Fail at Work

Agents' Last Exam tests AI agents on 1,490 expert-built professional tasks across 55 digital industries; the hardest tier averages only 2.6% full pass.

Reinforcement Learning · Tsinghua University

CHERRL: A Controlled Sandbox for Reward Hacking in Rubric RL

CHERRL injects four known judge biases to reliably reproduce reward hacking in rubric RL; an agent reading only training logs pinned the onset with 11-step total interval error and missed none of six runs.

AI Agents · HKUST

StreamMA: Streaming Beats Waiting in Multi-Agent Reasoning

StreamMA pipes each reasoning step to the next agent the moment it is written, not after the full chain. Across 8 benchmarks it gains +7.3 pp on average (max +22.4 pp on HMMT 2026) and runs up to 26.9x faster.

AI Agents · Independent Researcher

TIDE: Proactive Multi-Problem Discovery with Templates

TIDE: Proactive Multi-Problem Discovery with Templates turns proactive problem discovery into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Multimodal Models · Independent Researcher

VideoKR: Knowledge-Intensive Video Understanding

VideoKR: Knowledge-Intensive Video Understanding turns knowledge and reasoning in video understanding into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Video Generation · HKUST

Echo-Infinity: Learnable Evolving Memory for 24-Hour Real-Time Video

Echo-Infinity is an autoregressive video model with a learnable evolving memory that compresses any-length history at constant cost, hitting 24-hour rollouts (over 1.3M frames) in real time at 18.5 FPS on an H100.

Multimodal Models · Skywork AI

Audio Interaction Model: A Streaming Audio LLM That Decides When to Speak

The Audio Interaction Model runs a perceive-decide-respond loop so an audio LLM listens, decides if and when to reply, and answers on the fly; trained on StreamAudio-2M and competitive across 8 benchmarks.

Biomolecular Modeling · AIRI

GENEB: Why Genomic Foundation Models Are So Hard to Compare

GENEB probes frozen representations from 40 genomic foundation models across 100 tasks in 13 functional categories, and finds rankings flip across categories while extra parameters buy only modest, inconsistent gains.

Text-to-Image · NVIDIA

Bootstrap Your Generator: Unpaired Editing That Beats Supervised Models

ByG trains image and video editing models with no paired data and no reward model, winning 75.3% of head-to-head video votes against Ditto, which was supervised on millions of pairs.

Reinforcement Learning · UCLA

SDPG: Self-Distilled Policy Gradient for Sparse-Reward RL

SDPG adds a full-vocabulary self-distillation loss to verifier RL, learning from a hint-conditioned teacher. On Qwen3-4B it lifts AIME25 from 0.242 (GRPO) to 0.327 and AMC23 from 0.714 to 0.870 last-checkpoint.

World Models · NVIDIA

NVIDIA OmniDreams: Real-Time Generative World Model for AV Simulation

OmniDreams is a Cosmos-based generative driving simulator that renders 68 FPS single-view on one GB300 and replaces NuRec in a closed-loop AV stack, cutting closed-loop incidents from 10.1% to 4.7%.

Multimodal Models · Shanghai AI Laboratory

OVO-S-Bench: Streaming Spatial Intelligence in MLLMs

OVO-S-Bench: Streaming Spatial Intelligence in MLLMs turns streaming spatial intelligence into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

LLM Reasoning · Shanghai AI Laboratory

ThoughtFold: Cutting 56% of Reasoning Tokens Without Losing Accuracy

ThoughtFold trims the redundant reasoning of DeepSeek-R1-Distill-Qwen-7B by about 56% of tokens while keeping accuracy on AIME, MATH-500, and GPQA-Diamond intact, using a masked preference objective.

World Models · University of Macau

PF-OPSD: When Should an MLLM Trust a World Model's Video?

PF-OPSD teaches a Qwen3.5-9B MLLM to decide when to simulate the future with a video world model, verify the rollout, and fold it into its answer, lifting accuracy +10.6 and +10.9 points on two new QA benchmarks.

Robotics · Tsinghua University

Humanoid-GPT: A GPT-Style Transformer for Humanoid Motion Tracking

Humanoid-GPT treats humanoid control like language modeling: a causal Transformer distilled from ~384 PPO experts on a 2-billion-frame corpus, 200x prior data. It hits 92.58 percent sim success, under 1.5ms.

Language Models · Google Research

Language Models Need Sleep: A Consolidate-and-Dream Recipe

Google Research argues LLMs need an offline sleep phase to turn short-term context into stable weights. With sleep, Qwen3-8B hits 79.2% on AIME-24 and a Transformer reaches 80% on ARC few-shot, beating SEAL.

Text-to-Image · Alibaba Qwen Team

Qwen-Image-Flash: Beyond Objective Design in Few-Step Distillation

Qwen-Image-Flash distills Qwen-Image-2.0 to 4 sampling steps for both text-to-image and editing. The Alibaba Qwen team shows the training recipe — data, teachers, task mix — matters as much as the distillation objective.

Multimodal Models · University of Washington

Imaginative Perception Tokens: Letting VLMs Picture Space, Not Describe It

Imaginative Perception Tokens (IPT) make a VLM render a new viewpoint instead of reasoning in text — lifting multiview counting 3.4%, rivaling closed models on path tracing, while text chain-of-thought sometimes hurts.

Efficient AI · Huawei

KVarN: 2-Bit KV-Cache Quantization Without Calibration

KVarN compresses the KV-cache to 2 bits with no calibration data, using a Hadamard rotation plus dual-axis variance normalization to stop quantization errors from snowballing across long reasoning chains.

Vision-Language-Action · X Square Robot

WALL-WM: Event-Grounded World Action Modeling for Robots

WALL-WM organizes VLA pretraining around semantic action events, not fixed-length chunks. Its event mode scores 75.86 Task Progress on diverse real-robot manipulation versus 55.64 for pi0.5.

AI Agents · Nanjing University

DRIFT: Pinpointing Where Deep-Research Agents Go Wrong

TELBench asks models to find the span that broke a 12-step research trajectory. DRIFT audits claims against evidence, lifting macro-F1 to 54.91% with Claude-Sonnet-4.6, up to 30 points over raw inspection.

AI Agents · University of Illinois Urbana-Champaign

Harness-1: Move Search-Agent Bookkeeping Out of the Policy

Harness-1 is a 20B RL search agent that hands working memory to the environment, hitting 0.730 average curated recall and beating the next open subagent by +11.4 points.

AI Agents · Independent Researcher

K-BrowseComp: Korean Web-Browsing Agent Benchmark

K-BrowseComp: Korean Web-Browsing Agent Benchmark turns Korean-context web browsing agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Reinforcement Learning · Tianjin University

Why Multi-Domain RL Forgets, and How a Math Refresh Heals It

When you RL-tune an LLM across math, code, QA, and writing in sequence, math drops from 66.49 to 57.66 even though gradients look orthogonal. A short math refresh pulls it back to 66.04 without wrecking the other three.

Video Generation · Kuaishou Technology

VLM Teachers Score Video-Model Reasoning at Test Time

Instead of asking a video model to reason directly, a VLM grades its in-progress frames and fine-tunes a per-instance LoRA. The trick lifts RULER-Bench from 46.4 to 68.2.

Multimodal Models · The Chinese University of Hong Kong

X-Stream: Why MLLMs Score ~50% on Multi-Stream Video

X-Stream is the first benchmark for watching several live video streams at once. The best model, Gemini 3 Pro, hits 49.6% versus a 91.84% human baseline, and proactive ability collapses below 21%.

Multimodal Models · NVIDIA

Cosmos 3 Explained: NVIDIA's Omnimodal World Model for Physical AI

Cosmos 3 packs language, image, video, audio, and robot actions into one mixture-of-transformers model; NVIDIA reports it ranks first among open models on text-to-image, image-to-video, and RoboArena policy.

Fine-Tuning & Adaptation · Mind Lab

Scaling PEFT: Toward a Million Personal Models on One Base

A position paper reframing LoRA adapters as persistent personal state, not a cheap full-finetune substitute, across three axes: scale up the base, scale down the adapter, scale out to millions, plus a serving stack MinT.

AI Agents · Ant Group

SkillAdaptor: How LLM Agents Rewrite Their Own Skills

SkillAdaptor edits an agent's skill library from failed trajectories without touching model weights, lifting WebShop score +2.3 and PinchBench +1.5 over the frozen backbone.