Topics

Retrieval-Augmented Generation

Grounding language model outputs in retrieved documents to improve factuality and freshness.

Rethinking RAG in Long Videos: V-RAGBench and CARVE Explained

CARVE picks a modality-granularity config per video chunk instead of per query, lifting Recall@5 to 0.603 from a best baseline of 0.510 on V-RAGBench, a leak-filtered egocentric VideoRAG benchmark.

AI Agents · Ant Group

SearchSwarm: Delegation Intelligence for Deep Research

SearchSwarm fine-tunes Tongyi DeepResearch-30B-A3B on harness-generated delegation trajectories, lifting BrowseComp from 43.4 to 68.1 and topping every 30B-A3B model on four deep-research benchmarks.

Agent Memory · National University of Singapore

MRAgent: Graph Memory That Reconstructs Instead of Retrieves

MRAgent gives LLM agents a Cue-Tag-Content memory graph and lets the model reason while it traverses it, lifting LoCoMo LLM-Judge from 68.3 to 84.2 while cutting tokens to 118k per sample.

AI Agents · Renmin University of China

FORT-Searcher: Training Search Agents Without Shortcuts

FORT-Searcher trains a 3B-active search agent on shortcut-resistant tasks, scoring 66.2 overall among comparable open agents and delaying answer hit time from 18.7 to 46.9 versus REDSearcher data.

Text Embeddings · Microsoft Research

E5: Weakly-Supervised Contrastive Text Embeddings

E5 turns general-purpose text embeddings into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

AI Agents · University of Illinois Urbana-Champaign

Harness-1: Move Search-Agent Bookkeeping Out of the Policy

Harness-1 is a 20B RL search agent that hands working memory to the environment, hitting 0.730 average curated recall and beating the next open subagent by +11.4 points.

AI Agents · Independent Researcher

K-BrowseComp: Korean Web-Browsing Agent Benchmark

K-BrowseComp: Korean Web-Browsing Agent Benchmark turns Korean-context web browsing agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Theorem Proving · Princeton University

LeanDojo: Retrieval-Augmented Language Models for Theorem Proving

LeanDojo turns retrieval-augmented theorem proving in Lean into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

AI Agents · Independent Researcher

When Masking Stale Observations Helps Search Agents

When Masking Stale Observations Helps Search Agents turns context management for search agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Xiamen University

SAAS: Teaching Search Agents When to Stop Searching

SAAS uses self-aware RL to cut a Qwen2.5-7B search agent's average queries from 2.19 to 0.97 per question, while keeping accuracy near the best baseline (48.7% vs 49.8%).

Text Embeddings · Independent Researcher

Sentence-BERT: Sentence Embeddings with Siamese BERT

Sentence-BERT turns sentence embeddings for semantic similarity into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

AI Agents · Shanghai Jiao Tong University

SWE-Explore: Can Coding Agents Find the Right Code?

SWE-Explore isolates the repo-exploration stage of coding agents over 848 issues. Agentic explorers crush BM25 (HitFile 0.65 vs 0.08), but line-level recall stalls at 0.15-0.20, and that gap is what limits repairs.

Retrieval-Augmented Generation · Universidad de San Andres

Active Learners as Efficient PRP Rerankers: Fewer LLM Calls

Treating pairwise LLM reranking as active learning, a tournament selector hits 68.00 NDCG@10 on TREC DL while cutting LLM calls 3-5x versus sorting-based PRP, plus a randomized-direction oracle that debiases in one call.

AI Agents · University of Waterloo

Direct Corpus Interaction: Letting Agents grep Instead of a Retriever

Direct Corpus Interaction (DCI) lets a search agent grep the raw corpus instead of calling a retriever. On BrowseComp-Plus it lifts accuracy from 69.0% to 80.0% while cutting cost 29.4%.

Multimodal Models · Shanghai AI Laboratory

CiteVQA: A Benchmark That Catches Document AI Citing the Wrong Evidence

CiteVQA makes document QA models return bounding-box citations with every answer. The top model scores 76.0 Strict Attributed Accuracy; the best open model just 22.5. Most answer right but cite the wrong region.

Retrieval-Augmented Generation · University of Massachusetts Amherst

GrepSeek: Search Agents That grep the Corpus Instead of a Vector Index

GrepSeek trains an LLM to answer questions by issuing shell commands like grep against the raw corpus — no embedding index — and posts the best F1 and Exact Match across seven open-domain QA benchmarks.

Retrieval-Augmented Generation · AIRI

OCC-RAG: Small Models Built Only to Read Context Faithfully

OCC-RAG is a pair of 0.6B and 1.7B reasoning models trained to answer strictly from the given context and refuse when the answer isn't there — matching or beating general models 2-6x their size on multi-hop QA.

Retrieval-Augmented Generation · Meta AI

RAG (2020): The Paper That Named Retrieval-Augmented Generation

The original RAG paper bolts a Wikipedia dense retriever (DPR) onto a BART seq2seq generator, set new state-of-the-art on three open-domain QA tasks, and updates knowledge by swapping the index — no retraining.