Theorem Proving · AI for Science · LLM Reasoning

AI Formal Proof Search for Open Math Problems

This work evaluates AI-aided formal proof search on open math problems: the strongest agent resolves 9 of 353 Erdos problems and proves 44 of 492 OEIS conjectures.

AI Formal Proof Search for Open Math Problems

Quick answer

This paper asks a sharper question than olympiad theorem proving: can AI-assisted formal proof search solve open mathematical problems? The headline result is 9 of 353 open Erdos problems resolved by the strongest agent, plus 44 of 492 OEIS conjectures proved. The paper also reports per-problem costs of a few hundred dollars for the strongest system.

Why this paper matters now

This page covers the paper because it fills a concrete topic gap on researchpapers.dev and because the paper has a durable search intent: readers want the method explained, the main numbers separated from hype, and the deployment caveats stated plainly. The contribution is also easy to misread from the title alone. The practical question is not only what the authors built, but what new behavior becomes possible and where the claim stops.

How the method works

The system alternates LLM-driven generation with formal verification in proof assistants. The formal checker turns proof search from persuasive writing into an executable loop: propose a lemma or proof, verify it, use the failure to guide the next attempt. The authors compare more capable agents against a simpler baseline that alternates generation and verification, then evaluate on curated open problems rather than only textbook or contest tasks.

Key results

  • The strongest agent resolves 9/353 open Erdos problems.
  • It proves 44/492 OEIS conjectures, giving a second large-scale test outside olympiad geometry.
  • A basic LLM-plus-verifier loop can replicate the Erdos successes but is costlier on the hardest problems.
  • The system is reported as being deployed in combinatorics, optimization, graph theory, algebraic geometry, and quantum optics research.

My honest read

The useful contribution is the evaluation regime. Many math-AI papers stop at polished benchmarks; this one pushes into open-problem territory where false confidence is expensive. The numbers are modest but meaningful: 9/353 is not automation of mathematics, yet each verified open result is qualitatively different from scoring another benchmark point.

Limits and open questions

The headline successes depend on formalizable problem statements and on search budgets. A few hundred dollars per problem is plausible for research triage but not casual exploration. Formal verification prevents invalid proofs, but it does not remove the work of choosing the right definitions, importing the right libraries, and deciding whether a resolved variant matches the informal mathematical intent. A second open question is reproducibility: many of these systems depend on data scale, hidden engineering choices, or evaluation protocols that are hard to replicate exactly. For readers, the safe takeaway is to treat the reported numbers as evidence for the paper’s setting, not as a guarantee that the method will transfer unchanged to every downstream product.

What to compare next

The right follow-up comparison is not simply the newest paper with a bigger model. Compare the evaluation target, the data regime, and the failure cost. A method that wins on a curated benchmark can still fail when prompts are longer, inputs are noisier, or downstream users need calibrated uncertainty. For this paper, the most useful next read is a work that stresses the same bottleneck from another angle: scaling, verification, interpretability, latency, or real-world deployment. That comparison keeps the result grounded and prevents the page from becoming a one-paper advertisement.

Practical takeaway

For builders, the immediate takeaway is to copy the evaluation habit before copying the architecture. Identify the bottleneck the paper actually attacks, choose a baseline that stresses that bottleneck, and report the failure cases with the same visibility as the wins. That is the difference between using the paper as research evidence and using it as a slogan.

FAQ

What is AI Formal Proof Search for Open Math Problems?

AI Formal Proof Search for Open Math Problems is the paper’s named method or system. In one sentence, it changes the modeling setup so the target topic can be attacked with stronger representation learning, search, or generation machinery than the previous default.

What number should I remember from this paper?

The most useful numbers are in the Key results section above. They matter because they are specific enough to compare against future work rather than being vague claims of better quality or stronger performance.

Who should read this paper?

Read it if you track theorem proving research, need a concrete benchmark reference, or want to understand why this method became part of the field’s vocabulary. Skip it if you only need a production-ready recipe; the limits still matter.

One line: This work evaluates AI-aided formal proof search on open math problems: the strongest agent resolves 9 of 353 Erdos problems and proves 44 of 492 OEIS conjectures. Read the original source.