Goedel-Architect: 99.2% on MiniF2F — Open-Source, 500× Cheaper (2026)

arXiv:2606.06468 · June 4, 2026 · Princeton / NVIDIA

Goedel-Architect is a new agentic framework for formal theorem proving in Lean 4, built around a single innovation: the blueprint — a global dependency graph of lemmas that gets generated, proved in parallel, and globally rewritten when things fail. The result is the strongest open-source pass@1 performance on MiniF2F and PutnamBench to date, at a cost that undercuts comparable pipelines by up to 500×.

What Happened

The Goedel-LM team (Princeton, NVIDIA, collaborators) released Goedel-Architect on June 4. Using open-weight DeepSeek-V4-Flash (284B-A13B) as backbone:

MiniF2F-test: 99.2% pass@1 (242/244) in autonomous mode; 100% with optional natural-language proof seeding — the first pipeline to close all 244 problems.
PutnamBench: 75.6% pass@1 autonomous; 88.8% pass@4 (597/672) with NL seeding — surpassing Hilbert's 70.0% (which required ~$163,000 vs. Goedel-Architect's ~$294).
Competition results (NL-seeded mode): 4/6 IMO 2025, 11/12 Putnam 2025, 3/6 USAMO 2026.
Full pipeline and model weights open-sourced at github.com/Goedel-LM/Goedel-Prover-V2.

Prior SOTA for open-source: Seed-Prover 1.5 at 87.9% PutnamBench (NL-seeded, closed-source backbone).

How It Works

Most theorem-proving pipelines decompose a goal recursively — a top-down tree that can loop on dead-end strategies. Goedel-Architect does something different:

Blueprint generation: Given only the target theorem statement, a planner builds a blueprint — a DAG of formally stated definitions and lemmas, with all proof bodies left empty. An optional NL proof can seed the structure.
Parallel proving: Each open leaf node in the DAG is dispatched to a Lean prover that can only use that node's declared dependencies. The prover has access to the Lean compiler and Mathlib search. Success freezes the node; failure emits structured diagnostics (and counterexamples if the statement is wrong).
Blueprint refinement: Failed nodes drive global blueprint rewrites: split an over-hard lemma, fix a misformalized statement, or add auxiliary lemmas. Already-proved nodes are preserved. The loop continues until all nodes are closed or an iteration cap is hit.

The key mechanism: global blueprint rewriting avoids the dead-end recursion that plagues tree-based approaches. The blueprint is also the shared system of record — no hidden state outside it.

Claim Audit

Dimension	Assessment
Benchmark numbers	MiniF2F 99.2% pass@1 and PutnamBench 75.6% pass@1 as reported. Independently reproducing full runs is expensive but methodology is auditable.
Verification method	Lean 4 kernel + Mathlib throughout. All accepted proofs compile clean. Zero sorry in final output. ✓ Kernel-verified.
Autonomy level	Fully autonomous mode exists (target statement only). Competition scores (IMO/Putnam/USAMO) use NL-seeded mode — not autonomous.
Cost claim	~$294 for full PutnamBench run ($0.44/problem). Hilbert: ~$163,000. AxProverBase (Claude Opus 4.5): $8,467. 500× claim is against Hilbert; against AxProver it's ~29×.
Statement faithfulness	Not independently audited in the paper. Auto-formalization from NL proofs introduces faithfulness risk — the paper does not report domain-expert re-checking beyond the automated pipeline.
Openness	Full pipeline + DeepSeek-V4-Flash weights open. No closed API dependency in the default setup.

⚠️ Caveat: Competition benchmark results (IMO 4/6, Putnam 11/12) use NL-seeded mode. Fully autonomous numbers on competition-level problems are lower. This is a meaningful distinction for "did AI prove this without human proof hints."

Primary Sources

Paper: arXiv:2606.06468 — Goedel-Architect: Streamlining Formal Theorem Proving with Blueprint Generation and Refinement
Code: github.com/Goedel-LM/Goedel-Prover-V2

Goedel-Architect: 99.2% on MiniF2F — Open-Source, 500× Cheaper

What Happened

How It Works

Claim Audit

Primary Sources

Related content

Comments