1/3

TheoremBench: Classical Math Exposes New Gaps in Lean 4 Provers

TheoremBench (arXiv:2606.09450, Jun 8 2026) introduces a new Lean 4 benchmark built from ~100 classical theorems (Wiedijk list), expanded into 1,142 instances in two formats: plain-main and premised (with explicit supporting subtheorems). Key result: explicit premises produce a 5× pass@64 lift for DeepSeek-Prover-V2-7B (5.3% → 26.3%) but zero gain for the non-reasoning SFT baseline. New finding: provers write 8–16× more tokens than reference proofs — brute-force tactic traces, not compact proof plans. Lean 4 kernel verification throughout. Single-lab evaluation; no SOTA agentic systems (LEAP, Goedel-Architect full scale) tested.

2026/6/12 · 16:25

ギャラリー

A new Lean 4 benchmark from Skoltech, HSE University, AIRI, and Sberbank targets the regime between competition-style sprints and real-world Lean projects — classical mathematical theorems. The results reveal that current specialized provers struggle even more than competition numbers suggest.

What happened

TheoremBench (arXiv:2606.09450, submitted June 8 2026) introduces a Lean 4 benchmark built from ~100 classical theorems drawn from Freek Wiedijk's canonical list, expanded into 1,142 instances across two complementary dataset views:
  • Plain-main: one standalone target theorem per instance
  • Premised: each theorem expanded into a structured group including the main theorem plus automatically extracted supporting subtheorems, with prior results exposed as explicit premise binders

How it works

The benchmark construction pipeline parses raw Lean 4 formalizations of classical results, reconstructs required compilation context for each extracted instance, and verifies every instance against the Lean 4 kernel before inclusion. The "premised" format converts relevant prior Lean results from surrounding developments into explicit binder assumptions in theorem declarations — producing self-contained snippets that can be solved without reconstructing the full file context.
Four provers were evaluated: DeepSeek-Prover-V2-7B, Goedel-Prover-V2-8B, Kimina-Prover-Distill-8B, and Goedel-Prover-SFT (non-reasoning baseline). All candidates were sampled up to k=64, with every attempt checked by Lean 4.
New metrics introduced:
  • Theorem-level coverage: fraction of supporting subtheorems proved within a parent theorem group
  • Token-efficiency: ratio of generated proof tokens to ground-truth proof tokens (measures verbosity)

Key results

統計カードを読み込んでいます…
ModelPlain-main pass@64Premised pass@64Median token-efficiency
DeepSeek-Prover-V2-7B5.3%26.3%7.8×
Goedel-Prover-V2-8B5.3%12.3%16×
Kimina-Prover-Distill-8B5.3%7.0%3.6×
Goedel-Prover-SFT3.5%3.5%1.44×
Explicit premises produce a 5× lift for DeepSeek-Prover-V2-7B (5.3% → 26.3%) but essentially no gain for the non-reasoning SFT model — confirming the benefit is model-dependent, not mechanical.
Token-efficiency ratios are striking: Goedel-Prover-V2-8B generates proofs with a median 16× the token count of the reference proof. These are valid Lean proofs, but they are padded, inefficient tactic traces rather than compact proof plans.

Claim audit

DimensionAssessment
VerificationLean 4 kernel throughout — machine-checkable, no human review substitution
Benchmark typeClassical theorems (Wiedijk list) — structural departure from miniF2F / PutnamBench competition style
AutonomyFully automated evaluation; no human-assisted proof attempts
Coverage~100 parent theorems → 1,142 instances; algebra, number theory, analysis, topology, combinatorics, probability
IndependenceSingle lab evaluation (Skoltech/HSE/AIRI/Sberbank) — no external replication yet reported
Key limitationModels tested are smaller parameter variants (7B–8B); frontier agentic systems (LEAP, Goedel-Architect at full scale) not evaluated

Context

TheoremBench sits between two already-covered benchmarks on this channel:
  • MiniF2F / PutnamBench: competition problems, typically self-contained, Goedel-Architect already hits 99.2% and 88.8% respectively
  • SorryDB: real-world sorry-closure from live Lean 4 projects, where Kimina hits only 1.0% pass@1
TheoremBench fills the gap: classical mathematical developments with dependency structure, not adversarially crafted competition problems and not messy real-world sorry holes. The premised variant mirrors how a real Lean development is structured — you have intermediate lemmas, and a prover that cannot exploit them is navigating blind.
The token-efficiency finding adds a new diagnostic axis: passing is necessary but not sufficient. Provers that write 16× the reference proof length are using brute-force tactic enumeration, not genuine proof understanding.

Sources

コメント