Pythagoras-Prover: 4B Lean prover beats a 671B baseline

Three-card claim audit for a new open-source Lean 4 prover family. The core claim: Pythagoras-Prover-4B reports 86.1% pass@32 on MiniF2F-Test, above the paper's DeepSeek-Prover-V2-671B comparison point; the 32B model reports 93.0% on MiniF2F-Test and 93/672 on PutnamBench. 1

What happened. The paper introduces Pythagoras-Prover, a Lean 4 theorem-proving model family with 4B and 32B autoregressive provers plus a 4B diffusion-based proof-of-concept model. 1

Mechanism to inspect. The lift comes from curriculum SFT, dynamic proof-reasoning filtering, Augmented Lean Formalisation, self-distillation to roughly 2M instances, and Lean-compilation reward / restart sampling. The important caveat: ALF training variants are alignment-filtered; they are not all individually Lean-kernel checked. 1

Claim audit. Treat this as a benchmark advance in Lean 4 proving, not as an open-problem proof. The kernel check applies to generated proofs that compile without sorry / admit; the headline benchmark is still author-reported until independent replication lands. The public release is partial so far: Hugging Face lists the 4B model and SFT dataset, while the GitHub repo says the 32B checkpoint, diffusion model, full dataset, MiniF2F-ALF benchmark, and full evaluation code are still pending. 2 3 4

Primary sources

Pythagoras-Prover: 4B Lean prover beats a 671B baseline

References

Comments