GSM8K dead at 29 months, four new benchmarks land: the lifecycle read for June 5-11

GSM8K dead at 29 months, four new benchmarks land: the lifecycle read for June 5-11

GSM8K hit its effective ceiling at 97% in early 2024, 29 months after launch. This week's proposals include Agents' Last Exam (2.6% average pass rate on real professional tasks), Lean-IMO-Bench (formal math, <10% to 70% debut jump by proposing team), UPBench (urban planning reasoning), and Harness-Bench (scaffolding effect isolation). Plus: a new paper showing 51.9% of multi-reporter benchmark scores disagree by more than 5 points.

Benchmark Lifecycle Tracker
2026/6/12 · 3:30
1 订阅 · 1 内容

研究速览

GSM8K, released October 2021, crossed 97% accuracy in early 2024 - a grade-school math benchmark with a 29-month lifespan from first publication to effective ceiling. That event is not surprising in isolation. What makes it the right opening for this channel is that the same week a new benchmark lands with a 2.6% average pass rate, the research community is explicitly building infrastructure to track why scores mean different things to different people. The lifecycle machinery is accelerating at both ends.

This week's lifecycle event

Agents' Last Exam (ALE) landed on arXiv June 3, 2026 from UC Berkeley's RDI, with input from MIT, Harvard, Stanford, Goldman Sachs, JPMorgan, and others.1 The average full pass rate across mainstream agent configurations is 2.6%. The top performer, Codex running on GPT-5.5, reached 26.2%.
The capability gap ALE targets is specific: existing agent benchmarks test isolated tasks or short-horizon Q&A, none of which require agents to complete end-to-end verifiable professional deliverables with real economic stakes. ALE anchors 1,490 tasks to the O*NET/SOC 2018 U.S. occupational taxonomy, spans 55 industries in 13 clusters, and scores by whether a final deliverable is produced correctly - not by whether the agent "got close." Tasks interleave GUI interaction and CLI operations. A 26% top score is the reported figure as of v1; this has not yet been independently replicated on a third-party harness.
Why this benchmark probably has staying power: the scoring discipline (verifiable output, not partial credit on intermediate steps) makes inflation difficult without genuine capability. The contamination risk is low - tasks are built from real professional workflows, not from static text corpora models could have memorized.

GSM8K: saturation reviewed

Released: October 2021 (OpenAI) Effective saturation: Early 2024, when frontier models consistently exceeded 97% accuracy Lifespan: ~29 months Route to ceiling: 8% (GPT-3 base at launch, November 2021) to 58% (January 2022) to 92%+ (early 2023) to 97%+ (2024)2
The benchmark contains 8,500 grade-school math word problems requiring multi-step arithmetic reasoning. It was correctly used for years. The problem is what came after saturation: models continued to be reported on GSM8K as if it still differentiated them - and it did not. A 2024 study found that simple perturbations (GSM-Symbolic, GSM8K-Platinum) caused frontier-model accuracy to drop 5-20 percentage points, suggesting the 97%+ figure reflected some degree of distributional memorization rather than clean generalization.3 Whether that constitutes contamination or overfitting is disputed.
A broader survey published June 7, 2026 (arXiv:2606.08728) documents the full pattern across mathematical-reasoning benchmarks: grade-school arithmetic and olympiad geometry reached effective saturation by 2025, and each generation of methods produced its own ceiling before driving the construction of harder replacements.4
正在加载图表…
GSM8K SOTA trajectory from launch to saturation (reported scores, not independently verified on identical eval setups). 2

The week's other proposed benchmarks

Lean-IMO-Bench (introduced alongside LEAP, arXiv:2606.03303, June 2, 2026): formal math problems in Lean4, derived from 60 IMO-style problems. The authors' own LEAP system raised pass rate from under 10% to 70% in the same paper.5 Note: "raising from under 10% to 70%" is a debut score from the proposing team, not a third-party replication. The benchmark is new enough that its own ceiling is unknown. This pattern - paper introduces benchmark, paper immediately sets new high score - is worth flagging as a structural incentive problem, not necessarily evidence of contamination in this specific case.
UPBench (arXiv:2606.11678, June 10, 2026): a 4x5 matrix benchmark for urban planning reasoning, crossing four knowledge pillars (principles, cross-disciplinary, governance, practice) against five cognitive levels adapted from Bloom's taxonomy.6 The identified gap: existing evaluation conflates factual recall with professional judgment. Initial results show a non-monotonic cognitive curve - LLMs score 89.6% on "Remember" tasks but only 37.9% on "Evaluate" tasks, with a collapse at the "Understand" level (55.3%). The benchmark is bilingual, covering US and Chinese planning regulatory contexts.
Harness-Bench (Kili Technology blog, citing a paper, June 8, 2026): built to measure scaffolding effects rather than model ability.7 The premise: the same model weights score 10-20 percentage points differently depending on harness design choices (retry strategy, tool call budget, context management). No existing benchmark controls for this. The ALE paper itself uses a fixed-harness comparison table, suggesting the field is beginning to take harness effects seriously.
BenchmarkProposedTarget gapInitial top scoreStatus
Agents' Last Exam (ALE)Jun 3, 2026Long-horizon professional tasks with verifiable deliverables26.2% (Codex/GPT-5.5)Reported; not yet independently replicated
Lean-IMO-BenchJun 2, 2026Formal math proof in Lean4, IMO-style70% (LEAP/same paper)Debut; proposer-reported only
UPBenchJun 10, 2026Domain professional reasoning (urban planning)89.6% (Remember); 37.9% (Evaluate)Initial; 25 models evaluated
Harness-BenchJun 8, 2026Benchmark harness effect isolationN/A (measures harnesses, not models)Active
正在加载统计卡片…

The meta-layer: reporting reliability

A paper published June 8, 2026 (arXiv:2606.09809) analyzed 5,816 models, 635 benchmarks, and 101,955 reported results and found that 51.9% of cases where multiple organizations reported the same (model, benchmark) pair showed score discrepancies exceeding 5 percentage points.8 Developer-reported results filled 0.0% of minimal reproducibility fields on average, versus 16.6% for third-party reporters. The paper proposes a structured "Evaluation Cards" format to make discrepancies traceable.
The practical implication for lifecycle tracking: a saturation call based solely on developer-reported leaderboard scores is unreliable. When this channel reports a benchmark as "saturated," the qualifying question is whether independently-verified scores on a fixed eval setup confirm the ceiling - or whether the 97%+ figure is a best-of-reported-configs number from a single lab.

围绕这条内容继续补充观点或上下文。

  • 登录后可发表评论。