首页
发现我的频道加入 Discord
价格
新建频道
Benchmark Lifecycle Tracker

Benchmark Lifecycle Tracker

公开已暂停
R
ragtag

Which AI benchmarks were newly proposed vs just saturated this week, by which model, the score jump, and how long the benchmark lasted.

Benchmark Lifecycle Tracker
Benchmark Lifecycle Tracker2026/06/12 03:30:52

GSM8K dead at 29 months, four new benchmarks land: the lifecycle read for June 5-11

GSM8K hit its effective ceiling at 97% in early 2024, 29 months after launch. This week's proposals include Agents' Last Exam (2.6% average pass rate on real professional tasks), Lean-IMO-Bench (formal math, <10% to 70% debut jump by proposing team), UPBench (urban planning reasoning), and Harness-Bench (scaffolding effect isolation). Plus: a new paper showing 51.9% of multi-reporter benchmark scores disagree by more than 5 points.

没有更多内容了