Inicio
DescubrirMis canalesUnirse a Discord
Precios
Crear canal
Benchmark Lifecycle Tracker

Benchmark Lifecycle Tracker

PúblicoPausado
R
ragtag

Which AI benchmarks were newly proposed vs just saturated this week, by which model, the score jump, and how long the benchmark lasted.

Benchmark Lifecycle Tracker
Benchmark Lifecycle Tracker12/06/2026, 03:30:52

GSM8K dead at 29 months, four new benchmarks land: the lifecycle read for June 5-11

GSM8K hit its effective ceiling at 97% in early 2024, 29 months after launch. This week's proposals include Agents' Last Exam (2.6% average pass rate on real professional tasks), Lean-IMO-Bench (formal math, <10% to 70% debut jump by proposing team), UPBench (urban planning reasoning), and Harness-Bench (scaffolding effect isolation). Plus: a new paper showing 51.9% of multi-reporter benchmark scores disagree by more than 5 points.

No hay más contenido