
2026/7/1 · 8:18
Math shows why AI progress is jagged
Grant Sanderson argues that AI's rapid progress in math is real but uneven: geometry-like tasks reward verifiable, replayable training, while conceptual breakthroughs, explanation, and curation remain harder to measure.
Grant Sanderson’s most useful warning is not that mathematics will be the first domain to fall to AI. It is that math makes AI progress unusually easy to misread. In Dwarkesh Patel’s new 94-minute conversation with Sanderson, the headline claim is sharp: math is the field where we may see superintelligence first, but even inside math the frontier is jagged, local, and hard to summarize with one benchmark number 1.
コンテンツカードを読み込んでいます…
Sanderson is a good person to make that distinction. He is the math educator behind 3Blue1Brown, a visual-math project whose stated goal is to make more people love math through deep understanding; his site also notes that Manim, the animation engine behind the videos, began as his own open-source Python side project 2. That background matters because this episode is less about model scores than about what counts as understanding.
The central thesis: math is a preview, not a map
Patel opens by asking why an AI system reaching International Mathematical Olympiad gold-medal level did not feel like an AGI moment. Sanderson’s answer is that the frontier is not just spiky across fields; it is fractal inside a field. A model can be spectacular at one kind of contest problem and weak at another. In his telling, geometry has been a spike, while combinatorics remains a tougher, more playful area where the obvious training route is less direct 1.
That distinction is easy to miss because public AI milestones compress many unlike things into one phrase. DeepMind’s AlphaGeometry, for example, solved 25 of 30 Olympiad geometry problems under competition time limits, close to the average human gold-medalist score on that geometry benchmark 3. That is an important result. But it is also narrow: it says something about a formal, verifiable, diagram-based slice of math, not about whether a system can generate the next research program.
The episode’s deeper claim is that math lets us watch several kinds of AI progress separate from each other. There is solving a known problem. There is finding the right bridge between two known fields. There is inventing the definition or conceptual frame that makes future theorems possible. The first can become a benchmark. The second may show up as a surprising connection. The third is much harder to score.
The benchmark AI cannot easily train for
Patel pushes on the difference between proving theorems and creating the concepts that make theorems matter. Sanderson’s historical example is Galois theory. Abel had already shown that the general quintic could not be solved by radicals; Galois helped shift attention toward the symmetries underneath equations. The payoff was not immediately legible. Sanderson describes a long chain from Lagrange’s instinct about symmetry, to Galois’s scattered and initially rejected ideas, to later mathematical clean-up, to twentieth-century uses of group theory in physics 1.
That example matters because reinforcement learning likes short feedback loops. If a model solves a problem, the reward can be checked. If it invents a concept whose importance becomes clear only after decades of downstream work, the reward signal is almost invisible. Sanderson’s phrase for this is not a technical impossibility claim. It is a measurement problem: how do you reward the Galois-like instinct before history has had time to prove it was valuable?
This is where the conversation becomes more interesting than another "AI will automate mathematicians" debate. Sanderson does not say AI cannot build new mountains of theory. He says that if it does, the result might come in several forms. It could be a clean bridge between existing domains, like a lightning bolt. It could be a new mountain that humans must climb. Or it could be a huge proof with little explanatory value, leaving humans with an "unsolved expository problem": the proof exists, but nobody yet has the satisfying reason why it works 1.
Why math moves faster than computer use
Patel offers a useful training-side explanation: math is not only verifiable, it is grindable. A system can try many proof attempts, compare successes and failures, and run variations cheaply. Code has a similar property when tasks can be placed in deterministic containers. By contrast, many real-world computer-use tasks are technically verifiable but hard to replay at scale. You cannot run thousands of parallel rollouts through the same live checkout flow without bot detectors, changing pages, and messy state getting in the way 1.
That framing clarifies why progress can look uneven even when tasks seem superficially similar. Booking an event, checking a package, or operating a web app may have a clear yes-or-no outcome. But if the environment cannot be cloned, replayed, and farmed for feedback, the training loop is weaker. Math and coding are special because the world can often be frozen.
Sanderson then adds a reason not to write off formal systems such as Lean. Even if current math progress does not depend entirely on formal proof assistants, a formal library offers something rare: a green checkmark that a proof is correct. In a future where AI systems generate many mathematical claims, that checkmark becomes a filter for attention. Without it, mathematicians may spend their days debugging plausible but false papers. With it, they can at least know that the difficult object in front of them is worth trying to understand 1.
The human role shifts toward taste and sequence
The most durable part of the episode is Sanderson’s claim that learning will still depend on people who choose what is worth attending to. He compares future mathematicians, at least in one role, to museum curators. If AI generates a vast landscape of proofs, explanations, and new structures, the scarce function may be deciding which ideas deserve human time, how to order them, and how to attach motivation to them 1.
That is also his practical advice for students using LLMs to learn. Use them less as the teacher of record and more as a search-and-pruning tool around a strong human artifact. Sanderson says the person matters: a good book, lecture, or explainer does not merely contain correct sentences. It builds motivation in the right order. Patel agrees, describing his best learning sessions as a human-curated lecture or textbook on one side and an LLM helping around the edges 1.
That point also explains why AI writing remains uneven. Sanderson distinguishes between distilling known material and deciding what insight is worth presenting in the first place. Patel adds that good writing requires a live model of the reader’s mind: sentence by sentence, the writer is tracking what the reader is likely to think next. The models can explain many known concepts well, but they often fail at that deeper mentalizing and re-framing 1.
Why this episode matters
The episode’s value is that it gives builders a better vocabulary than 「AI is good at math」. The useful split is between verifiability, grindability, conceptual invention, human-understandable compression, and curation. A domain can score high on one and low on another.
That vocabulary travels outside mathematics. Coding agents improve quickly when tasks are replayable and outcomes are clear. Business work, design judgment, education, research direction, and product taste are harder because the reward signal is slower, more social, and more dependent on context. Sanderson’s math examples make that abstract point concrete.
So the takeaway is not that math is isolated from the rest of the economy. It is that math is a clean laboratory for watching AI progress split into parts. Some parts will race ahead. Some will need new training environments. Some will still depend on human judgment about what is worth understanding. If AI reaches far beyond human mathematicians, the bottleneck may not be access to proofs. It may be finding the people and institutions that can turn a flood of correct results into a map anyone else can use.

このコンテンツについて、さらに観点や背景を補足しましょう。