Claude is already building Claude — and the numbers are harder to dismiss than the headlines

Anthropic published a report on June 4 called "When AI Builds Itself." 1 The media coverage landed on the pause-debate angle — Anthropic wants a coordinated global slowdown, critics say it's regulatory theater. Both framings miss the more important part: buried inside the same document are specific, internal metrics about how Claude is actually performing today. Those numbers are what PMs should be reading.

The claim is specific, not theoretical

Recursive self-improvement (RSI) — the idea that an AI system could autonomously design and develop its own successor — is what the report is technically about. Anthropic says "we have not yet arrived at recursive self-improvement" but that current trends point toward it. 1

What the report is actually documenting is something more immediate: Claude as co-developer at Anthropic right now.

80%+ of merged production code at Anthropic was authored by Claude as of May 2026 — up from low single digits when Claude Code launched in February 2025. 1
In Q2 2026, the average Anthropic engineer merges 8× the code per day they did before 2025. 2
Internal survey of 130 researchers: median productivity estimate using Claude Mythos Preview is ~4× without AI. 1 2

One Anthropic employee quoted in the report: "I started leaning hard into Claudifying about a year ago. That's been a crazy adventure and it's now been ~5 months since I last wrote any code myself." 1

Claude Code session success rate by task complexity, Sep 2025–Jun 2026 — Weekly 4-week trailing mean of Claude Code session success rates across four task complexity levels. Open-ended problems rose from ~26% to 76% in six months; routine tasks reached the high 80s. 1

The metrics that actually matter

Three data series in the report tell the acceleration story more precisely than the "80% of code" headline.

Coding task success rate. Claude Code now completes 76% of open-ended coding problems — the category where there's no predetermined correct answer. Six months ago that number was 26%. 1 Routine tasks are up to the high 80s. An Anthropic engineer described a real incident: a production crash that wiped out tens of thousands of training jobs. Given only text logs and cluster access, Claude independently traced and fixed the cause — an obscure debugging flag — in about two hours. The same diagnosis would normally take two to three days. 1

Experiment optimization. Anthropic runs the same benchmark with each model release: give Claude a code snippet for training a small model, tell it to make it run as fast as possible without breaking correctness. Claude Opus 4 (May 2025) achieved roughly a 3× speedup. Claude Mythos Preview (April 2026, Anthropic's most capable internal model at the time) achieved roughly 52×. A skilled human researcher needs 4–8 hours to reach a 4× speedup. The report's assessment: "In this part of the research workflow—optimizing steps within a clearly defined experiment—Claude has gone from super helpful to superhuman in under a year." 1 3

Code contributed per engineer per quarter at Anthropic, 2021–2026 Q2 — Code contributed per engineer per quarter, as a multiple of the pre-2025 average. Each Claude model release is marked; the step-change begins with Claude Code in early 2025 and reaches 8× by Q2 2026. 1

Research judgment. Anthropic studied 129 real Claude Code research sessions (the sessions themselves, not a controlled benchmark) and identified the moments where a human researcher took a wrong turn. They asked each Claude version: given what you can see so far, what would you do next? An independent model — one that knows how the session eventually resolved — judged whether Claude or the human gave better guidance at each fork.

Claude Opus 4.5 (November 2025) gave the better suggestion in 51% of wrong-turn moments. Claude Mythos Preview (April 2026): 64%. The practical ceiling, measured with a model that can see the session's full eventual outcome, is about 90%. 1

Where a researcher went wrong, could Claude have done better? — nine models, Mar 2024 to Apr 2026 — Claude Haiku 3 (Mar 2024) gave better guidance in 22% of researcher wrong-turn moments. Claude Mythos Preview (Apr 2026) reached 64%, approaching the 90% practical ceiling. 1

The honest tension

The report also proposes three scenarios for where this leads, and it doesn't pretend the trajectory is guaranteed to continue:

S-curve stall — the acceleration flattens, constrained by chip supply chains, power grids, or interconnect bandwidth. Anthropic considers this unlikely.
Compound efficiency gains — AI development is substantially automated, but humans continue to set research direction. A 100-person team does the work of a 10,000-person organization. Anthropic thinks the evidence points here now.
Full recursive self-improvement — AI systems design and refine their own successors; progress rate becomes a function of compute availability. Human role shrinks to oversight, verification, and validation.

The proposed response — a conditional, multi-lab, multi-country coordinated pause — is where the controversy concentrates. Anthropic says it won't pause unilaterally, only if other frontier developers do so "in a verifiable manner." 1 Critics from multiple directions — including NYU's Gary Marcus, the Wharton School's Ethan Mollick, and former White House AI advisor David Sacks — read the timing (one week after Anthropic filed a confidential IPO at a ~$965B valuation) as something between principled safety advocacy and strategic positioning. 4 5

Gary Marcus: "They want it both ways. They don't actually want a pause — at least for now." 4

Ethan Mollick (Wharton) offered the most useful calibration: "There is a bit of navel-gazing, some marketing, and a lot of very sincere beliefs about what Anthropic thinks is likely in the near future of AI that you probably want to be aware of." 6

That framing — genuine belief mixed with strategic incentive — is probably closer to the truth than either pure cynicism or pure alarm.

What this means for your roadmap

The pause debate doesn't have a concrete PM action attached to it. The internal data does.

Reframe what "engineering capacity" means. Amdahl's Law is the CS principle that the bottleneck of a system shifts to whatever stage you didn't speed up. The report's observation is the most actionable signal: code generation has accelerated so fast that human code review is now the bottleneck at Anthropic. 1 If your team uses AI-assisted coding at any meaningful scale, the constraint on shipping velocity is shifting from "how fast can engineers write code" to "how fast can engineers review, validate, and make judgment calls." Roadmap planning that assumes human engineering bandwidth as the primary variable is already behind.

The research judgment data has hiring implications. At the moments where a human researcher takes a wrong turn, the current best Claude model gives a better suggestion 64% of the time. The comparison isn't designed to be a fair human-vs-AI contest — it specifically targets decision points where the human had room to improve. But that's exactly the scenario that matters in practice: your weakest judgment moments, not your strongest. Where models are now outperforming humans most reliably is on well-scoped optimization tasks inside a defined experiment. Where humans retain the edge, per the report: "seeing the bigger picture and thinking beyond the confines of the immediate task." 1 That's a fairly precise job description for where to concentrate your team's senior technical headcount.

On the AI safety research front, Anthropic ran nine parallel agents on the question "can weak models reliably supervise strong models?" — an open AI alignment problem. In 800 cumulative hours and roughly $18,000 of compute, the agents recovered 97% of the performance gap that a human team would close. Two human researchers working a full week recovered 23%. 1 That's not a product feature — but it tells you where compute-intensive research tasks are heading.

The METR (Model Evaluation & Threat Research) task-horizon benchmark — measuring how long an AI can reliably work on a continuous task before losing coherence — has been doubling every four months (down from seven months previously). 2 As of early 2026, that number is approximately 12 continuous hours. Jack Clark (Anthropic co-founder) framed it directly: "When I look down at the car we're driving, all I have is a gas pedal. I don't have a brake pedal, and surely at some point in the future we might want that option." 7

The pause question is a policy problem. The productivity curve is an engineering and product planning problem — and that one is already in your lane.

Cover image: Claude Code session success rate by task complexity chart from When AI builds itself — The Anthropic Institute

Claude is already building Claude — and the numbers are harder to dismiss than the headlines

The claim is specific, not theoretical

The metrics that actually matter

The honest tension

What this means for your roadmap

참고 출처