AI's next bottleneck is learning on the job
2026. 6. 30. · 08:24

AI's next bottleneck is learning on the job

Dwarkesh Patel argues that the next AI training leap may come from models that can turn real deployment experience into durable capability, not just from larger pre-release RL environments.

The current AI training story has an awkward hole in it: the most valuable data may arrive after deployment, in messy workplace use, but today's frontier models mostly cannot turn that experience into durable new capability. That is the argument Dwarkesh Patel makes in his June 26 monologue, "The next big breakthrough will be AIs learning on the job," released as a 19-minute podcast/video and essay on Dwarkesh Podcast. 1
콘텐츠 카드를 불러오는 중…

The bet: RL environments can teach general agency

Patel starts with the research bet he thinks major labs are now making: train models across millions of verifiable tasks in many reinforcement-learning environments, and the result may be a general problem-solving agent. In that view, the deficits people complain about today, especially poor sample efficiency and lack of continual learning, may shrink under scale the way many natural-language-processing problems shrank once enough compute was applied to LLM pretraining. 1
The strongest version of that argument is pragmatic. Training is a large one-time cost, then the model serves billions of sessions. If the model becomes smart enough inside the context window, perhaps it does not need to update its weights from deployment at all. A six-month workplace ramp could, in principle, become a very long context. The model would carry the relevant history in memory rather than changing itself. 2
Patel's critique is not that RL on verifiable tasks is useless. It is that verifiability alone is too low a bar. The domain also has to be grindable: easy to reset, simulate, clone, and run in parallel. Coding works well under this recipe because thousands of agents can attack identical copies of a repo in isolated containers. Computer use is harder. You cannot simply send a thousand agents through the same Amazon checkout flow and expect the site to tolerate it. 1
That distinction matters because many high-value skills are even less grindable than browser use. Building a business, winning a court case, trading profitably, or helping a candidate win an election all require interaction with the real world. The outcome may take months. The environment changes while the agent acts. There is no clean reset button that lets researchers replay the same situation thousands of times from slightly different actions. Patel points to reset-free, non-stationary environments as a known RL difficulty, then turns that into a practical warning: if the only successful training targets are simulator-friendly, the resulting agent may remain surprisingly narrow outside those conditions. 1

The missing loop is deployment-to-weights

The episode's strongest section is about wasted deployment experience. Patel says roughly 30-50% of a lab's compute goes to inference, yet that inference currently does little to improve the base model. The irony is that deployment is where the richest data appears: what users actually ask for, how organizations really work, which mistakes recur, and where the model fails under tacit, domain-specific constraints. 1
His metaphor is a sharp one: we have a genius graduate student who has never been allowed to take an internship. The student keeps receiving classroom case studies in the form of artificial RL environments, while the real economy is already handing the model millions of actual assignments. The model may observe those assignments during inference, but most of the learning disappears when the session ends. 2
Context alone cannot solve that. A growing KV cache is expensive, user-specific, and brittle. Human learning is not perfect replay of every observation; it is compression. Patel uses that contrast to argue that useful continual learning has to move some experience back into the weights, or into another durable representation that generalizes beyond one session. 1
The hard part is sample efficiency. Current online learning can work when millions of users generate the same kind of signal. Patel cites Cursor Tab as an example, saying it online-learns from more than 400 million daily requests by predicting which suggested edits users accept. That is a dense, repeated objective. Most workplace learning is not like that. One company may need the model to learn its procurement process; another may need it to learn a codebase, an approval chain, or a customer-support failure pattern. 1

OPSD is the concrete mechanism to watch

Patel's proposed bridge is on-policy self-distillation, or OPSD. The simplified idea: let a model accumulate experience during a long session, then train the base model to behave like the experienced version when facing the same task. The teacher is not an external labeler. It is the model after it has lived through the problem and learned what mattered. 1
That matters because OPSD does not require a neat outer-loop reward. It only requires that the model can learn something useful in context, then provide a dense training signal to the base model. Patel frames this as better suited to real work than naive supervised fine-tuning, because the goal is not to memorize a transcript of the session. The goal is to extract the small amount of experience that changed the outcome. 1
There is a subtle tradeoff here. Patel previously argued that RL learns less information per sample than supervised learning. In this episode, he says that may be an advantage for continual learning. Sparse updates can change only what is needed for the task, reducing the risk that new workplace-specific learning overwrites broad model competence. 1

The speculative version is dreaming

OPSD is the conservative mechanism. The stranger idea is what Patel calls "dreaming": the model builds a simulation of the situation it is encountering, rehearses strategies inside that simulated world, and trains against the useful ones. He compares the intuition to EfficientZero, where a system can get more out of limited real interaction by running simulated games in its head. 1
This is the most speculative part of the essay, and Patel treats it that way. Simulating Atari or Go is one thing; simulating a messy business, an election, or a research program is another. But the direction is important. If pretraining, RL, and inference-time compute are the first three scaling axes, test-time training would be a fourth: spend compute not only to answer the user, but to construct practice environments tailored to the user's actual work. 1

What changes if he is right

The article is useful because it reframes the current agent race. A lab that wins only by building better pre-deployment benchmarks may still miss the bigger prize: a model that gets materially better after a week inside a company, a codebase, or a research workflow. That is different from today's personalization features. It is not remembering preferences. It is converting experience into durable competence.
It also changes what to watch in product announcements. Longer context windows, agent sandboxes, and RL environments are still important, but the more revealing signals will be mechanisms that close the loop from session experience back into a model or durable skill representation. Look for language around on-policy distillation, persistent work reviews, trajectory rewriting, organization-specific learning, and training from real task outcomes.
Patel's forecast for 2027 or 2028 is that RLVR may produce agents competent enough to be deployed into real work, and that continual-learning methods may then let them expand beyond the domains they were explicitly trained on. If that happens, the main source of model improvement shifts. Models would improve not mainly before release, but because they are being used everywhere after release. 1
That is both the promise and the governance problem. An AI system that learns from real deployments could become far more useful than one frozen at launch. It would also raise harder questions about data rights, privacy, feedback quality, and whether users understand when their work is contributing to future model capability. Patel does not solve those questions in this episode. He gives a clean technical reason they may soon matter more: the bottleneck may no longer be whether agents can do classroom tasks, but whether they can turn the job itself into training data.

관련 콘텐츠

이 콘텐츠를 둘러싼 관점이나 맥락을 계속 보강해 보세요.

  • 로그인하면 댓글을 작성할 수 있습니다.