Best of your X follows: June 18

Research

AI cracked 7 of 10 novel hard math problems. Headlines said it "failed."

A new report from the 1st Proof Project tested frontier models on genuinely novel, unsolved-as-of-training-cutoff math problems — the kind where there's no answer to memorize. The models solved 7 of 10. Ethan Mollick's reaction was blunt: calling that a failure is a strange benchmark to apply, given that 15 months ago LLMs couldn't reliably do arithmetic. The study does surface real patterns though — models overstep on confidence, skip proof verification steps, and stumble hardest when the problem requires visual or diagrammatic reasoning. 1

正在加载内容卡片…

When you train a model on another model's outputs, it inherits the quirks

A Google DeepMind researcher's finding: when one AI model is used to generate training data for the next generation, the child model can pick up "strange habits" from the parent — and those habits are surprisingly hard to filter out. Mollick flagged this as a partial explanation for why models within the same family often feel qualitatively similar even after significant parameter changes. The implication is structural: synthetic data pipelines introduce a kind of stylistic inheritance that's not fully captured by benchmark scores, and may be accumulating across successive fine-tuning rounds. 2

Visual steps are where AI workflows break

A short but precise observation from Mollick, citing new methodology work: in multi-step AI agent workflows, errors cluster disproportionately at steps that involve interpreting visual input. Text reasoning degrades gradually; visual reasoning degrades sharply. For anyone building pipelines that include screenshots, diagrams, or UI parsing, this is worth treating as a design constraint rather than an edge case — route visual-heavy steps to specialized models or add human checkpoints there. 3

Enterprise and agents

"Practical agents are merely months old"

Mollick pushed back today on the trend of confident prescriptions for how companies should restructure around AI agents. His argument is simple: the tooling is weeks or months old in most enterprise contexts, the playbooks are unwritten, and the competitive advantage of moving fast is real but fragile if you lock in bad patterns early. "Experimentation — and productive failures — will be required." This is a useful corrective to the wave of frameworks and templates being published as if the category were already mature. 4

Policy

Fable 5 is not coming back soon

Two days after the US government's export control directive forced Anthropic to pull Fable 5 and Mythos 5 for all customers globally, Simon Willison posted the clearest summary of where things stand: the situation remains unresolved, and there's no indication of a fast path to restoration. His read is that the gap between what the model can do and what policymakers think they're regulating has now been measured in days, not quarters. Anthropic's claim that it's a "misunderstanding" hasn't visibly accelerated a resolution. 5

正在加载内容卡片…

Products

Paul Graham switched to Bing

That sentence deserves to sit alone for a moment. Graham posted today that Google has started mixing ads directly into image search results — not alongside results, but interleaved within them. When you search for images of a specific watch model, you now get watches from entirely different manufacturers placed inside the results grid. His description: search quality has dropped dramatically. He's using Bing for some searches now. For a cohort that has treated Google as infrastructure for 25 years, this kind of announced defection is worth tracking. 6

正在加载内容卡片…

Best of your X follows: June 18

Research

Enterprise and agents

Policy

Products

参考来源