Best of your X follows: May 30

Today's digest is light but sharp: OpenAI pushed Codex onto Windows and flooded the day with new product demos, Ethan Mollick poked a hole in the open-weights benchmark narrative, and Paul Graham let a 14-year-old settle a fashion debate with ChatGPT.

AI tools and the developer ecosystem

Codex comes to Windows — computer use, mobile steering, real-time translation

OpenAI shipped a cluster of Codex upgrades today. The headline: computer use now works on Windows, so Codex can interact with your Windows desktop directly. Alongside that, the ChatGPT mobile app gains Windows integration, letting you kick off and steer tasks on your phone while the work continues on your PC. 1

コンテンツカードを読み込んでいます…

Greg Brockman kept posting demos throughout the day — Codex for real-time meeting transcription, Codex for parallel browser-using subagents, Codex self-managing its own UI, and a standalone realtime translation feature supporting 70+ input languages piped into 13 output languages. 2 3

コンテンツカードを読み込んでいます…

The pace is notable: this is the fourth straight day where the bulk of fresh AI news has been Codex-adjacent. Bring-your-own MCP servers, ChatGPT conversation table-of-contents, Codex for Slack — it reads less like a product launch and more like a team shipping fixes as fast as users file requests. 4 5

Research

Open-weights models are more fragile than their benchmarks look

Ethan Mollick weighed in on Epoch AI's model capability analysis. He accepts Epoch's benchmarking methodology but disagrees with the gap it implies: "I continue to believe that open-weights models are much more fragile, especially out-of-distribution, than their benchmarks indicate." 6

コンテンツカードを読み込んでいます…

His specific challenge: the Epoch analysis suggested open-weights models were only about 3–4 months behind frontier closed models last year and today. Mollick's vibe-check disagrees. Out-of-distribution tasks — the ones that don't appear in training distribution — are where the gap shows up most, and benchmarks designed in-distribution miss it.

This is a recurring tension in AI evaluation: aggregate benchmark scores flatten behavior that matters most in real deployment. Mollick's framing doesn't invalidate Epoch's numbers; it says those numbers measure the wrong thing for practitioners deciding which model to run in production.

Society and everyday AI

ChatGPT rates Paul Graham's outfit

Paul Graham let his 14-year-old take a photo and ask ChatGPT to rate his fashion. The verdict: he looked "dressed to walk the dog." His take: "Maybe this is why dogs like me." 7

It's a small thing, but it captures something real: AI image-based style advice is now casual enough to be a teenager's first instinct. The fact that the result was both coherent and funny says more about where multimodal models have landed than most benchmark papers do.

Compiled from public posts by accounts you follow on X. Only posts from the past 24 hours are included.

Best of your X follows: May 30

AI tools and the developer ecosystem

Codex comes to Windows — computer use, mobile steering, real-time translation

Research

Open-weights models are more fragile than their benchmarks look

Society and everyday AI

ChatGPT rates Paul Graham's outfit

参考ソース