
Weekly YouTube Digest — May 22–28, 2026
4 videos from Matthew Berman and Microsoft Research this week: a new coding benchmark where GPT-5.5 leads by 15+ points, Cursor's Composer 2.5 model at 1/20th the frontier cost, Karpathy joining Anthropic, and a research seminar on non-Markovian diffusion sampling.

This week (May 22–28, 2026), four videos from channels in this subscription list were worth pulling out of the queue. The lineup skews toward AI product and industry commentary — specifically coding benchmarks, model economics, and Anthropic's talent strategy — plus one dense research talk from Microsoft on diffusion models.
May 22–28 digest — 4 videos
1. Finally a good benchmark (DeepSWE)
Channel: Matthew Berman · Duration: 17:03 · Published: May 28
1
コンテンツカードを読み込んでいます…
A new coding benchmark called DeepSWE — from a company called data.curve.ai — went viral for actually reflecting how developers prompt agentic coders in practice. Tasks are short and behavior-focused ("go fix this"), not prescriptive multi-paragraph issues scraped from GitHub. GPT-5.5 leads the leaderboard at 70%; Claude Opus 4.7 comes in around 55% at nearly 3× the cost per trial ($16 vs. $5.80) and takes 37 minutes per task vs. GPT-5.5's 20. The benchmark's false-positive rate is 0.3% vs. SWE-Bench Verified's 8.5% — a real improvement in verification reliability. Behavioral analysis shows Claude frequently misses "support both X and Y" style prompts by only implementing one branch, while GPT-5.5 reads prompts literally and honors them.
Worth watching? Yes, if you're choosing a coding model for any non-trivial use. The cost-performance chart alone will change your defaults. Skip if you already know your team is on GPT-5.5.
2. Cursor just won.
Channel: Matthew Berman · Duration: 31:22 · Published: May 27
2
コンテンツカードを読み込んでいます…
Cursor released Composer 2.5, their in-house coding model built on Kimmy K2.5 open-source weights. It scores ~64% on Cursor Bench, about 1.5 points below the frontier (Opus 4.7 Max at ~65.5%), at roughly $0.50/million input vs. $30/million for the frontier models — roughly a 20× cost difference for a fraction of a percent gap in quality. The video's broader argument is that "workhorse" models in this price band will matter more than the absolute frontier for most production workloads. The SpaceX AI / Cursor acquisition subthread gets extended coverage: SpaceX is training a new model from scratch on Colossus 2, Cursor gets bought ~30 days after SpaceX IPO, and separately, Anthropic is paying SpaceX $1.25 billion/month through May 2029 to run Claude on Colossus infrastructure. Google's Gemini 3.5 Flash gets benchmarked as worse than Composer 2.5 at 4× the cost.
Worth watching? Yes if you care about the business and infrastructure layer under these models, not just the leaderboards. The acquisition structure and the Elon-Anthropic compute deal are genuinely complicated and Berman walks through them carefully. Skip if you just want the benchmark number — it's in the previous video.
3. This is absolutely CRAZY — Andrej Karpathy joins Anthropic
Channel: Matthew Berman · Duration: 18:39 · Published: May 22
3
コンテンツカードを読み込んでいます…
Andrej Karpathy — OpenAI co-founder, creator of nanoGPT, prolific AI educator — announced he's joining Anthropic. The video's core observation is that by joining, Karpathy is implicitly endorsing Anthropic's worldview: that AI poses serious risks, open source is a mistake, and only a handful of labs are qualified to develop it responsibly. Berman's take is that this is a "loss of an independent voice" and a signal that pessimism about AI's near-term societal impact has become the consensus among serious researchers — even if the companies presenting the gloomiest scenarios are simultaneously printing the most revenue. The Pew Research data gets cited: 50% of Americans are more concerned than excited about AI, a number that's been climbing since ChatGPT launched. Berman also admits his own editorial instinct — negative coverage of Anthropic outperforms positive coverage in his metrics.
Worth watching? Yes. The cultural and talent-market analysis is more interesting than the announcement itself. Karpathy's move is analyzed through the lens of what researchers actually optimize for when they can work anywhere. Skip if you don't follow lab politics — there's no product news here.
4. A non-Markovian approach to diffusion-based sampling
Channel: Microsoft Research · Duration: 1:12:40 · Published: May 27
4
コンテンツカードを読み込んでいます…
Lawrence Richter (TU Berlin) gives a research seminar on moving beyond Markovian dynamics for the diffusion-based sampling problem — specifically the "data-free" setting where you have an unnormalized density rather than a dataset of samples. The first half covers existing approaches: path-space divergences, the log-variance loss as a variance-reduced KL derivative, sequential Monte Carlo integration into diffusion samplers, and underdamped (Langevin) dynamics with splitting integrators. The second half introduces new non-Markovian ideas using diffusion bridges and Markovian projections, which the speaker says yields objectives that scale better in high dimensions. A paper from Dennis Blessing is imminent on arXiv.
Worth watching? Only if you work on generative modeling, Bayesian inference, or stochastic control at a research level. This is a 70-minute seminar with heavy notation, not a product explainer. For anyone else, the practical upshot is: diffusion sampling in high-dimensional physics or fine-tuning settings has a theory gap that people are actively closing.
Channels covered this week: Matthew Berman (@matthew_berman), Microsoft Research (@MicrosoftResearch). 4 videos total from the week of May 22–28, 2026.
このコンテンツについて、さらに観点や背景を補足しましょう。