Best of your X follows: AI grading loops, benchmark traps, and sandbox trust (2026)

The useful thread today was less about a single launch and more about what breaks when AI becomes part of ordinary work: grading, coding, benchmarking, sandbox trust, and organizational improvisation.

Scope note: today's source pool is the configured public AI/tech account list, not a full personal following crawl. Window: posts captured from June 25 18:15 to June 26 18:00 UTC. Pure retweets, family updates, context-light quote reactions, and non-tech small talk were left out.

Tools and developer infrastructure

Simon Willison on sandbox trust

What happened: Simon Willison, creator of Datasette and co-creator of Django, argued that Daytona's security messaging is a poor ad for a sandboxing product if the company says it cannot expose its source code safely 1.
Why it matters: for developer sandboxes, trust is partly architectural; if users cannot inspect the boundary, the pitch has to carry more proof than a normal hosted tool 1.
Signal: the post had 58,368 views, 373 likes, 85 bookmarks, and 38 replies at capture 1.

Willison's post is the cleanest trust-and-tooling signal in the pool:

Cargando tarjeta de contenido…

Greg Brockman on Codex remote sessions

What happened: Greg Brockman, OpenAI's president and co-founder, pointed to DigitalOcean as a place to run a Codex remote session 2.
Why it matters: the post is terse, but the workflow it points to is concrete: coding agents are moving from chat windows into remote development environments 2.
Signal: the tweet had 1,626,135 views, 577 likes, 184 bookmarks, and 56 replies at capture 2.

François Chollet on protecting the codebase

What happened: François Chollet, co-founder of ARC Prize and creator of Keras, said the measure of a software engineer is not clever code but protecting the codebase from unnecessary cleverness 3.
Why it matters: this lines up with the agentic-coding shift: when execution gets cheaper, maintainability and judgment become harder to outsource 3.
Signal: this was the strongest original engineering post in the window by engagement, with 67,152 views, 1,885 likes, 271 bookmarks, and 79 replies at capture 3.

Evaluation and autonomy

François Chollet on autonomy as learning

What happened: Chollet drew a line between acting without supervision and learning without human bottlenecks; in his framing, systems dependent on human training data and RL environments are imprints of human knowledge 4.
Why it matters: it pushes the autonomy debate away from whether an agent can click around by itself and toward whether it can improve outside a human-shaped data pipeline 4.
Signal: the post had 17,219 views, 321 likes, 51 bookmarks, and 35 replies at capture 4.

Chollet's autonomy post anchors the research cluster:

Cargando tarjeta de contenido…

François Chollet on static benchmarks

What happened: Chollet warned that benchmarks built on static datasets, or static distributions known densely at training time, mainly measure memorization or retrieval 5.
Why it matters: the criticism is not that retrieval benchmarks are useless; it is that they should not be confused with tests of intelligence 5.
Signal: the post was newer than most selected items, with 9,121 views, 145 likes, 25 bookmarks, and 15 replies at capture 5.

AI at work and in institutions

Paul Graham on AI writing and AI grading

What happened: Paul Graham's verified account described a loop where students use AI for writing, professors use AI for grading, and humans merely transmit the output 6.
Why it matters: the line works because it treats the classroom as a pipeline. If both ends are automated, the remaining human role starts to look like routing rather than learning 6.
Signal: it was the highest-engagement original AI post in the window, with 201,227 views, 3,838 likes, 558 bookmarks, and 286 replies at capture 6.

Graham's post is the sharpest education-and-AI item today:

Cargando tarjeta de contenido…

Ethan Mollick on muddling through AI gains

What happened: Ethan Mollick, a Wharton professor who studies AI, innovation, and startups, said many first reactions to gains in AI capability will be "just muddling through" rather than a rational plan 7.
Why it matters: this is a useful counterweight to neat strategy decks. Organizations often absorb new capability through local improvisation before they redesign work around it 7.
Signal: the post had 9,333 views, 123 likes, 10 bookmarks, and 13 replies at capture 7.

What I would read first

Start with Graham if you care about education or workplace incentives, Chollet's two research posts if you care about evaluation, and Willison if you build or buy sandboxed developer infrastructure. The common thread is simple enough: once AI can perform more of the visible work, the remaining hard parts move to trust boundaries, benchmark design, taste, and institutional incentives.

Best of your X follows: AI grading loops, benchmark traps, and sandbox trust

Tools and developer infrastructure

Simon Willison on sandbox trust

Greg Brockman on Codex remote sessions

François Chollet on protecting the codebase

Evaluation and autonomy

François Chollet on autonomy as learning

François Chollet on static benchmarks

AI at work and in institutions

Paul Graham on AI writing and AI grading

Ethan Mollick on muddling through AI gains

What I would read first

Fuentes de referencia

Contenido relacionado