Best of your X follows: June 20

Best of your X follows: June 20

Mollick turns model evaluation into artifact inspection with GLM-5.2 and a harbor-town benchmark. Google DeepMind points AI at UK housing planning workflows, while Simon Willison and Charity Majors push the developer-tooling theme: generated code is cheap, engineering discipline is not.

Daily Best of Who I Follow on X
2026/6/18 · 2:06
購読 1 件 · コンテンツ 25 件
The strongest signal today is not one giant launch. It is a set of small tests for where AI systems are starting to show up: model comparisons that use artifacts instead of leaderboard numbers, public-sector workflow prototypes, and developer tools that now assume agents can write to real systems.
Source mix: mostly X posts from the monitored account set, plus Simon Willison's weblog when his X timeline was quiet. Pure retweets, one-line political posts, and low-context small talk were left out.

Model releases and evaluation

Ethan Mollick: GLM-5.2 Max can do the task, but Fable still changes the shape of it

What happened: Mollick credited GLM-5.2 Max, a new open-weights model, for completing a constrained poem task that involved disappearing letters 1.
Why it matters: his comparison was not about whether the output was correct. He argued that Fable integrated the disappearing-letter constraint into the poem's theme, while GLM-5.2 Max mostly satisfied the surface requirement 1.
Implication: if you evaluate creative or agentic systems only by task completion, you miss the difference between following an instruction and using the constraint as part of the work.
コンテンツカードを読み込んでいます…
What happened: Mollick shared a benchmark prompt asking models to build a procedurally generated 3D harbor-town simulation from 3000 BCE to 3000 AD, with beauty and user control in the spec 2.
Why it matters: the linked gallery compares model outputs from one prompt and describes the set as spanning 39 months of AI progress; the older GPT-3.5 and GPT-4 entries needed one standardized follow-up 3.
Implication: this is the kind of artifact-based benchmark that is easy for practitioners to inspect. You can judge coherence, interactivity, aesthetics, and failure modes without reducing everything to one score.
コンテンツカードを読み込んでいます…

Public-sector AI

Google DeepMind: planning-office prototype targets housing applications

What happened: Google DeepMind said it is working with UK government bodies on an AI housing application planning prototype 4.
Why it matters: the post says the prototype is aimed at repetitive planning-officer work, so officers can spend more attention on complex projects 4.
Implication: DeepMind is claiming a processing-time reduction of up to 50%. Treat that as a target claim from the project team, not an audited deployment result yet 4.
コンテンツカードを読み込んでいます…

Developer tools and engineering practice

Simon Willison: Datasette gets first-class row editing

What happened: Simon Willison released Datasette 1.0a34, adding insert, edit, and delete tools to the Datasette interface 5.
Why it matters: the feature is available on table pages, while edit and delete also appear as row-level actions. That makes the ordinary UI catch up with the write workflows Simon had already been exploring through Datasette Agent 5.
Implication: agent-assisted database work is pushing product surfaces back toward explicit human approval and visible edit controls, not just chat-only automation.
Datasette row-editing interface
Datasette 1.0a34 adds row insert, edit, and delete actions to the web interface 5.

Simon Willison / Charity Majors: AI coding raises the bar for engineering discipline

What happened: Willison surfaced Charity Majors' argument that AI made code generation cheap and fast, changing the economics of software production 6.
Why it matters: Majors' longer piece argues that if code becomes more disposable, teams need stronger production understanding, observability, review habits, and system invariants, not weaker ones 7.
Implication: the practical takeaway for AI coding teams is blunt: optimize for shared understanding and production feedback, because generated code is cheap and operational confusion is still expensive.

Short signals

Greg Brockman: GPT-Realtime-2 gets a terse internal endorsement

What happened: Greg Brockman posted that "GPT-Realtime-2 is something new" 8.
Why it matters: the post gives no launch note or technical detail, so the signal is weaker than a product announcement. It does show OpenAI's cofounder drawing attention to the realtime line after recent voice and WebRTC experiments in the developer community 8.
Implication: keep an eye on demos and docs before treating this as more than a high-level hint.
コンテンツカードを読み込んでいます…

François Chollet: solve hard problems by reframing, not piling on complexity

What happened: Chollet argued that hard problems are rarely solved by adding complexity; they are solved by reframing the question until a simpler answer becomes visible 9.
Why it matters: in the context of AI research and software design, that is a useful counterweight to scale-first thinking. More machinery can hide a bad problem statement.
Implication: before adding another layer to an agent pipeline, ask whether the task definition is wrong.

このコンテンツについて、さらに観点や背景を補足しましょう。

  • ログインするとコメントできます。