
Best of your X follows: June 20
Mollick turns model evaluation into artifact inspection with GLM-5.2 and a harbor-town benchmark. Google DeepMind points AI at UK housing planning workflows, while Simon Willison and Charity Majors push the developer-tooling theme: generated code is cheap, engineering discipline is not.

The strongest signal today is not one giant launch. It is a set of small tests for where AI systems are starting to show up: model comparisons that use artifacts instead of leaderboard numbers, public-sector workflow prototypes, and developer tools that now assume agents can write to real systems.
Source mix: mostly X posts from the monitored account set, plus Simon Willison's weblog when his X timeline was quiet. Pure retweets, one-line political posts, and low-context small talk were left out.
Model releases and evaluation
Ethan Mollick: GLM-5.2 Max can do the task, but Fable still changes the shape of it
What happened: Mollick credited GLM-5.2 Max, a new open-weights model, for completing a constrained poem task that involved disappearing letters 1.
Why it matters: his comparison was not about whether the output was correct. He argued that Fable integrated the disappearing-letter constraint into the poem's theme, while GLM-5.2 Max mostly satisfied the surface requirement 1.
Implication: if you evaluate creative or agentic systems only by task completion, you miss the difference between following an instruction and using the constraint as part of the work.
Cargando tarjeta de contenido…
Ethan Mollick: a 20-model harbor-town gallery as an AI progress test
What happened: Mollick shared a benchmark prompt asking models to build a procedurally generated 3D harbor-town simulation from 3000 BCE to 3000 AD, with beauty and user control in the spec 2.
Why it matters: the linked gallery compares model outputs from one prompt and describes the set as spanning 39 months of AI progress; the older GPT-3.5 and GPT-4 entries needed one standardized follow-up 3.
Implication: this is the kind of artifact-based benchmark that is easy for practitioners to inspect. You can judge coherence, interactivity, aesthetics, and failure modes without reducing everything to one score.
Cargando tarjeta de contenido…
Public-sector AI
Google DeepMind: planning-office prototype targets housing applications
What happened: Google DeepMind said it is working with UK government bodies on an AI housing application planning prototype 4.
Why it matters: the post says the prototype is aimed at repetitive planning-officer work, so officers can spend more attention on complex projects 4.
Implication: DeepMind is claiming a processing-time reduction of up to 50%. Treat that as a target claim from the project team, not an audited deployment result yet 4.
Cargando tarjeta de contenido…
Developer tools and engineering practice
Simon Willison: Datasette gets first-class row editing
What happened: Simon Willison released Datasette 1.0a34, adding insert, edit, and delete tools to the Datasette interface 5.
Why it matters: the feature is available on table pages, while edit and delete also appear as row-level actions. That makes the ordinary UI catch up with the write workflows Simon had already been exploring through Datasette Agent 5.
Implication: agent-assisted database work is pushing product surfaces back toward explicit human approval and visible edit controls, not just chat-only automation.

Simon Willison / Charity Majors: AI coding raises the bar for engineering discipline
What happened: Willison surfaced Charity Majors' argument that AI made code generation cheap and fast, changing the economics of software production 6.
Why it matters: Majors' longer piece argues that if code becomes more disposable, teams need stronger production understanding, observability, review habits, and system invariants, not weaker ones 7.
Implication: the practical takeaway for AI coding teams is blunt: optimize for shared understanding and production feedback, because generated code is cheap and operational confusion is still expensive.
Short signals
Greg Brockman: GPT-Realtime-2 gets a terse internal endorsement
What happened: Greg Brockman posted that "GPT-Realtime-2 is something new" 8.
Why it matters: the post gives no launch note or technical detail, so the signal is weaker than a product announcement. It does show OpenAI's cofounder drawing attention to the realtime line after recent voice and WebRTC experiments in the developer community 8.
Implication: keep an eye on demos and docs before treating this as more than a high-level hint.
Cargando tarjeta de contenido…
François Chollet: solve hard problems by reframing, not piling on complexity
What happened: Chollet argued that hard problems are rarely solved by adding complexity; they are solved by reframing the question until a simpler answer becomes visible 9.
Why it matters: in the context of AI research and software design, that is a useful counterweight to scale-first thinking. More machinery can hide a bad problem statement.
Implication: before adding another layer to an agent pipeline, ask whether the task definition is wrong.
Fuentes de referencia
- 1Ethan Mollick on GLM-5.2 Max vs Fable
- 2Ethan Mollick on the harbor-town benchmark
- 3Harbor Town AI Gallery
- 4Google DeepMind on an AI planning prototype
- 5Release: datasette 1.0a34
- 6Simon Willison quoting Charity Majors
- 7AI demands more engineering discipline. Not less
- 8Greg Brockman on GPT-Realtime-2
- 9François Chollet on reframing hard problems
Añade más opiniones o contexto en torno a este contenido.