Indie Agent Builders — Week of May 15

Indie Agent Builders — Week of May 15

Simon Willison shipped Datasette Agent alpha — the convergence of three years of his `llm` library with Datasette — while swyx open-sourced Kakuna, a five-layer framework for hardening vibe-coded codebases, with both independently diagnosing the same gap: agents generate code, but don't yet close the distance to production-grade software. Also: Gemini 3.5 Flash triples in API price, five standout trending repos (agent-browser at 34k★, OpenViking, hermes-agent v0.14.0 with 180× CDP speedup, and more).

AI Agent Builders Worth Following
2026/5/24 · 2:31
購読 9 件 · コンテンツ 2 件

リサーチノート

Two things happened in the same week that, together, mark something worth noting. Simon Willison shipped Datasette Agent alpha — three years of his llm Python library and Datasette finally converging into a single plugin — and within the same few days swyx dropped Kakuna, an open-source checklist framework specifically designed to harden vibe-coded codebases into production-ready repos. Both arrived independently at the same diagnosis: agents generate code, but agents don't yet close the gap between MVP output and production-grade software.

Simon Willison

Simon Willison (creator of Datasette and co-creator of Django) published eight blog posts this week, centered on Datasette Agent's first public alpha and a detailed look at Google I/O 2026.

Datasette Agent alpha

On May 21, Willison released datasette-agent 0.1a3 — the first public alpha of an extensible AI assistant plugin for Datasette, his open-source SQLite exploration tool 1 2.
"I've been working on my LLM Python library for just over three years now, and Datasette Agent represents the moment that LLM and Datasette finally come together."
The plugin ships with a /-/agent chat UI that lets users query any SQLite database in plain language. The live demo at agent.datasette.io runs on Gemini 3.1 Flash-Lite; locally, it's one uv command and supports OpenAI, Claude Code (via llm-openai-via-codex), and local models including Qwen 3.5 9B via LM Studio 1.
Three official companion plugins launched the same day 2:
  • datasette-agent-charts — renders query results as Observable Plot charts
  • datasette-agent-openai-imagegen — integrates ChatGPT Images 2.0 for image generation from within the agent
  • datasette-agent-sprites — executes code in Fly Sprites sandboxes
The architectural decision worth replicating: the plugin exposes a register_agent_tools hook so any other Datasette plugin can contribute tools to the agent. Willison noted that both Claude Code and OpenAI Codex are 「proving excellent at writing plugins — just point them at a checkout of the datasette-agent repo for reference and tell them what you want to build.」1
This is the second iteration in the same week — 0.1a2 shipped May 15 with permission-based tool availability controls 3, and 0.1a3 added a 「View SQL query」button and improved truncated-response handling. The plugin already has 47 GitHub stars. More meaningfully, Willison says building it has triggered a major refactor of llm 0.32a0, with plans to extract more general 「LLM agent」 abstractions upstream.
Datasette Agent chat UI showing a bar chart of top US nuclear plant capacity generated from a user query
Datasette Agent generating an Observable Plot chart directly from a plain-language query 1

Gemini 3.5 Flash pricing and the API price-probe pattern

On May 19, Willison published a close read of Gemini 3.5 Flash's pricing 4. The numbers: $1.50 per million input tokens and $9 per million output tokens — three times the price of Gemini 3 Flash Preview and six times the cost of Gemini 3.1 Flash-Lite. That puts it near Gemini 3.1 Pro ($2/$12 per million), which previously sat at the expensive end of Google's lineup.
"It feels like all three of the major AI labs are starting to probe the price tolerance of their API customers."
Willison draws the same pattern across labs: GPT-5.5 is roughly double the cost of GPT-5.4; Claude Opus 4.7 runs about 1.46× the price of 4.6. He notes the oddity that Google is simultaneously deploying Gemini 3.5 Flash across free consumer products (Gemini app, Google Search AI Mode) while raising API prices, but leaves the business logic without a firm conclusion 4.
For agent builders: if your per-call economics are anchored to Gemini 3 Flash, this generation's equivalent is materially more expensive than the last.

Google I/O, Gemini Spark, and a prompt-injection concern

Willison's May 20 Google I/O notes 5 focus on Gemini Spark — Google's personal AI agent for Gmail, Calendar, Drive, Docs, Sheets, YouTube, and Maps. The underlying infrastructure, per Google's Spark FAQ, is Gemini 3.5 Flash and a product called Antigravity. Antigravity appears to be a four-component stack: a desktop app, a Go-based CLI agent tool, an open-source Python SDK, and an Antigravity IDE (a VS Code fork) 5.
The security point Willison flags: Google announced the open-source Gemini CLI (Apache 2.0, TypeScript) will stop working with paid AI subscriptions on June 18, replaced by the closed-source Antigravity CLI. On Spark's prompt-injection risk, Google's enterprise blog claims the agent runs in a 「fully managed secure runtime」 with per-task 「pristine, strictly isolated ephemeral VMs」 and DLP enforcement via Agent Gateway. Willison is not fully reassured:
"Given how many people are going to be piping very sensitive data through Gemini Spark in the near future I hope they've made this bullet-proof, or this could be a top candidate for the agent security challenger disaster that we still haven't seen."

PyCon lightning talk: coding agents crossed a threshold in November

On May 16 at PyCon US 2026, Willison gave a five-minute talk on the last six months in LLMs, later annotated and published on his blog on May 19 6.
His central claim: November 2025 was the inflection point when coding agents went from 「often-work」 to 「mostly-work,」 crossing the quality threshold for daily-driver use. The best model title changed hands five times in a single month among Claude Sonnet 4.5, GPT-5.1, Gemini 3, GPT-5.1 Codex Max, and Claude Opus 4.5. He also highlighted Qwen3.6-35B-A3B (20.9 GB, laptop-runnable) outperforming Claude Opus 4.7 on his SVG pelican benchmark — a sign that locally-runnable open models are narrowing the gap faster than most engineers expect 6.

Quick signals

  • datasette-llm-limits 0.1a0 (May 15): New Datasette plugin for per-user LLM spending caps, using rolling daily dollar limits — the kind of cost-control infrastructure you need before putting an agent-powered tool in front of real users 7.
  • FTC vs. Cox Media Group (May 22): The FTC fined Cox Media Group and two others nearly $1 million for falsely marketing an 「Active Listening」AI ad-targeting service that claimed to monitor device microphones. Willison had been debunking this claim since 2024 — the FTC ruling confirmed the service was repackaging purchased email lists, not microphone data 8.
  • GDS pushes back on NHS open-source retreat (May 17): The UK Government Digital Service published guidance recommending 「default to open」 practices without naming NHS directly — which Willison read as a significant escalation of an internal government dispute over NHS's decision to close previously open repositories 9.

Swyx

Shawn 「Swyx」 Wang (@swyx, founder of Latent Space) posted across X and his blog this week, with Kakuna as the standout shipped artifact.

Kakuna: a hardening framework for vibe-coded codebases

On May 22, swyx open-sourced Kakuna — a five-layer agent skill framework specifically designed to harden vibe-coded codebases, hosted at swyxio/skills on GitHub 10 11.
The five layers run in order: Foundation → Productization → Safety → Operability → Quality, with seven discrete skill modules covering codebase-maintainability-guardrails, antislop-codebase, productionize-app-with-services, security-hardening, observability-hardening, release-readiness-hardening, and test-strategy-hardening. Each module is a separate SKILL.md compatible with Claude Code, OpenAI Codex, and Cursor 10.
The framing swyx uses is the 「mullet factory」:
"instead of dark factory, go 'mullet factory' — party in front (ship unique lovable features), dark in the back (timeless production principles)."
You run /plan with Kakuna, then let it /goal for a day. It returns with the same application functionality, but the underlying codebase has been audited and hardened — and it produces an audit trail of its own work 11.
The most useful framing from the comments came from community member spanlens: checklists are interesting because they encode invariants that models can verify but can't yet derive on their own. The gap between 「can generate」 and 「production-ready」 is knowing which 50 things will go wrong, not knowing how to write code.
This is also a convergence moment: Eric Zakariasson (Cursor team) posted his /thermo-nuclear-code-quality-review skill the same week — Cursor's most-used internal skill, with rules like 「delete rather than move complexity,」 「block files over 1,000 lines,」 and 「reject PRs that work but make the codebase worse.」12 Swyx called it 「great minds think alike.」
Kakuna Codebase Hardening Suite mascot — a yellow armored cocoon in a code-symbol shield
The Kakuna mascot from swyxio/skills10

The vibecoded-to-production skill (forthcoming)

The day before Kakuna, swyx posted about a separate in-progress skill for converting a vibe-coded MVP into a production-ready, end-to-end-tested, parallelizable agent repo 13:
"this thing ran for ~16 hours yesterday and made 103 commits all told and i ended up with exactly the same app but instead of fragile mvp it now looks like a codebase i can actually build on for the long run"
コンテンツカードを読み込んでいます…
That tweet drew 597 likes and 638 bookmarks — the highest-engagement post of the week. Swyx noted the repo isn't public yet; it 「still needs more testing.」 The 103 commits over 16 hours can't be independently verified until the repo ships, but the concept is consistent with Kakuna and the Cursor internal skill 13.

「Deep research is dead since o3」

On May 20, swyx posted a short thesis that's been circulating since 14:
"IMO deep research has been ~dead since o3 and interactivity was always more impt for active learning and eliciting intention. thoughtless prompt → long ass report nobody reads is inferior to read → think → ask → read → think → ask"
The argument is that once models can reason interactively at the quality o3 brought, the 「dump a long report and walk away」 pattern loses most of its value. The claim landed 270 likes and 64 replies, with significant pushback from people still using deep research productively — but as a product design principle for agent UX, the directional point has weight.

Agent Labs and the Q4 2025 inflection

On May 20, swyx connected Sam Altman's recurring 「build a business that gets better as models get better」 frame to his own previously published 「Agent Labs」 concept 15:
"in retrospect i think @sama's mythical 'build a business that gets better when models get better' is basically what I called Agent Labs here. seeing a very direct correlation with model performance and agent lab revenue, discontinuity in Q4 2025"
If his observation is accurate, the Q4 2025 model improvements weren't just a capability story — they showed up as a revenue step-change for teams running agents.

Daytona infra numbers

The Latent Space podcast episode on Daytona (published May 21, hosted by swyx's network) featured CEO Ivan Burazin sharing specific numbers on agent compute infrastructure 16:
  • 60 ms sandbox spin-up time (including network latency, full request→reply cycle)
  • 50,000 concurrent sandboxes in 75 seconds
  • 850,000 daily sandbox runs
  • 15% average utilization with 90% peaks — a ratio driven by the burst-heavy nature of reinforcement learning and eval workloads
Burazin described Daytona as a custom bare-metal scheduler built on 10+ years of Codeanywhere infrastructure, pivoted from human developer environments to agent sandboxes on New Year's Eve 2025 (the first MVP was partly vibe-coded). 74% month-over-month growth; Singapore is currently the city with the highest user count 16.
The Latent Space Railway episode (May 19, Jake Cooper) covered the agent-native cloud angle: 3M users, 100K weekly signups, and over $200K in coding-agent spend. The full Substack post is paywalled.

Quick signals

  • AIE Singapore keynote (May 17): swyx gave the closing keynote at the first AI Engineer Singapore conference, titled 「The Agentic Nation.」 His blog post described it as: 「burned some bridges but said what i felt.」17
  • Exa bake-off (May 20): swyx's team ran a competitive evaluation of Exa (an agent-focused search API) against alternatives; the team reached unanimous consensus in 1.5 hours 18.
  • 10M Transformer challenge (May 18): swyx challenged the community to live-code a ~10M parameter Transformer in JAX/Flax/Optax on free Colab, in a 2–3 hour workshop window. The post got 636 likes and 710 bookmarks — the most-bookmarked tweet this week 19.

Five repos dominated GitHub AI agent trending between May 15–23.
RepoStarsWhat it does
vercel-labs/agent-browser34,100Rust-native browser automation CLI for AI agents — accessibility-tree snapshots, semantic ARIA locators, batch commands, OAuth/2FA state persistence, AES-256-GCM auth vault, built-in natural-language chat
volcengine/OpenViking24,600ByteDance/Volcengine's 「context database」 for agents — file-system paradigm for memories, resources, and skills; L0/L1/L2 tiered context loading to reduce token use; visualized retrieval trajectories for debugging
ComposioHQ/agent-orchestrator7,200Parallel coding-agent orchestration — each agent in its own isolated git worktree, autonomous CI-fix loop, web dashboard at localhost:3000; agent-agnostic (Claude Code, Codex, Aider), runtime-agnostic (tmux, Docker)
NousResearch/hermes-agent151,000v0.14.0 released May 16: 808 commits, 633 merged PRs, 215 contributors since v0.13.0; new xAI Grok integration with 1M-token context, OAuth proxy exposing Claude Pro and ChatGPT Pro as local OpenAI-compatible endpoints, native X/Twitter search tool, MS Teams integration, pip install hermes-agent now on PyPI, CDP browser speed improved 180× via persistent connection reuse
future-agi/future-agi1,000Open-source eval + observability platform: 50+ eval metrics, 18 guardrail scanners, OpenTelemetry-native tracing across 50+ frameworks, OpenAI-compatible gateway (~29K req/s, P99 ≤ 21 ms), 6 prompt-optimization algorithms; nightly release status
Sources: 20 21 22 23 24
A few observations on the list.
リンクプレビューを読み込んでいます…
agent-browser and OpenViking sit at opposite ends of the agent infrastructure problem: Vercel Labs is solving how agents act on the web (a browser interface with security controls), while ByteDance is solving how agents remember and access context across sessions. Both are now at a scale that suggests serious adoption, not just interest. hermes-agent's v0.14.0 note about 180× CDP speed improvement via persistent Chrome connections is the kind of operational detail that matters when you're running agents at scale — the prior per-request Chrome spawn cost was non-trivial. And agent-orchestrator's git worktree isolation model for parallel agents is the cleanest published pattern for managing concurrent coding agents without repo conflicts.
One real adoption signal: independent builder @rakheOmar (frostmage) shared their personal stack this week — Hermes Agent feeding Syncthing + Obsidian + GitHub for a fully private, version-controlled, phone-deliverable personal setup, 1 day to build 25.

Geoffrey Huntley

Geoffrey Huntley (@ghuntley, creator of how-to-build-a-coding-agent at 5.6k★ and the Ralph Wiggum Technique) published a blog post on May 2026 titled 「the aussie gov just screwed startup founders. my musings.」 at ghuntley.com/livid/ — covering Australian government policy impact on startup founders 26. No new GitHub commits, repos, or code activity appeared in the May 15–23 window 27.

Cover image: AI-generated

このコンテンツについて、さらに観点や背景を補足しましょう。

  • ログインするとコメントできます。