The click tax for AI agents: how a deterministic retrieval layer took biological data accuracy from 17% to 100%

The click tax for AI agents: how a deterministic retrieval layer took biological data accuracy from 17% to 100%

On June 8, 2026, Anthropic published a research post and companion Science paper introducing VirBench — 120 viral sequence queries with ground-truth counts — and gget virus, a deterministic retrieval tool built with NCBI researchers. Without it, state-of-the-art agents answered correctly between 17% and 91% of the time, with run-to-run variance severe enough to push an inferred 2014 Ebola outbreak origin date back to 1922. With it, all tested systems exceeded 90%, and variability nearly disappeared. The paper's broader argument: biological databases need to treat agents as first-class users, and deterministic retrieval layers are critical infrastructure for reliable scientific AI — at least until model capabilities make them obsolete.

Anthropic & Claude Deep Tracker
June 12, 2026 · 10:16 AM
1 subscriptions · 13 items

Research Brief

Biological databases were designed for humans clicking through browsers. Andrej Karpathy spent a week doing exactly that — clicking dropdowns and following URLs — trying to make a vibe-coded web app real. He was frustrated: "The code was the easiest part." Computational biologists have been saying the same thing about their own data infrastructure for years. But when AI agents hit the same walls, the consequences go further than a wasted week.
On June 8, 2026, Anthropic published a research post and companion paper laying out the problem quantitatively — and what a deterministic retrieval layer does to fix it. 1 The preprint, submitted to arXiv on June 4, introduces VirBench, a benchmark of 120 viral sequence queries, and gget virus, a programmatic tool built with NCBI researchers. 2 The numbers are striking: without a dedicated retrieval layer, state-of-the-art agents get the right answer between 16.9% and 91.3% of the time. With gget virus, every tested system tops 90%, and GPT-5.5 reaches 99.7%.

The click tax and why it matters now

The NCBI Virus database is where virologists go to retrieve genomic sequences — for outbreak surveillance, vaccine design, diagnostic assay development, phylogenetic analysis. It has a web interface that exposes complex filtering logic. It does not expose that same logic through any single API. Getting "all human SARS-CoV-2 sequences released in 2025 containing the surface glycoprotein" from the web interface takes a few clicks from a trained researcher. Doing the same thing programmatically requires stitching together multiple APIs (REST, Datasets, E-utilities), paginating results, reconciling identifiers across databases, and downloading gigabytes of data to filter locally.
The paper's author, Laura Luebbert, who co-developed gget (a broader genomics retrieval toolkit), describes the situation in infrastructure terms borrowed from Karpathy: biological data is like a historic city built for pedestrians. You can retrofit it with traffic signs and parking lots, but the underlying layout was never meant for cars. 1 Software infrastructure, by contrast, was built for agents from the start — version control, well-documented APIs, test suites that tell you whether a patch is correct.
This matters in a practical and time-sensitive way. When an Ebola outbreak was declared in the DRC in May 2026, researchers needed to immediately compare new Bundibugyo virus genomes against historical Zaire ebolavirus sequences to assess diagnostic and therapeutic coverage. The first step — retrieving the relevant sequences from NCBI — is a manual, browser-dependent workflow. The paper uses this as a concrete example of why streamlined agent access to viral data can have life-or-death consequences.

VirBench: what the benchmark actually measures

The benchmark consists of 120 queries across 40 pathogens, manually verified against ground-truth counts. The queries reflect real scientific tasks: a sample query asks for sequences of Orthoebolavirus zairense (ZEBOV) matching specific host, geography, date, length, and quality filters. Five systems were tested: Claude Sonnet 4, Claude Opus 4.7, Biomni Open Source (using Claude Sonnet 4 as its base LLM), Edison Analysis, GPT-5.2-pro, and GPT-5.5.
Without any specialized retrieval tool, mean accuracy ranged from 16.9% to 91.3% across systems. 2 The paper notes that Anthropic's newer models are biosecurity-restricted for this type of task, which is why Claude Sonnet 4 (not Opus 4.8 or Fable 5) appears as the primary Anthropic test subject.
Accuracy alone understates the problem. The benchmark also measured run-to-run variability — asking the same model the same question three times and checking whether it gave the same answer. For the sample Ebolavirus query, Sonnet 4 returned 106 sequences on one run, 15 on a second, and 5 on a third, against an expected count of 266. That variance directly translates into flawed downstream analyses:
  • Phylogenetic trees: Two of the three Sonnet 4 sequence sets produced visibly incomplete trees. One pushed the inferred time to the most recent common ancestor (TMRCA) back to 1922 (the real 2014 outbreak TMRCA is between late January and mid-March 2014). 1
  • Therapeutic coverage analysis: Three separate retrieval runs produced three different pictures of mutation frequency in the glycoprotein regions targeted by maftivimab and MBP134 — antibodies that WHO recommended as candidate treatments in the ongoing DRC outbreak.
Phylogenetic trees of Zaire ebolavirus from manually retrieved vs. agent-retrieved sequence sets, showing divergent inferred TMRCA dates
Phylogenetic trees of Zaire ebolavirus: the manually curated dataset (upper left) recovers a January 2014 TMRCA; three Sonnet 4-retrieved datasets produce different trees, including one inferring a 1922 origin. 1
The failure modes follow a clear pattern. Agents under-counted when they stopped paginating early (most severe for high-volume pathogens like Influenza A, HIV-1, and SARS-CoV-2). They over-counted when they misapplied filters. They failed on metadata fields whose meaning depends on context — whether a sequence is "lab-passaged," which segment name applies to a given virus, how a particular geographic label maps to INSDC conventions.

How gget virus works

The paper argues that the core issue is not reasoning capability. Models largely understand what's being asked. The problem is that they lack a machine-actionable path to carry out, verify, and repeat the retrieval.
gget virus addresses this by building a layer that coordinates across the multiple underlying systems NCBI Virus sits on top of — the REST, Datasets, and E-utilities APIs — and replicates the filtering semantics of the web interface programmatically. 1 Specifically:
  • For large query results (SARS-CoV-2, Influenza A), it handles pagination to retrieve complete datasets rather than truncating
  • When filters require information stored in a separate database (e.g., determining whether a record contains a particular viral protein via GenBank), it retrieves those records, applies the filter, and preserves the relevant fields in the output
  • It returns standardized, machine-readable outputs with detailed logs showing exactly how the result was produced — making the retrieval auditable
The development was collaborative: Luebbert and her team built gget virus with researchers at NCBI, specifically to reproduce the NCBI Virus web interface behavior that isn't exposed through any single existing API.
Existing Zaire ebolavirus mutations across the glycoprotein, showing known footprints of antibody therapeutics maftivimab and MBP134. Three agent-retrieved datasets produce divergent mutation profiles from the same query.
Variability in antibody coverage analysis: three runs of the same Sonnet 4 query produced three different mutation profiles for Ebolavirus glycoprotein regions targeted by maftivimab and MBP134 — against a manually curated baseline (leftmost). 1
With gget virus, all systems exceeded 90% accuracy and run-to-run variability largely disappeared (response stability improved to 0.92–1.00). The abstract also reports that the tool reduced data transfer by more than 98% for high-volume queries by staging retrieval and applying metadata constraints before downloading sequences. 2 The gap between models narrowed dramatically — model choice matters much less when the retrieval substrate is reliable.
One data point worth noting from the benchmarking: in one of 360 runs, GPT-5.5 independently found and used gget virus without being prompted to. It was the only run for that question that returned the correct answer.

The bigger lesson for biological data infrastructure

The paper positions gget virus as an example of a broader category: what Luebbert calls "context engines" — reliable, agent-accessible infrastructure for biological data. Other projects in this space include ToolUniverse, Edison Scientific's Robin, and Biomni; all involve creating structured harnesses that connect agents to biological data sources.
The open question is how long this category of infrastructure matters. If model capabilities continue improving at current rates, agents might eventually navigate fragmented portals, reconcile identifiers, and paginate correctly without specialized tooling. Luebbert acknowledges this possibility directly: "If we draw the model curve forward from the results above, it's easy to imagine a (very near) future in which the benefit of tools like gget virus approaches zero." 1
But she pushes back on the implication that this makes the work unnecessary. Even if a model can brute-force its way through a confusing bioinformatics workflow, that doesn't mean every agent doing viral surveillance should reinvent the same multi-step API choreography on each run. A deterministic layer is cheaper, faster, and auditable in ways that a capable-but-improvising model isn't. And if the harnesses do eventually become obsolete, the database design lesson still holds: biological databases need to treat agents as a first-class user class, not an afterthought.
This paper sits alongside Anthropic's chemistry white paper from June 5 — which showed Claude Opus 4.7 matching ChemDraw and MestReNova on NMR prediction 3 — as the two clearest recent statements of what "AI for science" actually requires at the infrastructure layer. The chemistry paper demonstrated what a model can do when given clean, structured inputs. This one explains why structuring those inputs is half the problem.

Add more perspectives or context around this Post.

  • Sign in to comment.