25/6/2026 · 10:20

scientific-agent-skills: the benchmark data that didn't exist in June

First controlled benchmarks: 96→100% success on mass spec, but weak models auto-trigger the skill only 8% of the time.

Vistazo a la investigación

On June 1, this channel ran an overview of K-Dense-AI/scientific-agent-skills and flagged the same gap in every sentence: no benchmarks. The install was clean, the skill catalog was deep, the "160,000+ scientists" claim was unverified. The honest answer on whether these skills actually improved agent performance was: we don't know yet.

That answer changed on June 12. K-Dense published 250 controlled runs of the pyOpenMS skill, and the numbers are specific enough to be useful. 1 The same week, SkillsBench 1.1 released cross-model data showing a +16.6 percentage-point average lift from curated skills across multiple model families. 2 Today the library is at 29,300 stars, v2.53.0 (released June 23), and has been renamed from "Claude Scientific Skills" to "Scientific Agent Skills" to reflect that it runs on Cursor, Codex, Pi, Antigravity, and anything else that reads the open Agent Skills standard. 3

github.com · Repositorio de GitHub

K-Dense-AI/scientific-agent-skills

https://github.com/K-Dense-AI/scientific-agent-skills

Cargando tarjeta de contenido…

What the pyOpenMS benchmark actually found

The test was straightforward: 10 mass spectrometry analysis tasks, run 5 times each, across 2 model tiers (Claude Sonnet 4.6 as "strong," Claude Haiku 4.5 as "cheap"), with and without the pyOpenMS skill. 1 250 total runs.

Results for the strong model:

Success rate: 96% → 100%
API errors: 1.00 per run → 0.08 per run (92% reduction)
Time per task: down 20%
Cost per task: down 10%

The 96%→100% delta sounds small. The K-Dense team's framing of why it matters is more interesting than the number itself. Without the skill, the 4% failure mode was not "agent returns an error and stops." It was "agent returns output that looks correct but contains zero adduct annotations, or returns m/z values (mass-to-charge ratios) that are plausible but wrong." K-Dense called this the real contribution of the skill: 1

"The skill's biggest contribution may be that it prevents silent, plausible-looking wrong science."

That framing is harder to quantify but more operationally relevant for anyone doing real scientific workflows — the failure mode without the skill is not a crash, it's a confident wrong answer.

The gotcha: weak models don't trigger skills automatically

The cheap-model numbers are where the benchmark gets interesting. Haiku without the skill: 74% success. Haiku with the skill, when the skill is explicitly invoked: 100% success — cost cut in half ($0.152 → $0.074/task), task time cut from 121.5 seconds to 31.1 seconds. 1

But here's the catch buried in the data: Haiku's auto-trigger rate was 8%. Left to its own judgment, the model invoked the pyOpenMS skill on its own in only 8 out of 100 attempts. The benchmark had to force 100% adoption for the 100% success rate to materialize.

K-Dense's conclusion was explicit: 1

"For smaller or cheaper models, do not rely on auto-trigger. Invoke the skill explicitly."

This has a direct workflow implication. If you're running Haiku or any cost-optimized model and expecting it to automatically reach for the scientific skills you've installed, you'll get 74% performance — indistinguishable from having no skill at all. The pattern that actually works: prefix your task prompt with the skill name (e.g., "Use the pyopenms skill to...") or invoke via the slash command directly.

SkillsBench 1.1: cross-model signal

The pyOpenMS benchmark covers one skill on two model tiers. SkillsBench 1.1 extends the picture across models and harnesses. The top configuration reported by @xdotli (SkillsBench founder): GPT-5.5 on OpenHands at 67.3% resolution rate on scientific tasks — with curated skills producing an average +16.6 percentage-point lift across the full model fleet. 2 The K-Dense scientific-agent-skills bundle was one of the validated skill sets in that evaluation.

SkillsBench 1.1 also confirmed the K-Dense team's partnership framing: they're working with the SkillsBench project to maintain scientific-agent-skills as a benchmark fixture, which means future version updates will have externally validated performance data attached. That's a structural improvement over the zero-benchmark situation from June 1.

Community signal since June 1

The community signal since June 1 has filled in. A thread in r/bioinformatics titled "Anyone using Claude or other bioinformatics agents" gathered 120 upvotes and 84 comments, with user nickomez1 sharing field experience using K-Dense skills in production bioinformatics pipelines. 4 That thread is the most substantive public community feedback the library has generated so far.

On X, @ruffy0369 integrated scientific-agent-skills into Hermes Agent as a core component — calling it an "autonomous AI Scientist" — with the PR drawing 33,000+ views. 5 @zanehkoch (EdisonSci) used the library as one of three key components in the Ren research harness, specifically for dynamic skill discovery.

Cargando tarjeta de contenido…

Cristian Munteanu's LinkedIn take on the project has been the most-shared external framing: 6

"Skills are the SOPs of the AI era — codified expertise that any agent can use, regardless of the underlying model or framework."

The install count on Claude Marketplaces has reached 66,300 total installs, with the top four skills being scientific-writing (622), scientific-critical-thinking (604), scientific-visualization (603), and literature-review (586). 7

Install (still one command)

Prerequisites: Python 3.13+, uv package manager, macOS/Linux/WSL2. 3

npx skills add K-Dense-AI/scientific-agent-skills

Or with GitHub CLI:

gh skill install K-Dense-AI/scientific-agent-skills --agent cursor|claude-code|codex|gemini

Progressive disclosure keeps the always-on token cost at 768 tokens across all 147 skills — the full docs only load when a skill is invoked. Compare that to loading all 147 SKILL.md files in full: ~12,910 tokens. K-Dense calls this the "many-tool scalability mechanism." 3

Install selectively for the skills you'll actually use — the README says this explicitly, and the benchmark data backs the reasoning: token overhead matters on cheap models. 3

Usage: prompts that actually trigger the skills

Based on the auto-trigger data above, here are the patterns that reliably invoke the skills across model tiers.

Mass spec analysis (explicit invocation):

Use the pyopenms skill to analyze this LC-MS/MS data file.
Identify all adducts and generate an annotated spectrum plot.

Literature review (slash command, Claude Code):

/literature-review transformer-based protein structure prediction 2022–2025,
PRISMA-compliant output, Nature citation style

Bioinformatics pipeline with sequential skill chaining:

Use the bioinformatics-pipeline skill followed by scientific-visualization
to process this RNA-seq count matrix and produce publication-ready volcano plots.

On auto-trigger for strong models (Sonnet/Claude 4.6+): the benchmark shows these models trigger skills correctly when the task context matches the skill domain. For anything involving pyOpenMS, scientific writing, or literature review with a capable model, you can rely on natural-language prompts and let the model discover the skill. On weak models, always name the skill explicitly.

The knightli.com independent review put the threshold clearly: 8

"It is not a magic button that automatically does science after installation."

Caveats that didn't change since June 1

Four issues from the original review still apply:

The "160,000+ scientists" user claim remains unverifiable. Every occurrence of that figure still traces back to the README. Timothy Kassis (K-Dense CTO) has stated it on X, but no independent source corroborates the number. 9
Community-contributed skills need manual review. The README is explicit: "It is ultimately your responsibility to review the skills you install and decide which ones to trust." K-Dense uses Cisco AI Defense Skill Scanner but notes that community contributions haven't had the same review depth as the core library. 3
No single-command full-automation. K-Dense's own positioning: the open-source skills put you as the orchestration layer. If you want autonomous multi-hour runs without human checkpoints, that's what their paid K-Dense Web platform is built for. 10 The comparison is direct: 1–4 hours of manual-guided work with the open-source skills vs. ~15 minutes of automated execution on the paid tier.
Setup friction is real. Python 3.13+, uv, and the right agent client configuration add up to 1–4 hours if you're starting from scratch. The install command is one line; the environment it drops into is not.

Author and maintenance health

K-Dense Inc., founded and maintained by CTO Timothy Kassis (MIT/Georgia Tech PhD, former Biostate AI). 3 The library is at 537 commits, 91 releases, v2.53.0 as of June 23, 2026. Release cadence has been steady — roughly every 5–7 days across recent versions. The Trivy CI false-negative bug (Issue #162 from the June 1 review) has been closed. Issue triage has improved; the GitHub issues page is more actively managed than it was at the June 1 snapshot.

When to install / when to skip

Install if:

You're a researcher or AI engineer running scientific workflows — bioinformatics, mass spec, drug discovery, clinical research, or any of the 18 domain categories — in Claude Code, Cursor, Codex, or Antigravity
You want the benchmark-verified gains on strong models (Sonnet-tier): 96%→100% success on complex pipelines, 92% API error reduction
You're cost-optimizing with cheap models and are willing to explicitly name skills in your prompts (the auto-trigger problem is avoidable, just not automatic)
You want unified access to 100+ scientific databases (PubChem, ChEMBL, UniProt, ClinicalTrials.gov, and 78 others) without wiring them up manually

Skip if:

You need fully autonomous, end-to-end execution without human orchestration checkpoints — that's K-Dense Web (paid), not this
Your domain falls outside natural and exact sciences; social science and humanities coverage is sparse
You're on Windows without WSL configured
You want community-vetted quality on the full 147-skill catalog — the top 10 skills have real usage signals; the specialty skills in proteomics and neuroscience still have thin issue history

Quick reference

Repository	K-Dense-AI/scientific-agent-skills 3
License	MIT 3
Latest version	v2.53.0 (June 23, 2026) 3
Stars / forks / commits	29,300 stars · 3,000 forks · 537 commits 3
Skills	147 across 18 scientific domains 3
Supported agents	Claude Code, Cursor, Codex, Pi, Google Antigravity, OpenClaw, NemoClaw, Hermes, and any Agent Skills-standard client 3
Prerequisites	Python 3.13+, `uv`, macOS/Linux/WSL2 3
Install	`npx skills add K-Dense-AI/scientific-agent-skills` 3
Benchmark (strong model)	96% → 100% task success, 92% API error reduction 1
Benchmark (weak model)	74% → 100% when explicitly invoked; 8% auto-trigger rate without explicit call 1
SkillsBench 1.1 lift	+16.6 pp average across model fleet 2
Total installs	66,300+ (Claude Marketplaces) 7
Author	Timothy Kassis (@TimothyKassis), K-Dense Inc. CTO 3