AIL Player Card #014 — Kimi K2.6: The Agentic Swarm

AIL Player Card #014 — Kimi K2.6: The Agentic Swarm

92 OVR. CW. SWE-Bench Pro 58.6% — ties GPT-5.5. AIME 2026 96.4%. 981 tokens/sec on Cerebras. 300-agent swarm built into the model. Open-weight, Modified MIT, $0.95/M input. Moonshot Challengers just fielded the most dangerous open-weight player in the league. #AILeague

AIL·Player Card
2026/6/11 · 8:06
購読 1 件 · コンテンツ 14 件

リサーチノート

Nobody gave Moonshot AI a lane.
They're a Beijing-based startup, founded in 2023, without the compute fortress of Google or the brand gravity of OpenAI. Their consumer app, Kimi, was popular in China. Their model line was progressing fast but quietly. And then on April 20, 2026, they dropped a trillion-parameter open-weight bomb on the frontier — and Cursor immediately built their coding product on top of it.
Kimi K2.6 is on the pitch. The Moonshot Challengers have arrived in the AI League. 1

📋 Player card

FieldValue
PlayerKimi K2.6
ClubMoonshot Challengers
PositionCW — Community Wing
Overall92
SeasonAI League 2026

Stat breakdown

DimensionScoreWhat it measures
RZN (Reasoning)91AIME 2026 96.4%, GPQA-Diamond 90.5%, HMMT 2026 92.7%
CRE (Creativity)85Code Arena WebDev Elo 1,529 — 6th of 67 models
SPD (Speed)94981 TPS on Cerebras — fastest trillion-param open model ever clocked
MLT (Multimodal)82MMMU-Pro 79.4%, MathVision 93.2%, 400M MoonViT vision encoder
SAF (Safety)79Hallucination rate 39% (down from K2.5's 65%), approaching Opus 4.7
VAL (Value)93$0.95/$4.00 per M input/output — 5–6× cheaper than Claude Opus 4.7
Position note: CW (Community Wing) = open-weight / open-source roster pick; competes on accessibility, cost, and self-hostability; community-first. K2.6 extends this ceiling into frontier coding territory — prior CW archetype was about accessibility; K2.6 is about frontier agentic performance at community prices.

What Moonshot just fielded

K2.6 is a 1-trillion-parameter sparse Mixture-of-Experts model, 32 billion parameters active per token. 1 That architecture isn't the headline — DeepSeek runs MoE too. What separates K2.6 is what Moonshot post-trained into it: a native Agent Swarm primitive that lets the model autonomously fan out to 300 parallel sub-agents and coordinate 4,000 steps of tool-call execution across the whole run.
Most agent frameworks today — LangGraph, CrewAI, AutoGen — bolt orchestration on top of the model from the outside. The model is a black box. K2.6 internalises the orchestration. It decides when to fan out, how many sub-agents to spawn, and how to reconcile their results. The reference run Moonshot published is a 12-plus-hour autonomous port of a Qwen inference engine to Zig — 4,000+ tool calls, 14 iterations, throughput climbing from 15 to 193 tokens per second by the end. 1
On BrowseComp, plain K2.6 scores 83.2%. With swarms enabled: 86.3%. The lift is bigger on parallelisable work — batch file reads across a large monorepo, multi-path literature reviews — and near zero on inherently sequential tasks. That's honest engineering: Moonshot tells you when to disable it.

The benchmark picture

K2.6's AI benchmark numbers aren't the soft "competitive with frontier" claims you hear every launch cycle. They are precise wins against specific named opponents.
SWE-bench Pro: 58.6% — tied with GPT-5.5 (57.7%), ahead of Gemini 3.1 Pro (54.2%), behind Claude Opus 4.7 (64.3%). 2
SWE-bench Verified: 80.2% — within a tight band of every top-tier model. DeepSeek V4 Pro ties at 80.6%.
AIME 2026: 96.4%, HMMT 2026: 92.7%, GPQA-Diamond: 90.5% — highest of any open-weights model on competition math and graduate science.
Artificial Analysis Intelligence Index: 54 — highest of any open-weights model. Three points behind the closed flagship cluster (Anthropic, Google, OpenAI all score 57). 1
Tool-invocation success: 96.6% — the number that actually matters for agentic work, because a model that drops tools at 90% reliability loses a multi-hour autonomous run to compounding errors.
The hallucination rate — 39% on AA-Omniscience, down from K2.5's 65% — tells the story of the K2.5→K2.6 upgrade in a single number. The model learned to be less confident in things it doesn't know. 1
チャートを読み込んでいます…

The speed dimension nobody expected

Benchmarks are one thing. Speed is what convinced Cursor.
Cerebras clocked K2.6 at 981 output tokens per second on their Wafer Scale Engines — 6.7× faster than the next-fastest GPU cloud, 23× faster than the median inference provider. 2 For a 10,000-token prompt requesting 500 output tokens, Cerebras delivers the full response in 5.6 seconds. The official Kimi endpoint takes 163.7 seconds. That is a 29× improvement in wall-clock time.
The Cerebras infrastructure angle is separate from model quality. But what it reveals is that K2.6 — unlike many MoE models that get slow at serving time because weight routing creates memory bottlenecks — runs cleanly on highly parallelised hardware. The architecture scales. 3
And then there is the voice latency number: 452ms time-to-first-token with chain-of-thought enabled on the aiewf-eval voice benchmark — the first trillion-parameter model to clear the 500ms conversational threshold. 3 For context: Gemini 3.5 Flash, explicitly built for speed, sits at 960ms. A model with ten times the parameters is twice as responsive.

Head-to-head vs same-position rivals

K2.6 competes with other CW-class players on open-weight frontier coding:
ModelClubOVRSWE-Bench ProAIME 2026AA Intelligence IndexAPI Price (input/M)
Kimi K2.6Moonshot Challengers9258.6%96.4%54$0.95
MiniMax M3MiniMax Challengers9159.0%~88%~52$0.30
Llama 4 MaverickMeta Open88~45%~84%~46$0.18
DeepSeek V4 ProDeepSeek Athletic9555.1%93.5%~55$0.14
Footnote: DeepSeek V4 Pro is listed as VE-class (not CW), but included for cross-arch context. MiniMax M3 edges K2.6 on raw SWE-bench Pro by 0.4 points; K2.6 leads on AIME, reasoning depth, and Agent Swarm infrastructure. 4 5

License, deployment, and the open-weight story

This is where the Moonshot Challengers make their case as a club, not just a model.
The weights are published on Hugging Face under a Modified MIT license — commercially usable below ~100M MAU or $20M/month revenue, auditable by anyone, self-hostable on your own hardware. 1 The open-source Kimi Code CLI agent also ships under a clean MIT license:
コンテンツカードを読み込んでいます…
Native INT4 quantisation means the self-hosted path is real: the weights are ~594 GB in INT4, roughly 2× faster and 50% less memory than FP16, and Moonshot used Quantisation-Aware Training so quality loss is negligible, not an afterthought.
The enterprise path runs through NVIDIA NIM (there's an official K2.6 NIM container on NGC), Cloudflare Workers AI, DeepInfra, and GMI Cloud. OpenRouter offers K2.6 at $0.74/$3.50 — under the Moonshot direct price.
And Cursor's Composer 2.5, built on K2.5 with heavy RL fine-tuning on user coding data, scores 79.8% on SWE-bench Verified and costs $0.50/M. 6 That is arguably the strongest validation in the league: the leading AI coding tool picked Moonshot's architecture as the base for their premium product.

Season highlights

Three plays define K2.6's debut season:
The Swarm play. 300 parallel sub-agents, 4,000 coordinated steps, 12-hour autonomous coding runs. No other open-weight model can field this shape of work without falling apart.
The Speed record. 981 tokens/second, 452ms voice latency. When the Cerebras partnership went live, K2.6 became the fastest trillion-parameter model anyone has ever measured. Closed-source rivals are not even in the conversation at this speed tier.
The Cursor Endorsement. When the IDE your team already pays for chooses your architecture as its coding brain, that is a signed contract the league can read. K2.6 doesn't need to win a press release war — it already powers developer workflows at scale.

Key numbers at a glance

統計カードを読み込んでいます…

Scout's verdict

The CW position in this league has meant different things at different times. Llama 4 Maverick brought accessibility and community scale. MiniMax M3 brought frontier coding in an open package. Kimi K2.6 brings all of that plus the first native agent swarm architecture, the fastest inference numbers at scale, and a Cursor integration that proves enterprise adoption doesn't require a closed-source badge.
The limitation is context window: 256K tokens vs. DeepSeek's 1M. For truly massive context tasks, that still matters. And Moonshot's 39% hallucination rate — while massively improved from K2.5 — still trails Claude Opus 4.7's ~31%.
But at five to six times the cost efficiency of Opus 4.7, with open weights you can audit and self-host, and a swarm architecture that turns a 20-minute single-agent run into a 3–5 minute parallel sprint: 92 OVR is earned, not inflated. The Moonshot Challengers walked into this league carrying a player nobody had scouted, and he immediately made the starting XI.
#AILeague

このコンテンツについて、さらに観点や背景を補足しましょう。

  • ログインするとコメントできます。