
AIL Player Card #009 — Gemini 3.5 Flash: The Agentic Sprinter
91 OVR. AS. Terminal-Bench 76.2%. SWE-Bench Pro 55.1%. 4× faster than frontier rivals. $1.50/M input. Google National finally fields a player built for the agentic era — faster, cheaper, and better at multi-agent loops than its own flagship. #AILeague

리서치 브리프
91 OVR. AS. Google National finally fields a player who does the work instead of talking about it. Gemini 3.5 Flash shipped at Google I/O on May 19, 2026 — and it outperformed Google's own flagship Gemini 3.1 Pro on nearly every coding and agentic benchmark while running 4× faster than comparable frontier models. At $1.50/$9.00 per million tokens, it's 40% cheaper than 3.1 Pro. The big-budget national team just sent its most cost-effective sub-agent to the pitch. #AILeague
The scouting report
Google National has had a reputation problem. Generous CapEx, world-class research, but a habit of fielding squads that promise more than they deliver on match day. Gemini 2.5 Pro finally changed the narrative in the spring — but it played the slow, deep-thinking center. For this issue, Google has promoted a different archetype.
Gemini 3.5 Flash is not trying to be the cleverest player on the pitch. It is trying to be the fastest, most relentless sub-agent in the league — the one that keeps running at 289 tokens per second while the opposing midfield stops to think. 1
DeepMind chief technologist Koray Kavukcuoglu put it plainly before the I/O keynote: "3.5 Flash offers an incredible combination of quality and low latency. It outperforms our latest frontier model, 3.1 Pro, on nearly all the benchmarks." 2 That is a rare thing for a Flash-tier model to do — and it changes the position classification entirely.
Stat card
| Dimension | Score | What it measures |
|---|---|---|
| OVR | 91 | Weighted composite |
| RZN (Reasoning) | 84 | GPQA Diamond 82.8%, ARC-AGI-2 72.1%, HLE 40.2% |
| CRE (Creativity) | 82 | CharXiv Reasoning 84.2%, MMMU-Pro 83.6% |
| SPD (Speed) | 97 | 289 tokens/sec, 4× faster than frontier rivals |
| MLT (Multimodal) | 88 | Native text / image / video / audio / PDF input |
| SAF (Safety) | 79 | Strengthened CBRN safeguards; better-calibrated refusals |
| VAL (Value) | 88 | $1.50 input / $9.00 output per million tokens; 40% cheaper than 3.1 Pro |
Position: AS — Agentic Sprinter. Designed for multi-agent loops and sub-agent deployment at scale. Runs autonomously for multiple hours on complex coding pipelines. Trades pure reasoning ceiling for unmatched throughput at frontier-adjacent intelligence. The AS designation is new to the AI League taxonomy and reflects a generation of models built explicitly for the agentic era rather than the chatbot era.
Season highlights
Terminal-Bench 2.1: 76.2% — beats Gemini 3.1 Pro (70.3%) and GPT-5.5 (82.7% is the SC benchmark comparison; Flash sits clearly in the agentic tier). 3
SWE-Bench Pro: 55.1% — one of the first Flash-tier models to break the 50% threshold on the coding benchmark that actually matters in 2026. Previous generation flash models were stuck in the 30s.

Agentic finance benchmark (Finance Agent v2): 57.9% — up 14.9 percentage points from Gemini 3.1 Pro's 43.0%. This is the outlier performance in the eval card. Multi-step financial workflows are precisely where Flash 3.5 was co-developed with Antigravity. 3
GDPval-AA agentic Elo: 1656 — versus 1314 for 3.1 Pro. This is the evaluation that Google sees as the long-term replacement for text-chat arena ranking, measuring economic value creation per agent session. A 342-Elo gap over the flagship in the same team's lineup is not a rounding error.
Deployed as the default model in the Gemini app, AI Mode in Search, and Gemini Spark — Google's 24/7 personal agent. When a company deploys its newest Flash model as the backbone of its consumer products the same week it ships, the adoption signal is real. 2
The I/O keynote demo put the agentic scope in numbers: 93 sub-agents running in parallel inside Antigravity 2.0, collectively processing 2.6 billion tokens to build a working operating system from scratch. Flash's AS position is not a marketing angle. It is what the benchmark stack describes.
콘텐츠 카드를 불러오는 중…
Where it falls short
The SPD 97 and VAL 88 are genuine. The RZN 84 is honest about the ceiling.
ARC-AGI-2 at 72.1% is 5 points below Gemini 3.1 Pro's 77.1%. GPQA Diamond at 82.8% is 11 points below 3.1 Pro's 94.1%. Humanity's Last Exam at 40.2% trails both 3.1 Pro (44.4%) and Claude Opus 4.8. 1 On pure hard-science reasoning, Flash does not beat the Pro tier — and that is the honest read.
Long-context recall also has a dip: MRCR v2 at 128k shows 77.3% for Flash versus 84.9% for 3.1 Pro. At 1M tokens the gap nearly closes (26.6% vs 26.3%), but neither number is impressive. Players who need sustained 100k+ retrieval fidelity should route to 3.1 Pro. 3
The safety penalty is moderate rather than severe — SAF 79 is better than Grok 4's 48 from last week, reflecting genuine progress on CBRN calibration. But Gemini is still facing legal scrutiny over the 2025 chatbot-linked suicide incident, and shipping an always-on consumer agent (Spark) during active litigation is a reputational exposure the league office is watching. 2
Head-to-head: Agentic Sprinter class
| Model | Team | OVR | Terminal-Bench | SWE-Bench | Speed tier | $/M input |
|---|---|---|---|---|---|---|
| Gemini 3.5 Flash | Google National | 91 | 76.2% | 55.1% | 289 tok/s | $1.50 |
| GPT-5.5 | OpenAI United | 93 | 82.7% | — | ~70 tok/s | $5.00 |
| Gemini 3.1 Pro | Google National | — | 70.3% | 54.2% | ~70 tok/s | $2.50 |
| Gemini 3.5 Flash (12× Opt.) | Google National | — | est. same | est. same | ~3,500 tok/s | — |
The 12× optimized variant referenced at I/O has the same benchmark quality at roughly 12× the throughput of Flash — Google says it's coming. That makes the AS position's speed ceiling even higher than what's available today. 2
GPT-5.5 still leads on Terminal-Bench and pure agentic task completion (SC position, OVR 93). Flash 3.5 is not the league's best agentic executor — but it is the most cost-efficient one at this intelligence tier. Six agents running Flash in parallel cost the same as one GPT-5.5 call.
Coach's verdict
Google National has spent two years being outrun by OpenAI United and out-reasoned by Anthropic FC. Gemini 3.5 Flash does not fix both problems at once, but it solves the one that actually moves developer adoption: sub-agent economics at frontier-adjacent quality.
The Antigravity integration is worth noting outside the marketing layer. Co-developing the model with its deployment platform means Flash was optimized against real multi-agent loop patterns, not just benchmark prompt formats. That's a different kind of training signal, and the Finance Agent v2 and GDPval-AA numbers suggest it shows. 1
콘텐츠 카드를 불러오는 중…
The next card in the rotation at Google National is likely 3.5 Pro — the orchestrator that Flash is explicitly designed to serve under. When that ships, the AS + RP pairing could give Google National a legitimate two-player combination for the first time. Until then, Flash holds the field alone, and it's earning its minutes.
91 OVR. AS. Speed 97. Google National sends its fastest player to the pitch — and this time the benchmarks agree. #AILeague
이 콘텐츠를 둘러싼 관점이나 맥락을 계속 보강해 보세요.