AIL Player Card #009 — Gemini 3.5 Flash: The Agentic Sprinter

91 OVR. AS. Google National finally fields a player who does the work instead of talking about it. Gemini 3.5 Flash shipped at Google I/O on May 19, 2026 — and it outperformed Google's own flagship Gemini 3.1 Pro on nearly every coding and agentic benchmark while running 4× faster than comparable frontier models. At $1.50/$9.00 per million tokens, it's 40% cheaper than 3.1 Pro. The big-budget national team just sent its most cost-effective sub-agent to the pitch. #AILeague

The scouting report

Google National has had a reputation problem. Generous CapEx, world-class research, but a habit of fielding squads that promise more than they deliver on match day. Gemini 2.5 Pro finally changed the narrative in the spring — but it played the slow, deep-thinking center. For this issue, Google has promoted a different archetype.

Gemini 3.5 Flash is not trying to be the cleverest player on the pitch. It is trying to be the fastest, most relentless sub-agent in the league — the one that keeps running at 289 tokens per second while the opposing midfield stops to think. 1

DeepMind chief technologist Koray Kavukcuoglu put it plainly before the I/O keynote: "3.5 Flash offers an incredible combination of quality and low latency. It outperforms our latest frontier model, 3.1 Pro, on nearly all the benchmarks." 2 That is a rare thing for a Flash-tier model to do — and it changes the position classification entirely.

Stat card

Dimension	Score	What it measures
OVR	91	Weighted composite
RZN (Reasoning)	84	GPQA Diamond 82.8%, ARC-AGI-2 72.1%, HLE 40.2%
CRE (Creativity)	82	CharXiv Reasoning 84.2%, MMMU-Pro 83.6%
SPD (Speed)	97	289 tokens/sec, 4× faster than frontier rivals
MLT (Multimodal)	88	Native text / image / video / audio / PDF input
SAF (Safety)	79	Strengthened CBRN safeguards; better-calibrated refusals
VAL (Value)	88	$1.50 input / $9.00 output per million tokens; 40% cheaper than 3.1 Pro

Position: AS — Agentic Sprinter. Designed for multi-agent loops and sub-agent deployment at scale. Runs autonomously for multiple hours on complex coding pipelines. Trades pure reasoning ceiling for unmatched throughput at frontier-adjacent intelligence. The AS designation is new to the AI League taxonomy and reflects a generation of models built explicitly for the agentic era rather than the chatbot era.

Season highlights

Terminal-Bench 2.1: 76.2% — beats Gemini 3.1 Pro (70.3%) and GPT-5.5 (82.7% is the SC benchmark comparison; Flash sits clearly in the agentic tier). 3

SWE-Bench Pro: 55.1% — one of the first Flash-tier models to break the 50% threshold on the coding benchmark that actually matters in 2026. Previous generation flash models were stuck in the 30s.

Gemini 3.5 Flash vs Gemini 3.1 Pro benchmark comparison table across coding, agentic, multimodal and reasoning dimensions — Benchmark comparison: Gemini 3.5 Flash vs Gemini 3.1 Pro and GPT-5.5 across coding, agentic, and reasoning evals 1

Agentic finance benchmark (Finance Agent v2): 57.9% — up 14.9 percentage points from Gemini 3.1 Pro's 43.0%. This is the outlier performance in the eval card. Multi-step financial workflows are precisely where Flash 3.5 was co-developed with Antigravity. 3

GDPval-AA agentic Elo: 1656 — versus 1314 for 3.1 Pro. This is the evaluation that Google sees as the long-term replacement for text-chat arena ranking, measuring economic value creation per agent session. A 342-Elo gap over the flagship in the same team's lineup is not a rounding error.

Deployed as the default model in the Gemini app, AI Mode in Search, and Gemini Spark — Google's 24/7 personal agent. When a company deploys its newest Flash model as the backbone of its consumer products the same week it ships, the adoption signal is real. 2

The I/O keynote demo put the agentic scope in numbers: 93 sub-agents running in parallel inside Antigravity 2.0, collectively processing 2.6 billion tokens to build a working operating system from scratch. Flash's AS position is not a marketing angle. It is what the benchmark stack describes.

콘텐츠 카드를 불러오는 중…

Where it falls short

The SPD 97 and VAL 88 are genuine. The RZN 84 is honest about the ceiling.

ARC-AGI-2 at 72.1% is 5 points below Gemini 3.1 Pro's 77.1%. GPQA Diamond at 82.8% is 11 points below 3.1 Pro's 94.1%. Humanity's Last Exam at 40.2% trails both 3.1 Pro (44.4%) and Claude Opus 4.8. 1 On pure hard-science reasoning, Flash does not beat the Pro tier — and that is the honest read.

Long-context recall also has a dip: MRCR v2 at 128k shows 77.3% for Flash versus 84.9% for 3.1 Pro. At 1M tokens the gap nearly closes (26.6% vs 26.3%), but neither number is impressive. Players who need sustained 100k+ retrieval fidelity should route to 3.1 Pro. 3

The safety penalty is moderate rather than severe — SAF 79 is better than Grok 4's 48 from last week, reflecting genuine progress on CBRN calibration. But Gemini is still facing legal scrutiny over the 2025 chatbot-linked suicide incident, and shipping an always-on consumer agent (Spark) during active litigation is a reputational exposure the league office is watching. 2

Head-to-head: Agentic Sprinter class

Model	Team	OVR	Terminal-Bench	SWE-Bench	Speed tier	$/M input
Gemini 3.5 Flash	Google National	91	76.2%	55.1%	289 tok/s	$1.50
GPT-5.5	OpenAI United	93	82.7%	—	~70 tok/s	$5.00
Gemini 3.1 Pro	Google National	—	70.3%	54.2%	~70 tok/s	$2.50
Gemini 3.5 Flash (12× Opt.)	Google National	—	est. same	est. same	~3,500 tok/s	—

The 12× optimized variant referenced at I/O has the same benchmark quality at roughly 12× the throughput of Flash — Google says it's coming. That makes the AS position's speed ceiling even higher than what's available today. 2

GPT-5.5 still leads on Terminal-Bench and pure agentic task completion (SC position, OVR 93). Flash 3.5 is not the league's best agentic executor — but it is the most cost-efficient one at this intelligence tier. Six agents running Flash in parallel cost the same as one GPT-5.5 call.

Coach's verdict

Google National has spent two years being outrun by OpenAI United and out-reasoned by Anthropic FC. Gemini 3.5 Flash does not fix both problems at once, but it solves the one that actually moves developer adoption: sub-agent economics at frontier-adjacent quality.

The Antigravity integration is worth noting outside the marketing layer. Co-developing the model with its deployment platform means Flash was optimized against real multi-agent loop patterns, not just benchmark prompt formats. That's a different kind of training signal, and the Finance Agent v2 and GDPval-AA numbers suggest it shows. 1

blog.googlehttps://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-5/외부 링크