AIL Player Card #003 — Gemini 2.5 Pro: The Thinking Winger

93 OVR · MW (Multimodal Wing) · Google National

93 OVR. MW. Topped LMArena on debut. 1M context, native multimodal, thinking engine built in. Google National finally fields a player who shows up when it matters. #AILeague

The Scouting Report

On March 25, 2025, Google DeepMind dropped Gemini 2.5 Pro Experimental onto a free AI Studio pitch — no subscription wall, no API cost, no warmup period — and it walked straight to the top of the LMArena leaderboard, clearing the second-place model by nearly 40 points.1

That's the kind of entrance that gets you a contract extension before the press conference ends.

This card rates the stable June 2025 release of Gemini 2.5 Pro — the version with confirmed benchmark numbers, production API pricing, and full 1M-token context. Google has since fielded newer generations (3.x series is now active), but Gemini 2.5 Pro is the player that changed the locker room conversation and defines what a "thinking model" means in league terms.2

Player Card

OVR

RZN

CRE

SPD

MLT

SAF

VAL

통계 카드를 불러오는 중…

Attribute	Detail
Team	Google National
Position	MW — Multimodal Wing
Season	2025
Context window	1,048,576 tokens (1M)
Knowledge cutoff	January 2025
Thinking	Native (chain-of-thought before answer)
Modalities	Text, image, audio, video, PDF (input); text (output)

Stat Breakdown

RZN — Reasoning: 91

Gemini 2.5 Pro was Google's first openly deployed "thinking model" — it reasons through problems before responding, a technique openai's o1 had pioneered and DeepSeek-R1 had turbocharged in January 2025. The number that made labs sit up: GPQA Diamond 84.0% on single-attempt pass@1, placing it at the top of the scientific reasoning table at launch.3 On AIME 2025 math it hit 86.7%, and on AIME 2024 it reached 92.0% — numbers that put math olympiad questions in reach without external tools.1

The chain-of-thought mechanism costs latency: time to first token averages around 23 seconds on the production API, well above the 2-3 seconds typical of non-reasoning models.4 That's a real trade-off coaches need to plan around. But for complex reasoning tasks, the accuracy gap justifies the wait.

CRE — Creativity: 87

Gemini 2.5 Pro's creative output benefits from its long context and multimodal input. It can hold an entire codebase, novel draft, or video archive in a single prompt and synthesize across them — a capability few models in any position can match. Creative evaluation is harder to benchmark, but developer community reception on Reddit and HN placed it clearly ahead of GPT-4o and Claude 3.7 Sonnet for tasks involving large-document synthesis and cross-format creative work.

SPD — Speed: 85

At 120.7 output tokens per second, Gemini 2.5 Pro is faster than most frontier thinking models — roughly double the median speed of comparable reasoning models ($62 t/s).4 The speed score takes a hit for the long time-to-first-token inherent to thinking models, but once it starts streaming, it delivers quickly. For workflows that batch large jobs rather than demanding instant chat responses, this is an underrated strength.

MLT — Multimodal: 95

The highest dimension score on this card, and the reason for the "Multimodal Wing" position classification. Gemini 2.5 Pro accepts text, images, audio, video, and PDFs natively in a single prompt.2 The MMMU visual reasoning benchmark (multimodal multi-discipline understanding) came in at 81.7% at launch.1 No comparable model at the time could handle a 1M-token video transcript plus raw video frames plus structured data in a single request. Google National built this player specifically for cross-modal plays.

The 1M context window is not just a spec sheet number. On the long context benchmark MRCR at the 1M-token level, Gemini 2.5 Pro scored 83.1% pointwise — demonstrating it actually uses that context rather than degrading at the far end.1

SAF — Safety: 83

Google National has a decades-long brand to protect, and Gemini 2.5 Pro follows that tradition with conservative defaults. The safety score is moderate rather than elite: the model occasionally over-refuses on edge-case prompts, and compared to Anthropic FC's Claude (which builds safety into the architecture's core identity), Google's approach is more compliance-driven than first-principles. Adequate for most enterprise contexts, but developers building in sensitive domains should test refusal patterns before shipping.

VAL — Value: 89

This is where Gemini 2.5 Pro makes a strong case as a regular starter. API pricing settled at $1.25 per million input tokens and $10.00 per million output tokens — with prompt caching available at a 90% discount on cached tokens.4 Context window economics matter: at 1M tokens, a full-document analysis run that would require chunking with GPT-4o can be handled in a single call, reducing orchestration overhead and total cost. Prompts over 200K tokens are tiered at $2.50/M input, but even that is competitive for ultra-long context use cases.5 The debut as a free experimental release was itself a market signal: Google knows this player drives adoption, not just revenue.

Season Highlights

March 25, 2025 — LMArena #1 on debut. Released as a free experimental model in Google AI Studio, Gemini 2.5 Pro debuted at the top of the LMArena human preference leaderboard by a ~40-point margin over GPT-4o and Claude 3.7 Sonnet — the first time a Google model held the #1 position.1

May 6, 2025 — WebDev Arena #1. An I/O-preview update pushed Gemini 2.5 Pro to #1 on the WebDev Arena leaderboard for model-generated web development quality — a benchmark that measures how good the code looks and functions, not just whether it compiles.6

June 2025 — Production stable. GA release of gemini-2.5-pro with confirmed pricing, production rate limits, and full multi-platform availability (Gemini API, Vertex AI, Google AI Studio).

2025 SWE-bench: 63.8%. Using a custom agent setup at March launch; later passes pushed this score higher in community evaluations. Strong enough for production agentic coding pipelines, though not the outright #1 coding model at launch.1

Benchmark snapshot (official, at launch)

Multi-model benchmark comparison table showing Gemini 2.5 Pro leading on GPQA Diamond, AIME, SWE-Bench, MMMU, and long-context evaluations — Official benchmark table from the Gemini 2.5 Pro launch (March 2025) 1

Benchmark	Gemini 2.5 Pro	Notes
LMArena ELO	#1 at launch (~+40 pts lead)	Human preference voting
GPQA Diamond	84.0%	Scientific reasoning, pass@1
AIME 2025	86.7%	Math competition, pass@1
AIME 2024	92.0%	Math competition, pass@1
SWE-bench Verified	63.8%	Agentic coding, custom agent
MMMU	81.7%	Visual reasoning, pass@1
Global MMLU (Lite)	89.8%	Multilingual knowledge
MRCR 128k context	94.5%	Long context retrieval
MRCR 1M context	83.1%	Long context retrieval
HLE (no tools)	18.8%	Expert academic reasoning

Head-to-Head: Multimodal Wing class

차트를 불러오는 중…

The MW position covers models with native cross-modal reasoning as a primary capability, combined with long-context and reasoning depth. Rivals at launch:

Stat	Gemini 2.5 Pro	GPT-4o (Card #002)	Claude Sonnet 4 (Card #001)
OVR	93	90	91
RZN	91	86	90
MLT	95	91	80
SPD	85	90	88
SAF	83	82	93
VAL	89	82	86
Context	1M tokens	128K tokens	200K tokens
Thinking	Native	No	Optional
GPQA Diamond	84.0%	~53%	~80% (3.5 Sonnet)
Pricing (input/output)	$1.25/$10	$2.50/$10	$3/$15

Claude Sonnet 4 is the more surgical agentic coder; GPT-4o remains the fastest all-around option. But on multimodal depth, context length, and scientific reasoning, Gemini 2.5 Pro is in a different bracket than both. The arrival of a real challenger from Google National changes how teams build multimodal workflows.

Coach's Notes

The narrative about Google National — big payroll, always misses the finals — got complicated in Q1 2025. Gemini 2.5 Pro is not a squad player dressed up for the cameras; it topped the most credible human preference leaderboard in the league the day it debuted.

The caveats are real: thinking-model latency is a workflow constraint, the 200K-to-1M context tier pricing requires planning, and safety defaults can frustrate developers working near content edges. These are coachable problems, not character flaws.

What's harder to dismiss: a 1M-token native multimodal context window that actually works at the far end, a thinking engine that pushes GPQA Diamond past 84%, and pricing that undercuts Claude and GPT-4o on input tokens. If you're building something that involves large documents, video, audio, or long-horizon research, this player earns a starting spot.

The bigger question isn't whether Gemini 2.5 Pro is good. It's whether Google National can maintain this form, or whether this is another standout half before going quiet in the tournament bracket.

Watch the June stable release usage patterns. The market will tell us whether developers trust it or just benchmark it.

#AILeague