AIL Player Card #002 — GPT-4o: The Veteran Anchor

AIL Player Card #002 — GPT-4o: The Veteran Anchor

90 OVR. OF. Still starting for OpenAI United. Arena ELO #1 among peers, but context ceiling, value squeeze from DeepSeek, and GPT-5 on the bench tell the full story. #AILeague

AIL·Player Card
2026/5/30 · 8:10
1 订阅 · 2 内容
90 OVR · OF · OpenAI United · Season 2024–2026

Two years after OpenAI dropped GPT-4o on the world — a model so fast and cheap it briefly redrew the entire cost map — the veteran is still starting. Not leading the table. Starting. There's a difference, and every stat in this scouting report reflects it.
This is a card for a player in the middle third of his prime: dominant enough to trust with complex tooling workflows and multi-modal tasks, outpaced at the top by newer builds on both sides of the bracket. The franchise signed him at $5.00 input, let him run through '24 and '25, and now has GPT-5.x breathing down the depth chart. But GPT-4o's position is still his.

The stat sheet

正在加载统计卡片…
DimensionScoreSource
OVR (Overall)90Composite, weighted
RZN (Reasoning)86MMLU 88.7%, GPQA mid-tier at launch
CRE (Creativity)88Arena ELO ~1287 (human preference, May 2026)
SPD (Speed)82128K context, 48 t/s, 1.25s latency
MLT (Multimodal)94Native omni model — text/image/audio/vision in one net
SAF (Safety)83Preparedness Framework: Medium post-mitigation (Persuasion), Low for Cybersecurity/CBRN/Autonomy
VAL (Value)79$2.50/$10.00 per 1M tokens — competitive but no longer cheap
Position tag: OF (Omni Forward) Defined as: native cross-modal reasoning, instruction-following precision, and structured tool use across chat, API, and enterprise integrations.

Scouting report

The multimodal thesis that actually landed

When GPT-4o launched on May 13, 2024, the pitch was "omni" — one model end-to-end across text, image, and audio, no stitched pipeline 1. Voice response at 232–320ms average latency, comparable to human conversation. The omni claim was real. The MLT score of 94 is the highest on this card for a reason: native vision, audio, and multilingual capability shipped together, not stapled on.
正在加载内容卡片…
Two years later that architecture advantage still holds for teams that actually need cross-modal workflows. GitHub Copilot runs a multi-model matrix that includes GPT-4o specifically for structured API use cases and broad general tasks 2.

Instruction following is the real position

On pure code generation (HumanEval), GPT-4o posts 90.2% pass@1 — real but not top 3. Claude 3.5 Sonnet is at 92.0%, DeepSeek V3 at 91.3%. Where GPT-4o consistently outranks both in production at scale is structured workflows: multi-step tool calls, JSON output fidelity, and following complex multi-part instructions without drifting across long conversations 3. That specific reliability is what keeps him on the starting XI.
LMSYS Chatbot Arena Elo of ~1287 as of May 2026 is the highest among the four models in the comparison bracket 3. In Arena matchups — blind head-to-head where real humans vote on preferred responses — a 30-point Elo gap translates to roughly 54% win rate. GPT-4o over Claude 3.5 Sonnet (gap: ~23 points) is meaningful but not categorical. The franchise powerhouse is still the franchise powerhouse.

Context wall and value squeeze

The 128K context window was competitive in 2024. In 2026, it's a constraint. Gemini 1.5 Pro sits at 1M tokens. Llama 4 Scout runs 10M 4. Claude 3.5 Sonnet at 200K has room for the large codebases and long document reviews that increasingly define enterprise workloads. GPT-4o's effective window caps around 100K before instruction degradation shows up in production 3.
On value: the API price of $2.50 input / $10.00 output looked cheap in 2024 5. It doesn't look cheap next to DeepSeek V3 at $0.27 / $1.10 4. For high-volume production, the differential is 9× on input and 9× on output. GPT-4o's VAL score of 79 reflects a model that used to be the value play and is no longer.

Head-to-head: OF position class

正在加载图表…
StatGPT-4o (OpenAI United)Claude 3.5 Sonnet (Anthropic FC)DeepSeek V3 (DeepSeek Athletic)
OVR908987
Arena ELO~1287~1264~1243
HumanEval90.2%92.0%91.3%
MMLU88.7%88.3%88.5%
Context window128K200K128K
API input ($/1M)$2.50$3.00$0.27
API output ($/1M)$10.00$15.00$1.10
Best atTool use, structured outputLong docs, honest uncertaintyCost efficiency, code at scale
3 4 5

Season highlights

May 2024 — The omni launch. GPT-4o ships as a single model across text, audio, and vision. 50% cheaper and 2× faster than GPT-4 Turbo at parity performance on English text and code 1. The original contract year.
2024–2025 — The default installation. GitHub Copilot, Cursor, and dozens of enterprise AI products ship with GPT-4o as baseline or fallback. Developer survey data puts ChatGPT (powered by GPT-4o) at 41% usage among AI-assisted developers in 2025 6.
2025 — The depth chart problem. OpenAI drops GPT-4.1 (54.6% SWE-bench, 1M context), then GPT-5 (94.6% AIME 2025, 74.9% SWE-bench Verified) 7. GPT-4o remains in rotation but the "best OpenAI model" label passes up the roster.

Coach's verdict

GPT-4o is a 90 OVR player on a team with 95+ OVR on the bench. That's not a knock — that's franchise depth. The Omni Forward position he defined is still his, and his instruction-following precision and multimodal architecture remain the reference build for teams that need reliable tool use over raw benchmark scores.
The valuation gap with DeepSeek V3 is real. The context ceiling is real. But for production API workloads where consistent structured output matters more than either cost or benchmark rank, the veteran anchor delivers. Just don't mistake "still starting" for "still the best on the pitch."
Filed from the scout deck at OpenAI Park, May 30, 2026 #AILeague

围绕这条内容继续补充观点或上下文。

  • 登录后可发表评论。