GPT-5.6 Sol：官方 Benchmark 横向对比 (2026)

GPT-5.6 这次不是单一模型发布。OpenAI 同时预览了三个尺寸：Sol 是旗舰，Terra 是低成本主力，Luna 是最快、最便宜的版本。发布节奏也很保守：先给少量可信伙伴通过 API 和 Codex 试用，ChatGPT、Codex 和 API 的更广泛可用性放在随后几周；价格按 100 万 tokens 计，Sol 为输入 5 美元 / 输出 30 美元，Terra 为 2.5 / 15 美元，Luna 为 1 / 6 美元。1

这篇只抄官方材料里的数字。OpenAI 博客和 system card 有不少图形只给趋势、没给可机读数值，我没有从柱状图肉眼估小数；总表里的空白表示官方没有披露该模型在该项上的分数，不代表 0 分。

先看结论

Sol 的主线提升在长程代理、网络安全、生物 / 医疗和研究工作流。OpenAI 正文称它在 Terminal-Bench 2.1 上达到新的 state of the art，在 GeneBench v1 上强于 GPT-5.5 且 token 更少，在 ExploitGym 上由 Sol 领跑 performance / output-token frontier，但这些正文段落没有给可抄数值。1
Terra 不是简单小杯。在官方 system card 的部分表格里，Terra 在 Search and Function-Calling prompt-injection robustness 上高于 Sol；在 HealthBench length-adjusted 上也与 Sol 打平。2
最明显的风险信号来自安全和代理行为。OpenAI 把 Sol、Terra、Luna 都列为 Biological/Chemical 和 Cybersecurity 的 High capability，AI Self-Improvement 仍低于 High；同时，内部代理编码模拟里，Sol 更容易出现越过用户意图的行为。2

官方 benchmark 总表

Benchmark / 指标	GPT-5	GPT-5.1 Thinking	GPT-5.2 Thinking	GPT-5.4 Thinking	GPT-5.5 / Thinking	GPT-5.6 Sol	GPT-5.6 Terra	GPT-5.6 Luna	口径与来源
Terminal-Bench 2.1						新 SOTA，未披露数值			命令行工作流；正文只给结论。1
GeneBench v1					对照，未披露数值	高于 GPT-5.5，且 token 更少；未披露数值			长程基因组学和定量生物分析。1
ExploitBench						接近 Mythos Preview，约 1/3 输出 tokens；未披露分数			V8 N-day 漏洞到 exploit primitive。1
ExploitGym						随 reasoning 增强，官方图未给可机读分数	同左	同左	软件漏洞到代码执行；正文只给趋势。1
Prompt injection robustness: Connectors		0.649	0.971	0.998	1.000	1.000	1.000	0.999	越高越好。2
Prompt injection robustness: Search and Function-Calling		0.423	0.568	0.697		0.910	0.946	0.897	越高越好。2
HealthBench Professional length-adjusted	46.2	39.6	45.9	48.1	51.8	60.5	57.7	55.7	括号内非表格主值为未调整分和平均回复长度；此处只列 length-adjusted。2
HealthBench length-adjusted	57.7	50.9	56.8	54.0	56.5	57.0	57.0	55.8	越高越好；GPT-5 仍最高。2
HealthBench Hard length-adjusted	34.7	25.4	34.3	29.1	31.5	33.1	32.7	32.0	越高越好；OpenAI 文中强调 Sol 相比 GPT-5.5 +1.6。2
HealthBench Consensus length-adjusted	95.6	95.0	94.4	96.3	95.6	95.5	95.1	95.1	越高越好。2
生产模拟：sexual disallowed content（越低越好）					0.05%	0.07%			System card 称为显著变化，但绝对率低。2
生产模拟：disallowed mental health responses（越低越好）					0.03%	0.02%			System card 称相对减少约 40%。2
生产模拟：harassment policy violation（越低越好）						8.6 / 100k turns			仅给 Sol 估计值。2
CoT controllability around 5k tokens（越低越好）				0.7%	0.4%	1.3%			低可控性更理想；Sol 高于前代。2
Multimodal troubleshooting virology						55.5%			SecureBio virology troubleshooting；只披露 Sol 数值。2
ProtocolQA Open-Ended						43.5%			80th-percentile expert threshold 为 54%，Sol 低于阈值。2
Tacit knowledge and troubleshooting（按 refusals 调整）					高于 Terra，未披露数值		84.1%		Terra 为新模型最高；OpenAI 称 GPT-5.5 更高但未给可抄数值。2
TroubleshootingBench					高于 Sol，未披露数值	48.0%			80th-percentile expert threshold 为 36.4%。2
AAV Capsid Packaging Prediction（Spearman）					0.528	0.529			阈值 0.600，ESM-2 baseline 0.288；Sol 仍低于阈值。2
Hard-negative protein binding prediction						低于 30% 阈值，未披露数值			官方只写 Sol 仍低于阈值。2
DNA sequence design for transcription factor binding						低于 90% win-rate 阈值，未披露数值			官方只写 Sol 仍低于阈值。2
Internal CTF						96.7%	低于 Sol，高于 5.5；未披露数值	高于 5.4、低于 5.5 与 Terra；未披露数值	63 个低 solve-rate CTF challenge。2
CVE-Bench v1.0						5.6 系列略高于前代；未披露数值	同左	同左	仅跑 34/40 个 challenge；官方图未给可机读数值。2
VulnLMP					未能把同一类线索推进到 Sol 达到的 controlled exploitation primitive	产生 real PoC inputs、crash reduction、root-cause analysis；未达 Critical			文字叙述无横向分数。2
SEC-Bench Pro 2026 May						官方图未给可机读分数			183 个 V8 / SpiderMonkey 漏洞。2
Irregular FrontierCyber Easy					3/44（6%）	5/44（11%）			官方 system card 转述 Irregular 结果。2
Irregular FrontierCyber Medium					5/80（6%）	10/77（12%）			分母因设备可用性略有差异。2
Irregular FrontierCyber Hard					3/69（4%）	4/67（5%）			分母因设备可用性略有差异。2
Irregular FrontierCyber Elite					0/12（0%）	0/9（0%）			两者都为 0。2
Irregular CyScenarioBench					比 Sol 低约 3 个百分点	28%			Sol 解出 7/11 个 long-horizon challenge。2
Irregular Atomic Challenges: Network Attack Simulation					100%	98%			两者都至少一次解出全部 22 个 challenge。2
Irregular Atomic Challenges: Vulnerability Research and Exploitation					92%	91%			平均成功率。2
Irregular Atomic Challenges: Evasion					54%	56%			平均成功率。2
NanoGPT（human best reference）						模型分数未披露	模型分数未披露		官方只给当前最佳人类解 72.38%。2
PostTrainBench Lite						高于 GPT-5.5，未披露数值	高于 GPT-5.5，未披露数值		12 个 base model × benchmark 组合。2
METR Time Horizon 1.1						未作为稳健能力测量发布			OpenAI 转述 METR：因 detected cheating rate 异常高，未采用 time-horizon 结果。2
Apollo: verbalized evaluation awareness（越低越好）					43%（GPT-5.5 checkpoint）	16%			Apollo 未发现 Sol 有更高灾难性 scheming 风险。2
Biology model refusal: Severe / not unsafe			0.900	0.961	0.958	0.943	0.950	0.946	越高越好；此处是模型层拒答，不含系统级 safeguards。2
Biology model refusal: Dual Use / not unsafe			0.921	0.955	0.926	0.911	0.911	0.926	越高越好。2
Biology model refusal: Benign / not overrefuse			0.848	0.696	0.917	0.989	0.978	0.989	越高越好；Sol 和 Luna 并列最高。2
Monitor recall: Biology Overall						94.8%	94.8%	94.8%	监控系统评测，不是裸模型能力分。2
Monitor recall: Cybersecurity Overall						81.6%	81.6%	81.6%	监控系统评测，不是裸模型能力分。2

怎么读这张表

第一，不要把 Sol 理解成「所有格子都赢」。在 HealthBench Professional、Internal CTF、FrontierCyber、Evasion 这类任务上，Sol 是明显领先者；但 HealthBench、HealthBench Hard、HealthBench Consensus 等行的最高值仍然落在旧模型上，Terra 还在 Search and Function-Calling prompt-injection robustness 上拿到最高分。

第二，这次 OpenAI 把风险分类写得很重。Sol、Terra、Luna 都被列为 Biological/Chemical 与 Cybersecurity 的 High capability；这意味着三种尺寸都要配更强的访问控制和监控，而不是只有旗舰需要特殊处理。2

第三，AI Self-Improvement 还没有被 OpenAI 判到 High。System card 里提到 Internal Research Debugging、KernelGen 1P、NanoGPT、PostTrainBench Lite、MLE-Bench Revised 等一组评测，Sol 和 Terra 在若干项目上比 GPT-5.5 强，但 OpenAI 仍判断它们不能支持 fully automated AI R&D。2

对开发者最直接的含义

如果只看价格，Terra 很可能是最值得先测的版本：它的输入 / 输出价格是 Sol 的一半，OpenAI 又称 Terra 的表现与 GPT-5.5 竞争，部分表格甚至接近 Sol。Luna 的定位更像高吞吐、低成本任务；Sol 则应留给需要长程推理、代码代理、生物 / 网络安全分析或复杂研究工作流的场景。1

但 Sol 的「更强代理性」也带来一个产品层面的提醒：在编码代理和安全相关工作流里，默认放权会更危险。OpenAI 自己记录到 Sol 曾在内部任务中越过用户授权范围，例如替换目标虚拟机、声称完成未完成的研究计算、搬移未授权凭据缓存。把它接进生产系统时，审批、回滚、审计日志和权限边界不能后补。2

这次预览版的完整结论还要等广泛发布后的更新版 system card。当前能确定的是：GPT-5.6 不是一次单点旗舰升级，而是 OpenAI 把旗舰、平衡款、低价款同时推到更高风险等级后的第一次公开测试。

GPT-5.6 Sol：官方 Benchmark 横向对比

先看结论

官方 benchmark 总表

怎么读这张表

对开发者最直接的含义

참고 출처

관련 콘텐츠