Your AI doesn't lie — it just never learned to doubt

Your AI doesn't lie — it just never learned to doubt

Stop blaming hallucination rate. Start optimizing for epistemic honesty.

Tech Trend Translator: The PM Brief
2026/5/25 · 20:24
7 订阅 · 8 内容
Hallucination reduction has been on every AI team's roadmap for three years. It's also been reliably elusive. A new ICML 2026 position paper from Google Research and Tel Aviv University makes a case that the goal itself is wrong — and offers a more buildable alternative.

The problem with "reduce hallucinations"

The standard framing treats hallucination as any factual error. Fix it by encoding more knowledge, adding retrieval, or suppressing low-confidence outputs. Every approach has worked at the margins, and none has worked at scale.
Gal Yona, Mor Geva, and Yossi Matias at Google Research and Tel Aviv University define the issue more precisely in their paper accepted at ICML 2026: 1 "If we understand hallucinations not as any error, but as confident errors — incorrect information delivered without appropriate qualification — a third path emerges: expressing uncertainty."
That distinction matters practically. A model saying "the Berlin Trade Agreement of 1923 established three import quotas..." when no such agreement existed is a hallucination. A model saying "I'm not certain a 1923 Berlin trade agreement exists — you may want to verify this" is not. Same underlying gap in knowledge, radically different product behavior.
The paper calls the capacity to match expressed uncertainty with internal uncertainty faithful uncertainty — and frames it as one face of metacognition: knowing what you don't know and acting on that awareness. 1
正在加载链接预览…

Calibration ≠ discrimination — the number that explains the gap

Here's the uncomfortable data point. Current LLMs are actually reasonably well calibrated — when averaged over many questions, their stated confidence roughly matches their accuracy. What they can't do reliably is discriminate: identify which specific answer, in this specific instance, is likely wrong. 1
The metric for discrimination is AUROC — how well a model can separate its correct answers from its incorrect ones by confidence score. Current frontier models sit at 0.70–0.85 AUROC. To reach a level where suppressing uncertain answers has negligible utility cost, you need 0.95+. 1
The practical consequence is brutal: at AUROC 0.71, pushing the error rate from 25% down to 5% requires discarding 52% of valid answers. Even at the more optimistic AUROC 0.85, you still discard roughly 28%. 1 "Just be less confident" eats half your useful output.
Calibration vs. discrimination: cutting errors to 5% requires discarding 52% of valid answers at AUROC=0.71
Utility-error tradeoff curves from the ICML 2026 paper — the right panel shows how dramatically valid answers must be discarded to hit low error rates at current AUROC levels 1
One further irony: more reasoning doesn't help. On Vectara's long-form summarization benchmark, reasoning models — including GPT-5, Claude Sonnet 4.5, Grok-4, and Gemini 3 Pro — all had hallucination rates above 10%, with Grok-4-fast-reasoning hitting 20.2%. 2 More thinking steps create more opportunities to insert unsupported inferences.

How current models actually differ

The epistemic honesty gap between models is large enough to be a product decision, not just a research curiosity.
Talkory.ai ran a 20-question test specifically designed to elicit hallucinations — fake historical events, fabricated academic papers, questions with false premises. 3 Results across five models:
ModelScore (out of 20)False answers fabricated
Claude 3.7 Sonnet16/204
Perplexity Pro15/205
Gemini 1.5 Pro12/208
ChatGPT-4o9/2011
Grok 27/2013
ChatGPT-4o, on being asked about a nonexistent 1923 Berlin trade agreement, generated three confident paragraphs with fabricated representatives, terms, and outcomes. 3 Claude declined to confirm the agreement existed.
The divergence at the frontier level is also stark. On the AA-Omniscience benchmark, GPT-5.5 reached a record 57% accuracy but hallucinated 86% of the time when it didn't know the answer. Claude Opus 4.7 hallucinated at a 36% rate on the same benchmark, at the cost of slightly lower accuracy. 2 Two deliberate product philosophies — more knowledge with less self-awareness versus less knowledge with more honesty. Neither is universally right; they suit different deployment contexts.
@Montreal_AI summarized the underlying tension cleanly: "Calibration is not discrimination. A model can be well-calibrated in aggregate and still fail to distinguish which specific answers are likely wrong." 4
正在加载内容卡片…

Three levers a PM can pull

Lever 1: Choose models by epistemic profile, not accuracy leaderboard. For workflows where a wrong-but-confident answer causes real damage — medical, legal, financial, compliance — the Talkory and Suprmind data suggest Claude and Perplexity perform meaningfully better on honest uncertainty expression. For creative or exploratory tasks where coverage matters more than precision, GPT-5.5's higher raw accuracy may be the right tradeoff. The decision shouldn't default to "highest benchmark score."
Lever 2: Use prompt-level confidence architecture. Research from Johns Hopkins — the I-CALM framework 5 — shows that two prompt changes, without any fine-tuning, can cut the false-answer rate on black-box models from 52.3% to 34.2% on factual QA tasks: (1) add a reward framing that explicitly incentivizes abstention over guessing; (2) ask the model to state its verbal confidence before answering. "Reward framing drives abstention, while verbal confidence helps keep abstention from becoming overly aggressive." 5 The second piece matters — reward framing alone drops coverage to 29.8%, which is too restrictive for production use; the verbal confidence prompt restores it to 67.9% while keeping hallucination low.
Lever 3: Design uncertainty into the UX, not around it. UX firm YUJ Designs 6 placed "Transparency as Interface" as the top AI UX trend for 2026 — citing data that 71% of users who abandon AI products do so because the interface is opaque, not because the AI is inaccurate. Users shown AI decision explanations retain at a 3.2× higher rate. 58% don't trust an AI product that doesn't explain its recommendations. 6 The practical design implication: treat the AI's confidence threshold as a UX variable, not a backend parameter. Show confidence scores, surface when the model is operating near its knowledge boundary, and design failure states explicitly rather than hiding them behind a generic "I couldn't find an answer."
正在加载链接预览…
One note on agentic architectures: the ICML paper argues that "faithful uncertainty is thus not circumvented by tools, but rather becomes the control layer that governs them." 1 When your agent calls a retrieval API, that call should be driven by a calibrated uncertainty signal — not triggered on every query by default.

One architecture signal to track

NVIDIA's INTRA framework 7 shows that encoder-decoder models can retrieve evidence internally — using decoder attention weights to score and select chunks — without a separate retrieval module. On multi-hop QA benchmarks, INTRA closes 59.4% of the gap between retrieval-augmented and gold-standard answers, compared to 53.8% for the strongest decoder-only alternative. 7
Encoder-decoder models are still less available than decoder-only alternatives, so INTRA is not a production option today. Teams evaluating retrieval architecture 12–18 months out should watch the encoder-decoder model ecosystem — shared representation space between retriever and generator eliminates a common source of RAG hallucinations.
Cover image: AI generated

围绕这条内容继续补充观点或上下文。

  • 登录后可发表评论。