May 2026 LLM Hallucination-Mitigation Digest
A curated monthly digest of 24 hallucination-mitigation papers and 4 engineering tools from April 15 – May 15, 2026 — organized by methodology theme, with structured cards covering venue, authors, core contribution, benchmarks, and code availability, plus a synthesis of momentum signals and emerging sub-directions.
Coverage window: April 15 – May 15, 2026 · 24 core papers · 4 venue papers · 4 engineering tools
Monthly snapshot
Six themes defined this window.
Detection is going single-pass. Token-level and black-box detectors no longer need multiple sampling rounds or external retrieval. TokenHD's 0.6B model outperforms QwQ-32B 1; PCNET achieves AUROC up to 99% with a probabilistic circuit that triggers correction only when needed 2; a Koopman operator method delivers SOTA detection without any secondary sampling 3.
Multimodal hallucination is the most active sub-area. Nine papers this month target large vision-language models (LVLMs) and multimodal LLMs (MLLMs). Methods span attention steering (CAST 4, AR hallucinations 5), decoding-time intervention (PTI 6, LIME 7), preference fine-tuning (UE-DPO 8, HalluVL-DPO 9), and label-free post-hoc editing (VCE 10).
Benchmarks are moving beyond English. New benchmarks address 7 programming languages (Delulu 11), 306 natural languages (MultiWikiQHalluA 12), RAG long-context with label noise (TRIVIA+ 13), and LVLM prompt factors (HalluScope 9).
Domain-specific deployment is accelerating. Separate papers address optimization modeling, scientific measurement extraction, software package references, and API migration — all high-stakes narrow contexts where hallucination causes direct downstream errors.
Interventions are getting lighter. PTI intervenes once in the prefill stage rather than every decoding step 6; CAST is training-free 4; VCE requires no labeled data or fine-tuning 10; PC-LDCD corrects only tokens flagged as anomalous 2.
Hallucination is visible in production. An empirical audit of 111 million references across 2.5 million papers found a conservative estimate of 146,932 hallucinated citations in 2025 alone, disproportionately crediting already-prominent and male scholars 14.
Detection methods
LaaB: logical consistency as a bridge
ArXiv 2605.03971 · Venue ACL 2026 Main Conference · Status Accepted
Authors Hao Mi (first), Juan Cao (corresponding) · Institutions not disclosed on abstract page
Methodology Hybrid (neural uncertainty quantification + symbolic self-judgment)
Code/data Not available
Existing detectors treat micro-level neural uncertainty (token probabilities, internal activations) and macro-level symbolic self-judgment (verbalized prompts asking the model to rate its own outputs) as separate signals. LaaB bridges them via a "meta-judgment" step: the model generates a self-judgment label, then LaaB maps that label back into feature space and exploits the logical constraint that a response label and its meta-judgment label must be either identical or opposite based on the self-judgment semantics. This enables mutual learning between the two views. 15
Comparison with prior work: Evaluated on 4 public datasets across 4 LLMs against 8 baselines, demonstrating superiority. Specific dataset names and margin numbers are not reported in the abstract; full experimental tables require paper full-text access.
Evaluation 4 datasets, 4 LLMs, 8 baselines — specific benchmark names not disclosed in abstract.
Koopman dynamical system: black-box single-pass detection
ArXiv 2605.05134 · Venue Preprint · Status Preprint
Authors Dan Wilson, Mohamed Akrout · Institutions not disclosed on abstract page
Methodology Black-box detection using Koopman operator theory
Code/data Not available
The paper treats the LLM as a black-box dynamical system. Responses are projected into a high-dimensional embedding manifold; Koopman operator theory — a framework from dynamical systems that represents nonlinear dynamics as a linear operator on an infinite-dimensional function space — is used to fit transition operators for factual and hallucinated regimes separately. A differential residual score between the two operators flags a given response as factual or hallucinated in a single forward pass, with no secondary sampling or external retrieval. A preference-aware calibration mechanism handles distribution shift. 3
Comparison with prior work: Targets the cost bottleneck of consistency-based methods (e.g., SelfCheckGPT) that require multiple samples, and of retrieval-based methods that need external knowledge. Demonstrates SOTA on 3 benchmarks (not named in abstract) with reduced resource overhead.
Evaluation 3 benchmarks — names not disclosed in abstract.
TokenHD: scalable token-level detection
ArXiv 2605.12384 · Venue Preprint · Status Preprint
Authors Rui Min (first), Yi R. Fung (corresponding) · Institutions not disclosed on abstract page
Methodology Fine-tuning-based token-level detector with data synthesis engine
Code/data Not available
Token-level analysis has finer granularity than step-level analysis, but producing ground-truth annotations at token scale is expensive. TokenHD introduces a scalable data engine that synthesizes large-scale hallucination annotations automatically, paired with an importance-weighted training recipe that does not require segmenting a response into steps. A 0.6B detector trained with this pipeline surpasses QwQ-32B on mathematical and STEM benchmarks; performance scales consistently from 0.6B to 8B. 1
Comparison with prior work: The 0.6B-vs-QwQ-32B gap (~53× parameter difference) is the headline result. Step-level methods (e.g., process reward models used in reasoning chains) are the direct comparison class; TokenHD eliminates the segmentation requirement.
Evaluation Mathematical and STEM benchmarks; QwQ-32B baseline; 0.6B-to-8B scaling curve.
PCNET + PC-LDCD: hallucination as a geometric anomaly
ArXiv 2605.05953 · Venue Preprint · Status Preprint
Authors Erik Nielsen (first), Giovanni Iacca (corresponding) · Institutions not disclosed on abstract page
Methodology Probabilistic Circuit density estimator + conditional contrastive decoding
Code/data GitHub (URL not specified in abstract; described as publicly available)
PCNET is a Probabilistic Circuit (PC) trained as a tractable density estimator over the LLM's residual stream. At inference time, a token's position on the factual manifold is measured via exact Negative Log-Likelihood — no sampling required. PC-LDCD (PC-Latent Contrastive Decoding) triggers contrastive decoding only when the PC detects geometric deviation from the factual region, leaving correct generations untouched. Across 4 LLMs (1B–8B parameters) and 4 benchmarks, AUROC reaches up to 99%. On TruthfulQA specifically, PC-LDCD achieves the highest True+Info, MC2, and MC3 scores on 3 of 4 models while reducing corruption rate to 53.7% and achieving a 79.3% preservation rate. 2
Comparison with prior work: The 53.7% corruption rate and 79.3% preservation rate should be read together: methods that apply corrections uniformly (e.g., standard contrastive decoding baselines) degrade fluency and factuality even when the original generation was correct. PC-LDCD's conditional trigger avoids this penalty.
Evaluation CoQA, SQuAD v2.0, TriviaQA, TruthfulQA; 4 LLMs (1B–8B); SOTA baselines.
Internal attention divergence: lightweight single-pass UQ
ArXiv 2605.05025 · Venue ACL SRW 2026 · Status Accepted
Authors Gijs van Dijk (sole author) · Institution not disclosed on abstract page
Methodology White-box uncertainty quantification via KL divergence over attention distributions
Code/data Not available
The method measures KL divergence between each attention head's distribution and a uniform reference distribution, then trains a logistic regression probe on the resulting attention features. No repeated sampling, no external models. Signal concentrates in middle layers and on factual tokens (named entities and numbers). Competitive with existing uncertainty quantification (UQ) methods across multiple datasets and model families. 16
Comparison with prior work: Competes with sampling-based UQ (e.g., semantic entropy) at a fraction of the cost. Specific benchmark names and margin numbers not disclosed in abstract.
Evaluation Multiple datasets across task types and model families — names not disclosed in abstract.
GSAR: typed grounding for multi-agent recovery
ArXiv 2604.23366 · Venue Preprint · Status Preprint
Authors Federico A. Kamelhar (sole author) · Institution not disclosed on abstract page
Methodology Grounding evaluation + replanning framework with formal structural properties
Code/data Not available
GSAR partitions claims into a four-way typology: grounded, ungrounded, contradicted, complementary. Evidence-type-specific weights and an asymmetric contradiction penalty produce a weighted groundedness score. The score feeds a three-tier decision function (proceed / regenerate / replan) under an explicit compute budget. Six structural properties are proven. Evaluated on FEVER with gold Wikipedia evidence under four LLM judges (gpt-5.4, claude-sonnet-4-6, claude-opus-4-7, gemini-2.5-pro), with a head-to-head comparison against Vectara HHEM-2.1-Open. Every ablation direction reproduced across all four judges. 17
Comparison with prior work: Vectara HHEM-2.1-Open is a direct baseline. The paper is the first published groundedness framework that couples evidence-typed scoring with tiered recovery under an explicit compute budget.
Evaluation FEVER with gold Wikipedia evidence; four frontier LLM judges; Vectara HHEM-2.1-Open baseline.
Multimodal and LVLM hallucinations
Nine papers address vision-language models (LVLMs) and multimodal LLMs this month — the highest volume among sub-areas. Seven are covered in depth here; two additional tangentially-related papers (audio-visual sink tokens and VLM relation-perturbation analysis) are noted in the tangential list.
LIME: inference-time relevance propagation
ArXiv 2605.01766 · Venue Preprint · Status Preprint
Authors Itai Allouche, Joseph Keshet · Institutions not disclosed on abstract page
Methodology Training-free decoding-time intervention via Layer-wise Relevance Propagation (LRP)
Code/data Not available
LIME addresses modality imbalance: MLLMs often rely too heavily on language priors during generation and insufficiently on the visual (or audio) input. LRP (Layer-wise Relevance Propagation) quantifies each token's contribution to a prediction; LIME defines a relevance-based objective that promotes increased reliance on perceptual inputs, enforced through inference-time updates to key-value representations. No parameter changes. Evaluated across vision and audio benchmarks with consistent hallucination reductions while preserving generation quality. 7
Comparison with prior work: Most training-free LVLM hallucination methods operate on attention maps (CAST, VCD) or contrastive decoding strategies. LIME's LRP-based KV update is mechanistically distinct. Specific benchmark names and numbers not disclosed in abstract.
Evaluation Multiple multimodal benchmarks in vision and audio domains — names not disclosed in abstract.
AR hallucinations / RVE: action-relation visual enhancement
ArXiv 2605.11808 · Venue Preprint · Status Preprint
Authors Zhenxin Qin (first), Wen Shen (corresponding) · Institutions not disclosed on abstract page
Methodology Attention-steering via Action-Relation Sensitivity (ARS) scoring
Code/data Not available
Object hallucination has received substantial attention; this paper targets action-relation hallucinations — where the model correctly identifies objects but misidentifies their interactions (e.g., "the dog biting the man" vs. "the man biting the dog"). The core claim is that insufficient visual attention to the interaction region drives these errors. ARS score identifies attention heads sensitive to action-relation changes; Relation-aware Visual Enhancement (RVE) then steers those heads toward the action-relevant image region. The method generalizes to spatial-relation and object hallucinations with negligible inference cost. 5
Comparison with prior work: Prior attention-steering works (CAST, IVE) focus on object presence; this is the first approach with an explicit action-relation taxonomy. Specific benchmark names and numbers not disclosed in abstract.
Evaluation Existing LVLM hallucination baselines — specific benchmark names not disclosed in abstract.
UE-DPO: uncertainty-aware preference optimization
ArXiv 2605.04874 · Venue Preprint · Status Preprint
Authors Huatian Zhang (first), Yongdong Zhang (corresponding) · Institutions not disclosed on abstract page
Methodology Fine-tuning via DPO (Direct Preference Optimization) with token-level epistemic uncertainty
Code/data Not available
Standard DPO for MLLM hallucination mitigation allocates training emphasis based on the model's self-assessed visual sensitivity — a form of self-referential bias, because the model will already have high confidence on tokens it learned well, and low learning pressure on tokens where visual grounding is weakest. UE-DPO replaces self-assessed sensitivity with token-level epistemic uncertainty derived from the model's failure to ground predictions in images. This directs more gradient toward visually deficient tokens in preferred samples, while reducing over-penalization in dispreferred samples. Theoretical justification is provided. 8
Comparison with prior work: Direct competitor to LLaVA-RLHF, RLHF-V, and POVID-style DPO methods. Specific benchmark names and numbers not disclosed in abstract.
Evaluation Extensive experiments — benchmark names not disclosed in abstract.
PTI: prefill-stage intervention
ArXiv 2604.25642 · Venue CVPR 2026 · Status Accepted
Authors Chengsheng Zhang (first), Xinmei Tian (corresponding) · Institutions not disclosed on abstract page
Methodology Decoding-time intervention — single intervention at prefill stage
Code/data https://github.com/huaiyi66/PTI
Most steering vector approaches intervene at each decoding step as errors accumulate. PTI argues that the KV cache built during prefill is where the model's "understanding" of the visual context is consolidated, and that correcting at the source is more efficient than correcting during token generation. PTI intervenes once during prefill: keys are steered toward visually-grounded objects; values are filtered to suppress background noise. The steering directions are modality-aware (separate directions for visual vs. textual representations). Because it operates only on the prefill KV cache, PTI is orthogonal to any decoding-stage method and can be combined plug-and-play. 6
Comparison with prior work: ICD (Instruction Contrastive Decoding), VCD (Visual Contrastive Decoding), and OPERA apply corrections at every decoding step. PTI's one-time prefill intervention has lower inference overhead by design, and is orthogonal rather than competing.
Evaluation Diverse decoding strategies, LVLMs, and hallucination benchmarks — specific names not disclosed in abstract.
HalluScope + HalluVL-DPO: prompt-induced hallucination
ArXiv 2604.21911 · Venue Preprint · Status Preprint
Authors Pegah Khayatan (first), Matthieu Cord (corresponding) · Institutions not disclosed on abstract page
Methodology Benchmark (HalluScope) + preference optimization mitigation (HalluVL-DPO)
Code/data [https://pegah-[kh.github.io/projects/prompts-override-vision/](https://pegah-kh.github.io/projects/prompts-override-vision/)](https://kh.github.io/projects/prompts-override-vision/](https://pegah-kh.github.io/projects/prompts-override-vision/))
The standard framing blames LVLM hallucinations on the vision backbone's limited resolution or representational capacity. HalluScope shows empirically that hallucinations largely stem from textual instruction priors — the language component dominates and overrides visual evidence when there is any ambiguity in the prompt. The HalluVL-DPO training data is curated specifically to prefer visually-grounded responses over responses driven by language priors. Dataset, benchmark, and code are to be released publicly. 9
Comparison with prior work: Challenges the vision-backbone-centric explanation. Competitors include RLHF-V and standard DPO baselines; specific numbers not disclosed in abstract.
Evaluation HalluScope benchmark + other hallucination benchmarks and visual capability evaluations.
VCE: zero-cost visual contrastive editing
ArXiv 2604.19412 · Venue ICASSP 2026 · Status Accepted
Authors Yanbin Huang (first), Xuelong Li (corresponding) · Institutions not disclosed on abstract page
Methodology Post-hoc parameter editing via SVD, label-free
Code/data Not available
VCE (Visual Contrastive Editing) identifies and suppresses object hallucinations by analyzing the model's response to contrastive visual perturbations — pairs of inputs where one contains the object in question and one does not. SVD decomposes the resulting activation difference to isolate the hallucination subspace; targeted parameter edits then suppress that subspace. No fine-tuning, no labeled data, no inference-time overhead beyond the initial parameter edit. Scalable to resource-constrained settings. 10
Comparison with prior work: Label-free post-hoc editing is mechanistically distinct from fine-tuning methods (UE-DPO, HalluVL-DPO) and from inference-time steering (CAST, PTI). The "zero-cost" label refers to the absence of labeled training data and the absence of per-inference overhead after the one-time parameter edit.
Evaluation Multiple object hallucination benchmarks — specific names not disclosed in abstract.
CAST: caption-guided attention steering
ArXiv 2605.04641 · Venue Preprint · Status Preprint
Authors Qiming Li (first), Bing Qin (corresponding) · Institutions not disclosed on abstract page
Methodology Training-free attention-steering via caption-query probing
Code/data Not available
CAST starts from an empirical observation: when an LVLM is asked a captioning query ("describe this image"), its attention to visual tokens is significantly higher than when asked a non-captioning query ("answer this question about the image"). CAST uses that difference to identify which attention heads are "caption-sensitive," then estimates steering directions from those heads' caption-query activation patterns and applies them during non-caption inference. Training-free, plug-and-play. Average reduction in object hallucination: 6.03% across 5 LVLMs and 5 benchmarks, achieving SOTA with minimal inference cost. 4
Comparison with prior work: IVE (Image-grounded Vision Enhancement) and other attention-steering baselines are the direct comparison. 6.03% average reduction across 5 LVLMs × 5 benchmarks is the headline number.
Evaluation 5 LVLMs, 5 benchmarks (discriminative and generative tasks).
Benchmarks and evaluation frameworks
TRIVIA+: RAG-based long-context benchmark with label noise
ArXiv 2605.11330 · Venue ACL 2026 Main Conference · Status Accepted
Authors Wenbo Chen (first), Leman Akoglu (corresponding) · Institutions not disclosed on abstract page
Methodology New benchmark + desiderata formalization
Code/data Open-sourced (URL not specified in abstract)
The paper first establishes a set of formal desiderata properties for Hallucination Detection Benchmarks (HDBs). Auditing existing HDBs shows no single existing benchmark satisfies all properties. The two largest gaps: (1) absence of RAG-based benchmarks with long context — existing HDBs use short passages where even simple retrieval suffices; (2) absence of realistic label noise — HDBs with clean labels overestimate real-world detector performance since human annotation of borderline cases is inherently noisy. TRIVIA+ fills both gaps: it is the longest-context HDB in the literature and provides four sets of noisy labels with rigorous human annotation. Key finding: LLM-as-a-Judge performs competitively with specialized detectors on TRIVIA+, and label noise significantly degrades all current detection methods. 13
Comparison with prior work: TriviaQA, PopQA, and HaluEval are standard short-context HDBs. The long-context + label-noise combination is novel. Specific detector performance gaps not disclosed in abstract.
Evaluation TRIVIA+ (new); SOTA detector comparisons; LLM-as-a-Judge baseline.
HalluScan: 72-configuration systematic benchmark
ArXiv 2605.02443 · Venue Submitted to Neural Computing and Applications · Status Under review (single author, no institution disclosed)
Authors Ahmed Cherif (sole author) · Institution not disclosed on abstract page
Methodology Systematic benchmark framework + HalluScore composite metric + Adaptive Detection Routing
Code/data Not available
HalluScan covers 6 detection methods × 4 open-weight model families × 3 domains = 72 configurations. Three contributions. First, HalluScore, a composite metric that correlates at Pearson r = 0.41 with human judgments across these configurations. Second, Adaptive Detection Routing (ADR), which routes each input to the cheapest detection method that still meets a performance threshold: 2.0× cost reduction with only 0.1% AUROC degradation vs. always using the best method. Third, systematic error cascade decomposition showing how errors propagate differently across detection types. NLI Verification achieves the highest AUROC (0.88); RAV (Retrieval-Augmented Verification) achieves 0.66. 18
Comparison with prior work: Most benchmark papers evaluate a single model family or single detection method; the 4×6×3 grid enables cross-family and cross-method comparison. ADR's 2.0× cost reduction baseline is the "always use best method" strategy.
Evaluation TruthfulQA, HaluEval; 6 detection methods; 4 model families; 3 domains.
Note on reproducibility: This is a sole-author submission with no institution disclosed. HalluScore's r = 0.41 with human judgments is a modest correlation. The AUROC gap between NLI Verification (0.88) and RAV (0.66) is substantial and should be independently verified before adopting these numbers as representative baselines.
PRISM: diagnostic hallucination evaluation
ArXiv 2604.16909 · Venue ACL 2026 Main Conference · Status Accepted
Authors Yuhe Wu (first), Zhuang Liu (corresponding) · Institutions not disclosed on abstract page
Methodology Diagnostic benchmark disentangling hallucination by generation stage
Code/data Not available
PRISM (Probing Reasoning, Instruction, and Source Memory) reformulates hallucination evaluation as a diagnostic problem rather than a binary detection problem. It decomposes hallucinations into four dimensions — knowledge missing, knowledge errors, reasoning errors, instruction-following errors — grounded in three generation stages: memory retrieval, instruction following, and reasoning. The benchmark contains 9,448 instances across 65 tasks, evaluated on 24 mainstream open-source and proprietary LLMs. Key finding: mitigation strategies that improve one dimension (e.g., improving instruction following via RLHF) tend to degrade another (e.g., increase reasoning errors). This trade-off structure has direct implications for deciding which mitigation approach to apply in a given deployment context. 19
Comparison with prior work: TruthfulQA, HaluEval, and FactScore treat hallucination detection as a binary task and do not attribute errors to specific generation stages. PRISM's four-dimension decomposition is more actionable for targeted mitigation.
Evaluation PRISM benchmark (9,448 instances, 65 tasks); 24 open-source and proprietary LLMs.
Delulu: code FIM hallucination benchmark (7 languages)
ArXiv 2605.07024 · Venue Preprint · Status Preprint
Authors Mahdi Erfanian (first), Shengyu Fu (corresponding) · Institution Microsoft (inferred from github.com/microsoft/delulu)
Methodology Verified benchmark via Docker-based test execution
Code/data https://github.com/microsoft/delulu
Delulu targets code generation's Fill-in-the-Middle (FIM) task — completing a code span given preceding and following context — where models frequently invent non-existent API methods, invalid parameters, undefined variables, and phantom imports. The 1,951 benchmark samples span 7 programming languages and 4 hallucination types, curated through an adversarial pipeline: frontier LLM generation → 4 diverse judge models → embedding-based clustering to ensure diversity → Docker-based execution verification → human expert review. Evaluated on 11 open-weight FIM models (Qwen2.5-Coder scaling slate plus CodeLlama, DeepSeek-Coder-V2, StarCoder2), ranging from 0.5B to 32B parameters. The strongest model reaches only 84.5% pass@1; no family exceeds 0.77 Edit Similarity. 11
Comparison with prior work: HumanEval and MBPP measure code correctness but not hallucination type. Delulu's Docker verification pipeline separates "wrong but syntactically valid" from "contains hallucinated symbols."
Evaluation 11 open-weight FIM models (0.5B–32B); pass@1, Edit Similarity.
MultiWikiQHalluA: 306-language hallucination evaluation
ArXiv 2605.02504 · Venue RESOURCEFUL 2026 (camera-ready) · Status Accepted
Authors Freja Thoresen (first), Dan Saattrup Smart (corresponding) · Institutions not disclosed on abstract page
Methodology Multilingual synthetic hallucination dataset + token-level classifiers
Code/data Not available in abstract
MultiWikiQHalluA extends the LettuceDetect framework — an open-source token-level hallucination classification model trained on question-answering datasets — and the MultiWikiQA dataset to create synthetic hallucination datasets for 306 languages, training classifiers for 30 European languages. Models evaluated: Qwen3-0.6B, Qwen3-14B, Gemma-3-12B-IT (developed by Google), cogito-v1-preview-qwen-32B, and cogito-v1-preview-llama-70B (developed by Cognitive Computations). Results are consistent: smaller models hallucinate more; lower-resource languages see higher hallucination rates. Qwen3-0.6B reaches up to 60% hallucination rate in Icelandic, the peak result. 12
Comparison with prior work: Most hallucination benchmarks are English-only or cover fewer than 10 languages. The 306-language scope is the broadest coverage in the literature for this task.
Evaluation MultiWikiQA base dataset; 5 models (0.6B–70B); 30 European languages for classifier training; 306 languages for dataset creation.
Domain-specific and text-level methods
OptArgus: multi-agent detection for optimization modeling
ArXiv 2605.11738 · Venue Preprint · Status Preprint
Authors Zhong Li (first), Chao Shen (corresponding) · Institutions not disclosed on abstract page
Methodology Multi-agent detection with conductor routing and specialist auditors
Code/data Not available
LLMs are increasingly used to translate natural-language optimization problems (scheduling, routing, allocation) into mathematical formulations and solver code (e.g., CPLEX, Gurobi inputs). The paper observes that matching the reference objective value is not a reliable correctness check: a model can produce a syntactically valid formulation with the right objective value by coincidence, while misrepresenting constraint structure. OptArgus defines a first fine-grained hallucination taxonomy for optimization modeling — objective failures, variable misspecification, constraint hallucinations, and implementation failures — and introduces a multi-agent detector with a conductor that routes inputs to specialist auditors. Evaluated on a three-part benchmark: 484 clean artifacts, 1,266 controlled injected hallucinations, and 6,292 natural LLM-generated artifacts. Compared to a matched single-agent baseline, OptArgus produces fewer false alarms on clean inputs and stronger detection on natural outputs. 20
Comparison with prior work: No prior hallucination taxonomy or detection system exists specifically for optimization modeling. The paper establishes the benchmark and baseline simultaneously.
Evaluation 484 clean + 1,266 injected + 6,292 natural artifacts; matched single-agent baseline.
LLM Ghostbusters: adaptive unlearning for package hallucinations
ArXiv 2605.01047 · Venue Preprint · Status Preprint
Authors Joseph Spracklen (first), Murtuza Jadliwala (corresponding) · Institutions not disclosed on abstract page
Methodology Post-deployment fine-tuning via hybrid token-level objective + adaptive discovery loop
Code/data Not available
LLMs frequently generate non-existent software package names in code completions — a phenomenon dubbed "slopsquatting," where attackers pre-register the hallucinated package names to deliver malicious payloads. Adaptive Unlearning (AU) addresses this post-deployment: a hybrid token-level objective simultaneously reinforces valid package outputs and suppresses hallucinated ones. Crucially, an adaptive discovery loop continuously surfaces new hallucination-inducing prompts without human supervision, so the suppression keeps pace with new prompt patterns. The method operates on model-generated data, requires no human annotation, and generalizes across coding domains. Package hallucination rate reduction: 81% while maintaining performance on standard coding benchmarks. 21
Comparison with prior work: RLHF-based safety methods suppress broad categories of harmful output but are not targeted at package hallucination specifically. The 81% reduction on this narrow target is the headline number; baseline absolute rate not disclosed in abstract.
Evaluation Standard coding benchmarks; package hallucination rates; no standard benchmark named in abstract.
MeasHalu: scientific measurement hallucination mitigation
ArXiv 2604.16929 · Venue ACL 2026 · Status Accepted
Authors Ruijun Huang (first), Min Yang (corresponding) · Institutions not disclosed on abstract page
Methodology Fine-tuning-based (two-stage reasoning-aware + progressive reward curriculum)
Code/data Not available
In AI4Science applications, models extract measurement mentions from scientific papers — quantities, units, modifiers, and the relations between them. Errors here are high-stakes: a model reporting "100 mg/kg" instead of "100 μg/kg" changes a toxicology result by three orders of magnitude. MeasHalu presents a fine-grained taxonomy of measurement-specific hallucination types (quantity hallucination, unit hallucination, modifier hallucination, relation hallucination), then applies a two-stage approach: first, reasoning-aware fine-tuning on augmented scientific data with process-based supervision; second, a progressive reward curriculum that applies increasing penalties for each hallucination type across training. Substantially reduces hallucination rates on MeasEval. 22
Comparison with prior work: Generic fine-tuning baselines on MeasEval do not apply measurement-type-specific supervision. Specific margin improvement numbers not disclosed in abstract.
Evaluation MeasEval benchmark.
Hallucination Inspector: static analysis for API migration
ArXiv 2604.20202 · Venue Preprint · Status Preprint
Authors Marcos Tileria (first), Earl T. Barr (corresponding) · Institutions not disclosed on abstract page
Methodology Domain-specific detection via AST + API documentation knowledge base
Code/data Not available
API migration — converting code from a deprecated API to a replacement (e.g., Android's Java-to-Kotlin migrations) — is a task where LLMs frequently invent symbols that do not exist in the target API. The paper coins these Phantom Symbols: imaginary imports, constructors, and constants that the model generates because they look plausible given the old API structure. Standard code correctness metrics cannot catch these, because Phantom Symbol code may compile or run without error until the specific code path is executed. Hallucination Inspector uses static analysis: it extracts all referenced symbols from the AST, then cross-checks them against an API documentation knowledge base. Preliminary evaluation on Android API migrations shows successful identification of Phantom Symbols and significant reduction in false positives compared to standard metrics and probabilistic LLM judges. 23
Comparison with prior work: Standard metrics (CodeBLEU, execution-based) and probabilistic judge baselines. Margin numbers not disclosed in abstract.
Evaluation Android API migration corpus; standard metrics and LLM judge baselines.
CFB: watermark-inspired faithfulness boosting
ArXiv 2604.22335 · Venue ACL 2026 · Status Accepted
Authors Weixu Zhang (first), Xue Liu (corresponding) · Institutions not disclosed on abstract page
Methodology Decoding-time logit adjustment (no retraining)
Code/data Open-sourced (URL not specified in abstract)
Faithfulness hallucination — generating content that contradicts the input context — is the target. CFB (Context-Fidelity Boosting) borrows from watermarking: watermarking adds token-level logit perturbations to embed a detectable signal; CFB applies additive token-level logit boosts based on each token's degree of contextual support. Three strategies of increasing sophistication: static boosting (uniform boost for context-supported tokens), context-aware boosting (boost magnitude proportional to distribution divergence between context-conditioned and unconditioned generation), and token-aware boosting (using source-position attention scores plus semantic similarity to quantify context support). No retraining or architectural changes. Consistently improves faithfulness metrics on summarization and QA tasks across multiple open-source LLMs with minimal overhead. 24
Comparison with prior work: Context-aware decoding (CAD) and contrastive decoding methods also boost context reliance; CFB's token-level granularity and source-position attention mechanism are the distinguishing factors. Specific faithfulness metric names and numbers not disclosed in abstract.
Evaluation Summarization and QA tasks; multiple open-source LLMs; faithfulness metrics.
Hallucination in the wild: 146,932 fabricated citations in 2025
ArXiv 2605.07723 · Venue Preprint · Status Preprint
Authors Zhenyue Zhao (first), Yian Yin (corresponding) · Institutions not disclosed on abstract page
Methodology Large-scale empirical audit using citation verification
Code/data Not available
This paper audits 111 million references across 2.5 million papers from arXiv, bioRxiv, SSRN, and PubMed Central. Scientific citations are uniquely verifiable — a paper either exists or it does not — making them a clean signal for hallucination detection at scale. The audit finds a sharp rise in non-existent references following widespread LLM adoption: conservative estimate of 146,932 hallucinated citations in 2025 alone 14.
The social dynamics are notable. Hallucinated references are more pronounced in fields with rapid AI uptake, in papers with AI-assisted writing signals, and among small or early-career author teams. More troublingly, hallucinated citations disproportionately assign credit to already-prominent and male scholars — the model's training-data frequency bias translating into a tangible citation-inflation mechanism for researchers who are already well-represented. Preprint moderation captures only a small fraction of these errors.
Comparison with prior work: Small-scale citation audits (dozens to hundreds of papers) and crowdsourced retraction lists exist; the 2.5M-paper scale and the attribution-bias finding are both novel. The field-specific variation and team-size correlation are the most actionable findings for journal and preprint editors.
Evaluation 111M references, 2.5M papers across arXiv, bioRxiv, SSRN, PubMed Central.
Engineering tools and OpenReview contributions
Entroly: context engine with WITNESS verification
Source GitHub juyterman1000/entroly · License Apache 2.0 · Stars 368 · Commits 428
Stack Rust + WASM, Python (
pip install entroly), npm (entroly-wasm)Entroly is a context engine for LLM agents that combines two mechanisms. PRISM (not related to the benchmark above — Entroly's PRISM stands for an adaptive weighting system) learns codebase structure over time, starting from a generic configuration and adapting to a specific repository by Day 30 with zero manual configuration. WITNESS is a proof-carrying output gateway: it flags unsupported claims before the model bills tokens for reasoning over thousands of lines it did not actually read. Three WITNESS modes:
strict (block unsupported output), audit (log and flag), annotate (inline attribution). Supports 38 agent wrappers including Claude, Cursor, Copilot, and Codex. 25Claimed results on NeedleInAHaystack (n = 20, gpt-4o-mini): 100% accuracy; claimed token savings: 70–95%. On LongBench HotpotQA (n = 50), Entroly improves accuracy from 64.0% to 68.0% (Wilson 95% CI). Ranked #1 on MCP Market as of 2026-03-26. The benchmark sample sizes (n = 20, n = 50) are small; independent evaluation at scale would be needed to validate the 70–95% token savings claim broadly.
Hallucinator v0.2.0: academic reference validator
Source GitHub gianlucasb/hallucinator · License AGPL-3.0 · Stars 197 · Commits 596
Stack Rust (73.2%), Python (21.2%); TUI, CLI, Python bindings (
pip install hallucinator)
Author Gianluca StringhiniHallucinator extracts references from PDFs and cross-checks them against 12 academic databases: CrossRef, arXiv, DBLP, Semantic Scholar, ACL Anthology, Europe PMC, PubMed, DOI Resolver, OpenAlex (250M+ works), Open Library, GovInfo, and IACR ePrint. Offline snapshots are available for DBLP (~4.6 GB), arXiv (~4 GB Kaggle snapshot), ACL Anthology, IACR ePrint, and OpenAlex. Retracted papers are detected automatically via CrossRef retraction metadata and Retraction Watch. A SearxNG meta-search fallback handles papers not indexed in any of the 12 databases. 26
v0.2.0 was released on 2026-04-30. Stringhini describes the tool's motivation directly: "Academia is under attack from AI-generated slop — fake citations, fabricated papers, LLM-written reviews." The ICLR 2025 security event (21% of reviews potentially AI-generated, 199 papers potentially AI-authored) and ICML 2026 (506 reviewers flagged for LLM policy violations) are cited as evidence for the problem Hallucinator addresses.
IBM Granite RAG library: LoRA adapters for hallucination detection in RAG pipelines
Source HuggingFace ibm-granite/granitelib-rag-gpt-oss-r1.0 · Downloads 437 (last month) · Status Experimental
Base model openai/gpt-oss-20b
IBM Granite's experimental RAG library provides a collection of LoRA adapters for task-specific augmentation of gpt-oss-20b in RAG pipelines. The Hallucination Detection (HD) adapter takes a conversation (including an assistant response) and a set of retrieved passages, and outputs a per-sentence hallucination risk score. Other adapters in the collection: Query Rewriting (QR), Query Clarification (QC), Answerability Detection (AD), and Citation Generation (CG). Recommended invocation path is via the Mellea framework (generative-computing/mellea). The HD adapter can be combined with sampling: generate multiple response variants, then filter by HD score. 27
Note: experimental status, 437 monthly downloads, and dependency on gpt-oss-20b mean this is best treated as a research integration for prototyping rather than a production-ready library at this point.
Vectara HHEM 2.1: industry hallucination benchmark data (May 2026)
Source Vectara (via third-party analysis, Confident AI and Suprmind.ai)
Vectara HHEM (Hughes Hallucination Evaluation Model) is the most widely cited open-source hallucination evaluation model for summarization tasks, measuring factual consistency between generated content and a reference context. Current data from the 2026 leaderboard: Gemini 2.0 Flash has the lowest hallucination rate at 0.7%; DeepSeek-R1 has the highest among tested models at 14.3%, approximately 4× DeepSeek-V3's rate. Third-party analysis from Suprmind.ai reports that every reasoning model tested in May 2026 exceeded 10% hallucination rate on Vectara's dataset. 28
Note: the 14.3% and 0.7% figures are from third-party analytical reports, not Vectara's direct publication. The exact test conditions, context lengths, and document types used in HHEM 2.1 scoring affect comparability across models.
VISTA: claim-level evaluation for conversational RAG
OpenReview MSLD 2026 Poster · License CC BY 4.0
Author Ashley Lewis · Venue MSLD 2026 (Multi-lingual Speech and Language Data Workshop)
VISTA (Verification in Sequential Turn-based Assessment) applies linguistic common ground theory to conversational RAG evaluation. Each dialogue turn is decomposed into atomic claims, classified into five categories: verified, contradicted, lacking evidence, abstention, and out of scope. The framework addresses four structural sources of hallucination: fluency optimization pressure, frequency effects (popularity biasing away from truth), a statistical preference for plausibility over factuality, and epistemological confusion in training data. Evaluated on multiple conversational RAG datasets across 8 models; outperforms LLM-as-a-Judge and FActScore, with larger gains on small open-source models. 29
Lewis's framing: "Hallucination is not a bug but a structural feature of LLMs — plausibility is optimized; truth is incidental." The claim-level atomic decomposition approach aligns with FActScore but extends to multi-turn conversational structure.
Thematic synthesis
Detection: efficiency has caught up with accuracy
The detection sub-area has matured rapidly in one specific direction: eliminating the cost of consistency-based methods. SelfCheckGPT (2022) and semantic entropy (2023) required sampling 5–20 outputs per query to estimate factuality. This month's papers achieve competitive or superior performance in a single forward pass — TokenHD (0.6B > QwQ-32B), PCNET (AUROC 99%), Koopman differential residual, and attention KL divergence all work without secondary sampling. The open question is external retrieval: grounding-based detection (GSAR, retrieval-augmented verification in HalluScan) still outperforms internal-signal methods on knowledge-intensive tasks where the LLM genuinely lacks the relevant fact. The two regimes — internal signal vs. external grounding — are converging on different deployment contexts: internal methods suit latency-sensitive production inference; grounding methods suit research evaluation and audit tasks.
Multimodal: the vision backbone is not the bottleneck
HalluScope's finding that textual instruction priors drive LVLM hallucinations more than vision architecture is consistent with several other methods this month. PTI intervenes on the prefill KV cache, not on visual features; LIME adjusts KV representations based on modality relevance; UE-DPO targets visually-deficient token positions rather than visual tokens per se; VCE isolates hallucination subspaces through contrastive activation analysis rather than contrastive visual perturbation. The practical implication for model architects: improvements to the vision encoder will deliver diminishing returns on hallucination rates if the language component's prior is strong enough to override visual evidence.
Benchmarks: three under-served dimensions now have coverage
Evaluating against only English text on factoid QA tasks (TruthfulQA, HaluEval) leaves three large gaps unaddressed. This month partially fills all three: (1) Low-resource languages — MultiWikiQHalluA at 306 languages, Delulu at 7 programming languages, with the consistent finding that lower-resource targets see higher rates; (2) Long-context RAG — TRIVIA+, showing that current detectors still have substantial room for improvement when context length increases; (3) Diagnostic decomposition — PRISM's four-dimensional breakdown showing that mitigation trade-offs are real and measurable across 24 models. TRIVIA+'s label noise finding is particularly actionable: if a benchmark's labels are cleaner than real-world annotation quality, detector performance numbers are optimistic by an unknown margin.
Directions under-represented this window
Three sub-areas from the channel's scope are largely absent this month. Knowledge editing (e.g., ROME, MEMIT-style fact editing) produced no papers in the window — unusual given the activity in prior months. RAG grounding papers appeared (CFB, TRIVIA+, VISTA) but no new retrieval augmentation architecture papers targeted directly at hallucination. Factuality fine-tuning at scale (instruction tuning designed specifically to improve factual grounding across tasks) also produced nothing notable. These gaps may reflect conference submission cycles (NeurIPS 2026 deadline pressure) rather than declining interest; both are worth monitoring for the June–July window.
Cover image: AI-generated.
참고 출처
- 1TokenHD: Scalable Token-Level Hallucination Detection
- 2PCNET: Hallucination as an Anomaly via Probabilistic Circuits
- 3Low-Cost Black-Box Hallucination Detection via Dynamical System Prediction
- 4CAST: Caption-Guided Visual Attention Steering
- 5Mitigating Action-Relation Hallucinations
- 6PTI: Prefill-Time Intervention
- 7LIME: Mitigating Multimodal Hallucinations via Relevance Propagation
- 8UE-DPO: Uncertainty-Aware Exploratory DPO
- 9HalluScope Benchmark
- 10VCE: Zero-Cost Hallucination Mitigation
- 11Delulu: Verified Multi-Lingual Benchmark for Code Hallucination
- 12MultiWikiQHalluA: Multilingual Hallucination Benchmark
- 13Rethinking Evaluation for LLM Hallucination Detection with TRIVIA+
- 14LLM Hallucinations in the Wild: Large-Scale Evidence from Non-Existent Citations
- 15LaaB: Logical Consistency as a Bridge for Hallucination Detection
- 16Internal Attention Divergence Signals for Hallucination Detection
- 17GSAR: Typed Grounding for Hallucination Detection and Recovery
- 18HalluScan: Systematic Benchmark for Detection and Mitigation
- 19PRISM: Probing Reasoning, Instruction, and Source Memory in LLM Hallucinations
- 20OptArgus: Multi-Agent Hallucination Detection for Optimization Modeling
- 21LLM Ghostbusters: Surgical Hallucination Suppression via Adaptive Unlearning
- 22MeasHalu: Mitigation of Scientific Measurement Hallucinations
- 23Hallucination Inspector: Fact-Checking Judge for API Migration
- 24CFB: Context-Fidelity Boosting via Watermark-Inspired Decoding
- 25Entroly GitHub
- 26Hallucinator GitHub
- 27IBM Granite RAG Library
- 28Confident AI: Top 7 LLM Evaluation Tools in 2026
- 29VISTA: Hallucination in the Wild — MSLD 2026
이 콘텐츠를 둘러싼 관점이나 맥락을 계속 보강해 보세요.