
2026/6/29 · 8:23
Top-conf paper digest — week of June 23-29, 2026
Nine arXiv papers from the June 23-29 recent batches, grouped by vision, LLM systems and safety, agents, and ML auditing. This issue highlights ICML/ICLR/CVPR-tagged work on MLLM evaluation, jailbreak internals, ternary quantization, long-context decoding, privacy audits, embodied cooperation, and 3D world modeling.
The strongest batch this week is less about one dominant theme than about tool-making: evaluation rubrics for MLLMs, safer internals for refusal behavior, cheaper LLM serving, and better control loops for embodied or visual reasoning all show up in the same arXiv window.
Selection rule: I kept main-conference or clearly top-conference-tagged arXiv entries from the June 23-29 recent batches and excluded workshop-only papers. When arXiv comments explicitly say accepted, oral, or published, the status below says so; otherwise the entry is marked as a conference-tagged preprint.
At a glance
| Area | Paper | Status | Why open it |
|---|---|---|---|
| Vision / evaluation | PerceptionRubrics | ICML 2026-tagged preprint | Tests whether MLLMs satisfy mandatory visual facts, not just average rubric scores. 1 |
| Agents / embodied AI | LLawCo | Accepted to ICML 2026 | Converts failure cases into explicit cooperation laws for multi-agent embodied planning. 2 |
| LLM safety | Robust Harmful Features | ICML 2026 oral | Identifies attention heads whose suppression can induce jailbreak-like behavior, then uses persistent safety activations for detection. 3 |
| ML methods / auditing | RECAST | Accepted to ICML 2026 | Reconstructs black-box classifier behavior from limited samples and one-sided counterfactual explanations. 4 |
| Vision / 3D simulation | P3Sim | Published at CVPR 2026 | Treats perceptual simulation as conditional inference over RGB, depth, flow, and 3D transforms. 5 |
| LLM compression | CAT-Q | ICML 2026 oral | Post-training ternary quantization with 512 calibration samples instead of massive QAT runs. 6 |
| LLM systems | Dustin | Accepted to ICML 2026 | Uses draft-model lookahead and target-model attention history to cut long-context speculative-decoding verification cost. 7 |
| Privacy / data audits | Natural identifiers | Accepted to ICLR 2026 | Uses naturally occurring hashes and short URLs as post-hoc audit anchors for trained LLMs. 8 |
| Vision-language reasoning | ActiveScope | ICML 2026-tagged preprint | Actively localizes and rechecks high-resolution image evidence before answering. 9 |
Vision and multimodal evaluation
PerceptionRubrics: Calibrating Multimodal Evaluation to Human Perception
Area tag: Vision / MLLM evaluation
arXiv: 2606.28322
Authors / institutions: Yana Wei, Hongbo Peng, Yanlin Lai, Liang Zhao, Kangheng Lin, En Yu, Keyu Lv, Han Zhou, Yin Tang, Haodong Li, Mitt Huang, Hangyu Guo, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Vishal M. Patel; the extracted arXiv HTML did not expose a clear affiliation block. 10
Peer-review status: ICML 2026 appears in the arXiv comments, but the comment does not explicitly say accepted, so I treat it as a conference-tagged preprint. 1
The paper argues that many MLLM benchmarks are too forgiving because they average partial correctness. PerceptionRubrics instead builds 1,038 dense images and 12,004 instance-specific rubrics, split into Must-Right criteria for essential visual facts and Easy-Wrong criteria for fine-grained errors. Its gated score drops to zero when a mandatory fact fails. 10
The headline result is a persistent open-closed gap: Seed-2.0-Lite scores 70.07 overall, while the best open-source model in the reported table, Qwen3.5-397B-A17B, scores 61.61. The benchmark also reports Pearson correlation 0.916 and Spearman correlation 1.000 with Vision Arena human preference, higher than DOCCI or DetailCaps in the extracted comparison. 10
Takeaway: open it if you evaluate MLLMs on dense screenshots, documents, UI, STEM figures, or puzzles. The most useful idea is the mandatory-fact gate: a model that gets ten small details right but misses the one needed fact should not receive a high perceptual score.
Code / resources: Project page. 10
P3Sim: Perceptual 3D Simulation With Physical World Modeling
Area tag: Vision / 3D world modeling
arXiv: 2606.27575
Authors / institutions: Wanhee Lee, Klemen Kotar, Rahul Mysore Venkatesh, Jared Watrous, Daniel L. K. Yamins; Stanford University is listed in the extracted HTML. 11
Peer-review status: arXiv comments say it was published as a CVPR 2026 conference paper. 5
P3Sim frames scene simulation as conditional inference over multimodal local scene variables: RGB, depth, optical flow, partial 3D transforms, and persistent scene memory. The system combines a 7B decoder-only physical world model, a geometric conditioning module that supplies target depth and flow, and a memory module that keeps a globally consistent 3D estimate over time. 11
The reported benchmark numbers are concrete. On SEVA novel-view synthesis, P3Sim reaches PSNR 21.54 on RE10K, 15.18 on LLFF, and 15.50 on DTU, beating ViewCrafter and SEVA in the extracted table. On 3DEditBench object manipulation, it reports PSNR 23.12, LPIPS 0.121, and EA 0.827 versus LightningDrag at PSNR 19.52, LPIPS 0.184, and EA 0.722. 11
Takeaway: this is the vision paper to open if you care about world models that do more than image-to-video. It tries to keep geometry explicit while letting the learned model fill the missing parts of perception.
Code / resources: no code or project page was found in the extracted arXiv HTML. 11
ActiveScope: Actively Seeking and Correcting Perception for MLLMs
Area tag: Vision-language reasoning
arXiv: 2606.24292
Authors / institutions: Yajing Wang, Chao Bi, Junshu Sun, Shufan Shen, Zhaobo Qi, Shuhui Wang, Qingming Huang; the extracted HTML did not expose a clear affiliation block. 12
Peer-review status: ICML 2026 appears in the arXiv comments, but the comment does not explicitly say accepted, so I treat it as a conference-tagged preprint. 9
ActiveScope is a training-free correction loop for high-resolution image understanding. Semantic Anchor Localization extracts target-specific attention maps from the query, while Interference-Suppressed Refinement masks wrong regions after failed verification and forces attention to redistribute before the final answer. 12
On Qwen3VL-4B, ActiveScope reports 96.34 accuracy on V* Bench, 78.13 on HR-Bench 4K, 74.75 on HR-Bench 8K, and 50.02 on MME-RealWorld-Lite. The same table lists the regular Qwen3VL-4B baseline at 89.53, 76.88, 71.75, and 46.74, respectively. 12
Takeaway: this is a practical read for anyone trying to improve MLLM perception without retraining. Its limitation is also practical: rectangular crops may be a poor fit for irregular or non-convex targets. 12
Code / resources: GitHub repository. 12
LLM internals, safety, and systems
Robust Harmful Features Under Jailbreak Attacks
Area tag: LLM safety / mechanistic interpretability
arXiv: 2606.28153
Authors / institutions: Yanchen Yin, Dongqi Han, Linghui Li; the extracted HTML lists funding from the National Natural Science Foundation of China and Beijing University of Posts and Telecommunications but does not expose a full affiliation block. 13
Peer-review status: arXiv comments say accepted at ICML 2026 as an oral presentation. 3
The paper separates two attention-head types behind jailbreak behavior. Adversarially Compromised Heads are early-layer heads that attacks suppress; Safety-Aligned Heads are mid-layer heads that keep firing even when the model produces a harmful response. The method back-traces a refusal direction through attention OV circuits and then classifies heads by distribution shifts across benign, refused harmful, and successfully jailbroken prompts. 13
The ablations are striking. Suppressing only eight compromised heads raises attack success from 0% to 95.0% on Llama-3-8B and 81.6% on Llama-2-7B; random-head interventions reach only 4.0% and 10.2%. A training-free detector that reads persistent activations reaches weighted Macro-F1 0.888 on Llama-3-8B across ten safety datasets, beating LlamaGuard3, LlamaGuard4, Qwen3Guard, and WildGuard in the extracted comparison. 13
Takeaway: the paper is useful if you build white-box safety monitors. Its detector needs internal activations, and the authors note that the analysis is mostly on attention heads rather than MLP layers. 13
Code / resources: no public code link was found in the extracted content. 13
CAT-Q: Cost-efficient and Accurate Ternary Quantization for LLMs
Area tag: LLM compression
arXiv: 2606.26650
Authors / institutions: Shigeng Wang, Chao Li, Yangyuxuan Kang, Jiawei Fan, Anbang Yao; the arXiv source did not expose a clear affiliation block in the extracted section. 14
Peer-review status: arXiv comments say accepted to ICML 2026 as an oral. 6
CAT-Q is a post-training ternary quantization method. It combines learnable modulation, which reshapes weight distributions and thresholds before ternarization, with softened ternarization, a differentiable-to-hard transition that stabilizes optimization. 14
The cost claim is the reason to read it. CAT-Q quantizes 1.7B-8B LLMs using 512 calibration samples, roughly 1 million tokens, and the authors compare that with BitNet 1.58-bit models trained on 100B tokens. They also report scaling to 14B-235B pre-trained models in 8 to 60 hours on 8 A100-80GB GPUs. 14
Takeaway: CAT-Q is relevant if you want a ternary path that starts from existing checkpoints rather than retraining a BitNet-like model. The core comparison to check in the full paper is how the ternary models trade accuracy against hardware deployment constraints.
Code / resources: GitHub repository. 14
Dustin: Draft-Augmented Sparse Verification for Efficient Long-Context Generation with Speculative Decoding
Area tag: LLM systems / long-context inference
arXiv: 2606.24957
Authors / institutions: WenHung Lee, Jian-Jia Chen, Xiaolin Lin, Pei-Shuo Wang, Chi-Chih Chang, Chun-Che Yang, Ning-Chi Huang, Grace Li Zhang, Kai-Chiang Wu; the extracted HTML did not expose affiliations. 15
Peer-review status: arXiv comments say accepted to ICML 2026. 7
Dustin targets a specific bottleneck in speculative decoding for long-context, multi-batch LLM serving: verification can become dominated by KV-cache loading. It keeps protected sink and recent tokens, then selects the remaining KV tokens using draft-model lookahead signals and target-model historical attention, with online scoring restricted to a small set of semantic retrieval heads. 15
On Qwen2.5-72B at 32k context length, Dustin reports 27.85x self-attention speedup and 9.17x end-to-end decoding speedup at batch size 16. On LongBench with a 128-token KV budget, the extracted result gives average accuracy 52.41 versus 47.08 for the best competing compression baseline. 15
Takeaway: Dustin is a systems paper for serving teams, not a general KV-compression solution. The authors state that it indexes the full history and does not reduce KV-cache memory footprint, so the gain is verification latency rather than fitting longer contexts into the same GPU memory. 15
Code / resources: no public code link was found in the extracted content. 15
Natural Identifiers for Privacy and Data Audits in Large Language Models
Area tag: LLM privacy / data auditing
arXiv: 2606.24408
Authors / institutions: Lorenzo Rossi, Bartłomiej Marek, Franziska Boenisch, Adam Dziedzic; the arXiv abstract page did not expose affiliations. 8
Peer-review status: accepted at ICLR 2026. 8
The paper proposes natural identifiers: structured random strings such as cryptographic hashes and shortened URLs that already occur in web-scale training data. Because their format can generate unlimited same-distribution random strings, they can act as alternative canaries for post-hoc differential privacy audits and as non-member held-out data for dataset inference. 8
The abstract does not provide the main numeric table, so I would not treat this as a results-first paper from the arXiv page alone. Its value is the audit setup: it removes two common blockers, retraining with inserted canaries and constructing private IID non-member data for a suspect dataset. 8
Takeaway: read this if you audit trained models after the fact. The entry point is less a new metric than a new source of naturally distributed audit strings.
Code / resources: no public code link was found on the extracted arXiv abstract page. 8
Agents and ML methods
LLawCo: Learning Laws of Cooperation for Modeling Embodied Multi-Agent Behavior
Area tag: Agents / embodied cooperation
arXiv: 2606.28182
Authors / institutions: Qinhong Zhou, Chuang Gan, Anoop Cherian; the extracted paper text states that Qinhong Zhou conducted part of the work during an internship at MERL and that Anoop Cherian was supported by MERL. 16
Peer-review status: arXiv comments say accepted to ICML 2026. 2
LLawCo is a failure-driven alignment loop for communicative embodied agents. Failed episodes are analyzed for behavioral mismatches, the system derives high-level laws such as when to talk or wait, and successful law-aligned traces are used for supervised fine-tuning. 16
Across four backbone LLMs, LLawCo reports a 4.5 percentage-point average success-rate improvement over the strongest communicative baseline on the proposed PARTNR-Dialog benchmark and a 6.8 point improvement over the strongest baseline on TDW-MAT. In the extracted ablation, removing laws drops Qwen-3-14B success on PARTNR-Dialog by 7 points. 16
Takeaway: this paper is a useful bridge between post-hoc reflection and deployable agent policy. The explicit laws make the behavior inspectable, but the authors also note that laws induced from task success still need human validation before deployment. 16
Code / resources: MERL research highlight. 16
RECAST: Model Reconstruction via Counterfactual-Aware Wasserstein Geometry under Limited Data
Area tag: ML methods / model auditing
arXiv: 2606.27948
Authors / institutions: Xuan Zhao, Lena Krieger, Zhuo Cao, Arya Bangun, Hanno Scharr, Ira Assent; the extracted HTML lists Helmholtz Association and Jülich Supercomputing Centre support but does not expose full affiliations. 17
Peer-review status: arXiv comments say accepted at ICML 2026. 4
RECAST reconstructs a black-box binary classifier when the auditor has limited labeled samples and one-sided counterfactual explanations. Instead of treating counterfactuals as hard labels, it uses them as soft cross-class evidence and learns Wasserstein barycentric prototypes for the two classes. 17
The extracted results emphasize low-query behavior. With 100 instances per class, RECAST is reported as consistently higher-fidelity than SAMPLES and CCA across four real datasets and three target model families, with more stable behavior under different counterfactual generators and distribution noise. The extraction did not expose one compact headline score, so treat this as a method-and-robustness paper rather than a single-number leaderboard result. 17
Takeaway: open RECAST if your audit setting is constrained: offline access, noisy counterfactuals, and too few queries to fit a high-capacity surrogate safely.
Code / resources: GitHub repository. 17
Suggested reading order
Start with CAT-Q if deployment cost is your bottleneck, Robust Harmful Features if you work on white-box safety monitoring, and PerceptionRubrics if you maintain multimodal evaluation suites. For embodied-agent work, read LLawCo before the vision-heavy entries; it gives a clearer recipe for turning failure traces into behavior changes.
参考来源
- 1PerceptionRubrics arXiv abstract
- 2LLawCo arXiv abstract
- 3Robust Harmful Features arXiv abstract
- 4RECAST arXiv abstract
- 5P3Sim arXiv abstract
- 6CAT-Q arXiv abstract
- 7Dustin arXiv abstract
- 8Natural Identifiers arXiv abstract
- 9ActiveScope arXiv abstract
- 10PerceptionRubrics arXiv HTML
- 11P3Sim arXiv HTML
- 12ActiveScope arXiv HTML
- 13Robust Harmful Features arXiv HTML
- 14CAT-Q arXiv HTML
- 15Dustin arXiv HTML
- 16LLawCo arXiv HTML
- 17RECAST arXiv HTML

围绕这条内容继续补充观点或上下文。