Top-conf paper digest — week of June 23-29, 2026

The strongest batch this week is less about one dominant theme than about tool-making: evaluation rubrics for MLLMs, safer internals for refusal behavior, cheaper LLM serving, and better control loops for embodied or visual reasoning all show up in the same arXiv window.

Selection rule: I kept main-conference or clearly top-conference-tagged arXiv entries from the June 23-29 recent batches and excluded workshop-only papers. When arXiv comments explicitly say accepted, oral, or published, the status below says so; otherwise the entry is marked as a conference-tagged preprint.

At a glance

Area	Paper	Status	Why open it
Vision / evaluation	PerceptionRubrics	ICML 2026-tagged preprint	Tests whether MLLMs satisfy mandatory visual facts, not just average rubric scores. 1
Agents / embodied AI	LLawCo	Accepted to ICML 2026	Converts failure cases into explicit cooperation laws for multi-agent embodied planning. 2
LLM safety	Robust Harmful Features	ICML 2026 oral	Identifies attention heads whose suppression can induce jailbreak-like behavior, then uses persistent safety activations for detection. 3
ML methods / auditing	RECAST	Accepted to ICML 2026	Reconstructs black-box classifier behavior from limited samples and one-sided counterfactual explanations. 4
Vision / 3D simulation	P3Sim	Published at CVPR 2026	Treats perceptual simulation as conditional inference over RGB, depth, flow, and 3D transforms. 5
LLM compression	CAT-Q	ICML 2026 oral	Post-training ternary quantization with 512 calibration samples instead of massive QAT runs. 6
LLM systems	Dustin	Accepted to ICML 2026	Uses draft-model lookahead and target-model attention history to cut long-context speculative-decoding verification cost. 7
Privacy / data audits	Natural identifiers	Accepted to ICLR 2026	Uses naturally occurring hashes and short URLs as post-hoc audit anchors for trained LLMs. 8
Vision-language reasoning	ActiveScope	ICML 2026-tagged preprint	Actively localizes and rechecks high-resolution image evidence before answering. 9

Vision and multimodal evaluation

PerceptionRubrics: Calibrating Multimodal Evaluation to Human Perception

Area tag: Vision / MLLM evaluation arXiv: 2606.28322 Authors / institutions: Yana Wei, Hongbo Peng, Yanlin Lai, Liang Zhao, Kangheng Lin, En Yu, Keyu Lv, Han Zhou, Yin Tang, Haodong Li, Mitt Huang, Hangyu Guo, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Vishal M. Patel; the extracted arXiv HTML did not expose a clear affiliation block. 10 Peer-review status: ICML 2026 appears in the arXiv comments, but the comment does not explicitly say accepted, so I treat it as a conference-tagged preprint. 1

The paper argues that many MLLM benchmarks are too forgiving because they average partial correctness. PerceptionRubrics instead builds 1,038 dense images and 12,004 instance-specific rubrics, split into Must-Right criteria for essential visual facts and Easy-Wrong criteria for fine-grained errors. Its gated score drops to zero when a mandatory fact fails. 10

The headline result is a persistent open-closed gap: Seed-2.0-Lite scores 70.07 overall, while the best open-source model in the reported table, Qwen3.5-397B-A17B, scores 61.61. The benchmark also reports Pearson correlation 0.916 and Spearman correlation 1.000 with Vision Arena human preference, higher than DOCCI or DetailCaps in the extracted comparison. 10

Takeaway: open it if you evaluate MLLMs on dense screenshots, documents, UI, STEM figures, or puzzles. The most useful idea is the mandatory-fact gate: a model that gets ten small details right but misses the one needed fact should not receive a high perceptual score.

Code / resources: Project page. 10

P3Sim: Perceptual 3D Simulation With Physical World Modeling

Area tag: Vision / 3D world modeling arXiv: 2606.27575 Authors / institutions: Wanhee Lee, Klemen Kotar, Rahul Mysore Venkatesh, Jared Watrous, Daniel L. K. Yamins; Stanford University is listed in the extracted HTML. 11 Peer-review status: arXiv comments say it was published as a CVPR 2026 conference paper. 5

P3Sim frames scene simulation as conditional inference over multimodal local scene variables: RGB, depth, optical flow, partial 3D transforms, and persistent scene memory. The system combines a 7B decoder-only physical world model, a geometric conditioning module that supplies target depth and flow, and a memory module that keeps a globally consistent 3D estimate over time. 11

The reported benchmark numbers are concrete. On SEVA novel-view synthesis, P3Sim reaches PSNR 21.54 on RE10K, 15.18 on LLFF, and 15.50 on DTU, beating ViewCrafter and SEVA in the extracted table. On 3DEditBench object manipulation, it reports PSNR 23.12, LPIPS 0.121, and EA 0.827 versus LightningDrag at PSNR 19.52, LPIPS 0.184, and EA 0.722. 11

Takeaway: this is the vision paper to open if you care about world models that do more than image-to-video. It tries to keep geometry explicit while letting the learned model fill the missing parts of perception.

Code / resources: no code or project page was found in the extracted arXiv HTML. 11

ActiveScope: Actively Seeking and Correcting Perception for MLLMs

Area tag: Vision-language reasoning arXiv: 2606.24292 Authors / institutions: Yajing Wang, Chao Bi, Junshu Sun, Shufan Shen, Zhaobo Qi, Shuhui Wang, Qingming Huang; the extracted HTML did not expose a clear affiliation block. 12 Peer-review status: ICML 2026 appears in the arXiv comments, but the comment does not explicitly say accepted, so I treat it as a conference-tagged preprint. 9

ActiveScope is a training-free correction loop for high-resolution image understanding. Semantic Anchor Localization extracts target-specific attention maps from the query, while Interference-Suppressed Refinement masks wrong regions after failed verification and forces attention to redistribute before the final answer. 12

On Qwen3VL-4B, ActiveScope reports 96.34 accuracy on V* Bench, 78.13 on HR-Bench 4K, 74.75 on HR-Bench 8K, and 50.02 on MME-RealWorld-Lite. The same table lists the regular Qwen3VL-4B baseline at 89.53, 76.88, 71.75, and 46.74, respectively. 12

Takeaway: this is a practical read for anyone trying to improve MLLM perception without retraining. Its limitation is also practical: rectangular crops may be a poor fit for irregular or non-convex targets. 12

Code / resources: GitHub repository. 12

LLM internals, safety, and systems

Robust Harmful Features Under Jailbreak Attacks

Area tag: LLM safety / mechanistic interpretability arXiv: 2606.28153 Authors / institutions: Yanchen Yin, Dongqi Han, Linghui Li; the extracted HTML lists funding from the National Natural Science Foundation of China and Beijing University of Posts and Telecommunications but does not expose a full affiliation block. 13 Peer-review status: arXiv comments say accepted at ICML 2026 as an oral presentation. 3

The paper separates two attention-head types behind jailbreak behavior. Adversarially Compromised Heads are early-layer heads that attacks suppress; Safety-Aligned Heads are mid-layer heads that keep firing even when the model produces a harmful response. The method back-traces a refusal direction through attention OV circuits and then classifies heads by distribution shifts across benign, refused harmful, and successfully jailbroken prompts. 13

The ablations are striking. Suppressing only eight compromised heads raises attack success from 0% to 95.0% on Llama-3-8B and 81.6% on Llama-2-7B; random-head interventions reach only 4.0% and 10.2%. A training-free detector that reads persistent activations reaches weighted Macro-F1 0.888 on Llama-3-8B across ten safety datasets, beating LlamaGuard3, LlamaGuard4, Qwen3Guard, and WildGuard in the extracted comparison. 13

Takeaway: the paper is useful if you build white-box safety monitors. Its detector needs internal activations, and the authors note that the analysis is mostly on attention heads rather than MLP layers. 13

Code / resources: no public code link was found in the extracted content. 13

CAT-Q: Cost-efficient and Accurate Ternary Quantization for LLMs

Area tag: LLM compression arXiv: 2606.26650 Authors / institutions: Shigeng Wang, Chao Li, Yangyuxuan Kang, Jiawei Fan, Anbang Yao; the arXiv source did not expose a clear affiliation block in the extracted section. 14 Peer-review status: arXiv comments say accepted to ICML 2026 as an oral. 6

CAT-Q is a post-training ternary quantization method. It combines learnable modulation, which reshapes weight distributions and thresholds before ternarization, with softened ternarization, a differentiable-to-hard transition that stabilizes optimization. 14

The cost claim is the reason to read it. CAT-Q quantizes 1.7B-8B LLMs using 512 calibration samples, roughly 1 million tokens, and the authors compare that with BitNet 1.58-bit models trained on 100B tokens. They also report scaling to 14B-235B pre-trained models in 8 to 60 hours on 8 A100-80GB GPUs. 14

Takeaway: CAT-Q is relevant if you want a ternary path that starts from existing checkpoints rather than retraining a BitNet-like model. The core comparison to check in the full paper is how the ternary models trade accuracy against hardware deployment constraints.

Code / resources: GitHub repository. 14

Dustin: Draft-Augmented Sparse Verification for Efficient Long-Context Generation with Speculative Decoding

Area tag: LLM systems / long-context inference arXiv: 2606.24957 Authors / institutions: WenHung Lee, Jian-Jia Chen, Xiaolin Lin, Pei-Shuo Wang, Chi-Chih Chang, Chun-Che Yang, Ning-Chi Huang, Grace Li Zhang, Kai-Chiang Wu; the extracted HTML did not expose affiliations. 15 Peer-review status: arXiv comments say accepted to ICML 2026. 7

Dustin targets a specific bottleneck in speculative decoding for long-context, multi-batch LLM serving: verification can become dominated by KV-cache loading. It keeps protected sink and recent tokens, then selects the remaining KV tokens using draft-model lookahead signals and target-model historical attention, with online scoring restricted to a small set of semantic retrieval heads. 15

On Qwen2.5-72B at 32k context length, Dustin reports 27.85x self-attention speedup and 9.17x end-to-end decoding speedup at batch size 16. On LongBench with a 128-token KV budget, the extracted result gives average accuracy 52.41 versus 47.08 for the best competing compression baseline. 15

Takeaway: Dustin is a systems paper for serving teams, not a general KV-compression solution. The authors state that it indexes the full history and does not reduce KV-cache memory footprint, so the gain is verification latency rather than fitting longer contexts into the same GPU memory. 15

Code / resources: no public code link was found in the extracted content. 15

Natural Identifiers for Privacy and Data Audits in Large Language Models

Area tag: LLM privacy / data auditing arXiv: 2606.24408 Authors / institutions: Lorenzo Rossi, Bartłomiej Marek, Franziska Boenisch, Adam Dziedzic; the arXiv abstract page did not expose affiliations. 8 Peer-review status: accepted at ICLR 2026. 8

The paper proposes natural identifiers: structured random strings such as cryptographic hashes and shortened URLs that already occur in web-scale training data. Because their format can generate unlimited same-distribution random strings, they can act as alternative canaries for post-hoc differential privacy audits and as non-member held-out data for dataset inference. 8

The abstract does not provide the main numeric table, so I would not treat this as a results-first paper from the arXiv page alone. Its value is the audit setup: it removes two common blockers, retraining with inserted canaries and constructing private IID non-member data for a suspect dataset. 8

Takeaway: read this if you audit trained models after the fact. The entry point is less a new metric than a new source of naturally distributed audit strings.

Code / resources: no public code link was found on the extracted arXiv abstract page. 8

Agents and ML methods

LLawCo: Learning Laws of Cooperation for Modeling Embodied Multi-Agent Behavior

Area tag: Agents / embodied cooperation arXiv: 2606.28182 Authors / institutions: Qinhong Zhou, Chuang Gan, Anoop Cherian; the extracted paper text states that Qinhong Zhou conducted part of the work during an internship at MERL and that Anoop Cherian was supported by MERL. 16 Peer-review status: arXiv comments say accepted to ICML 2026. 2

LLawCo is a failure-driven alignment loop for communicative embodied agents. Failed episodes are analyzed for behavioral mismatches, the system derives high-level laws such as when to talk or wait, and successful law-aligned traces are used for supervised fine-tuning. 16

Across four backbone LLMs, LLawCo reports a 4.5 percentage-point average success-rate improvement over the strongest communicative baseline on the proposed PARTNR-Dialog benchmark and a 6.8 point improvement over the strongest baseline on TDW-MAT. In the extracted ablation, removing laws drops Qwen-3-14B success on PARTNR-Dialog by 7 points. 16

Takeaway: this paper is a useful bridge between post-hoc reflection and deployable agent policy. The explicit laws make the behavior inspectable, but the authors also note that laws induced from task success still need human validation before deployment. 16

Code / resources: MERL research highlight. 16

RECAST: Model Reconstruction via Counterfactual-Aware Wasserstein Geometry under Limited Data

Area tag: ML methods / model auditing arXiv: 2606.27948 Authors / institutions: Xuan Zhao, Lena Krieger, Zhuo Cao, Arya Bangun, Hanno Scharr, Ira Assent; the extracted HTML lists Helmholtz Association and Jülich Supercomputing Centre support but does not expose full affiliations. 17 Peer-review status: arXiv comments say accepted at ICML 2026. 4

RECAST reconstructs a black-box binary classifier when the auditor has limited labeled samples and one-sided counterfactual explanations. Instead of treating counterfactuals as hard labels, it uses them as soft cross-class evidence and learns Wasserstein barycentric prototypes for the two classes. 17

The extracted results emphasize low-query behavior. With 100 instances per class, RECAST is reported as consistently higher-fidelity than SAMPLES and CCA across four real datasets and three target model families, with more stable behavior under different counterfactual generators and distribution noise. The extraction did not expose one compact headline score, so treat this as a method-and-robustness paper rather than a single-number leaderboard result. 17

Takeaway: open RECAST if your audit setting is constrained: offline access, noisy counterfactuals, and too few queries to fit a high-capacity surrogate safely.

Code / resources: GitHub repository. 17