
Top-conf paper digest — week of June 5–11, 2026
Twelve arXiv papers posted June 5–8 with confirmed top-conference acceptance or preprint submission: ICML 2026 (five main papers), ACL 2026, RLC 2026, ICLR 2026, KDD 2026, IJCV 2026, and one NeurIPS 2026 submission. Areas covered: Agents, LLM, Generative models, Vision/Video, RL, ML Methods, and Scientific ML.

研究速览
Twelve papers posted June 5–8 on arXiv with confirmed top-conference acceptance or submission, grouped by research area.
Agents
Q-Evolve: self-improving LLM agents via in-distribution RL
Area: Agents · Venue: ICML 2026 · arXiv: 2606.07367
Authors: Yudi Zhang, Meng Fang, Zhenfang Chen, Mykola Pechenizkiy
Long-horizon LLM agents struggle with credit assignment when rewards only arrive at episode end. Q-Evolve handles this by jointly learning an in-distribution critic and a process-reward signal — in each iteration it trains a value function from a hybrid dataset mixing expert demonstrations with agent trajectories, derives per-step advantages, and then runs behavior-proximal policy optimization over the same distribution. The key claim: iterative self-improvement without distribution shift, because supervision and policy stay in the same in-distribution loop. Evaluated on AlfWorld, WebShop, and ScienceWorld, Q-Evolve outperforms strong baselines on sample efficiency and task completion rate. Prior work such as ReAct and Reflexion relies on heuristic or human-provided process rewards; Q-Evolve automates this labeling. No code repo listed at submission. 1
LLM
正在加载内容卡片…
MDP-GRPO: fixing GRPO instability under discrete rewards
Area: LLM · Venue: ACL 2026 Main · arXiv: 2606.06058
Authors: Mohammad Mahdi Salmani-Zarchi, Zahra Rahimi, Heshaam Faili, Mohammad Javad Dousti
Standard GRPO becomes pathological when rewards are discrete and low-dispersion — within-group reward distributions are often homogeneous, causing z-score normalization to produce zero gradients or amplified noise. The paper formalizes three failure modes (low-variance amplification, mean-centering blindness, zero-variance collapse) and addresses them with four changes: multi-temperature sampling to spread reward distribution, dual-anchor advantages to restore gradients in homogeneous groups, prospect-theoretic shaping based on Kahneman–Tversky loss, and asymmetric KL regularization. On FollowBench and IFEval, MDP-GRPO improves strict constraint satisfaction by up to 5% on Llama-3.2-3B over standard GRPO, while preserving general capability on MMLU and ARC. Supports stable training with small group sizes — useful when compute per rollout is limited. Code not linked at submission. 2
Generative models
GILC: plug-and-play guidance for discrete diffusion
Area: Generative · Venue: ICML 2026 · arXiv: 2606.06303
Authors: Hongkun Dou, Zike Chen, Fengji Li, Hongjue Li, Yue Deng
Controlling discrete diffusion (DNA, protein, molecule generation) without retraining is hard because gradient signals are unstable in high-dimensional discrete spaces. GILC (Gradient-Informed Logit Correction) sidesteps this by using the pretrained denoiser as a variational proxy and applying a Jacobian-free correction directly to the clean prediction logits — no backprop through the full diffusion chain required. It supports both differentiable and non-differentiable reward functions. Results across DNA sequence design, protein sequence generation, and molecular generation show GILC at or above fine-tuned baselines without any additional training. The Jacobian-free design is the notable departure from classifier guidance approaches that require computing score Jacobians. 3
PhaseLock: preserving motion physics in video diffusion
Area: Generative · Venue: ICML 2026 · arXiv: 2606.06361
Authors: Woojung Han, Seil Kang, Youngjun Jun, Min-Hung Chen, Fu-En Yang, Seong Jae Hwang
Image-to-video diffusion models generate visually convincing frames but frequently violate physical motion. The paper makes a surprising observation: a 2-step diffusion output often has better physical consistency than a 50-step output from the same model. Via spectral analysis, the authors show the phase component of the latent (which encodes motion structure) degrades by ~18% from step 2 to step 50, while magnitude stays relatively stable. PhaseLock is a training-free framework that extracts the motion prior from just two denoising steps and enforces it throughout the full generation trajectory via Latent Delta Guidance. Across several video diffusion models, PhaseLock improves physical consistency scores by an average of 6.2 points with only 1.06× inference time and 1.02× memory overhead — a considerably lighter overhead than external guidance methods that run ~5× slower. 4
GReinSS: policy gradients for discrete latent structure recovery
Area: Generative / Scientific ML · Venue: ICML 2026 · arXiv: 2606.07400
Authors: Stefan Ivanovic, Ge Liu, Mohammed El-Kebir
Recovering mechanistic latent states from indirect observations is a core challenge in computational biology and systems science. EM-based approaches don't scale to combinatorially large spaces; VAEs tend to produce artifacts rather than ground-truth latent structure. GReinSS frames this as policy learning with dynamically rescaled rewards, learning distributions over latent sets and graphs that maximize observed data likelihood. On simulated data it accurately recovers latent sets and latent graphs over baselines. On real RNA sequencing data, GReinSS reconstructs isoforms from short-read data that better match long-read sequencing results than the RSEM baseline — a concrete empirical anchor beyond synthetic benchmarks. The dynamic reward rescaling is the mechanism enabling stable training in combinatorially large latent spaces. 5
Vision and video
正在加载内容卡片…
OMTG: one-to-many temporal grounding in video
Area: Vision / Video · Venue: ICML 2026 · arXiv: 2606.06294
Authors: Qi Xu, Yue Tan, Shihao Chen, Jiahao Meng, Anna Wang, Shunping Ji, Hao Fei, Jason Li
Prior temporal grounding assumes a query maps to a single video segment. OMTG targets the harder one-to-many setting where a query can match multiple disjoint segments, requiring cardinality perception. State-of-the-art MLLMs optimized for one-to-one settings score near zero on this task. The paper introduces three contributions: a benchmark with new metrics (Count Accuracy C-Acc, Effective Temporal F1 EtF1), a 56K-sample training dataset built via a chain-of-thought construction pipeline, and novel temporal and caption reward functions. The caption reward explicitly uses CoT reasoning over dense video captions to guide policy optimization toward both precision and completeness. The resulting model achieves 43.65% EtF1 on the benchmark, exceeding Gemini 2.5 Pro and Seed-1.8 by 15.85 and 15.61 percentage points respectively. 6
StoryVideoQA: deep video understanding at scale
Area: Vision / Video · Venue: IJCV 2026 · arXiv: 2606.06338
Authors: Zhengqian Wu, Zhixian Liu, Aodong Chen, Jingyang Zhang, Ruizhe Li, Hanlin Ge, Zhongyuan Wang, Chunxia Xiao, Chao Liang
Existing VideoQA datasets focus on factoid questions; deep video understanding (DVU) requires comprehension of storylines spanning full TV episodes or movies. StoryVideoQA is the largest DVU dataset to date: 363K+ QA pairs over 393 hours of diverse story video (TV series averaging ~27 minutes; movies averaging ~131 minutes per clip). Construction uses StoryMindv2, a multi-agent framework with supervisor-guided generation and multi-reviewer voting. Evaluating 20 VideoQA methods on the benchmark reveals that none maintain long-range character associations or coherent storyline understanding at this scale. The paper also proposes PlotTree, a video understanding agent that reorganizes video into hierarchical plot structures for storyline reasoning. Code and project page available at github.com/nercms-mmap/StoryVideoQA. 7
DBD: adversarial attacks as test-time defenders for VLMs
Area: Vision / Robustness · Venue: ICLR 2026 · arXiv: 2606.06186
Authors: Liangsheng Liu, Si Chen, Jiamin Wu, Weiwei Feng, Zhixin Cheng, Xiaotian Yin, Wenfei Yang, Tianzhu Zhang
Standard adversarial defenses for VLMs (e.g., CLIP) require either retraining or expensive inference-time denoising. DBD (Directional Bias-guided Defense) starts from an empirical finding: under diverse input transformations, adversarial images in CLIP's feature space consistently shift along a single dominant direction, while clean images scatter. The paper argues this "Defense Direction" points back toward the correct class center, i.e., the adversarial perturbation itself encodes directional information about the true decision boundary. DBD estimates this direction at test time and applies a two-stream reconstruction strategy using a DB-score. Across 15 datasets, DBD reaches SOTA adversarial robustness while preserving clean accuracy — and in some cases adversarial accuracy exceeds clean accuracy, supporting the hypothesis that perturbations encode useful priors. No retraining required. 8
RL
正在加载内容卡片…
Online KL-regularized RL under model misspecification
Area: RL · Venue: RLC 2026 · arXiv: 2606.06053
Authors: Haoyang Hong, Zichen Wang, Quanquan Gu, Huazheng Wang
KL-regularized RL (the basis of RLHF-style policy optimization) is typically analyzed under realizability. This paper studies what happens when the model is misspecified — i.e., when the hypothesis class doesn't contain the true model. The authors introduce KL misspecification formulations for contextual bandits and episodic RL, then analyze regression-based algorithms with Gibbs policy updates. The resulting high-probability regret bounds include explicit misspecification error terms and reduce to standard realizable bounds as a special case. This gives a theoretical foundation for understanding performance degradation in RLHF when reward models or policy classes are approximate, a common practical scenario. 9
ML methods
TabSwift: efficient tabular foundation model (ICML Spotlight)
Area: ML Methods · Venue: ICML 2026 Spotlight · arXiv: 2606.07345
Authors: Si-Yang Liu, Han-Jia Ye
Recent tabular foundation models improve accuracy by adding architectural complexity, at the cost of inference latency. TabSwift revisits the minimal TabPFN design and shows that a row-wise attention-only backbone with two additions — gated attention stabilization and a small set of learnable register tokens — is competitive with heavier models (TabPFN v2, TabICL) on both classification and regression. An additional adaptive layer-wise early-exit mechanism allows dynamic adjustment of inference depth per sample at serving time. The result is a tabular in-context learner that is competitive on accuracy while substantially faster to deploy. Awarded Spotlight at ICML 2026. Code available at github.com/automl/AlphaPFN (via companion α-PFN repo). 10
CorSW: Sliced-Wasserstein for EEG domain generalization
Area: ML Methods / BCI · Venue: KDD 2026 · arXiv: 2606.06104
Authors: Chen Hu, Rui Wang, Jiale Zhou, Jingjun Yi, Shaocheng Jin, Yidong Song, Yefeng Zheng
EEG decoding pipelines commonly use covariance matrices as features, but covariance is sensitive to channel-wise scaling. Full-rank correlation matrices are scale-invariant but geometrically non-Euclidean, complicating Wasserstein-based distance computations. CorSW extends Sliced Wasserstein (SW) to correlation matrix manifolds via a Pullback Euclidean Metric framework, instantiating two correlation geometries (Off-Log Metric and Log-Scaled Metric). A domain generalization framework for EEG decoding built on CorSW shows improved generalization under distribution shifts across three EEG datasets with low training overhead and no additional inference cost. Code at github.com/ChenHu-ML/CorSW. 11
Scientific ML
Reactive Flux Matching: data-driven reaction coordinates for molecular simulation
Area: Scientific ML · Venue: NeurIPS 2026 (submitted) · arXiv: 2606.06295
Authors: Rishal Aggarwal, David Ryan Koes, Nicholas M. Boffi, Eric Vanden-Eijnden
Path sampling methods generate reactive trajectories between molecular metastable states, but extracting mechanistic insight from trajectory ensembles is non-trivial. Flux Matching learns two objects directly from reactive path data without knowing the underlying dynamics: a current velocity u(z) whose streamlines trace dominant reaction pathways, and a scalar potential h(z) from a weighted Helmholtz–Hodge decomposition that serves as a data-driven reaction coordinate. Both quantities minimize quadratic functionals analogous to flow matching objectives in generative modeling. Unlike committor-based methods, u and h remain well-defined under non-Markovian projections onto collective variables. Validated on molecular systems for current velocity generation and rate constant estimation. Submitted to NeurIPS 2026 (preprint). 12
围绕这条内容继续补充观点或上下文。