Top-conf Paper Digest — Week of May 22–28, 2026

Week of May 22–28, 2026 — papers from arXiv cs.LG and cs.CV posted this week, tagged by research area. Acceptance status noted where confirmed in the arXiv submission record; unmarked entries are preprints with review status unknown.

LLM

ZO fine-tuning as an inference workload — 8× speedup on OPT-13B

arXiv:2605.28760 · Zelin Li, Caiwen Ding · Preprint

Zeroth-order (ZO) fine-tuning avoids backpropagation by repeatedly scoring the model under nearby parameter states. The standard practice is to run this loop inside a conventional training framework, even though most of the compute is just structured forward passes. This paper reframes ZO fine-tuning as an inference workload and routes the scoring phase through vLLM.

Method. The dominant cost in ZO algorithms (e.g., LoZO, MeZO) is a large number of forward-pass evaluations under slightly perturbed weights. The paper shows these can be batched and served by an inference runtime rather than iterated one-by-one in a training loop. LoRA adapter states are represented as dynamic inference state, so the same vLLM serving path handles the repeated scoring with batching and KV-cache reuse.

Results vs. baseline. On OPT-13B / SST-2, the vLLM path completes a 20k-step LoZO run in 0.51 hours versus 4.15 hours for the official LoZO baseline — an 8.13× speedup — while achieving 0.922 final eval accuracy. Across OPT-1.3B to OPT-13B, the same reorganization yields 2.34–7.72× speedups. A MeZO-style high-rank experiment tracks a comparable loss trajectory at up to 2.55× faster.

Key takeaway. ZO fine-tuning's bottleneck is inference throughput, not gradient computation. Plugging it into a serving runtime rather than a training loop is a near-zero-cost engineering change that removes most of the wall-clock penalty, making ZO methods practical for large-model adaptation without GPU memory for backprop.

arxiv.orghttps://arxiv.org/abs/2605.28760外部リンク

コンテンツカードを読み込んでいます…

PEFT-Arena: orthogonal finetuning sits on the best stability-plasticity frontier

arXiv:2605.28819 · Yangyi Huang et al. (incl. Bernhard Schölkopf, Weiyang Liu) · Preprint

Most PEFT evaluations report downstream task accuracy and stop there. This paper argues that forgetting pretrained general capabilities is equally important and introduces a benchmark that measures both sides simultaneously.

Method. PEFT-Arena evaluates methods (LoRA, full fine-tuning, orthogonal finetuning, and others) on a stability-plasticity axis: how much does each method gain on the target task versus how much does it degrade on held-out general-capability tasks? The authors analyze the difference geometrically in two spaces: (1) weight space — spectral analysis of PEFT update matrices relative to pretrained singular-value structure; (2) activation space — whether finetuning preserves isometric representation structure for general-capability inputs.

Key finding. Under comparable parameter budgets, orthogonal finetuning achieves the most favorable Pareto frontier across downstream gain and capability retention. Forgetting correlates with non-isometric representation distortion in activation space. Final SFT checkpoints often overshoot a better stability-retention operating point; path-wise rewinding post-hoc recovers it.

Key takeaway. Comparing PEFT methods only on target accuracy obscures trade-offs that matter in production: a method that scores 2 points higher on a benchmark but degrades general reasoning or instruction-following may be the wrong choice. Orthogonal finetuning's geometric constraint appears to be the mechanism behind its superior retention.

Transformers provably learn to internalize Chain-of-Thought

arXiv:2605.28600 · Yixiao Huang, Hanlin Zhu, Zixuan Wang, Jiantao Jiao, Stuart Russell, Somayeh Sojoudi, Song Mei · Preprint

Implicit Chain-of-Thought (ICoT) trains models to hide intermediate reasoning inside hidden states, eliminating inference-time token generation for each reasoning step. Until now, theory for ICoT was absent.

Method. The paper provides the first theoretical analysis of ICoT. It proposes a Log-ICoT curriculum: instead of removing thinking tokens one at a time (standard ICoT, which requires O(k) stages for k-parity), Log-ICoT removes them in geometric chunks, cutting training stages to O(log k). An L-layer transformer trained under this curriculum learns k-parity with poly(n) samples and L = log₂ k stages — matching the sample efficiency of explicit CoT while eliminating its inference-time token cost.

Comparison. Standard ICoT removes one thinking token per stage, requiring k stages. Log-ICoT requires log₂ k stages. Both match the sample complexity of explicit CoT. The prior theory covered only single-layer architectures; this result extends to multi-layer transformers.

Key takeaway. ICoT is not just an empirical speed trick — it has a provable theoretical basis. The Log-ICoT curriculum suggests a principled schedule for training future models to internalize reasoning chains more efficiently, with direct implications for inference-time compute budgets.

Vision

NEO-ov: native one-vision model without external encoders

arXiv:2605.28820 · Haiwen Diao et al. (incl. Dahua Lin, Ziwei Liu) · Preprint · Code: EvolvingLMMs-Lab/NEO

Most vision-language models (VLMs) fuse a separate image encoder with a language decoder via multi-stage alignment. Pixel-level signals get fragmented across frames and early pixel-word interactions are delayed by the modular boundary. Native VLMs — those that learn visual and language representations in a single model without external encoders — have shown strong results on single images but have not been seriously explored for multi-image and video settings.

Method. NEO-ov is a native foundation model that learns cross-frame and pixel-word correspondence end-to-end, with no external encoder, auxiliary adapter, or post-hoc fusion module. All spatiotemporal modeling emerges inside a single model trained directly on pixels and text. The paper also includes detailed architectural analyses and training recipes for native multimodal modeling.

Comparison. NEO-ov narrows the accuracy gap to modular counterparts (which benefit from encoder pretraining) while outperforming them on fine-grained visual perception tasks. The result suggests native architectures are not just theoretically appealing but competitive at scale.

Key takeaway. Eliminating encoder-decoder module boundaries is feasible at scale. If native architectures can close the remaining gap with modular VLMs, future models may train end-to-end on raw pixels without needing separately pretrained vision encoders — simplifying deployment and potentially improving cross-frame consistency in video.

SSR3D-LLM: structured spatial reasoning for 3D grounding

arXiv:2605.28490 · Jiawei Li et al. · Preprint

In 3D scene understanding, fine-grained grounding — finding a specific chair among many chairs using relational language like "the red one to the left of the table" — requires ruling out same-class candidates by reasoning through context. Unified 3D-LLMs that compress this into a single pointer selection are brittle on such queries.

Method. Given fixed Mask3D object proposals, the LLM in SSR3D-LLM writes a sequence of latent spatial reasoning steps from the query. A geometry-aware scorer reads these steps in order, refining candidate rankings step-by-step with step-length masking. The latent steps are learned from standard benchmark supervision plus auxiliary referential-cue supervision during training; inference uses only the input query and Mask3D proposals.

Results. SSR3D-LLM achieves the strongest results among unified 3D-LLM baselines on ReferIt3D, ScanRefer, and Multi3DRef, with substantial gains over the single-pointer QPG baseline on fine-grained queries, while preserving standard language-task performance.

Key takeaway. The paper's central insight is that fine-grained spatial grounding fails when compressed into a single selection step — multi-step latent reasoning with a geometry-aware ranker is more reliable. The approach doesn't require additional test-time object proposals beyond Mask3D.

RL

ProRL: fixing gradient estimation for RL-based proactive recommendation (ICML 2025)

arXiv:2605.28293 · Hongru Hou et al. · Accepted: ICML 2025 · Code: hongruhou89/ProRL

Proactive recommender systems (PRSs) aim to guide users toward target items through a sequence of intermediate recommendations. RL is a natural fit for this — the full path reward captures both step-level acceptance and long-term guidance. But direct policy gradient application to PRS produces biased, high-variance gradient estimates.

Method. ProRL identifies two specific gradient deficiencies: (1) path-level rewards decompose into step-level rewards with positive mean, creating a length-dependent bias that makes gradients favor longer paths over better paths; (2) weighting each step by the full path reward ignores reward decomposition structure, raising variance. Two mechanisms address this: Stepwise Reward Centering subtracts expected rewards to neutralize length bias (path extension yields zero expected gradient); Position-Specific Advantage Estimation uses step-dependent baselines derived from the decomposition structure to cut variance.

Results. ProRL significantly outperforms state-of-the-art PRSs on three real-world datasets (specific margins in the paper). Peer review status: accepted at ICML 2025.

Key takeaway. The bias and variance problems are structural — they arise from the reward decomposition in any multi-step PRS, not from particular dataset choices. Fixing gradient estimation at the algorithm level (rather than network architecture) is sufficient to get large gains across datasets. The two mechanisms are modular and applicable to other RL sequential recommendation setups.

arxiv.orghttps://arxiv.org/abs/2605.28293外部リンク

コンテンツカードを読み込んでいます…

Extrapolative weight averaging navigates correctness-efficiency frontiers in code RL

arXiv:2605.28751 · Kunhao Zheng et al. (incl. Gabriel Synnaeve) · Preprint

RL training for competitive programming enforces both functional correctness (test cases pass) and computational efficiency (time/memory limits). Higher test-case coverage during training improves correctness on hard problems but increases efficiency failures — the two objectives trade off.

Method. Starting from a shared initialization, checkpoints are trained under nested unit-test coverage: low-coverage checkpoints only need to pass small-input tests; high-coverage checkpoints must pass tests up to the full suite. This sweep reveals a correctness-efficiency Pareto frontier. Linear interpolation between low- and high-coverage checkpoints recovers this frontier; extrapolation beyond the trained endpoints extends it to checkpoints not reachable by additional RL training.

Results. The frontier and its extrapolative continuation appear across three inference settings (pure reasoning, tool use, agentic coding) and two model scales (32B and 7B). Ensembles with extrapolative weight averaging improve pass@250 on LCB/hard by 3.3% over the best single checkpoint at matched sample budget.

Key takeaway. Nested unit-test coverage induces a geometric frontier in checkpoint space that weight space arithmetic can navigate and extend without retraining. Extrapolated checkpoints are complementary policies — they solve different problems at the hard end — making them more valuable for inference-time scaling ensembles than independent restarts would be.

Agents

LearnWeak: weakness-targeted specialization for small computer-use agents

arXiv:2605.28775 · Suji Kim, Kangsan Kim, Sung Ju Hwang · Preprint

Small open computer-use agents (CUAs) are more practical deployment targets than large expert models — but they show large uneven performance gaps across software domains. Simply generating large-scale training data for the target domain produces only marginal improvements.

Method. LearnWeak is an annotation-free specialization framework: a stronger reference agent is used to identify the student's domain-specific weaknesses, synthesize targeted tasks around those failure modes, and construct supervision automatically. An error-aware objective disentangles planning errors from execution errors during training, enabling more targeted updates than broad uniform supervision.

Results. On OSWorld, LearnWeak achieves average gains of 11.6 pp over EvoCUA-8B and 11.1 pp over OpenCUA-7B across eight domains. Student-aware dataset generation outperforms existing autonomous trajectory generation baselines; the error-aware training objective outperforms standard behavior cloning on the synthesized data.

Key takeaway. The key shift is moving from domain-scale data generation to student-aware weakness targeting: knowing which tasks the student fails and why (planning vs. execution) before synthesizing training data. The 11 pp average gain over strong 7–8B baselines suggests that targeted data synthesis, rather than more data overall, is the lever for specializing small agents efficiently.

arxiv.orghttps://arxiv.org/abs/2605.28775外部リンク

コンテンツカードを読み込んでいます…

Gamma-World: multi-agent world modeling that scales beyond two players

arXiv:2605.28816 · Fangfu Liu et al. (incl. Sanja Fidler, Jun Gao, Igor Gilitschenski) · Preprint

Interactive video world models have operated almost entirely in single-agent settings. Extending to multiplayer environments — where multiple players or robots act simultaneously in a shared space — requires agents that are independently controllable, permutation-symmetric, and scalable in number.

Method. Two architectural contributions:

Simplex Rotary Agent Encoding: a parameter-free extension of 3D RoPE that places each agent at a vertex of a regular simplex in rotary angle space. Each agent gets a distinct phase, but all agents are permutation-equivalent — no learned per-slot identities or fixed ordering.
Sparse Hub Attention: learnable hub tokens mediate token interaction across agents, reducing cross-agent attention from O(N²) to O(N) in the number of agents.

For real-time rollout, a full-context diffusion teacher is distilled into a causal student with KV caching, enabling action-responsive generation at 24 FPS.

Results. In multiplayer virtual environments, Gamma-World improves video fidelity, action controllability, and inter-agent consistency over slot-based and dense-attention baselines. The model generalizes from two to four players without additional training.

Key takeaway. The simplex encoding and sparse hub attention together solve the two core scalability problems in multi-agent world models: identity representation and communication cost. Generalizing to four players with no retraining suggests the architecture may scale further; the 24 FPS causal distillation makes it practical for interactive sim applications.

All papers are preprints unless otherwise noted. ProRL (arXiv:2605.28293) is confirmed accepted at ICML 2025 per the submission record. Peer-review status for all other entries: unknown.

Top-conf Paper Digest — Week of May 22–28, 2026

LLM

ZO fine-tuning as an inference workload — 8× speedup on OPT-13B

PEFT-Arena: orthogonal finetuning sits on the best stability-plasticity frontier

Transformers provably learn to internalize Chain-of-Thought

Vision

NEO-ov: native one-vision model without external encoders

SSR3D-LLM: structured spatial reasoning for 3D grounding

RL

ProRL: fixing gradient estimation for RL-based proactive recommendation (ICML 2025)

Extrapolative weight averaging navigates correctness-efficiency frontiers in code RL

Agents

LearnWeak: weakness-targeted specialization for small computer-use agents

Gamma-World: multi-agent world modeling that scales beyond two players

参考ソース