Five diffusion papers worth reading today (May 29, 2026)

Five diffusion papers worth reading today (May 29, 2026)

Friday's last pre-weekend batch (262 cs.CV + 524 cs.LG scanned, 25 diffusion candidates). Two clusters dominate: inference-time spectral/noise control (CNS's training-free colored-noise SDE delivers 17–30% FID reduction across three architectures; Spectral Guidance unifies label/CLIP/mask guidance under a singular-function framework at +37pp accuracy and 4× speed) and alignment/efficiency at scale (AGSM fixes SoftREPA's counting failures via score-level guidance at +35% GenEval; Veda achieves 5.1× end-to-end speedup on a 12B video DiT; GDSD eliminates ELBO training-inference mismatch in diffusion LLM RL fine-tuning at +19.6% accuracy). All five are preprints; GDSD is the only one with a day-one code release.

ArXiv Diffusion Models Digest
29/5/2026 · 22:23
2 suscripciones · 22 contenidos

Vistazo a la investigación

Friday's batch is the last before ArXiv's weekend gap. Of 262 cs.CV and 524 cs.LG new submissions scanned, 25 diffusion-related candidates were identified and ranked; five made the cut. The standout cluster this batch is inference-time spectrum and noise control — two papers independently revisit the stochastic part of diffusion sampling that prior work largely left fixed. A second cluster covers alignment and efficiency: fixing score-matching T2I failures, distilling sparse attention for video DiTs at production scale, and removing a systematic bias from RL fine-tuning of diffusion LLMs.
Ranking signals: FID reduction magnitude and architectural breadth (CNS), phase-transition theory plus 4× speed gain (Spectral Guidance), direct GenEval counting improvement over SoftREPA with no reward model (AGSM), wall-clock speedup on a 12B production video model (Veda), ELBO bias elimination with code release (GDSD).

1. CNS: training-free colored-noise SDE sampling with consistent FID gains across architectures

ArXiv: 2605.30332 | Hadar Davidson; Noam Issachar; Sagie Benaim | cs.CV
Peer-review status: Preprint.
Standard SDE samplers — Euler-Maruyama, ancestral sampling — inject isotropic white noise uniformly across all frequencies at every denoising timestep. That uniform budget ignores a well-known property of the generative trajectory: diffusion models resolve low-frequency structure (global layout, coarse shapes) early, and high-frequency detail (textures, edges) late. Pouring equal noise energy into already-resolved bands is wasteful. 1
CNS replaces the uniform schedule with a dynamic, frequency-dependent colored noise schedule that allocates injected energy toward bands the model is currently resolving. The method is entirely training-free and plug-and-play: it modifies only the noise schedule during sampling, with no architectural changes and no additional networks. As Davidson et al. write, "Conventional stochastic differential equation (SDE) solvers fail to account for this dynamic, naively injecting uniform white noise throughout the entire process and misusing the finite energy budget." 1
Benchmark results — ImageNet-256 unguided FID:
ModelStandard SDECNSReduction
SiT-XL/28.266.2724.1%
JiT-B/1632.3926.6917.6%
JiT-H/1611.888.3130.1%
Consistent relative improvements carry over to CFG-guided runs as well. The SiT-XL/2 unguided FID of 6.27 is competitive with results reported by dedicated distillation methods on the same benchmark. 1
Code/resources: No repository listed. Project page: hadardavidson.github.io/CNS
Cargando tarjeta de contenido…
Why read it: CNS is arguably the most deployment-ready paper in this batch. It drops into any existing SDE or ODE solver without touching model weights, training data, or architectures. A 17–30% FID reduction at zero additional cost is a strong argument for adopting it as a default in any inference pipeline that currently uses ancestral or Euler-Maruyama sampling. The remaining open question is whether the colored schedule generalizes to latent-space models (FLUX, SD3) as cleanly as it does on pixel-space SiT/JiT — the project page should clarify that.

2. Spectral Guidance: singular-function-based control unifying label, CLIP, and mask guidance

ArXiv: 2605.28900 | Gabriel Moreira; Manuel Marques; João Paulo Costeira (IST Lisbon); Chenyan Xiong (CMU) | cs.LG
Peer-review status: Preprint.
Guidance in diffusion models typically comes in three forms: classifier guidance (requires a trained classifier + denoiser backpropagation), classifier-free guidance (requires conditional training), or training-free approaches (often heuristic, slow, or limited to specific signal types). Spectral Guidance provides a theoretical unification by characterizing what information survives progressive noise corruption. 2
The core idea: as data is corrupted by noise, the informative features for control do not vanish uniformly — they concentrate in a low-dimensional subspace characterized as the singular functions of a conditional expectation operator. Moreira et al. show that these singular functions can be recovered via self-supervised learning, and that the resulting basis supports projection of arbitrary guidance signals — class labels, CLIP embeddings, spatial masks — directly onto the sampling trajectory without retraining or denoiser backpropagation. A secondary result is empirical: the framework reveals a phase transition in the generative process, identifying a time window during which guidance is maximally effective. 2
Benchmark results — CIFAR-10 (vs. strongest training-free baseline):
MetricSpectral GuidanceBest prior training-free baseline
Conditional accuracy+37 percentage pointsbaseline reference
Sampling speed4× fasterbaseline reference
The same representation that enables label guidance also supports mask-based spatial control without auxiliary models. 2
Code/resources: No repository or project page listed at time of writing.
Why read it: The combination of a clean mathematical foundation and strong empirical gains is relatively rare in training-free guidance work. The +37pp accuracy gap over the strongest prior training-free method is large enough to be practically meaningful, not just a leaderboard footnote. More interestingly, the phase-transition result is independent of whether you use this specific guidance method — it says something fundamental about when guidance is useful in any diffusion sampler, and that observation is likely to seed follow-on sampler design work. The CMU co-authorship (Chenyan Xiong) is a reasonable signal of NLP/retrieval community crossover interest.

3. AGSM: fixing SoftREPA's counting and repetition failures via score-level alignment

ArXiv: 2605.30038 | Jaa-Yeon Lee; Yeobin Hong; Taesung Kwon; Jong Chul Ye (KAIST) | cs.LG / cs.AI / cs.CV
Peer-review status: Preprint. Project page: jaayeon.github.io/AGSM
SoftREPA introduced contrastive learning on soft text tokens to improve text-image alignment in diffusion models. The approach works for general fidelity, but the contrastive formulation has a specific pathology: excessive penalization of negative pairs causes the model to over-count objects and generate repetitive elements. These aren't subtle artifacts — they are characteristic failure cases that show up reliably across SD backbone versions. 3
AGSM (Alignment-Guided Score Matching) fixes this by moving the alignment intervention from the representation space to the score function itself. Rather than applying contrastive loss on embeddings, AGSM assigns alignment directions at the score level — telling the model which direction in score space improves text-image agreement at each denoising step. This integration into score matching is what prevents over-penalization: the score function is the diffusion model's native mathematical object, and corrections in that space are more coherent than post-hoc embedding constraints. The method is reward-free and refines only soft tokens, keeping it lightweight. 3
Benchmark results — GenEval:
  • Counting accuracy: AGSM vs. SoftREPA baseline — +35% improvement 3
  • Overall quality metrics: matches SoftREPA on general generation quality
  • Compatible backbones: SD1.5, SDXL, SD3
As Lee et al. state: "Our method matches SoftREPA while substantially improving its failure cases, achieving over 35% improvement in counting accuracy on the GenEval benchmark." 3
Code/resources: No repository listed. Project page above.
Cargando tarjeta de contenido…
Why read it: If you use SoftREPA or are evaluating it, AGSM is a direct upgrade on its worst failure modes. The reward-free constraint matters practically — no reward model to train, no RLHF pipeline to set up. The method also positions itself as complementary to existing RL-based post-training (DDPO, Diffusion-DPO), meaning it can be stacked without conflict. Jong Chul Ye's group at KAIST has a consistent track record delivering usable improvements to diffusion post-training, and the project page at submission time is a positive signal for reproducibility.

4. Veda: distilled sparse attention for production-scale video diffusion transformers

ArXiv: 2605.30325 | Shihao Han; Hao Yang; Xinting Hu; Xiaofeng Mei; Yi Jiang; Xiaojuan Qi (HKU) | cs.CV
Peer-review status: Preprint.
Attention's O(n²) cost is the primary bottleneck for long video generation in DiT-based models. Sparse attention approaches address this in theory, but most published results report FLOPs reductions rather than actual wall-clock speedup — the gap between the two is large when the sparse mask does not map cleanly to hardware-efficient tile operations. Veda closes that gap with a principled approach: instead of choosing sparsity patterns heuristically (local windows, strided patterns, random pruning), it formulates tile selection as a reconstruction problem against full attention. 4
The key finding from Han et al.: "generation quality is determined not by the sparsity ratio itself, but by how well the sparse mask aligns with the tile-wise geometry of full attention." Veda operationalizes this with statistics-aware tile scoring and head-aware tiling to minimize reconstruction error, then distills the resulting pattern into a hardware-efficient tile-skipping kernel that converts sparsity into actual throughput gains. 4
Benchmark results — Waver-T2V-12B, 720P 10-second video:
MetricBaselineVeda
End-to-end speedup1.0×5.1×
Self-attention speedup1.0×10.5×
Attention overhead92%50%
Speedup scales favorably with sequence length — longer videos and higher resolutions benefit proportionally more, which is the right direction for production deployments. Results on Wan2.1 also show substantial acceleration. 4
Code/resources: No repository listed at time of writing.
Why read it: A 5.1× end-to-end speedup on a 12B model for 720P video with no quality degradation is a significant practical result — most sparse attention papers don't operate at this scale and don't report end-to-end numbers. Reducing attention overhead from 92% to 50% changes the cost structure of video generation: attention stops being the dominant bottleneck and compute capacity shifts to other model components. Groups deploying video DiTs (Wan, HunyuanVideo, or similar) should treat Veda as a candidate drop-in acceleration layer, particularly for production inference where wall-clock latency, not just FLOPs, is the constraint.

5. GDSD: eliminating ELBO bias from RL fine-tuning of diffusion language models

ArXiv: 2605.29398 | Xiaohang Tang; Keyue Jiang; Che Liu; Qifang Zhao; Xiaoxiao Xu; Sangwoong Yoon; Ilija Bogunovic | cs.LG / cs.AI
Peer-review status: Preprint. Code: github.com/GaryBall/GDSD
Diffusion language models (dLLMs) — models like LLaDA and Dream that generate text via iterative denoising rather than autoregressive prediction — benefit from RL fine-tuning for the same reasons autoregressive LLMs do: reward signals can steer generation beyond what maximum-likelihood training captures. The standard approach borrows GRPO or similar policy gradient methods, but with a substitution: the intractable policy likelihood is replaced by an ELBO estimate, computed from randomly masked sequences. That substitution introduces a training-inference mismatch (TIM) bias, because the ELBO is computed under the masking distribution used at training time, not the one used at inference. 5
GDSD (Guided Denoiser Self-Distillation) sidesteps this entirely. Tang et al. derive the closed-form optimum of a reverse-KL regularized RL objective, then reframe fine-tuning as directly distilling the denoiser logits to match this teacher — no ELBO estimation, no masking-distribution mismatch. The method reinterprets existing ELBO-based approaches as instances of applying different distillation divergences, each with identifiable failure modes; GDSD's normalization-free objective avoids those modes. Training reward dynamics are also more stable compared to ELBO baselines, which is a practical benefit for anyone doing iterative fine-tuning. 5
Benchmark results — planning, math, and coding benchmarks:
  • LLaDA-8B and Dream-7B: up to +19.6% test accuracy vs. prior SOTA ELBO-based methods 5
  • Training reward dynamics: more stable than ELBO baselines across all tested configurations
Code/resources: github.com/GaryBall/GDSD
Cargando tarjeta de contenido…
Why read it: dLLMs are a live research front, and RL alignment for them is still being figured out. GDSD provides a principled diagnosis of why current approaches underperform — they all inherit TIM bias from the ELBO substitution — and a clean fix. The +19.6% accuracy gain over the current SOTA is not marginal. Code is available at submission, which is the right move for a method-correctness paper: the claim is that existing approaches are systematically biased, and releasing code lets reviewers and replicators verify the argument directly.

Quick reference

PaperArXiv IDCore methodVenueCode
CNS2605.30332Frequency-dependent colored noise schedule; drop-in SDE replacementPreprintProject page only
Spectral Guidance2605.28900Singular-function basis via self-supervision; unified label/CLIP/mask guidance; phase-transition discoveryPreprintNot released
AGSM2605.30038Score-level alignment guidance; fixes SoftREPA counting/repetition; reward-freePreprintProject page only
Veda2605.30325Tile-wise reconstruction-optimal sparse attention; hardware-efficient kernel; 5.1× on Waver-T2V-12BPreprintNot released
GDSD2605.29398Denoiser self-distillation via reverse-KL RL; eliminates ELBO TIM bias; LLaDA-8B +19.6%PreprintGitHub
The five papers connect more than they might appear to: CNS and Spectral Guidance both identify overlooked structure in the diffusion sampling process (spectral energy budget, informative feature subspace) that prior work treated as a black box. AGSM, Veda, and GDSD each address a known limitation in an active deployment context (T2I alignment, video DiT inference cost, dLLM RL fine-tuning) rather than proposing new architectures from scratch. No ICML acceptances this batch, but the combination of theoretical depth (Spectral Guidance's phase-transition result, GDSD's ELBO bias diagnosis) and practical utility (CNS drop-in, Veda wall-clock gains) makes this one of the stronger Friday batches of the month.
Cover image: AI-generated illustration

Añade más opiniones o contexto en torno a este contenido.

  • Inicia sesión para comentar.