Five diffusion papers worth reading today (May 26, 2026)

Five diffusion papers worth reading today (May 26, 2026)

Tuesday's batch is the most architecturally diverse single-day selection this week. SKILD (MIT) encodes scale as a diffusion coordinate rather than a conditioning variable, enabling one unconditional model to handle both generation (FID 2.65 on CIFAR-10) and 2×–8× super-resolution. LoopMDM (KAIST) shows that selective transformer-layer looping delivers depth-scaling for masked diffusion language models at 3.3× fewer training FLOPs and +8.5 GSM8K points. DRM (Peking University, ICML 2026) repurposes a Flux-based diffusion model as a step-wise reward evaluator, replacing VLM-based reward models for alignment. A UC Berkeley theory paper (Malik, Abbeel et al.) establishes generalization bounds for multi-objective diffusion learning under semi-supervised regimes. Paris 2.0 demonstrates that video diffusion training — typically assumed to require monolithic GPU clusters — can be done decentralizedly, cutting FVD from 561.04 to 279.01 at matched compute.

ArXiv Diffusion Models Digest
2026/5/26 · 22:25
購読 2 件 · コンテンツ 9 件

リサーチノート

Tuesday's batch is unusually diverse in problem class. SKILD (MIT) redesigns the forward process itself to unify generation and super-resolution in a single unconditional model. LoopMDM (KAIST) asks whether masked diffusion language models can borrow depth from looping rather than adding parameters. DRM (Peking University, ICML 2026) turns a generation model inside-out to use it as a reward evaluator. A UC Berkeley theory paper formalizes what happens statistically when one diffusion model must serve multiple objectives. And Paris 2.0 asks whether video diffusion training requires a monolithic GPU cluster at all — and answers no.

1. SKILD: MIT unifies generation and continuous super-resolution without task-specific architecture

ArXiv: 2605.26032 | Zixin Jessie Chen, Zhuo Chen, Archer Wang, Jeff Gore, William T. Freeman, Congyue Deng, Marin Soljačić | cs.CV | MIT
Peer-review status: Preprint. Code available at github.com/JazzyCH/SKILD.
The standard approach to combining generation and super-resolution is to train two separate models — or one conditional model that takes a scale factor as input. SKILD (Scale-Invariant K-space Image Learning Diffusion) takes a different route: it trains a single unconditional diffusion model and routes both tasks through it by varying only the starting timestep. Generation starts from pure noise; super-resolution starts from a partially noised version of the low-resolution input, at a timestep calibrated to the target scale. No conditioning branch, no classifier-free guidance, no per-scale retraining. 1
The mechanism rests on scale invariance in natural image statistics. Natural images exhibit power-law decay in their variance spectra — a property the authors verify across CIFAR-10, ImageNet-128, and ImageNet-256. SKILD's forward process is designed to match this: it attenuates image content from fine to coarse scales while injecting spectrum-matched Gaussian noise, so scale becomes an explicit coordinate of the diffusion dynamics rather than a conditioning variable. The authors describe this as shifting the modeling burden from conditional mappings and task-specific architectures onto the design of the forward process itself. 1
SKILD conceptual illustration on a self-similar fractal image: (a) effective signal resolution decreases; (b) injected noise correlation length increases; (c) the process obeys scale invariance
SKILD forward process on a fractal input — signal resolution and noise correlation length trade off in a scale-invariant way. 1
Benchmark results: On unconditional CIFAR-10 generation, SKILD achieves FID 2.65 and Inception Score 9.63, competitive with DDPM and EDM baselines. On ImageNet, the same checkpoint performs 2×–8× super-resolution, outperforming conditional models (BSRGAN and others) across PSNR, SSIM, LPIPS, CLIPIQA, and MUSIQ. A physics validation experiment on critical Ising model configurations — where connected four-point correlations closely track ground truth — extends the claim beyond natural images. 1
Code/resources: github.com/JazzyCH/SKILD
Why read it: The transferable idea is treating scale as a diffusion coordinate rather than a conditioning input. That reframing applies to any downstream task where scale variation is a deployment reality — medical imaging, satellite imagery, generative SR pipelines. The physics validation is an unusual credibility check: if the model's spectra match Ising correlations, the scale invariance claim has support outside the benchmark regime.

2. LoopMDM: KAIST cuts masked diffusion LM training FLOPs by 3.3× with selective layer looping

ArXiv: 2605.26106 | Sanghyun Lee, Chunsan Hong, Seungryong Kim, Jonghyun Lee, Jongho Park, Dongmin Park | cs.LG | KAIST
Peer-review status: Preprint. Code and weights to be publicly released.
Masked Diffusion Models (MDMs) for language — models that generate text by iteratively unmasking tokens across multiple denoising rounds — have remained architecturally conservative compared to their autoregressive counterparts. LoopMDM's premise is that depth is what MDMs are missing, and that depth can be obtained without adding parameters: selectively looping the early-to-middle transformer layers forces the model to process its input multiple times per denoising step, producing a depth-scaling effect from a shallow stack. 2
Two knobs control the behavior. At training time, looping reduces the FLOPs required to reach a target test negative log-likelihood: the model trains on matched-compute budgets and reaches the same NLL faster than a non-looped baseline. At inference time, the loop count can be increased beyond the training configuration, providing a dial for compute-quality trade-offs after the model is already trained. An attention analysis reveals why the loop count matters: looping promotes attention interactions among masked positions — positions that, in a single-pass model, never directly attend to each other within a step. 2
LoopMDM overview: (left) selective looping on early-to-middle denoising layers; (center) matched-compute training curves showing faster NLL convergence; (right) GSM8K accuracy improves monotonically with more inference loops
LoopMDM: the three panels show the looping architecture, training efficiency, and inference-time scaling behavior. 2
Benchmark results: LoopMDM matches same-size non-looped MDMs with 3.3× fewer training FLOPs. On GSM8K (a math reasoning benchmark), it surpasses deeper non-looped MDMs by +8.5 points at comparable per-step compute. Generative perplexity decreases monotonically with loop count: from 116.71 ± 2.72 at S=1 to 42.56 ± 1.09 at S=4 (1024 denoising steps). Zero-shot perplexity improves across PTB, WikiText, LM1B, Lambada, AG News, PubMed, and ArXiv benchmarks. 2
Code/resources: Not yet released; the authors indicate public release is planned.
Why read it: The training FLOPs reduction (3.3×) and the inference-time dial are independent benefits — you get both from the same architectural change. For teams training MDMs under compute constraints, this is a training-side intervention with no penalty at inference if you keep the loop count fixed. The GSM8K improvement also suggests that looping is doing something specifically useful for multi-step reasoning, which matters if you're targeting coding or math tasks with discrete diffusion.

3. DRM: Peking University turns a diffusion model into its own reward evaluator (ICML 2026)

ArXiv: 2605.25661 | Jaxon Zhang, Binxin Yang, Hubery Yin, Chen Li, Jing Lyu | cs.CV | Peking University
Peer-review status: Accepted at ICML 2026. Code available at github.com/jjaxonx/DRM.
Reward models for diffusion model alignment — HPSv3, PickScore, ImageReward — are trained on VLMs (Vision-Language Models) fine-tuned for preference prediction. The DRM argument is that VLMs are the wrong backbone: they are pre-trained for semantic alignment, not for the aesthetic and compositional attributes that drive human preference at the pixel level. A model that can generate high-fidelity images must, by virtue of having learned to do so, possess an implicit understanding of those attributes. DRM (Diffusion-based Reward Model) operationalizes this by removing the last three transformer layers from a pre-trained Flux-based diffusion model and training the resulting architecture on preference data (HPDv3, Pick-A-Pic, and ImageReward subsets). 3
The structural advantage is step-wise evaluation. Standard reward models treat generation as a black box: they score only the final output. DRM can score any noisy intermediate latent at any denoising stage — it was trained on those latents, so it has representations for them. This enables two downstream uses. Step-wise GRPO provides dense per-step rewards during reinforcement learning training, addressing the credit-assignment problem that arises when a single terminal reward must propagate back through dozens of denoising steps. Step-wise Sampling uses DRM as an inference-time guide: at each step, multiple candidate continuations are scored and the best one is carried forward. 3
DRM vs. VLM-based reward models: existing RMs provide only terminal reward; DRM evaluates at any intermediate denoising stage
DRM architecture comparison: VLM-based RMs score the final image; DRM scores at any denoising stage. 3
Benchmark results: Experiments on SD3.5-Medium (Stable Diffusion 3.5 Medium) show DRM-optimized generations achieve superior visual quality compared to HPSv3, PickScore, and ImageReward baselines. Quantitative comparisons from the preference tables were not fully accessible in HTML-rendered form; the ICML 2026 acceptance provides a credibility check that the methodology was independently reviewed. 3
Code/resources: github.com/jjaxonx/DRM
Why read it: The backbone-substitution argument is the idea to stress-test. If a generation model's internal representations genuinely contain more perceptual signal than a VLM fine-tuned on preference data, the principle extends: any sufficiently capable generative model could serve as a better reward model for its own domain than an external discriminator trained separately. Step-wise GRPO is also directly practical — teams running GRPO on diffusion models have reported instability from sparse terminal rewards, and a dense per-step signal from the same architectural family is a low-overhead fix to try.

4. Multi-objective diffusion learning: UC Berkeley establishes a statistical theory for Pareto trade-offs

ArXiv: 2605.25210 | Ziheng Cheng, Yixiao Huang, Hanlin Zhu, Haoran Geng, Somayeh Sojoudi, Jitendra Malik, Pieter Abbeel, Xin Guo | cs.LG | UC Berkeley
Peer-review status: Preprint (submitted 2026-05-24). No code repository confirmed at time of writing.
A deployed diffusion model rarely serves one distribution. A text-to-image model must handle diverse prompt domains. A robotic diffusion policy must generalize across environments. Each condition defines a different target distribution, and maximizing Pareto trade-offs across all of them simultaneously requires model capacity — which drives up sample complexity. This paper develops the statistical theory for that trade-off, formalized as multi-objective learning (MOL) of conditional diffusion models under a semi-supervised regime. 4
The key result: the number of paired samples required for the generalist model to reach a target error bound depends only on the complexity of the specialist models, not the generalist. This matters because the generalist, by absorbing pseudo-samples generated from abundant unlabeled condition data, can grow in capacity without proportionally increasing the paired-data requirement. The two-stage procedure is: (1) fit lightweight specialists from limited paired data; (2) generate pseudo-samples from abundant unlabeled condition data, then distill into a generalist. The theory extends to diffusion policies for sequential decision-making, accounting for distribution shift between training and on-policy rollouts. 4
Semi-supervised multi-objective framework: limited paired data feeds specialist fitting; abundant unlabeled condition data feeds generalist distillation
The two-stage pipeline: specialists trained on scarce paired data generate pseudo-samples that scale the generalist. 4
Benchmark results: Experiments on robotic manipulation with domain randomization at four difficulty levels, and on image restoration tasks, validate the theoretical generalization bounds. The paper does not claim state-of-the-art on any single benchmark; the contribution is the proof that specialist-to-generalist distillation reduces paired sample requirements, and the experimental section confirms that the bound is not vacuous. 4
Code/resources: Not confirmed at time of writing.
Why read it: The senior authors — Jitendra Malik and Pieter Abbeel, both UC Berkeley faculty — bring credibility to the theoretical framing. More practically, the result has direct implications for anyone training a single diffusion model across multiple domains with limited labeled data per domain. The semi-supervised setup (scarce paired data, abundant unlabeled conditions) is the norm rather than the exception in robotics and medical imaging, and the theory gives a principled justification for the common-practice intuition that specialist-then-generalist training is data-efficient.

5. Paris 2.0: the first video diffusion model trained through fully decentralized computation

ArXiv: 2605.26064 | Ali Rouzbayani, Bidhan Roy, Marcos Villagra, Zhiying Jiang | cs.CV | independent (Paris project)
Peer-review status: Preprint (submitted 2026-05-25). No code repository confirmed at time of writing.
Video diffusion training is assumed to require tightly coupled GPU clusters: the combination of spatial, temporal, and cross-attention across long token sequences demands fast inter-GPU communication, and the community has treated this as a fixed constraint. Paris 2.0 challenges that assumption by extending the Decentralized Diffusion Model (DDM) paradigm — first demonstrated for images in Paris 1.0 (arXiv:2510.03434) — to temporally coherent video generation. The key challenge is that video generation requires synchronizing not just spatial consistency but temporal coherence across frames, a problem that does not arise in single-image generation. 5
The training communication stack handles data, pipeline, tensor, and context parallelism, each with different synchronization costs across distributed, heterogeneous GPUs. The authors do not claim that decentralized training is free — each parallelism mode introduces its own latency. What they claim is that at matched total compute budget, a DDM-trained video model can produce better outputs than the monolithic centralized baseline. The proposed explanation is implicit regularization from asynchronous updates, though this is the authors' interpretation rather than a formally derived result. 5
Paris 2.0 qualitative video samples: each row shows eight frames from one generated video, demonstrating frame-to-frame temporal coherence
Paris 2.0 video samples — eight consecutive frames per row showing temporal coherence. 5
Benchmark results: In the low-resolution text-to-video setting, Paris 2.0 cuts Fréchet Video Distance (FVD) from 561.04 (monolithic baseline, identical data and compute) to 279.01 — roughly a 2.0× improvement. CLIP text-video similarity and aesthetic scores both exceed the monolithic baseline at matched compute. 5
Code/resources: Not confirmed at time of writing.
Why read it: The FVD gain is large enough to rule out noise — a 2× improvement at matched compute is a strong result. But the more consequential claim is structural: if video diffusion training does not require a monolithic cluster, then the compute required to train competitive video models becomes accessible to research groups without NVIDIA infrastructure contracts. Whether that generalizes from the low-resolution experimental regime to production-scale video is an open question the paper does not answer, which is what makes it worth tracking.

Quick reference

PaperCore contributionInstitutionPeer-review statusCode
2605.26032 — SKILDScale-invariant forward process; single unconditional model handles generation + 2×–8× SR; FID 2.65 on CIFAR-10MITPreprintGitHub (open)
2605.26106 — LoopMDMSelective layer looping for MDMs; 3.3× training FLOPs reduction, +8.5 pts GSM8KKAISTPreprintPlanned release
2605.25661 — DRMDiffusion model as reward backbone; step-wise evaluation at any denoising stage; Step-wise GRPO + SamplingPeking UniversityICML 2026GitHub (open)
2605.25210 — MOLStatistical theory for multi-objective diffusion learning; specialist-to-generalist distillation reduces paired sample requirementsUC BerkeleyPreprintNot confirmed
2605.26064 — Paris 2.0First video diffusion model trained via decentralized computation; FVD 561 → 279 vs. monolithic baseline at matched computeParis projectPreprintNot confirmed
Tuesday's papers share a common structural move: each one relocates a burden that the field has been placing on conditioning or post-hoc mechanisms. SKILD moves super-resolution from a conditioning branch into the forward process design. LoopMDM moves depth from parameter count into loop count. DRM moves reward estimation from a separate VLM into the generative model's own representations. The MOL paper moves multi-domain generalization from single-task repetition into a principled specialist-distillation framework. Paris 2.0 moves cluster dependency from a hard infrastructure constraint into a design variable. Whether these moves prove durable is a question for peer review and replication — but they represent coherent theoretical bets rather than incremental tuning.
Cover image: AI-generated illustration

このコンテンツについて、さらに観点や背景を補足しましょう。

  • ログインするとコメントできます。