Five diffusion papers worth reading: June 18, 2026

Five diffusion papers worth reading: June 18, 2026

DRL drops SiT FID 72%, Moebius 0.22B beats 11.9B FLUX, Sumi first 7B UDLM open, PhaseLock ICML, DFD DMD2 fix

ArXiv Diffusion Models Digest
2026. 6. 18. · 22:18
구독 2개 · 콘텐츠 26개

리서치 브리프

Thursday's batch brought 172 cs.CV and 307 cs.LG new submissions. Of ~25 diffusion-model candidates, five stand out: a Meta AI/FAIR result that drops SiT FID from 9.38 to 2.62 using discriminator-guided RL without human preferences, a 0.22B inpainting model from HUST that beats 11.9B FLUX.1-Fill-Dev at 15× the speed, the first 7B uniform diffusion language model trained from scratch (fully open, Tohoku University), an ICML 2026 paper that shows 2-step video generation beats 50-step on physical consistency, and a single-line post-training fix that restores diversity in DMD2-distilled video models.

Speed-read table

#PaperarXivInstitutionOne-line highlight
1DRL2606.19162Meta AI / FAIRDiscriminator-guided RL cuts SiT FID 9.38 → 2.62; no human preferences needed
2Moebius2606.19195HUST + VIVO AI Lab0.22B beats FLUX.1-Fill-Dev (11.9B) on 6 inpainting benchmarks at >15× speedup
3Sumi2606.19005Tohoku University NLPFirst 7B UDLM trained from scratch; full CC-BY-4.0 open release
4PhaseLock2606.06361Multi-institutionICML 2026: 2-step beats 50-step on physical consistency; training-free, 1.06× overhead
5DFD2606.18478Single-line code change + 100–300 finetuning steps fixes DMD2 mode collapse

1. The reward was in your data all along: correcting flow matching with discriminator-guided RL

arXiv: 2606.19162 (submitted June 17) · Nicolas Beltran-Velez, Felix Friedrich, Zhang Xiaofeng, Reyhane Askari-Hemmat, Xiaochuang Han, Adriana Romero-Soriano, Michal Drozdzal; 7 authors including Meta AI / FAIR researchers · cs.CV / cs.LG 1
Code/demo: No public repository at time of writing.
Peer-review status: Preprint; 84 pages including appendices.
Core contribution. Flow-matching training minimizes an ℓ₂ regression error on the velocity field under training-time noise marginals. The authors argue that this proxy is structurally misaligned with the visual and semantic properties that determine sample quality at inference — the model learns to be a good ℓ₂ regressor under training marginals, not a good generator under model marginals. DRL (Discriminator-guided RL) fixes this gap without human preference labels: a discriminator is trained to separate real data from base-model samples in a pretrained representation space (e.g., DINOv3), then its logit is used as a reward in a KL-regularized RL objective. 1
Key technical insight. Using a pretrained representation space for the discriminator does two things. First, it constrains the discriminator to perceptually meaningful directions in feature space rather than pixel-level artifacts. Second, the discriminator logit approximates the log-likelihood ratio between the data distribution and the model distribution — which is the theoretically optimal reward for targeting the data distribution. This gives DRL a principled objective without any human annotation. The representation space also acts as a regularizer: discriminators trained in pixel space tend to latch onto high-frequency texture artifacts unrelated to image quality; pretrained feature space prevents this. 1
Quantitative results. Across four flow-matching backbones on ImageNet 256×256 (guidance-free FID, class-conditional): 1
차트를 불러오는 중…
BackboneFID (base)FID (DRL)Semantic FD — DINOv3 (base)Semantic FD — DINOv3 (DRL)
SiT9.382.6288.219.3
JiT(baseline)improved(baseline)improved
REPA(baseline)improved(baseline)improved
RAE(baseline)improved(baseline)improved
The SiT result is the sharpest: FID drops from 9.38 to 2.62 (−72%), and semantic-space FD (measured in DINOv3 feature space) collapses from 88.2 to 19.3 (−78%). DRL also improves human-preference reward scores without being trained on them — suggesting the discriminator-derived reward aligns well with human perception. A further result shows DRL provides a better starting point for subsequent human-preference post-training: the DRL-initialized model sits on a better Pareto frontier (quality vs. KL-divergence from the original model) than the flow-matching baseline when preference optimization is applied on top. 1
Why read it. The central claim — that ℓ₂ matching loss is a poor proxy for sample quality — has been stated informally in the field for a while, but this paper provides the clearest mechanistic argument and the most dramatic quantitative demonstration to date. The 72% FID reduction on SiT is substantial. If you work on flow-matching post-training, the combination of (a) the theoretical framing, (b) the discriminator construction recipe, and (c) the demonstrated Pareto-frontier improvement for downstream preference training is actionable. The 84-page paper is long; the appendix contains full derivations worth reading if you intend to adapt the discriminator design.

2. Moebius: 0.22B inpainting specialist matches 11.9B FLUX.1-Fill-Dev at 26 ms per step

arXiv: 2606.19195 (submitted June 17) · Kangsheng Duan, Ziyang Xu, Wenyu Liu, Xiaohu Ruan, Xiaoxin Chen, Xinggang Wang; 6 authors · Huazhong University of Science and Technology + VIVO AI Lab · cs.CV 2
Project page: hustvl.github.io/Moebius (visual comparisons)
Peer-review status: Preprint.
Core contribution. FLUX.1-Fill-Dev is an 11.9B-parameter inpainting model; Moebius accomplishes the same task with 226M parameters — less than 2% of FLUX's size. The key question is how to preserve the expressiveness of a large model in a specialist with ~50× fewer parameters. Moebius introduces the Local-λ Mix Interaction (LλMI) block, which compresses spatial context and global semantic priors from the full latent field into fixed-size linear matrices. An adaptive multi-granularity distillation strategy then transfers knowledge from a teacher model (called PixelHacker, also from the HUST/VIVO team) within the latent space, avoiding expensive pixel-space decoding during training. A gradient norm adaptive loss weighting scheme stabilizes the distillation across granularity levels. 2 3
Key technical insight. Standard inpainting models apply computation uniformly across the image. LλMI blocks depart from this: they route spatial and semantic information through a compressed matrix representation, so the model's representational capacity is concentrated on the interface between masked and unmasked regions rather than spread uniformly. This is why a 0.22B model can recover the perceptual quality of an 11B model on inpainting specifically — the task structure allows specialization in a way that general text-to-image generation does not. 3
Moebius overall pipeline: LλMI blocks compress spatial context; distillation operates in latent space to avoid pixel-space decoding cost. 3
Quantitative results. Evaluated across 6 benchmarks covering natural scenes (Places2) and portrait scenes (CelebA-HQ, FFHQ): 2
  • Matches or surpasses FLUX.1-Fill-Dev (11.9B) and SD3.5 Large-Inpainting on all 6 benchmarks
  • Inference latency: 26.01 ms/step on a single GPU
  • Total runtime: >15× faster than FLUX.1-Fill-Dev (parameter count ratio: 0.22B vs. 11.9B = 54× compression)
The benchmark table on the project page (Places2 natural scenes) shows Moebius consistently ahead of FLUX.1-Fill-Dev on L1, PSNR, SSIM, and LPIPS metrics. 3
Places2 benchmark results: Moebius (0.22B) vs. FLUX.1-Fill-Dev (11.9B) and SD3.5 Large-Inpainting. 3
Why read it. The result inverts the intuition that inpainting quality requires large generalist models. If you build production pipelines around FLUX.1-Fill-Dev or SD3.5 Large-Inpainting, a 15× speedup at matched quality is directly deployable. The code is public and the project page has rich side-by-side comparisons. The LλMI block design is also worth examining if you work on task-specific compression of diffusion models more broadly — the latent-space distillation strategy cleanly avoids the VAE decoder bottleneck.

3. Sumi: first 7B uniform diffusion language model trained from scratch

arXiv: 2606.19005 (submitted June 17) · Mengyu Ye, Keito Kudo, Wataru Ikeda, Ryosuke Matsuda, Keisuke Sakaguchi, Jun Suzuki; 6 authors · Tohoku University NLP · cs.CL / cs.LG 4
Code / weights: github.com/tohoku-nlp/sumi · License: CC-BY-4.0
Peer-review status: Preprint.
Core contribution. Uniform diffusion language models (UDLMs) allow any token at any position to be updated at any denoising step, unlike masked diffusion models (MDMs) that require tokens to remain masked once denoised. In principle, UDLMs offer greater generation flexibility — any token remains a candidate for revision throughout the trajectory. In practice, no UDLM had previously been pretrained at the 7B-parameter scale on large text corpora. Sumi fills this gap: it is the first 7B UDLM trained from scratch on 1.5T tokens, with full CC-BY-4.0 open release of model weights, training checkpoints, training recipe, and data mixture specification. 4
Key technical insight. The Tohoku team's central motivation is not to claim Sumi beats autoregressive models, but to provide the diffusion language model community with a tractable reference point for studying what a large-scale UDLM actually learns — something that has been missing because all prior work on UDLMs stopped at much smaller scale. The CC-BY-4.0 release of weights, checkpoints, and the full training recipe is the paper's primary contribution, explicitly designed to let other researchers reproduce, extend, and interrogate the training dynamics at scale. 4
Quantitative results. The paper reports Sumi performs comparably to autoregressive models trained on an equivalent token budget across knowledge, reasoning, and coding benchmarks. The abstract does not give specific benchmark numbers (e.g., MMLU scores, HumanEval pass rates); those are in the full paper. Sumi underperforms on commonsense benchmarks, which the authors attribute to a higher proportion of educational data in the training mixture. 4
Why read it. The release is a research enabler more than a performance claim. If you work on discrete diffusion or masked/uniform language models, Sumi is the first large-scale UDLM you can actually download, fine-tune, and study. The full training recipe release is especially valuable for understanding what choices matter when scaling UDLMs — a question that has been difficult to study without a public baseline at this scale. Two things worth noting: (1) the code is live on GitHub now, (2) this is a preprint and benchmark numbers should be treated as preliminary pending community replication.

4. PhaseLock: 2-step generation beats 50-step on physical consistency (ICML 2026)

arXiv: 2606.06361 (v2, updated June 17; ICML 2026 accepted) · Woojung Han, Seil Kang, Youngjun Jun, Min-Hung Chen, Fu-En Yang, Seong Jae Hwang; 6 authors · multi-institution · cs.CV 5
Peer-review status: ICML 2026 accepted.
Core contribution. A widely held assumption in video diffusion is that more denoising steps produce better outputs. PhaseLock challenges this for physical consistency. The authors show that in image-to-video (I2V) diffusion models, the motion prior — the information encoding how objects should move according to physical laws — is established in the early denoising steps and systematically degraded by subsequent steps. The counterintuitive finding: generating with just 2 denoising steps often produces better physical consistency than 50-step generation from the same model. 5
Key technical insight. The authors analyze the frequency characteristics of latent representations across denoising steps. Phase (which encodes structural and motion information) drops by approximately 18% from step 2 to step 50, while magnitude (which encodes appearance) remains stable across the same range. This dissociation means that visual refinement steps — which are necessary for fidelity — come at the cost of eroding the motion prior encoded in the phase. PhaseLock is a training-free approach: it extracts the 2-step motion prior, then enforces it as a Latent Delta Guidance constraint during full-step generation. The 2-step output provides the "what direction should motion go" signal; the full-step run provides the "what should it look like" signal. 5
As the authors put it: "a 2-step generation often exhibits better physical consistency than a 50-step output from the same model." 5
Quantitative results. 5
통계 카드를 불러오는 중…
  • Physical consistency improvement: average +6.2 points across diverse I2V models
  • Visual fidelity: largely maintained (PhaseLock does not trade fidelity for consistency)
  • Computational overhead: 1.06× time, 1.02× memory (the 2-step extraction adds negligible cost)
  • Comparison to external guidance methods (e.g., physics simulators): those incur ~5× time overhead; PhaseLock achieves comparable or better physical consistency at 1.06×
Why read it. ICML 2026 acceptance provides peer-reviewed backing for the core finding. The result is immediately applicable: any researcher using I2V diffusion models for physical simulation, robotics data generation, or video world models can apply PhaseLock as a drop-in guidance wrapper at essentially zero cost. The phase-degradation diagnosis is the conceptual advance — it reframes the "more steps = better quality" assumption and opens a new analysis direction for video diffusion quality.

5. Data-Forcing Distillation: single-line fix for diversity collapse in distilled video diffusion

arXiv: 2606.18478 (submitted June 16) · Siyi Chen, Shaowei Liu, Yixuan Jia, Zian Wang, Huan Ling, Qing Qu, Jun Gao; 7 authors · cs.CV 6
Code/demo: No public repository at time of writing.
Peer-review status: Preprint.
Core contribution. Consistency distillation methods such as DMD and DMD2 accelerate video diffusion inference by training a student to match the teacher's score field using a reverse KL objective. The reverse KL is mode-seeking: the student concentrates probability mass on the modes of the teacher's distribution, which causes two failure modes in practice — mode collapse (the student ignores low-probability but valid output modes) and over-saturation (the student assigns probability mass to implausible regions not present in real data). Data-Forcing Distillation (DFD) fixes both with a single mechanism: the teacher score discrepancy (the difference between the teacher's score under the student's samples and the teacher's score under real data) is used to guide the student toward the real-data distribution. This requires changing one line in the training loop. 6
Key technical insight. The teacher score discrepancy acts as a two-sided corrective signal. When the student generates samples in modes absent from real data (over-saturation), the teacher score under those samples diverges from the teacher score under real data — the signal pushes the student away. When the student ignores valid modes present in real data (mode collapse), the teacher score under real data pulls the student toward those missing modes. The same scalar term addresses both failure modes simultaneously. This is why the fix requires minimal intervention: the existing teacher is already computing the score; DFD only changes how that score is used. 6
Quantitative results. DFD is validated on: 6
  • Text-to-video, image-to-video, and autoregressive video generation
  • Tested backbones: Wan2.1-1.3B and Cosmos-Predict2.5-2B
  • Post-training budget: 100–300 steps of finetuning
  • Outcome: DFD models match the teacher's fidelity while recovering diversity; in several evaluations, DFD students outperform the teacher on joint diversity-fidelity metrics
The paper does not report FID or FVD numbers in the abstract; quantitative tables are in the full paper. The diversity-fidelity improvement is consistent across all three generation paradigms tested.
Why read it. The practical value is high for anyone who has deployed or is evaluating DMD/DMD2-based video distillation. Mode collapse and over-saturation in distilled video models are well-known pain points in production deployments, and the available fix until now has been accepting the quality degradation or reverting to the slower teacher. A 100–300 step post-training intervention is negligible cost relative to the original distillation training. The claimed "single-line code change" framing requires independent verification (no code is yet available), but the mechanism is clearly specified in the paper.

Three distinct repair themes dominate today's batch, and they are not independent.
RL as a flow-matching corrective. DRL (paper 1) and the general trajectory of reward-based post-training for diffusion models (visible also in last week's Flash-GRPO and DiPOD) share a premise: the pretraining objective is a poor proxy for what you actually want, and you need a separate signal to correct the distribution after the fact. DRL's contribution is making that corrective signal self-supervised — the discriminator constructs the reward from the data itself, without requiring human raters. This matters because human preference labeling is expensive and slow; a discriminator-derived reward scales with your data.
Specialist compression, not scale. Moebius (paper 2) inverts the "bigger is better" story that dominates text-to-image coverage. When the task is narrowly defined — inpainting specifically — a 226M specialist outperforms an 11.9B generalist at 15× the speed. This is consistent with a pattern visible across recent diffusion weeks: task-specific distillation (TEASR last Tuesday, Flash-GRPO, DFD this week) consistently produces models that are smaller, faster, and competitive or better on their target task. The tradeoff is that the specialist cannot step outside its task boundary.
Phase vs. magnitude in video diffusion. PhaseLock (paper 4) and DFD (paper 5) both diagnose failures in video diffusion that are invisible in image generation. PhaseLock shows that the denoising process itself degrades motion priors at the frequency level — a finding that has no parallel in image diffusion, where there is no temporal coherence to lose. DFD shows that consistency distillation's reverse KL objective interacts badly with the multi-modal distribution of natural video, producing collapse and over-saturation that require a targeted fix. Both results suggest video diffusion has failure modes that image-diffusion intuitions will not predict.
Sumi (paper 3) stands apart: it is infrastructure for future research rather than a performance result. Its value is that it makes a previously opaque training regime (large-scale UDLM pretraining) reproducible and inspectable. The Tohoku release — weights, checkpoints, full recipe — is the kind of open artifact that shifts which experiments are feasible for researchers who are not at large labs.

이 콘텐츠를 둘러싼 관점이나 맥락을 계속 보강해 보세요.

  • 로그인하면 댓글을 작성할 수 있습니다.