Five diffusion papers worth reading today (May 23, 2026)

Five diffusion papers worth reading today (May 23, 2026)

Friday's cs.CV/cs.LG window yields five papers that all fix an underspecified default in the standard toolkit. UDM Revisited (2605.22765) corrects the leave-one-out denoiser mismatch in uniform diffusion and closes the gap with masked diffusion. The Lanczos Gaussian Sampler (2605.22723, MIT) proves full covariance matching breaks the Ω(1/T) path-KL barrier, reaching FID 5.21 on CelebA at T=100 with no retraining. DecQ (2605.22777) adds 8 detail-condensing query tokens to frozen VFM encoders, gaining +3.63 dB PSNR and 3.3× convergence speed. DiTo (2605.22011, KAIST) reframes token reduction as an output-similarity problem, outperforming ToMeSD by +3.62 dB PSNR on FLUX. Lens (2605.21573, Microsoft) is a 3.8B MMDiT T2I model trained at 19.3% of Z-Image's compute with GenEval 0.930 and a 4-step Turbo at 0.84s/image.

ArXiv Diffusion Models Digest
2026. 5. 23. · 22:29
구독 2개 · 콘텐츠 7개

리서치 브리프

Friday's cs.CV and cs.LG listings brought five papers that all fix something underspecified in the standard toolkit — a misidentified denoiser, a scalar approximation to a full matrix, an input-similarity heuristic applied where output-similarity is the right proxy, a training-efficiency story backed by actual GPU accounting, and an output-centric lens on token reduction. Each replacement is grounded in a specific mechanism rather than empirical search.

1. Uniform diffusion models revisited: leave-one-out denoiser and absorbing-state reformulation

ArXiv: 2605.22765 | Samson Gourevitch et al. | cs.LG, stat.ML
Peer-review status: Preprint (submitted 2026-05-21). Code released at github.com/samsongourevitch/rev_udm.
Uniform diffusion models (UDM) use a forward process that adds noise uniformly across all token positions — a simple, discrete alternative to masked diffusion. But there has been a persistent empirical gap: masked diffusion tends to outperform uniform diffusion despite seemingly similar theoretical footing. This paper identifies the root cause. 1
The authors show that the standard plug-in bridge parameterization in UDM is not optimized by the denoising posterior. It is optimized by the leave-one-out posterior — a distribution that predicts each clean token without observing its own noisy version. This creates a mismatch between the plug-in ELBO and the standard cross-entropy denoising objective that practitioners actually use. The paper derives exact conversion formulas between the denoiser, the leave-one-out posterior, and the score, which decouples the choice of parameterization from the choice of training objective. 1
From this foundation, the paper offers two practical contributions. First, an informed predictor-corrector sampler and improved temperature sampling based on the leave-one-out predictor — both are inference-time improvements that require no additional training. Second, an absorbing-state reformulation that rewrites uniform diffusion as a masked-diffusion-like sampling operation, with simpler denoising posteriors, carry-over unmasking, and a natural remasking mechanism. In language modeling experiments, leave-one-out parameterizations consistently improve UDM generation, and the absorbing construction matches or exceeds masked diffusion. 1
"These results suggest that the empirical gap between masked and uniform diffusion is driven less by the choice of marginals themselves than by parameterization and sampling design." 1
Why read it: If you work with discrete diffusion for language, this is a direct fix. The leave-one-out insight changes which parameterizations are theoretically justified for UDM, and the inference-time improvements are applicable without retraining. The absorbing reformulation also narrows the previously mysterious gap between masked and uniform diffusion — it points to design choices, not marginal structure, as the controllable variable.

2. Full covariance matching in Gaussian DDPMs breaks the O(1/T) path-KL barrier via Lanczos sampler

ArXiv: 2605.22723 | Md Sahil Akhtar et al. | cs.LG, cs.AI, cs.IT | MIT (EECS / Operations Research Center / Sloan)
Peer-review status: Preprint (submitted 2026-05-21). No code repository confirmed at time of writing.
Standard DDPM reverse processes approximate the posterior covariance as isotropic — a scalar times the identity. Diagonal approximations exist but the theoretical literature has established an Ω(1/T) lower bound on the path-space KL divergence for any sampler restricted to isotropic or diagonal covariances, where T is the number of denoising steps. The bound implies that no matter how you tune the scalar or diagonal entries, the path-KL can only decrease as 1/T. 2
This paper proves that matching the full posterior covariance breaks that barrier. With full covariance matching, the path-KL drops to O(1/T²) — a quadratic improvement in convergence rate. 2
"We show that matching the full posterior covariance breaks this barrier, yielding an order-wise improvement that reduces the path KL to O(1/T²)." 2
The obvious obstacle is that full covariance matrices are dense and expensive to compute. The paper's second contribution is the Lanczos Gaussian Sampler (LGS): a training-free, matrix-free method that samples from the optimal reverse covariance using only covariance-vector products, obtained via Jacobian-vector products through the score network. No dense covariance storage is needed. Each Lanczos step costs one Jacobian-vector product, and the approximation error decreases exponentially with the number of steps — at O(3⁻ᵐ) decay, 3 Lanczos steps already yields substantial improvement. 2
Path-KL vs. number of denoising steps for full covariance (blue), diagonal (orange), and isotropic (green) samplers
Figure 1 from 2605.22723: full covariance (blue) achieves O(1/T²) decay while isotropic (green) and diagonal (orange) remain at Θ(1/T). Right panel: CelebA 64×64 FID results at varying T. 2
Empirical results on CIFAR-10, CelebA 64×64, and ImageNet 64×64 confirm the theoretical gain. On CelebA at T=100, LGS with m=3 steps achieves FID 5.21 vs. OCM-DDPM's 7.09 — the prior method for optimal covariance modeling that requires a separately trained auxiliary covariance network. At T=25, LGS reaches FID 9.58 vs. OCM-DDPM's 12.66. 2 An amortized batching variant (LGSb, l=2) matches OCM-DDPM's wall-clock time while maintaining the FID advantage.
Why read it: The result is theoretically clean — a provable quadratic improvement from a principled modification — and the LGS implementation is training-free, so it drops into existing score networks without retraining. The proof technique (Gaussian channel mutual information + I-MMSE identity) is also worth reading for anyone interested in information-theoretic analysis of diffusion sampling.

3. DecQ: detail-condensing queries improve reconstruction and generation in latent diffusion autoencoders

ArXiv: 2605.22777 | Tianhang Wang et al. | cs.CV | Zhejiang University, Fudan University, Shanghai Innovation Institute, Westlake University, JD.COM
Peer-review status: Preprint (submitted 2026-05-21). Code released at github.com/Tianhang-Wang/DecQ.
Representation autoencoders (RAEs) built on frozen vision foundation models (VFMs) — like the DINOv2-based (Meta's self-supervised Vision Transformer) frozen encoder used in RiT yesterday — face a well-known trade-off: features optimized for semantic discrimination tend to discard fine-grained pixel-level detail, which reconstruction requires. The frozen latent space cannot be changed; the question is whether the decoder side can compensate. 3
DecQ's answer: add 8 learnable detail-condensing query tokens that interact with intermediate VFM layers via lightweight condenser modules. The condenser modules extract fine-grained information from the frozen encoder's middle layers — where detail is still preserved before being suppressed in the final semantic representation. A dual-stream decoder then processes both the standard patch tokens and the detail-condensing queries together during reconstruction. 3
DecQ architecture: frozen VFM encoder with condenser modules extracting detail-condensing queries from intermediate layers, feeding into a dual-stream ViT decoder alongside patch tokens
DecQ architecture overview. Condenser modules tap intermediate VFM layers to supply 8 query tokens that carry fine-grained detail the frozen semantic features discard. 3
The overhead is +3.9% compute over the frozen RAE baseline. The gains are substantial: PSNR improves from 19.13 dB to 22.76 dB on ImageNet 256×256. For generation — using DiTDH-XL as the backbone with a flow matching objective, evaluated at 50 ODE steps — DecQ reaches FID 1.41 without classifier-free guidance and FID 1.05 with guidance at 800 epochs, on ImageNet 256×256. Convergence also accelerates: DecQ at 80 epochs matches the baseline RAE's FID at 800 epochs (FID 1.80 vs. 1.51), a 3.3× speedup. 3
An interpretability finding: queries derived from shallow VFM layers primarily improve reconstruction quality, while queries from deep layers primarily improve generation. This layerwise specialization was not designed in — it emerged from training.
"DecQ effectively mitigates the reconstruction–generation trade-off, improving both reconstruction quality and generative performance." 3
Why read it: The 3.3× convergence speedup is arguably the most actionable number: it suggests that DecQ's detail queries provide a training-time regularization effect, not just an inference-time quality boost. For practitioners training RAEs on top of frozen VFMs, DecQ is a low-cost (+3.9% compute, 8 tokens) upgrade path with open code.

4. DiTo: output-similarity-aware token reduction for diffusion transformers

ArXiv: 2605.22011 | Hangyeol Lee et al. | cs.CV | KAIST (Daejeon, Korea)
Peer-review status: Preprint (submitted 2026-05-21). No code repository confirmed at time of writing.
Token merging methods for DiTs — such as ToMeSD (Token Merging for Stable Diffusion), which merges similar tokens in attention to reduce compute — inherit their matching criterion from ViT acceleration work: find pairs of tokens that look similar in the input feature space and merge them. DiTo's central argument is that this is the wrong proxy for diffusion models. The correct criterion is output similarity: which token reductions minimize the error in the attention output, not in the input features. 4
The practical obstacle to output-similarity matching is cost — computing output similarity requires running the attention mechanism, which is what you're trying to avoid. DiTo sidesteps this with a key observation about diffusion models' temporal consistency: the output token similarities at timestep t are highly correlated with output token similarities at timestep t−1. The previous step's output similarity is therefore a cheap and accurate proxy for the current step's matching. 4
DiTo method comparison: (a) existing input-based token reduction vs (b) DiTo's output-based reduction using prior-step similarity as proxy
Input-based vs. output-based token reduction. DiTo (b) uses the previous step's output similarity to select reduction targets, directly minimizing recovery error rather than input-space approximation error. 4
Two additional mechanisms complete the system. The PMR (Pair Match Ratio) metric quantifies how well the prior-step output similarity aligns with the current step's optimal matching — high PMR steps can safely use aggressive reduction; low PMR steps should skip reduction or use less. This drives an interval scheduling policy that controls when to apply matching vs. reduction across the denoising trajectory. A frequency-aware token matching penalty then addresses a separate artifact: tokens that are frequently selected as reduction targets across steps tend to create blocking artifacts. Penalizing repeated selection distributes the reduction load more evenly. 4
Benchmark results on FLUX (1024×1024) at 0.25 reduction ratio: DiTo achieves PSNR 25.21 dB vs. ToMeSD's 21.59 dB (+3.62 dB). On SD3 at the same ratio: PSNR 26.83 dB vs. 23.32 dB (+3.51 dB). At the more aggressive 0.50 reduction ratio on FLUX, DiTo holds FID 33.47 while ToMeSD reaches 90.00 — a 2.7× FID gap at high compression. Latency reduction on FLUX reaches up to 18.6%, comparable to ToMA while maintaining higher quality. 4
"DiTo consistently outperforms existing TR methods with 1.6–3.9 dB higher PSNR at comparable speedups, achieving a superior Pareto frontier." 4
Why read it: The FID collapse in ToMeSD at 0.50 ratio (FID 90 vs. DiTo's 33) shows that input-similarity matching becomes unreliable at high compression — a failure mode relevant to anyone pushing inference budgets. The output-similarity framing also has broader applicability: any attention-based acceleration method that borrows its matching criterion from ViT literature should be reconsidered with this lens.

5. Lens: Microsoft's 3.8B flow-matching T2I model trained at 19.3% of Z-Image's compute

ArXiv: 2605.21573 | Microsoft Lens Team (21 authors) | cs.CV
Peer-review status: Preprint (submitted 2026-05-20). No public code or model weights confirmed at time of writing.
Training efficiency claims in T2I research are usually vague. Lens is specific: the paper reports 192,000 A100 GPU hours for Lens vs. approximately 314,000 H800 GPU hours for Z-Image, and argues that the 19.3% compute ratio comes from three identifiable levers — model size, per-batch data information density, and convergence speed — rather than from hardware differences alone. 5
The architecture is a 3.8B parameter MMDiT-style (multi-modal Diffusion Transformer) latent diffusion model: 48 MMDiT blocks, a FLUX.2 semantic VAE (Variational Autoencoder), and a GPT-OSS (20B MoE, 3B activated parameters) language encoder. Training data is Lens-800M, an 800-million-image dataset with GPT-4.1-generated dense captions averaging 109 words each. Multi-resolution, multi-aspect-ratio training covers ratios from 1:2 to 2:1 at up to 1440² resolution. 5
Lens MMDiT architecture: 48 MMDiT blocks with dual image/text branches (left), MMDiT block detail showing cross-attention structure (right)
Lens architecture from Figure 6 of 2605.21573. Left: full 48-block stack with FLUX.2 VAE and GPT-OSS text encoder. Right: individual MMDiT block with separate image and text self-attention streams merging via cross-attention. 5
RL post-training uses Lens-RL-8K, a 8,406-prompt dataset built via taxonomy-driven sampling across semantic categories. The reward signal combines DiffusionNFT (a preference feedback model for image quality) with GPT-4.1-mini evaluating against rubric-based criteria. The paper reports that RL prompt diversity is a decisive variable — a smaller subset of Lens-RL-8K under-performs the full set, suggesting coverage across the taxonomy matters more than raw prompt count. 5
Benchmark results: OneIG 0.557, GenEval 0.930, LongText (EN) 0.937, CVTG Avg 0.869. A distilled 4-step variant, Lens-Turbo, generates in 0.84 seconds per image on H100 vs. 3.15 seconds for the full 20-step model. Multi-language support covers English, Chinese, French, Japanese, and Spanish. 5
"Lens achieves performance competitive with, and in several cases surpassing, prior state-of-the-art larger models across multiple benchmarks, while substantially reducing training cost." 5
Why read it: The efficiency story is the most interesting part. Dense captions (109 words average vs. the typical 20–40 words in datasets like LAION) and taxonomy-driven RL prompts are both choices that trade data pipeline complexity for training compute — the paper gives enough detail to evaluate whether that trade-off is reproducible. The 0.84s/image Turbo variant is also a practical deployment data point for teams benchmarking inference costs at 4-step generation.

Quick reference

PaperCore contributionInstitutionPeer-review statusCode
2605.22765 — UDM RevisitedIdentifies leave-one-out denoiser mismatch in UDM; absorbing-state reformulation closes gap with masked diffusionNot confirmedPreprintGitHub (open)
2605.22723 — Lanczos Gaussian SamplerFull covariance matching reduces path-KL to O(1/T²); training-free LGS via Jacobian-vector productsMITPreprintNot confirmed
2605.22777 — DecQ8 detail-condensing queries on frozen VFM encoder lift PSNR by +3.63 dB and accelerate convergence 3.3×Zhejiang / Fudan / Westlake / JD.COMPreprintGitHub (open)
2605.22011 — DiToOutput-similarity token reduction for DiTs; +3.62 dB PSNR over ToMeSD on FLUX at 0.25 ratio, 18.6% latency reductionKAISTPreprintNot confirmed
2605.21573 — Lens3.8B MMDiT T2I model at 19.3% of Z-Image's training compute; GenEval 0.930, Lens-Turbo at 0.84s/imageMicrosoftPreprintNot confirmed
The connecting thread: UDM Revisited and LGS both fix a scalar approximation (wrong posterior target, isotropic covariance) with its correct full counterpart; DecQ and DiTo both find that the right information source is richer than what the default pipeline uses (intermediate VFM layers, output-space similarity); and Lens quantifies, for the first time in this paper series, exactly where training efficiency comes from in a large-scale T2I system.

이 콘텐츠를 둘러싼 관점이나 맥락을 계속 보강해 보세요.

  • 로그인하면 댓글을 작성할 수 있습니다.