Five diffusion papers worth reading today (May 18, 2026)

Today's cs.CV listing is dense. After scanning the new submissions and the week's high-signal preprints, five papers stand out on novelty, result quality, and early community traction. Here they are, in descending order of impact.

1. Asymmetric Flow Models (AsymFlow)

ArXiv: 2605.12964 | Stanford University | Code & weights released

The core observation is deceptively simple: in flow-matching models, noise prediction sits in a much higher-dimensional space than the actual data distribution warrants, especially when that data lives on a low-rank manifold. Hansheng Chen, Jan Ackermann, Minseo Kim, Gordon Wetzstein, and Leonidas Guibas at Stanford propose asymmetric velocity parameterization — noise is predicted in a low-rank subspace, while data prediction stays full-dimensional. From those asymmetric predictions the full-dimensional velocity is recovered analytically, requiring no architectural surgery or modified sampling.

On ImageNet 256×256, AsymFlow reaches 1.57 FID, placing it ahead of all prior DiT- and JiT-style pixel diffusion models by a clear margin. The more striking result: the authors show the first principled route for finetuning a pretrained latent flow model into a pixel-space model. Aligning the low-rank pixel subspace to the existing latent space gives a smooth initialization, so finetuning mostly fixes low-level texture mismatches rather than relearning generation from scratch. The resulting AsymFLUX.2 klein 9B pixel model — finetuned from FLUX.2 klein — outperforms its latent base on HPSv3, DPG-Bench, and GenEval 1.

Weights are already on HuggingFace and the code is live at GitHub 2. Community reaction on Reddit's r/StableDiffusion has been quick — multiple workflow threads within 24 hours of release.

Why read it: If you've been skeptical of pixel-space generation because of training cost, this paper directly attacks that assumption. The latent-to-pixel finetuning route is novel and could be applied to any flow-matching backbone.

2. L2P: Unlocking Latent Potential for Pixel Generation

ArXiv: 2605.12013 | Nanjing University (NJU-PCALab) | Code on GitHub

Pixel diffusion models avoid the VAE bottleneck but historically require prohibitive compute to train from scratch. L2P from Zhennan Chen, Junwei Zhu, and colleagues at NJU's PCALab offers a different route: transfer the learned priors from an existing latent diffusion model directly into pixel space, requiring only 8 GPUs and a fine-tuning budget rather than a full-scale training run 3.

The mechanism leverages the smooth manifold structure of pretrained LDMs. By projecting into the pixel domain along this manifold — rather than training pixel models cold — the model inherits high-level semantic structure and bypasses the expensive from-scratch regime. The practical payoff: L2P achieves 93% of the source model's GenEval score while enabling native 4K ultra-high-resolution generation, a capability that falls out naturally once the VAE compression stage is eliminated 4.

Code and a Gradio demo are already public. The roadmap lists 8K/10K UHR generation as in progress.

Why read it: Directly practical for anyone building pixel-space generation pipelines. The 8-GPU requirement puts it within reach of university labs and mid-tier compute budgets.

3. RaPD: Resolution-Agnostic Pixel Diffusion via Semantics-Enriched Implicit Representations

ArXiv: 2605.15908 | Yanhao Ge, Shanyan Guan, Weihao Wang, Ying Tai, Mingyu You

Nearly every generative model — whether latent or pixel-space — is trained on fixed-resolution grids and inherits resolution as a hard hyperparameter. RaPD sidesteps this by performing diffusion in a continuous Neural Image Field (NIF) latent space 5. Rather than encoding an image into a fixed-size tensor, a semantics-enriched implicit representation captures the image as a continuous function, allowing generation and decoding at any target resolution without retraining.

The paper addresses the reconstruction-generation gap that plagues prior resolution-agnostic approaches — a mismatch between what the model can reconstruct from compressed representations and what it can synthesize during generation. The semantics-enriched design closes this gap by tying the implicit representation to high-level visual content, not just pixel-level statistics.

Why read it: Resolution-agnostic generation is a practical problem that rarely gets clean solutions. If the approach generalizes well to downstream applications (editing, SR, inpainting at arbitrary scales), RaPD could become a useful building block.

4. Infinite Mask Diffusion for Few-Step Distillation (IMDM)

ArXiv: 2605.10518 | KAIST | Code available

Masked Diffusion Models (MDMs) have attracted attention as an autoregressive alternative for discrete text generation, offering parallel decoding with bidirectional context. Their main limitation: many sampling steps are required because simultaneous token updates introduce factorization errors. Jaehoon Yoo, Wonjung Kim, Chanhyuk Lee, and Seunghoon Hong at KAIST prove that standard MDMs face a theoretical lower bound on factorization error that cannot be broken by training alone — because they use a single deterministic mask state 6.

Their fix: replace the binary mask with a stochastic infinite-state mask (IMDM). This breaks the lower bound while preserving compatibility with pretrained MDM weights. On a synthetic few-step generation task where standard MDMs fail entirely, IMDM finds an efficient solution. When paired with appropriate distillation, IMDM surpasses existing few-step distillation baselines on LM1B and OpenWebText at small step counts.

Why read it: The theoretical argument — proving a hard lower bound and then constructing a mechanism to escape it — is clean and transferable. The technique could propagate to image-domain masked diffusion variants (e.g., Meissonic, MAGE) facing the same distillation bottleneck.

5. DreamSR: Towards Ultra-High-Resolution Image Super-Resolution via a Receptive-Field Enhanced Diffusion Transformer

ArXiv: 2605.15682 | CVPR 2026

When diffusion-based SR models handle very high-resolution images through patch-wise inference, a familiar failure mode appears: the global text prompt describes the whole image, but each patch only sees a fragment, causing the model to over-generate textures inconsistent with neighboring patches. DreamSR addresses this with a dual-branch MM-ControlNet 7 8.

One ControlNet branch generates local textual features using patch-level prompts; the pretrained DiT branch handles global prompts. These two streams are fused to keep semantic consistency across patches while suppressing patch-level hallucination. A Receptive-Field Enhancement training strategy improves the model's ability to capture patch context and restore local texture. The result is visually faithful high-resolution output at scales where prior diffusion SR methods tend to degrade or hallucinate.

Why read it: DreamSR is a CVPR 2026 paper with a clean architectural answer to a long-standing problem in diffusion-based SR. The dual-branch design is modular and could be adapted to other patch-wise generation scenarios.

Quick reference

Paper	Key idea	Venue/status	Code
AsymFlow	Rank-asymmetric velocity parameterization; latent-to-pixel finetuning	arXiv May 2026	GitHub + HF weights
L2P	Transfer LDM priors to pixel space with 8 GPUs; 4K generation	arXiv May 2026	GitHub
RaPD	Diffusion in continuous NIF latent space; any-resolution generation	arXiv May 2026	—
IMDM	Stochastic infinite-state mask breaks MDM factorization error bound	arXiv May 2026	GitHub
DreamSR	Dual-branch MM-ControlNet for patch-wise SR; receptive-field training	CVPR 2026	—