
ArXiv Diffusion Models Digest
2026. 05. 19. 22:43:20@NeoDrop Official
Five diffusion papers worth reading today (May 19, 2026)
Today's top five: SURGE (ICML 2026) brings gradient-free inference guidance via Girsanov reweighting; FMwC from Max Welling's group at UvA adds per-sample confidence to flow matching without ensemble cost; GeoFlow uses RL finetuning to enforce geometric consistency in video diffusion; Dual-Rate Diffusion from Hoogeboom's lab achieves 2–4× speedup via interleaved heavy-light networks; and Drift Flow Matching unifies single-step and multi-step generation in one framework.
리서치 브리프
This digest covers a slightly extended ~28-hour window (2026-05-18T14:12Z to 2026-05-19T19:00Z), capturing 8 new diffusion / flow-matching preprints. Five cleared the bar on novelty, author signal, and practical relevance. None have numeric baseline comparisons available yet — all were posted under 48 hours ago and full experimental tables remain behind ArXiv HTML timeouts — so quantitative claims below are drawn from paper abstracts and the papers.cool summary pages, with gaps explicitly noted.
1. SURGE: gradient-free inference-time scaling via Girsanov reweighting
ArXiv: 2605.18745 | ICML 2026 accepted | Lifu Wei, Yinuo Ren, Naichen Shi, Yiping Lu
Peer-review status: Accepted, ICML 2026.
Every inference-time guidance method in diffusion literature — classifier guidance, Doob's h-transform, MPGD (Manifold-Constrained Gradient Descent) — shares a core requirement: it needs a score function, a Hessian, or some PDE evaluation at each step. That coupling between guidance and differentiability has been treated as a given.
SURGE (the paper's arXiv title uses URGE — Unbiased Resampling via Girsanov Estimation — as the algorithm name) breaks that assumption 1. The key tool is the Girsanov theorem, which allows computing a likelihood ratio between two continuous-time stochastic processes without evaluating either process's score. Concretely: for each simulated diffusion trajectory, the algorithm attaches a scalar multiplicative weight derived from the Girsanov measure change, then periodically resamples trajectories according to those weights — a sequential Monte Carlo step that requires no gradient computation anywhere.
The paper also establishes a theoretical result with independent value: path-level SMC (Sequential Monte Carlo) and particle-level SMC are equivalent for this reweighting scheme. The Girsanov path weights recover particle-level weights through backward conditional expectations, unifying two framings that were previously treated as separate approaches.
On synthetic tests and diffusion-model benchmarks, SURGE outperforms existing inference-time guidance baselines in generation quality with simpler implementation 1. Specific benchmark names, metrics, and baseline numbers are not yet available from the abstract alone.
Code/resources: No public repository linked at time of writing; ICML 2026 camera-ready may include one.
Why read it: The gradient requirement is the main barrier to applying inference-time guidance to black-box or non-differentiable reward models. SURGE removes that barrier entirely. The theoretical path/particle SMC equivalence is a clean result that will likely be cited regardless of whether practitioners adopt the specific algorithm.
2. Flowing with Confidence (FMwC): per-sample confidence for flow matching at standard cost
ArXiv: 2605.18472 | University of Amsterdam | Friso de Kruiff, Dario Coscia, Max Welling, Erik Bekkers
Peer-review status: Preprint (submitted 2026-05-18).
Flow matching models have no native mechanism for saying "I'm unsure about this sample." Getting that signal currently means running k independent trajectories — ensemble or stochastic rollouts — at k× the compute cost, and what you measure is disagreement between runs, not the model's confidence in a single prediction.
FMwC addresses this directly 2 3. The method injects input-dependent multiplicative noise into selected network layers (not the standard additive noise used in Bayesian neural network approximations). Noise variance then propagates through the network in closed form — no Monte Carlo sampling inside the forward pass — and integrates along the ODE trajectory, producing a per-sample confidence score at the end of a single standard inference call.
Three demonstrated applications: filtering (discard low-confidence outputs to raise average image quality and crystal thermodynamic stability), editing (backtrack the ODE trajectory to the timestep where the model first became uncertain, then redirect), and adaptive step sizing (concentrate ODE solver steps in regions where the velocity field is ambiguous). The paper also reports that the confidence score correlates with the divergence magnitude of the learned velocity field — a connection that provides a new handle on interpreting what a flow model has and has not learned 2.
Specific FID or IS numbers versus baselines are not reported in the abstract.
Code/resources: No repository found; the paper was posted under 24 hours before this digest.
Why read it: Per-sample uncertainty without ensemble cost is a widely wanted property that has resisted clean solutions in the flow-matching literature. The closed-form variance propagation approach is technically distinct from prior work and the three application modes are concrete enough to evaluate. Max Welling's group at University of Amsterdam has a strong track record in principled probabilistic inference — the theoretical framing here is likely rigorous.
3. GeoFlow: RL finetuning turns geometric consistency from an emergent property into an optimization target
ArXiv: 2605.18365 | Jan Ackermann, Shengqu Cai, Boyang Deng, Zhengfei Kuang, Songyou Peng et al. | Code and weights released
Peer-review status: Preprint (submitted 2026-05-18).
Video diffusion models can generate plausible-looking footage that fails basic geometric tests: background pixels drift in ways no real camera motion would produce, and independently moving objects change appearance between frames. These artifacts are not bugs in the loss function — they emerge from the fact that geometric consistency was never an explicit training objective.
GeoFlow reframes this as an RL finetuning problem 4. The authors construct a geometric consistency reward function that measures whether generated video motion is compatible with a coherent 3D scene: background motion should be explainable by a rigid camera transform, and independently moving objects should maintain appearance consistency along their motion trajectories. The reward is computed using optical flow, depth-and-pose prediction, and feature-based correspondence matching to separate rigid and dynamic regions. DDPO (Denoising Diffusion Policy Optimization), paired with a DDIM sampler, is then used to finetune a CogVideoX-2B backbone against this reward.
The paper reports significant reduction in temporal geometric artifacts versus strong baselines across diverse dynamic scenes — including both camera motion and independently moving objects — while maintaining perceptual quality 4. Metrics used include MEt3R (a multi-view reconstruction consistency score) and Sampson error (a geometric reprojection residual measuring epipolar constraint violations); specific numeric comparisons are not available from the abstract.
Code/resources: Code and model weights are publicly released. The specific GitHub URL was not included in the abstract; check the paper's ArXiv page or project page linked from the PDF.
Why read it: The approach is model-agnostic — the geometric reward function and DDPO finetuning loop can be applied to any video diffusion backbone, not just CogVideoX. The fact that code and weights are already live makes this immediately testable. This is also one of the first clean demonstrations of RL-based geometric reward shaping for video diffusion, as opposed to the more common approach of architectural modifications or self-supervised consistency losses.
4. Dual-Rate Diffusion: interleaved heavy-light networks for 2–4× speedup
ArXiv: 2605.18190 | Grigory Bartosh, David Ruhe, Emiel Hoogeboom et al.
Peer-review status: Preprint (submitted 2026-05-18).
Diffusion model acceleration research has converged on a few standard playbooks: distillation, step count reduction, quantization, token merging. All of them either shrink the number of denoising steps or compress per-step computation. Dual-Rate Diffusion takes a different angle: it questions whether global context needs to be recomputed at every step at all 5.
The architecture pairs a heavy, high-capacity context encoder with a lightweight denoising model. The context encoder runs infrequently — it extracts rich, high-dimensional feature representations — and the lightweight denoiser runs at every step, reusing those cached features for fine-grained refinement. The frequency mismatch (sparse heavy + dense light) is the source of the speedup. The analogy to video codec design is direct: the heavy encoder works like an I-frame, the lightweight denoiser like P-frames that only compute incremental updates.
On ImageNet, Dual-Rate Diffusion matches standard baseline performance at 2–4× lower compute cost 5. The FID and IS values for the baseline and the Dual-Rate model are not reported in the abstract. The approach is also compatible with Moment Matching Distillation (MMD), a few-step distillation technique, allowing the two speedup mechanisms to compound.
Code/resources: No repository found at time of writing.
Why read it: The compute decomposition idea — separating global context extraction from per-step denoising — is architecturally novel in diffusion. If the ImageNet results hold across architectures and resolutions, this is a practical speedup technique that works orthogonally to distillation. The compatibility with MMD is a useful design choice: practitioners who already use distillation can layer Dual-Rate on top.
5. Drift Flow Matching: bridging single-step and multi-step generation
ArXiv: 2605.17244 | Chenrui Ma, Xi Xiao, Lin Zhao, Tianyang Wang, Ferdinando Fioretto, Yanning Shen
Peer-review status: Preprint (submitted ~2026-05-17).
Two previously separate families of generative models have occupied different ends of the efficiency-quality tradeoff. Drift Models learn direct transport maps and generate in one step — fast but with limited iterative refinement. Flow Matching models (and diffusion models more generally) iterate across many steps — slower but with quality that scales with compute. Until now, the two have been developed as competing paradigms rather than a unified design space.
Drift Flow Matching (DFM) proposes a single framework that contains both as special cases 6 7. The core mechanism preserves the direct transport map efficiency of Drift Models while adding the ability to run iterative refinement steps when generation quality needs to improve. A single trained DFM model can then operate anywhere from 1 step to N steps, with the compute budget chosen at inference time to match the quality-efficiency requirements of the deployment context.
The paper reports that extensive experiments across different tasks and datasets confirm the framework's effectiveness and generality. Specific dataset names, metrics, and baseline comparisons are not available from the abstract.
Code/resources: No repository found at time of writing.
Why read it: The single-step vs. multi-step tradeoff sits at the center of inference compute debates that have widened since large-scale models made step cost economically significant. DFM's unified design space — where step count is a runtime parameter rather than a training-time choice — is a conceptually clean answer to that debate. The connection to consistency models (which distill diffusion into 1–2 steps from a fixed base) and rectified flows (which straighten ODE trajectories for fewer required steps) is worth working through when reading the paper.
Quick reference
| Paper | Core idea | Status | Code |
|---|---|---|---|
| SURGE (2605.18745) | Gradient-free inference-time scaling via Girsanov path weights + SMC resampling | ICML 2026 accepted | Not yet public |
| FMwC (2605.18472) | Per-sample confidence for flow matching via closed-form variance propagation | Preprint | Not yet public |
| GeoFlow (2605.18365) | RL (DDPO) finetuning with geometric consistency reward; model-agnostic | Preprint | Released |
| Dual-Rate Diffusion (2605.18190) | Interleaved heavy context encoder + light denoiser; 2–4× speedup on ImageNet | Preprint | Not yet public |
| Drift Flow Matching (2605.17244) | Unified framework for 1-step drift models and N-step flow matching | Preprint | Not yet public |
참고 출처
- 1SURGE: Approximation-free Training Free Particle Filter for Diffusion Surrogate
- 2Flowing with Confidence
- 3Flowing with Confidence (arXiv abstract)
- 4GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation
- 5Dual-Rate Diffusion: Accelerating diffusion models with an interleaved heavy-light network
- 6Drift Flow Matching
- 7Drift Flow Matching (arXiv abstract)
이 콘텐츠를 둘러싼 관점이나 맥락을 계속 보강해 보세요.