
2026. 7. 1. · 09:24
Five diffusion papers: July 1, 2026
Today’s digest prioritizes five diffusion and flow-matching papers from the July 1 arXiv window: Cross-Space Distillation, ECFM, OTCache, MBD-LMs, and SyncCache.
This issue covers the arXiv window from June 30, 09:20 to July 1, 09:00 in the channel's display time. The window is slightly shorter than a normal daily cycle because the previous issue closed after 09:00.
The ranking below favors four signals: method novelty, relevance to active diffusion-model research, venue or lab signal, and concrete evidence in the available paper record. Today's strongest cluster is practical: three of the five selected papers try to make diffusion or diffusion-like models smaller, faster, or easier to deploy. The theory pick matters because it asks what flow matching must guarantee before those faster systems can be trusted.
Speed-read table
| # | Paper | First-read reason | Evidence strength |
|---|---|---|---|
| 1 | Cross-Space Distillation | It breaks the shared-latent-space assumption in one-step distillation and reports SD 1.5 HPSv3 improving from 5.4 to 9.4 while preserving one-step inference. 1 | Strong practical claim, ECCV 2026 signal, and directly relevant teacher-student setup. |
| 2 | Entropy-Controlled Flow Matching | It turns mode coverage into an explicit entropy-rate constraint and connects flow matching to a Schrödinger bridge formulation. 2 | Strong theory contribution; evidence is mathematical rather than benchmark-led. |
| 3 | OTCache | It uses optimal-transport-inspired cache schedule modeling and reports 4.5× FLUX.1, 4.7× Qwen-Image, and 3.66× HunyuanVideo acceleration. 3 | Strong systems evidence across three model families, plus ECCV 2026 and public code. |
| 4 | Multi-Block Diffusion Language Models | It extends block diffusion language models to multi-block inference and reports TPF rising from 3.47 to 6.19, a 78.4% gain. 4 | Strong for diffusion-LM readers; less central for image/video researchers. |
| 5 | SyncCache | It targets audio-driven portrait animation and reports 4.12× HunyuanVideo-Avatar and 3.75× Wan-S2V acceleration with near-lossless visual fidelity. 5 | Strong application-specific acceleration paper; scope is narrower than OTCache. |
1. Cross-Space Distillation: large teachers, compact one-step students
Decision: open this first if your work touches one-step diffusion distillation, deployable text-to-image models, or the practical gap between modern high-capacity teachers and older compact students. Cross-Space Distillation is the clearest top slot because it attacks a constraint that has kept SD 3.5 and Flux-style teachers from transferring cleanly into SD 1.5-scale one-step students. 1
What it does: the paper proposes a Cross-Space Distillation paradigm in which teacher and student models can use different latent spaces, including different resolutions and VAE parameterizations. The bridge is a lightweight latent interface: the authors freeze the student VAE decoder as a spatial prior and learn a compact projector between the teacher-side and student-side spaces. 1
Technical read: the important claim is that one-step distillation does not need a shared latent space. The paper reports that SD 1.5 improves from 5.4 to 9.4 HPSv3 while keeping one-step inference, low latency, and broad SD 1.5 ecosystem compatibility. 1 That combination is why this ranks above the caching papers: it is not just faster sampling, but a route for using better teachers without abandoning a deployable student base.
Evidence and limits: the venue signal is strong because the paper is listed as accepted to ECCV 2026, the European Conference on Computer Vision. 1 The available summary does not list public code or a complete metric table. Readers should inspect whether the HPSv3 gain transfers across prompt categories and whether the bridge adds training complexity that offsets the one-step inference benefit.
2. Entropy-Controlled Flow Matching: mode coverage as a constraint, not a hope
Decision: read Entropy-Controlled Flow Matching if your work depends on flow matching guarantees, mode coverage, or theoretical links between deterministic transport and stochastic control. This is the most theory-heavy pick, and it earns the second slot because it formalizes a failure mode that faster generators can otherwise hide.
What it does: ECFM imposes an entropy-rate budget on the continuity-equation path, written as
d/dt H(μ_t) ≥ -λ, while minimizing the L2 distance to a reference drift. The paper states that the resulting problem is convex in Wasserstein space and has a KKT/Pontryagin optimality system. 2Technical read: the paper connects that constrained variational principle to a Schrödinger bridge representation under a Brownian reference law. It also says ECFM recovers entropic optimal-transport geodesics in the pure transport regime and Γ-converges to classical optimal transport as λ approaches 0. 2 For readers, the point is not the acronym; it is the shift from measuring only trajectory fit to enforcing an information-geometry condition along the path.
Evidence and limits: the paper reports certificate-style mode-coverage guarantees, density-floor bounds, Lipschitz stability, and counterexamples in which unconstrained flow matching can have near-optimal loss while collapsing semantic modes. 2 It is also a single-author paper by Chika Maduabuchi of William & Mary and is listed as accepted to ECCV 2026. 2 The limit is practical translation: read the algorithm section carefully before assuming the certificate-style guarantees will be easy to carry into large-scale image or video training.
3. OTCache: cache schedules as geometry, not per-budget hacks
Decision: read OTCache first if you work on diffusion inference systems, DiT caching, or production serving budgets for image and video generation. The paper ranks just below ECFM because it is highly actionable and has numbers across multiple backbones, but its contribution stays within caching policy design.
What it does: OTCache is a training-free three-stage framework that uses optimal transport to predict cache schedules for diffusion models. Stage 1 uses a graph-based method under a conservative budget to get a high-fidelity reference schedule; Stage 2 uses Optuna and CMA-ES under an extreme low budget to get an anchor schedule; Stage 3 lifts both schedules into continuous warping curves and interpolates them in Wasserstein space. 3
Technical read: the useful idea is that optimal cache schedules across NFE budgets may share a smooth structural relationship. OTCache treats those schedules as points along a policy-space trajectory rather than independent black-box searches for every target budget. 3 The authors frame this as an answer to a specific failure: graph-based caching methods can rely on additive independence assumptions that break down in low-NFE regimes. 3
Evidence and limits: the reported acceleration is concrete: 4.5× on FLUX.1 [dev], 4.7× on Qwen-Image, and 3.66× on HunyuanVideo, with LPIPS used in the schedule-search objective. 3 The paper is listed as ECCV 2026 accepted, and the authors report code at github.com/UnicomAI/OTCache. 3 The adoption question is whether a team can afford the offline search and reference-schedule construction for every model family it serves.
4. Multi-Block Diffusion Language Models: more parallelism for diffusion LMs
Decision: read MBD-LMs if you work on diffusion language models, masked decoding, or serving engines for semi-autoregressive text generation. Image researchers can skim it; DLM researchers should not.
What it does: Multi-Block Diffusion Language Models extend Block Diffusion Language Models from single-block inference to multi-block inference. The proposed Multi-block Teacher Forcing method trains bounded noise groups under clean-prefix conditioning, and its chain-uniform noise scheduler is designed to match the heterogeneous slot-wise noise patterns seen during MultiBD inference. 4
Technical read: the paper's systems hook is the Block Buffer. It keeps a fixed number of block slots, preserves static input shapes, supports CUDA Graph capture and replay, and lets decoding parallelism become wall-clock speed rather than only a conceptual parallelism gain. 4 That matters because many DLM speedups disappear once variable shapes, cache reuse, and runtime scheduling enter the serving stack.
Evidence and limits: on LLaDA2-Mini, the paper reports TPF improving from 3.47 to 6.19, a 78.4% gain, while accuracy rises from 79.95% to 81.03%. With DMax, MBD-LLaDA2-Mini-DMax reaches 9.34 average TPF with only a 1.02% accuracy drop, and the inference engine reports 951.41 TPS versus 781.50 TPS for the baseline. 4 The paper lists authors from Shanghai Jiao Tong University, Xi'an Jiaotong University, and Huawei, plus a project page at sjtu-deng-lab.github.io/mbd-lms. 4 The main limit is transfer: the reported setup is centered on LLaDA2-Mini, so readers should check whether the same training-inference alignment holds for larger DLMs and longer contexts.
5. SyncCache: specialized caching for talking portraits
Decision: read SyncCache if your pipeline uses audio-driven portrait animation or video DiTs with modality-specific bottlenecks. It is narrower than OTCache, but it is a strong example of how cache design can exploit structure inside a task rather than treating every token or region equally.
What it does: SyncCache is a training-free DiT caching method for audio-driven portrait animation. The method uses asymmetric dynamics between the face region and the background: Spatially-Asymmetric Probing prioritizes error sensitivity in dynamic facial areas, while Modality-Decoupled Caching reuses stable visual residuals through heavy DiT blocks and recomputes lightweight audio blocks to preserve lip synchronization. 5
Technical read: the practical distinction is that SyncCache separates visual stability from audio alignment. Generic text-to-video cache reuse can save compute but risk drifting on the moving mouth region; SyncCache's design keeps the audio pathway active while avoiding unnecessary recomputation where the frame is stable. 5 The cache ratio is optimized offline through dynamic programming, which the authors state adds no online overhead. 5
Evidence and limits: the paper reports 4.12× acceleration on HunyuanVideo-Avatar and 3.75× on Wan-S2V with near-lossless visual fidelity and precise audio alignment. 5 It is also listed as accepted to ECCV 2026. 5 The narrower scope is the reason it lands fifth: portrait animation is a high-value deployment niche, but readers working on general image synthesis or text-only DLMs will get more immediate mileage from the first four papers.
Reading order by research area
For one-step text-to-image distillation, start with Cross-Space Distillation and verify whether the 5.4 to 9.4 HPSv3 jump holds under your prompt mix. 1 For flow-matching theory, read ECFM before chasing another sampler speedup, because its entropy-rate constraint speaks directly to mode-coverage failure. 2
For inference infrastructure, compare OTCache and SyncCache: OTCache is the broader schedule-modeling paper, while SyncCache is the stronger task-specialized design for audio-driven portraits. 3 5 For diffusion language models, MBD-LMs is the cleanest read because it ties a training recipe to a serving mechanism rather than reporting only a decoding trick. 4
Cover image: AI-generated editorial illustration.

이 콘텐츠를 둘러싼 관점이나 맥락을 계속 보강해 보세요.