Five diffusion papers worth reading: June 25, 2026

Thursday's batch from arXiv closes a week that has been unusually heavy on video diffusion. Five papers make the cut: a new consistency-model distillation framework that applies continuous-time theory to autoregressive video generation; an ICML 2026 paper showing how to remove concepts from 12B-parameter frontier image models in a way that survives adversarial attack; an ECCV 2026 work that jointly denoises perception outputs and ego-trajectory in a single diffusion pass; a serving-infrastructure paper that halves I2V latency by sharing sparse attention patterns across similar requests; and a scaled-up masked diffusion language model that closes most of the gap with autoregressive LLMs at 8B parameters.

Speed-read table

#	Paper	arXiv	Key result
1	Causal-rCM	2606.25473	2-step Wan2.1-1.3B reaches VBench-T2V 84.63; continuous-time CM converges 10× faster than discrete
2	BLOCK (concept removal)	2606.25548	Style removal UA 88.6% on Flux 12B; adversarial attack success rate cut from 58% to 25%
3	UniTeD	2606.25736	NAVSIM v1 PDMS 90.2, camera-only, beats LiDAR baselines
4	Chorus II	2606.25040	2.16× I2V end-to-end speedup without quality loss; 6.71× attention-only acceleration at 35% sparsity
5	iLLaDA	2606.25331	8B masked diffusion LLM, 12T-token pretraining; BBH +21.6 pts over LLaDA; competitive with Qwen2.5 7B

1. Causal-rCM: continuous-time consistency models reach autoregressive video

arXiv: 2606.25473 · Kaiwen Zheng, Guande He, Min Zhao et al. · Tsinghua + NVIDIA · Submitted June 24, 2026 1

Code: github.com/NVlabs/rcm

Venue: Preprint.

What it does

Consistency model distillation has worked well for single-step image generation but has not extended cleanly to autoregressive (AR) video diffusion — the causal structure creates new complications for the forward/reverse split that rCM (rectified Consistency Models) relies on. Causal-rCM resolves this by unifying two training objectives: teacher-forcing consistency models (TF-CM) handle the forward, offline pass; self-forcing distributional matching distillation (SF-DMD) handles the reverse, on-policy pass. Together they preserve the rCM complementarity principle within autoregressive video generation.

The framework ships with a custom FlashAttention-2 JVP (Jacobian-vector product) kernel that makes continuous-time score matching tractable under the causal mask required for AR generation. This engineering contribution is what unlocks continuous-time CMs for video: without the JVP kernel, the teacher-forcing training step is computationally prohibitive. 1

Strength of evidence

Applied to Wan2.1-1.3B, the distilled 2-step model scores VBench-T2V 84.63 — a concrete benchmark on a standardized evaluation suite. The paper also reports that continuous-time CM (sCM/MeanFlow) converges 10× faster than discrete-time CM (dCM) on the same AR video setup. 1 Infrastructure comparison against Self-Forcing, FastVideo, and FastGen across seven capability axes (training paradigms, parallelism, JVP support) is tabulated in the paper.

Separately, NVIDIA reports applying the same distillation recipe to Cosmos 3, their full-modality world foundation model, enabling action-conditioned interactive world models at 1–2 inference steps. The training uses synthetic data only, which limits direct comparison against models trained on diverse real-video corpora.

What it means

For practitioners building on Wan-class video models, a 2-step inference path with VBench-T2V 84.63 is directly deployable. The GitHub repository is live. For researchers: the JVP kernel and the TF-CM/SF-DMD unification are re-usable components; the paper explicitly supports frame-wise and chunk-wise AR modes and two lightweight inference acceleration variants (noisy context and custom step schedule).

2. BLOCK: concept removal that survives adversarial pressure (ICML 2026)

arXiv: 2606.25548 · Aditya Kumar, Pierre Joly, Adam Dziedzic, Franziska Boenisch · CISPA Helmholtz Center for Information Security · Submitted June 24, 2026; accepted at ICML 2026 2

Code: github.com/sprintml/block-concept-removal

Venue: ICML 2026.

What it does

Most concept-removal methods attach an external safety module — a classifier, a guidance signal, or a steering vector — that can be removed or bypassed. BLOCK (Bottleneck-Layer-Oriented Concept Knockout) takes a different path: it replaces the bottleneck transformation layer between the text encoder and the generative backbone with a transcoder trained in isolation, then redirects the decoder weights associated with a target concept to an empty token. 2

The transcoder uses TopK activations instead of ReLU to avoid dead-unit and activation-shrinkage problems under ℓ₁ regularization, and trains with a three-part loss (multi-TopK + auxiliary). The method covers SD3.5 Large, Flux.1-dev (12B), Infinity-2B, and Infinity-8B.

Strength of evidence

On Flux.1-dev, style removal reaches UA 88.6% (how often the concept is absent from generated images) with CRA 96.4% (how often unrelated content is preserved), versus UCE at 67.43% and LOCOEDIT at 66.45% UA. Object removal scores UA 93.2% with IRA 96.61%. 2

When subjected to the Ring-A-Bell adversarial attack on SD3.5, the attack success rate drops from 58.08% to 25.21% — compared to baselines that remain above 50%. The paper also reports sequential multi-concept removal for up to 10 concepts: BLOCK maintains performance while UCE and LOCOEDIT degrade sharply. 2

Training cost is reported as orders of magnitude lower than retraining-based methods like EraseAnything.

What it means

ICML acceptance and the adversarial robustness results are the two signals that separate this from the backlog of concept-removal papers. The method does not require access to model weights at inference — the transcoder is permanently integrated — so white-box bypass attacks lose their leverage. For organizations deploying frontier text-to-image models under content policies, this is a candidate replacement for add-on classifiers that can be detached.

The sequential multi-concept evaluation (up to 10) is also a useful practical signal: most earlier methods degrade as the concept list grows, which is the real deployment scenario.

3. UniTeD: joint perception and planning in one diffusion pass (ECCV 2026)

arXiv: 2606.25736 · Bo Zhao, Xinting Zhao, Naifan Li, Erkang Cheng, Haibin Ling · Nullmax + Westlake University · Submitted June 24, 2026; accepted at ECCV 2026 3

Code: Not released at preprint stage.

Venue: ECCV 2026.

What it does

End-to-end autonomous driving systems typically run perception (detection, mapping, motion prediction) and planning as separate stages. Diffusion-based planners introduced recently still treat perception outputs as fixed conditioning inputs, which means perception errors propagate downstream without correction. UniTeD proposes a Unified Temporal Diffusion framework that models both perception and planning simultaneously in a single shared generative denoising space, allowing the two tasks to refine each other through iterative denoising. 3

Three components make this work: a Unified Diffusion Decoder that combines self-attention, spatial deformable attention, and conditional modulation; a Temporal Transition Module (TTM) that aligns the noise level of historical frames with the current frame (the key technical challenge when chaining across timesteps in video); and an Anchor Refresh Strategy (ARS) that reduces training–inference distribution shift in sparse diffusion settings.

Strength of evidence

NAVSIM (Non-reactive Autonomous Vehicle Simulation benchmark) v1: PDMS (Planning Driving Metric Score) 90.2, using camera input only with a ResNet-34 backbone. This exceeds LiDAR-based methods including TransFuser (84.0), Hydra-MDP (86.5), and WoTE (88.3). NAVSIM v2: EPDMS 90.1, surpassing DiffRefiner (86.2) by 3.9 points. Bench2Drive closed-loop: Driving Score 87.25, above DiffRefiner (87.1) and HiP-AD (86.8). 3

Ablations are informative: removing the diffusion objective from perception tasks (reverting to discriminative) costs 0.7 PDMS; removing TTM costs 0.8 PDMS; removing ARS costs 2.0 PDMS. The largest single contributor is ARS, which addresses the training–inference gap directly.

What it means

Camera-only surpassing LiDAR-based methods on NAVSIM v1 by 6.2 PDMS points over the nearest LiDAR baseline (WoTE 88.3) is an unusual result. The critical caveat: NAVSIM evaluates on reactive simulation, not open-world driving, and all three benchmarks measure different closed-loop or non-reactive scenarios. Actual deployment gaps remain.

For the research community, the TTM formulation is the reusable insight: any method that chains temporal diffusion frames faces the noise-level mismatch problem, and TTM provides a tested solution. ARS's 2.0-point PDMS contribution suggests that training–inference distribution mismatch has been underweighted in prior diffusion-planning work.

arXiv: 2606.25040 · Hao Liu, Chenghuan Huang, Xing Cai et al. · Sun Yat-sen University + Tencent WeChat · Submitted June 23, 2026 4

Code: Not released.

Venue: Preprint.

What it does

Sparse attention acceleration for diffusion inference has a well-known problem: constructing a high-quality sparsity mask requires routing computation (predicting which block pairs to skip), which adds overhead that partially cancels the attention savings. Chorus II reframes the problem: if two I2V requests use similar input images, their block-level attention sparsity patterns are likely similar too, so an existing mask from a prior request can substitute for computing a new one. 4

The mechanism: DINO-based visual similarity retrieval selects historical requests; if the retrieved mask has IoU ≥ 85% with a newly computed ground-truth mask (the paper verifies this holds when DINO similarity > 0.4), the historical mask is directly applied. Two safety fallbacks prevent compounding errors: a block-pair visit refreshing step (every 8 layers guarantees all Q/K pairs are visited at least once) and a minimum top-k guarantee.

An optional extension adds feature reuse — downsampled latent features from the matched historical request are reused alongside the mask, with guidance enhancement to correct for the approximation. This pushes end-to-end speedup from 2.16× to 2.59×.

Strength of evidence

Tested on Wan2.2-I2V (4-step distilled), single H20 GPU, 720p 61-frame video. The FP8 block-sparse attention backend delivers 6.71× attention-layer acceleration at 35.2% keep ratio versus FlashAttention-2. 4 End-to-end latency: 78.9 seconds with sparsity reuse versus 160 seconds without, against an SVG2-quality baseline — a 2.03× wall-clock reduction. The claimed 2.16× figure accounts for a slightly different configuration.

Quality is measured against SVG2 baseline outputs; the paper reports equivalent generation quality.

What it means

The practical scope is I2V serving infrastructure, not research training. If your deployment receives batches of I2V requests with correlated input images — stock media workflows, user-generated content pipelines with thematically similar submissions, reference-guided generation — the cross-request reuse assumption holds. On uncorrelated request streams the DINO similarity condition may fail often enough to erode the speedup.

The sparsity-reuse approach also requires a retrieval store of historical request masks, which means it is a stateful serving component rather than a drop-in replacement for the attention module.

5. iLLaDA: scaling masked diffusion to 8B and 12 trillion tokens

arXiv: 2606.25331 · Shen Nie, Qiyang Min, Shaoxuan Xu et al. · Renmin University of China, Gaoling School of AI (GSAI) · Submitted June 24, 2026 5

Code and weights: github.com/ML-GSAI/LLaDA

Venue: Preprint.

What it does

LLaDA (2025) showed that masked diffusion language models could reach competitive performance at smaller scales. iLLaDA (Improved LLaDA) scales the same paradigm to 8 billion parameters trained on 12 trillion tokens, with a 25B-token instruction fine-tuning corpus trained for 12 epochs. The model uses fully bidirectional attention — no causal mask — throughout both pretraining and SFT, maintaining the masked diffusion objective end-to-end. Two inference improvements are added: variable-length generation (reducing padding overhead) and confidence-based scoring for multiple-choice evaluation. 5

Strength of evidence

Against LLaDA (the direct predecessor):

iLLaDA-Base: +21.6 points on BBH (BIG-Bench Hard, a multi-task reasoning benchmark), +14.9 points on ARC-Challenge 5
iLLaDA-Instruct: +14.5 points on MATH, +16.5 points on HumanEval 5

The paper reports that iLLaDA performs competitively with Qwen2.5 7B across multiple benchmarks despite using non-autoregressive generation. Weights and code are public.

The gaps versus autoregressive models are not fully closed — the paper characterizes performance as "competitive," not superior. The specific benchmarks and magnitude of any remaining gaps are in the full paper.

What it means

The gap between masked diffusion LLMs and autoregressive LLMs has been the central question for this research line since the original MDLM and LLaDA papers. iLLaDA's results suggest the gap is primarily a scale and training-recipe problem, not a fundamental architectural one — at 8B + 12T, a bidirectional masked diffusion model reaches the range where comparison to mainstream 7B autoregressive models is reasonable. 5

The practical implications are narrow for now: autoregressive inference serves KV-cache and speculative decoding toolchains that don't transfer to masked diffusion, and deployment infrastructure strongly favors autoregressive models. But the benchmark trajectory from LLaDA → iLLaDA provides evidence that continued scaling is worth pursuing.

Cross-paper themes

Two patterns stand out across this batch.

Speed by reuse, not by simplification. Causal-rCM and Chorus II both achieve speedup by preserving the underlying computation — fewer denoising steps or fewer attention ops — rather than approximating the model architecture. Causal-rCM reuses the rCM recipe (TF-CM + SF-DMD complementarity) in the autoregressive domain; Chorus II reuses sparsity masks across requests. Neither changes the model weights. The pattern suggests that for production systems, the next round of throughput gains will come from smarter scheduling and reuse rather than architectural compression alone.

Bidirectional attention as the contested boundary. UniTeD and iLLaDA both operate with bidirectional context — UniTeD across perception and planning outputs simultaneously, iLLaDA across the full token sequence. The BLOCK transcoder also uses bidirectional feature access within the bottleneck. Autoregressive video generation (Causal-rCM) and autoregressive text (standard LLMs) represent the other side of that boundary. This week's papers don't settle which side wins in the long run, but they show that bidirectional diffusion has matured enough to compete on benchmarks that previously belonged exclusively to the autoregressive camp.

Paper	Core mechanism	Evaluation benchmark	Evidence level
Causal-rCM	TF-CM + SF-DMD unification with JVP kernel	VBench-T2V	Preprint
BLOCK	Transcoder replacement + concept weight redirect	UA / CRA / adversarial	ICML 2026
UniTeD	Joint perception+planning diffusion decoder + TTM + ARS	NAVSIM PDMS / Bench2Drive DS	ECCV 2026
Chorus II	Cross-request sparsity mask reuse via DINO similarity	Wall-clock latency vs. SVG2	Preprint
iLLaDA	Masked diffusion at 8B + 12T tokens	BBH / ARC-C / MATH / HumanEval	Preprint

Five diffusion papers worth reading: June 25, 2026

Speed-read table

1. Causal-rCM: continuous-time consistency models reach autoregressive video

What it does

Strength of evidence

What it means

2. BLOCK: concept removal that survives adversarial pressure (ICML 2026)

What it does

Strength of evidence

What it means

3. UniTeD: joint perception and planning in one diffusion pass (ECCV 2026)

What it does

Strength of evidence

What it means

What it does

Strength of evidence

What it means

5. iLLaDA: scaling masked diffusion to 8B and 12 trillion tokens

What it does

Strength of evidence

What it means

Cross-paper themes

Fuentes de referencia

Contenido relacionado

Text generation loses its left-to-right constraint

DiffusionGemma, ASSERT, OpenSharing, TestSprite CLI, and Claude Corps — AI Digest for June 11, 2026

扩散模型论文速递 2026-05-14

Speed-read table

1. Causal-rCM: continuous-time consistency models reach autoregressive video

What it does

Strength of evidence

What it means

2. BLOCK: concept removal that survives adversarial pressure (ICML 2026)

What it does

Strength of evidence

What it means

3. UniTeD: joint perception and planning in one diffusion pass (ECCV 2026)

What it does

Strength of evidence

What it means

4. Chorus II: sharing sparse attention patterns across I2V requests

What it does

Strength of evidence

What it means

5. iLLaDA: scaling masked diffusion to 8B and 12 trillion tokens

What it does

Strength of evidence

What it means

Cross-paper themes

Fuentes de referencia

Contenido relacionado

Text generation loses its left-to-right constraint

DiffusionGemma, ASSERT, OpenSharing, TestSprite CLI, and Claude Corps — AI Digest for June 11, 2026

扩散模型论文速递 2026-05-14