
25/6/2026 · 9:20
Five diffusion papers worth reading: June 25, 2026
Thursday's arXiv batch: Causal-rCM (Tsinghua+NVIDIA) brings continuous-time consistency models to AR video (2-step VBench-T2V 84.63); BLOCK (ICML 2026, CISPA) removes concepts from Flux 12B with adversarial robustness; UniTeD (ECCV 2026) jointly denoises perception+planning in autonomous driving (NAVSIM 90.2); Chorus II halves I2V serving latency via sparsity reuse; iLLaDA scales masked diffusion LLM to 8B+12T tokens.
Vistazo a la investigación
Thursday's batch from arXiv closes a week that has been unusually heavy on video diffusion. Five papers make the cut: a new consistency-model distillation framework that applies continuous-time theory to autoregressive video generation; an ICML 2026 paper showing how to remove concepts from 12B-parameter frontier image models in a way that survives adversarial attack; an ECCV 2026 work that jointly denoises perception outputs and ego-trajectory in a single diffusion pass; a serving-infrastructure paper that halves I2V latency by sharing sparse attention patterns across similar requests; and a scaled-up masked diffusion language model that closes most of the gap with autoregressive LLMs at 8B parameters.
Speed-read table
| # | Paper | arXiv | Key result |
|---|---|---|---|
| 1 | Causal-rCM | 2606.25473 | 2-step Wan2.1-1.3B reaches VBench-T2V 84.63; continuous-time CM converges 10× faster than discrete |
| 2 | BLOCK (concept removal) | 2606.25548 | Style removal UA 88.6% on Flux 12B; adversarial attack success rate cut from 58% to 25% |
| 3 | UniTeD | 2606.25736 | NAVSIM v1 PDMS 90.2, camera-only, beats LiDAR baselines |
| 4 | Chorus II | 2606.25040 | 2.16× I2V end-to-end speedup without quality loss; 6.71× attention-only acceleration at 35% sparsity |
| 5 | iLLaDA | 2606.25331 | 8B masked diffusion LLM, 12T-token pretraining; BBH +21.6 pts over LLaDA; competitive with Qwen2.5 7B |
1. Causal-rCM: continuous-time consistency models reach autoregressive video
arXiv: 2606.25473 · Kaiwen Zheng, Guande He, Min Zhao et al. · Tsinghua + NVIDIA · Submitted June 24, 2026 1
Code: github.com/NVlabs/rcm
Venue: Preprint.
What it does
Consistency model distillation has worked well for single-step image generation but has not extended cleanly to autoregressive (AR) video diffusion — the causal structure creates new complications for the forward/reverse split that rCM (rectified Consistency Models) relies on. Causal-rCM resolves this by unifying two training objectives: teacher-forcing consistency models (TF-CM) handle the forward, offline pass; self-forcing distributional matching distillation (SF-DMD) handles the reverse, on-policy pass. Together they preserve the rCM complementarity principle within autoregressive video generation.
The framework ships with a custom FlashAttention-2 JVP (Jacobian-vector product) kernel that makes continuous-time score matching tractable under the causal mask required for AR generation. This engineering contribution is what unlocks continuous-time CMs for video: without the JVP kernel, the teacher-forcing training step is computationally prohibitive. 1
Strength of evidence
Applied to Wan2.1-1.3B, the distilled 2-step model scores VBench-T2V 84.63 — a concrete benchmark on a standardized evaluation suite. The paper also reports that continuous-time CM (sCM/MeanFlow) converges 10× faster than discrete-time CM (dCM) on the same AR video setup. 1 Infrastructure comparison against Self-Forcing, FastVideo, and FastGen across seven capability axes (training paradigms, parallelism, JVP support) is tabulated in the paper.
Separately, NVIDIA reports applying the same distillation recipe to Cosmos 3, their full-modality world foundation model, enabling action-conditioned interactive world models at 1–2 inference steps. The training uses synthetic data only, which limits direct comparison against models trained on diverse real-video corpora.
What it means
For practitioners building on Wan-class video models, a 2-step inference path with VBench-T2V 84.63 is directly deployable. The GitHub repository is live. For researchers: the JVP kernel and the TF-CM/SF-DMD unification are re-usable components; the paper explicitly supports frame-wise and chunk-wise AR modes and two lightweight inference acceleration variants (noisy context and custom step schedule).
2. BLOCK: concept removal that survives adversarial pressure (ICML 2026)
arXiv: 2606.25548 · Aditya Kumar, Pierre Joly, Adam Dziedzic, Franziska Boenisch · CISPA Helmholtz Center for Information Security · Submitted June 24, 2026; accepted at ICML 2026 2
Venue: ICML 2026.
What it does
Most concept-removal methods attach an external safety module — a classifier, a guidance signal, or a steering vector — that can be removed or bypassed. BLOCK (Bottleneck-Layer-Oriented Concept Knockout) takes a different path: it replaces the bottleneck transformation layer between the text encoder and the generative backbone with a transcoder trained in isolation, then redirects the decoder weights associated with a target concept to an empty token. 2
The transcoder uses TopK activations instead of ReLU to avoid dead-unit and activation-shrinkage problems under ℓ₁ regularization, and trains with a three-part loss (multi-TopK + auxiliary). The method covers SD3.5 Large, Flux.1-dev (12B), Infinity-2B, and Infinity-8B.
Strength of evidence
On Flux.1-dev, style removal reaches UA 88.6% (how often the concept is absent from generated images) with CRA 96.4% (how often unrelated content is preserved), versus UCE at 67.43% and LOCOEDIT at 66.45% UA. Object removal scores UA 93.2% with IRA 96.61%. 2
When subjected to the Ring-A-Bell adversarial attack on SD3.5, the attack success rate drops from 58.08% to 25.21% — compared to baselines that remain above 50%. The paper also reports sequential multi-concept removal for up to 10 concepts: BLOCK maintains performance while UCE and LOCOEDIT degrade sharply. 2
Training cost is reported as orders of magnitude lower than retraining-based methods like EraseAnything.
What it means
ICML acceptance and the adversarial robustness results are the two signals that separate this from the backlog of concept-removal papers. The method does not require access to model weights at inference — the transcoder is permanently integrated — so white-box bypass attacks lose their leverage. For organizations deploying frontier text-to-image models under content policies, this is a candidate replacement for add-on classifiers that can be detached.
The sequential multi-concept evaluation (up to 10) is also a useful practical signal: most earlier methods degrade as the concept list grows, which is the real deployment scenario.
3. UniTeD: joint perception and planning in one diffusion pass (ECCV 2026)
arXiv: 2606.25736 · Bo Zhao, Xinting Zhao, Naifan Li, Erkang Cheng, Haibin Ling · Nullmax + Westlake University · Submitted June 24, 2026; accepted at ECCV 2026 3
Code: Not released at preprint stage.
Venue: ECCV 2026.
What it does
End-to-end autonomous driving systems typically run perception (detection, mapping, motion prediction) and planning as separate stages. Diffusion-based planners introduced recently still treat perception outputs as fixed conditioning inputs, which means perception errors propagate downstream without correction. UniTeD proposes a Unified Temporal Diffusion framework that models both perception and planning simultaneously in a single shared generative denoising space, allowing the two tasks to refine each other through iterative denoising. 3
Three components make this work: a Unified Diffusion Decoder that combines self-attention, spatial deformable attention, and conditional modulation; a Temporal Transition Module (TTM) that aligns the noise level of historical frames with the current frame (the key technical challenge when chaining across timesteps in video); and an Anchor Refresh Strategy (ARS) that reduces training–inference distribution shift in sparse diffusion settings.
Strength of evidence
NAVSIM (Non-reactive Autonomous Vehicle Simulation benchmark) v1: PDMS (Planning Driving Metric Score) 90.2, using camera input only with a ResNet-34 backbone. This exceeds LiDAR-based methods including TransFuser (84.0), Hydra-MDP (86.5), and WoTE (88.3). NAVSIM v2: EPDMS 90.1, surpassing DiffRefiner (86.2) by 3.9 points. Bench2Drive closed-loop: Driving Score 87.25, above DiffRefiner (87.1) and HiP-AD (86.8). 3
Ablations are informative: removing the diffusion objective from perception tasks (reverting to discriminative) costs 0.7 PDMS; removing TTM costs 0.8 PDMS; removing ARS costs 2.0 PDMS. The largest single contributor is ARS, which addresses the training–inference gap directly.
What it means
Camera-only surpassing LiDAR-based methods on NAVSIM v1 by 6.2 PDMS points over the nearest LiDAR baseline (WoTE 88.3) is an unusual result. The critical caveat: NAVSIM evaluates on reactive simulation, not open-world driving, and all three benchmarks measure different closed-loop or non-reactive scenarios. Actual deployment gaps remain.
For the research community, the TTM formulation is the reusable insight: any method that chains temporal diffusion frames faces the noise-level mismatch problem, and TTM provides a tested solution. ARS's 2.0-point PDMS contribution suggests that training–inference distribution mismatch has been underweighted in prior diffusion-planning work.
4. Chorus II: sharing sparse attention patterns across I2V requests
arXiv: 2606.25040 · Hao Liu, Chenghuan Huang, Xing Cai et al. · Sun Yat-sen University + Tencent WeChat · Submitted June 23, 2026 4
Code: Not released.
Venue: Preprint.
What it does
Sparse attention acceleration for diffusion inference has a well-known problem: constructing a high-quality sparsity mask requires routing computation (predicting which block pairs to skip), which adds overhead that partially cancels the attention savings. Chorus II reframes the problem: if two I2V requests use similar input images, their block-level attention sparsity patterns are likely similar too, so an existing mask from a prior request can substitute for computing a new one. 4
The mechanism: DINO-based visual similarity retrieval selects historical requests; if the retrieved mask has IoU ≥ 85% with a newly computed ground-truth mask (the paper verifies this holds when DINO similarity > 0.4), the historical mask is directly applied. Two safety fallbacks prevent compounding errors: a block-pair visit refreshing step (every 8 layers guarantees all Q/K pairs are visited at least once) and a minimum top-k guarantee.
An optional extension adds feature reuse — downsampled latent features from the matched historical request are reused alongside the mask, with guidance enhancement to correct for the approximation. This pushes end-to-end speedup from 2.16× to 2.59×.
Strength of evidence
Tested on Wan2.2-I2V (4-step distilled), single H20 GPU, 720p 61-frame video. The FP8 block-sparse attention backend delivers 6.71× attention-layer acceleration at 35.2% keep ratio versus FlashAttention-2. 4 End-to-end latency: 78.9 seconds with sparsity reuse versus 160 seconds without, against an SVG2-quality baseline — a 2.03× wall-clock reduction. The claimed 2.16× figure accounts for a slightly different configuration.
Quality is measured against SVG2 baseline outputs; the paper reports equivalent generation quality.
What it means
The practical scope is I2V serving infrastructure, not research training. If your deployment receives batches of I2V requests with correlated input images — stock media workflows, user-generated content pipelines with thematically similar submissions, reference-guided generation — the cross-request reuse assumption holds. On uncorrelated request streams the DINO similarity condition may fail often enough to erode the speedup.
The sparsity-reuse approach also requires a retrieval store of historical request masks, which means it is a stateful serving component rather than a drop-in replacement for the attention module.
5. iLLaDA: scaling masked diffusion to 8B and 12 trillion tokens
arXiv: 2606.25331 · Shen Nie, Qiyang Min, Shaoxuan Xu et al. · Renmin University of China, Gaoling School of AI (GSAI) · Submitted June 24, 2026 5
Code and weights: github.com/ML-GSAI/LLaDA
Venue: Preprint.
What it does
LLaDA (2025) showed that masked diffusion language models could reach competitive performance at smaller scales. iLLaDA (Improved LLaDA) scales the same paradigm to 8 billion parameters trained on 12 trillion tokens, with a 25B-token instruction fine-tuning corpus trained for 12 epochs. The model uses fully bidirectional attention — no causal mask — throughout both pretraining and SFT, maintaining the masked diffusion objective end-to-end. Two inference improvements are added: variable-length generation (reducing padding overhead) and confidence-based scoring for multiple-choice evaluation. 5
Strength of evidence
Against LLaDA (the direct predecessor):
- iLLaDA-Base: +21.6 points on BBH (BIG-Bench Hard, a multi-task reasoning benchmark), +14.9 points on ARC-Challenge 5
- iLLaDA-Instruct: +14.5 points on MATH, +16.5 points on HumanEval 5
The paper reports that iLLaDA performs competitively with Qwen2.5 7B across multiple benchmarks despite using non-autoregressive generation. Weights and code are public.
The gaps versus autoregressive models are not fully closed — the paper characterizes performance as "competitive," not superior. The specific benchmarks and magnitude of any remaining gaps are in the full paper.
What it means
The gap between masked diffusion LLMs and autoregressive LLMs has been the central question for this research line since the original MDLM and LLaDA papers. iLLaDA's results suggest the gap is primarily a scale and training-recipe problem, not a fundamental architectural one — at 8B + 12T, a bidirectional masked diffusion model reaches the range where comparison to mainstream 7B autoregressive models is reasonable. 5
The practical implications are narrow for now: autoregressive inference serves KV-cache and speculative decoding toolchains that don't transfer to masked diffusion, and deployment infrastructure strongly favors autoregressive models. But the benchmark trajectory from LLaDA → iLLaDA provides evidence that continued scaling is worth pursuing.
Cross-paper themes
Two patterns stand out across this batch.
Speed by reuse, not by simplification. Causal-rCM and Chorus II both achieve speedup by preserving the underlying computation — fewer denoising steps or fewer attention ops — rather than approximating the model architecture. Causal-rCM reuses the rCM recipe (TF-CM + SF-DMD complementarity) in the autoregressive domain; Chorus II reuses sparsity masks across requests. Neither changes the model weights. The pattern suggests that for production systems, the next round of throughput gains will come from smarter scheduling and reuse rather than architectural compression alone.
Bidirectional attention as the contested boundary. UniTeD and iLLaDA both operate with bidirectional context — UniTeD across perception and planning outputs simultaneously, iLLaDA across the full token sequence. The BLOCK transcoder also uses bidirectional feature access within the bottleneck. Autoregressive video generation (Causal-rCM) and autoregressive text (standard LLMs) represent the other side of that boundary. This week's papers don't settle which side wins in the long run, but they show that bidirectional diffusion has matured enough to compete on benchmarks that previously belonged exclusively to the autoregressive camp.
| Paper | Core mechanism | Evaluation benchmark | Evidence level |
|---|---|---|---|
| Causal-rCM | TF-CM + SF-DMD unification with JVP kernel | VBench-T2V | Preprint |
| BLOCK | Transcoder replacement + concept weight redirect | UA / CRA / adversarial | ICML 2026 |
| UniTeD | Joint perception+planning diffusion decoder + TTM + ARS | NAVSIM PDMS / Bench2Drive DS | ECCV 2026 |
| Chorus II | Cross-request sparsity mask reuse via DINO similarity | Wall-clock latency vs. SVG2 | Preprint |
| iLLaDA | Masked diffusion at 8B + 12T tokens | BBH / ARC-C / MATH / HumanEval | Preprint |




Añade más opiniones o contexto en torno a este contenido.