X-Slim: No Cache Left Idle -- Accelerating Diffusion Model via Extreme-Slimming Caching

🎥 Demo Video

📝 Abstract

Diffusion models deliver strong generative quality, but inference cost scales with timestep count, model depth, and token length. Feature caching reuses nearby computations, yet aggressive timestep skipping often hurts fidelity while conservative block or token refresh yields limited speedup. We present X-Slim (eXtreme-Slimming Caching), a training-free, cache-based accelerator that jointly exploits redundancy across temporal, structural, and spatial dimensions. X-Slim introduces a dual-threshold push-then-polish controller: it first pushes timestep-level reuse up to an early-warning line, then polishes residual error with lightweight block- and token-level refresh; a critical line triggers full inference to reset error. Level-specific, context-aware indicators guide when and where to cache, shrinking search overhead. On FLUX.1-dev and HunyuanVideo, X-Slim reduces latency by up to 4.97x and 3.52x with minimal perceptual loss, and on DiT-XL/2 it reaches 3.13x acceleration with a FID improvement of 2.42 over prior methods.

💡 Motivation

(1) Inference-time redundancy appears in three dimensions: many iterative timesteps, deep stacks of Transformer blocks, and long token sequences at high resolution. Adjacent timesteps show strong feature similarity across all levels!

🌟 Is it possible to remove redundancy across temporal, structural, and spatial dimensions via caching?

(2) Pure timestep reuse can quickly cross a quality threshold, while only refreshing blocks or tokens leaves speed on the table.

🌟 How can we harness the benefits of aggressive step-level and conservative block/token reuse to achieve maximal speed while preserving quality?

🛠️ Methodology

Analysis of feature similarity across step-level and block-level dimensions

Push-then-Polish Caching. Rather than a direct mixture, X-Slim introduces a push-then-polish caching paradigm, exploiting cachable redundancy across temporal, structural, and spatial dimensions. Step-level reuse is pushed until an early-warning line, then polished with partial refresh (blocks/tokens). A critical line triggers a full step to reset accumulated error.

Level-specific Strategy. Different levels exhibit distinct reuse dynamics. Adjacent steps follow a U-shaped pattern and are weakly prompt dependent. Block sensitivity varies with depth yet follows a consistent depth-wise pattern across timesteps. Tokens are largely content dependent. Building on these properties, X-Slim adopts a hybrid level-specific strategy that plays to each level's strengths.

🖼️ Qualitative Comparison

Across text-to-image and text-to-video tasks, qualitative comparison shows that our proposed X-Slim produces sharper structures and more consistent details than competing baselines with a higher speedup.

Comparison on FLUX.1-dev

Comparison on HunyuanVideo

📊 Quantitative Evaluation

Quantitative evaluation proves that our proposed X-Slim achieves the best speed–quality trade-off across tasks. Reported results show up to 4.97x and 3.52x latency reductions on FLUX.1-dev and HunyuanVideo, and 3.13x speedup with a lower FID of 2.42 on DiT-XL/2.

Evaluation on FLUX.1-dev & DiT-XL/2

Quantitative results on FLUX.1-dev & DiT-XL/2

Evaluation on HunyuanVideo

📚 BibTeX

If you find this work helpful, please cite:

@article{xslimcache2025,
  title={No Cache Left Idle: Accelerating Diffusion Model via Extreme-Slimming Caching},
  author={Wen, Tingyan and Li, Haoyu and Chen, Yihuang and Zhou, Xing and Zhu, Lifei and Wang, XueQian},
  journal={arXiv preprint arXiv:2512.12604},
  year={2025}
}

X - S l i m

No Cache Left Idle: Accelerating Diffusion Model via Extreme-Slimming Caching