Fast-dLLM is a family of acceleration techniques for diffusion-based Large Language Models (dLLMs) and Vision-Language Models (dVLMs). This repository contains:
| Fast-dLLM v1 | Fast-dLLM v2 | Fast-dVLM | |
|---|---|---|---|
| Paper | Training-free Acceleration of Diffusion LLM | Efficient Block-Diffusion LLM | Block-Diffusion VLM via Direct Conversion |
| Modality | Text | Text | Vision + Text |
| Approach | Training-free inference acceleration | Block diffusion with fine-tuning | Direct AR-to-diffusion VLM conversion |
| Backbone | Dream, LLaDA | Qwen2.5 | Qwen2.5-VL |
| Key Techniques | KV Cache + Parallel Decoding | Block Diffusion + Hierarchical Caching | Block-Size Annealing + Speculative Decoding |
| Code | v1/ |
v2/ |
fast_dvlm/ |
| Model | — | Fast_dLLM_v2_7B | Fast_dVLM_3B |
- (🔥 New) [2026/04/10] Fast-dVLM is released! Up to 6.18x speedup over AR baseline while matching quality across 11 benchmarks. Check out our webpage, model, and paper!
- (🔥 New) [2026/01/26] Fast-dLLM v1/v2 is accepted by ICLR-2026. 🎉🎉🎉
- [2025.10.08] We have open sourced Fast-dLLM v2. Have a look at our webpage, model, and paper!
- [2025.08.01] Our new online demo of Fast-dLLM: https://fast-dllm.hanlab.ai/, welcome to try!
- [2025.07.06] Added factor-based parallel strategy and LLaDA-1.5 evaluation in
v1/llada/eval_gsm8k.sh. - [2025.07.04] We updated our paper with latest improvements and evaluation results.
- [2025.06.30] Fast-dLLM has been integrated into LLaDA-V. With Fast-dLLM, it accelerates the inference latency from 60s to 6s! Have a try here!!
- [✅] Inference and evaluation code
- [✅] Training code of Fast-dLLM v2
- [✅] Fast-dVLM: Block-diffusion VLM
- [🚀] vLLM support
Fast-dLLM/
├── v1/ # Fast-dLLM v1: Training-free acceleration (LLM)
│ ├── dream/ # Dream model support
│ ├── llada/ # LLaDA model support
│ ├── requirements.txt
│ └── README.md
├── v2/ # Fast-dLLM v2: Block diffusion (LLM)
│ ├── src/ # LMFlow training framework
│ ├── train_scripts/ # Fine-tuning scripts
│ ├── configs/ # DeepSpeed configs
│ ├── generation_functions.py
│ ├── eval.py / eval_script.sh
│ ├── app.py / run_chatbot.py
│ ├── requirements.txt
│ └── README.md
├── fast_dvlm/ # Fast-dVLM: Block-diffusion VLM
│ ├── run_chatbot.py # Command-line chatbot
│ ├── requirements.txt
│ └── README.md
├── CONTRIBUTING.md
├── LICENSE
└── README.md # This file
cd v1
pip install -r requirements.txt
# LLaDA interactive chat
python llada/chat.py --gen_length 128 --steps 128 --block_size 32
# Dream evaluation
accelerate launch dream/eval.py --model dream \
--model_args pretrained=Dream-org/Dream-v0-Base-7B,max_new_tokens=256,diffusion_steps=8,add_bos_token=true,alg=confidence_threshold,threshold=0.9,use_cache=true \
--tasks gsm8k --num_fewshot 5 --batch_size 1For full details, see v1/README.md.
cd v2
pip install -e .
# Gradio web demo
python app.py
# Evaluation
bash eval_script.shFor full details, see v2/README.md.
cd fast_dvlm
pip install -r requirements.txt
# Quick inference
python run_chatbot.py \
--model-name Efficient-Large-Model/Fast_dVLM_3B \
--image path/to/image.jpg \
--prompt "Describe this image in detail."
# Interactive mode
python run_chatbot.pyFor full details, see fast_dvlm/README.md.
Issues and Pull Requests are welcome! Please see CONTRIBUTING.md for details.
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
If you find this work useful, please cite our papers:
@misc{wu2026fastdvlmefficientblockdiffusionvlm,
title={Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM},
author={Chengyue Wu and Shiyi Lan and Yonggan Fu and Sensen Gao and Jin Wang and Jincheng Yu and Jose M. Alvarez and Pavlo Molchanov and Ping Luo and Song Han and Ligeng Zhu and Enze Xie},
year={2026},
eprint={2604.06832},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.06832},
}
@misc{wu2025fastdllmv2efficientblockdiffusion,
title={Fast-dLLM v2: Efficient Block-Diffusion LLM},
author={Chengyue Wu and Hao Zhang and Shuchen Xue and Shizhe Diao and Yonggan Fu and Zhijian Liu and Pavlo Molchanov and Ping Luo and Song Han and Enze Xie},
year={2025},
eprint={2509.26328},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.26328},
}
@misc{wu2025fastdllmtrainingfreeaccelerationdiffusion,
title={Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding},
author={Chengyue Wu and Hao Zhang and Shuchen Xue and Zhijian Liu and Shizhe Diao and Ligeng Zhu and Ping Luo and Song Han and Enze Xie},
year={2025},
eprint={2505.22618},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.22618},
}We would like to thank the authors of LLaDA and Dream for their excellent work and open-source contributions. We thank Qwen2.5 and Qwen2.5-VL for the base model architectures and LMFlow for the training framework.