Fast-dLLM v2 is a carefully designed block diffusion language model (dLLM) that efficiently adapts pretrained autoregressive (AR) models into dLLMs for parallel text generation, requiring only approximately 1B tokens of fine-tuning. This represents a 500x reduction in training data compared to full-attention diffusion LLMs while preserving the original model's performance.
demo.mp4
- Novel training recipe combining block diffusion with complementary attention masks
- Enables blockwise bidirectional context modeling
- Token shift mechanism to retain autoregressive characteristics
- Block-level cache: Stores historical context representations across blocks
- Sub-block cache: Enables efficient parallel generation within partially decoded blocks
- Achieves up to 2.5x speedup over standard AR decoding
- Real-time visualization of the denoising process
- Maintains generation quality while delivering state-of-the-art efficiency
Fast-dLLM v2 significantly outperforms baselines in both efficiency and accuracy:
- 2.54× higher throughput than Qwen2.5-7B-Instruct
- 5.2% accuracy improvement over Fast-dLLM-LLaDA
Comprehensive evaluation across diverse tasks:
| Model Size | Model | HumanEval-Base | HumanEval-Plus | MBPP-Base | MBPP-Plus | GSM8K | Math | IFEval | MMLU | GPQA | Average |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1B-scale | Fast-dLLM v2 (1.5B) | 43.9 | 40.2 | 50.0 | 41.3 | 62.0 | 38.1 | 47.0 | 55.1 | 27.7 | 45.0 |
| 7B+ scale | Fast-dLLM v2 (7B) | 63.4 | 58.5 | 63.0 | 52.3 | 83.7 | 61.6 | 61.4 | 66.6 | 31.9 | 60.3 |
First, create and activate a conda environment:
conda create -n lmflow python=3.9 -y
conda activate lmflow
conda install mpi4pyInstall the package in development mode:
pip install -e .Download the training data (e.g., Alpaca dataset):
cd data
bash download.sh alpacaRun the fine-tuning script:
bash train_scripts/finetune_alpaca.shThis will start the training process using the Alpaca dataset with the optimized block diffusion training recipe.
Launch the Gradio-based web interface:
python app.pyThis will start a web server at http://localhost:10086 with:
- Real-time conversation interface
- Live visualization of the denoising process
- Adjustable generation parameters (block size, temperature, threshold)
- Performance metrics display
For a simple command-line interface:
python run_chatbot.pyCommands:
- Type your message and press Enter
clear- Clear conversation historyexit- Quit the chatbot
Execute the evaluation script for comprehensive benchmarking:
bash eval_script.shThis script evaluates the model on:
- MMLU: Massive Multitask Language Understanding
- GPQA: Graduate-level Google-Proof Q&A
- GSM8K: Grade School Math 8K
- Minerva Math: Mathematical reasoning
- IFEval: Instruction following evaluation
For custom evaluation with specific parameters:
accelerate launch eval.py \
--tasks gsm8k \
--batch_size 32 \
--num_fewshot 0 \
--model fast_dllm_v2 \
--model_args model_path=Efficient-Large-Model/Fast_dLLM_v2_7B,threshold=0.9- Token Shift Mechanism: Each masked token is predicted using the logit of its preceding token
- Block-wise Causal Attention: Access to all clean tokens from previous blocks and noisy tokens within current block
- Complementary Masks: Alternate masking patterns ensure every token position is learned
- Block-level Generation: Autoregressive at the block level
- Sub-block Parallelization: Parallel decoding within blocks for efficiency
- Hierarchical Caching: Block and sub-block level caching for speed optimization
v2/
├── app.py # Gradio web interface
├── run_chatbot.py # Command-line chatbot
├── eval.py # Evaluation harness integration
├── eval_script.sh # Benchmark evaluation script
├── generation_functions.py # Core generation algorithms
├── index.html # Project webpage
├── asset/ # Visual assets
│ ├── demo.mp4
│ ├── benchmark_results.png
│ ├── throughput.png
│ ├── training_recipe.png
│ └── visualization_animation.gif
└── README.md # This file
The web interface provides real-time visualization of:
- Denoising Process: Watch tokens being unmasked in real-time
- Generation Progress: Visual feedback of the generation pipeline
- Performance Metrics: Live throughput and timing information
- Slow Motion Replay: Detailed step-by-step visualization
- Based on Qwen2.5 architecture with block diffusion modifications
- 7B parameter model with efficient parallel decoding capabilities
- Custom attention mechanisms for block-wise processing
- Block-level KV caching for reduced computation
- Sub-block parallel processing for improved throughput
- Confidence-aware token unmasking for quality preservation
We welcome contributions! Please see our Contributing Guidelines for details.
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
If you find this work useful, please cite our paper:
@misc{wu2025fastdllmv2efficientblockdiffusion,
title={Fast-dLLM v2: Efficient Block-Diffusion LLM},
author={Chengyue Wu and Hao Zhang and Shuchen Xue and Shizhe Diao and Yonggan Fu and Zhijian Liu and Pavlo Molchanov and Ping Luo and Song Han and Enze Xie},
year={2025},
eprint={2509.26328},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.26328},
}We thank Qwen2.5 for the base model architecture



