From 67967e8689455df22a8520d621c8462ba4c52926 Mon Sep 17 00:00:00 2001 From: hongping Date: Tue, 3 Mar 2026 12:37:28 +0800 Subject: [PATCH] docs: add energy efficiency considerations to bitsandbytes quantization guide --- docs/source/en/quantization/bitsandbytes.md | 23 +++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/docs/source/en/quantization/bitsandbytes.md b/docs/source/en/quantization/bitsandbytes.md index da5106a23347..e7a0f150806c 100644 --- a/docs/source/en/quantization/bitsandbytes.md +++ b/docs/source/en/quantization/bitsandbytes.md @@ -253,6 +253,29 @@ model_8bit = AutoModelForCausalLM.from_pretrained( ) ``` +### Energy efficiency considerations + +When deploying quantized models, it's important to understand the energy efficiency implications of different quantization configurations. + +**INT8 mixed-precision trade-offs** + +The default `llm_int8_threshold=6.0` configuration provides excellent accuracy preservation but may increase energy consumption by 17-33% compared to FP16 due to mixed-precision decomposition overhead. This is a justified trade-off for maintaining model quality in production deployments. + +**Threshold configuration** + +While setting `llm_int8_threshold=0.0` can speed up inference, it's not recommended for quality-sensitive workloads. Benchmarking shows that `threshold=0.0` saves only ~3% energy but can cause significant accuracy degradation (up to 25% perplexity increase in some models). The default `threshold=6.0` strikes the best balance between accuracy and performance for most use cases. + +**Model size considerations** + +- **NF4 quantization**: Most energy-efficient for models with >5B parameters, where memory bandwidth savings outweigh dequantization overhead +- **Small models** (<5B parameters): May not benefit from NF4 quantization due to dequantization costs exceeding memory bandwidth savings + +**Batch size impact** + +Energy efficiency improves significantly with larger batch sizes. Increasing from `batch_size=1` to `batch_size=8-64` can reduce energy consumption by 84-96%, making batching a critical optimization for production deployments. + +For detailed benchmarking data and methodology, see the [EcoCompute AI benchmark repository](https://github.com/hongping-zh/ecocompute-ai). + ### Skip module conversion For some models, like [Jukebox](model_doc/jukebox), you don't need to quantize every module to 8-bit because it can actually cause instability. With Jukebox, there are several `lm_head` modules that should be skipped using the `llm_int8_skip_modules` parameter in [`BitsAndBytesConfig`].