This document provides a thorough analysis of the FLUTE (Flexible Lookup Table Engine) kernel for lookup-table-quantized LLM inference. It covers the kernel architecture, implementation details, performance characteristics, and relevance to the bitsandbytes kbit GEMM kernel design.
FLUTE and the bitsandbytes kbit GEMM kernel are two different approaches to the same problem — fused dequantization + matrix multiplication for lookup-table-quantized LLM weights — with comparable instruction-level efficiency.
They are similar in:
- Core operation: load compressed weights, dequant via codebook, tensor core MMA
- Instruction count per element: roughly comparable (~3-6 ops depending on bit width)
- Performance regime: both achieve 2-4x over FP16 at small batch, converging to dense throughput at large batch (fundamental property of weight-only quantization)
- Both require offline weight repacking for GEMM-friendly tile layout
FLUTE trades flexibility for per-shape optimization:
- Built on CUTLASS 3 / CuTe — gets multi-stage pipelining and Stream-K for free
- Requires per-(shape, bits, group_size, GPU) compilation and auto-tuning
- Shape-specialized binaries limit deployment flexibility
- CUTLASS dependency (pinned to v3.4.1)
- 3-bit uses bit-slice decomposition (1+2 split) — different code path, ~33% more instructions than 4-bit
- No 5-bit support
- Focused on A100/A6000; RTX 4090 supported but less tuned
kbit trades CUTLASS infrastructure for simplicity and breadth:
- Self-contained hand-written CUDA, no external dependencies
- Uniform code path for K=2,3,4,5 via bit-plane format — no special cases
- No per-shape recompilation or tuning needed
- Register-based codebook lookup via
__shfl_sync(zero memory, 1 cycle) - E4M4 absmax (1 byte per block of 32) — finer granularity than FLUTE's FP16 scales
- Developed and tested on RTX 4090; not yet tuned for data center GPUs
Bottom line: FLUTE does not have a fundamental architectural advantage over the kbit design. The two kernels have similar instruction-level efficiency with different engineering trade-offs. FLUTE's head start is that it exists as a working fused GEMM today and has been benchmarked on data center GPUs. Once the kbit GEMM is implemented and tuned for A100/H100, there is no reason to expect FLUTE would be meaningfully faster. The bitsandbytes ecosystem integration (Transformers, PEFT, Accelerate) and broader bit-width support (K=2-5 uniform) are practical advantages that matter more than marginal kernel-level performance differences.
FLUTE has limited real-world adoption despite its EMNLP 2024 publication — it is not a default in any major inference framework and has known issues (shape specialization, numerical instability at some configurations, bfloat16 underperformance). It is best understood as an academic contribution that validates the LUT-quantized GEMM approach, not as a production system to compete against.
- Overview and Motivation
- The Core Problem: LUT-Quantized GEMM on GPUs
- Three-Part Solution Architecture
- Offline Weight Restructuring (Section 3.1)
- Vectorized Lookup Table with Duplication (Section 3.2)
- Stream-K Workload Partitioning (Section 3.3)
- CUTLASS 3 / CuTe Implementation
- Source Code Structure
- Kernel Configuration and Tuning
- NormalFloat and NFL (Learned NormalFloat)
- Performance Analysis
- Comparison with Other Kernels
- Relevance to Bitsandbytes kbit GEMM
- Limitations and Known Issues
- Links and References
Paper: "Fast Matrix Multiplications for Lookup Table-Quantized LLMs" Authors: Han Guo, William Brandon, Radostin Cholakov, Jonathan Ragan-Kelley, Eric P. Xing, Yoon Kim Published: EMNLP 2024 (Findings) ArXiv: 2407.10960 (v4, January 17, 2025)
FLUTE is a CUDA kernel engine for efficient inference of weight-quantized LLMs where the quantization is based on lookup tables (LUT) rather than uniform (linear) integer quantization. This distinction is critical:
-
Uniform quantization (e.g., standard INT4):
dequant(q) = q * scale + zeroSimple arithmetic, easily fused with GEMM. -
LUT quantization (e.g., NF4, custom codebooks):
dequant(q) = table[q] * scaleRequires a table lookup per element, which is fundamentally different from arithmetic dequantization and presents unique GPU optimization challenges.
FLUTE supports arbitrary lookup tables, making it compatible with:
- Integer quantization: int4, int3, int2
- Floating-point: fp4, fp3, fp2
- Normal float variants: nf4, nf3, nf2
- Learned Normal Float (NFL): A learnable extension to QLoRA's nf4
- Custom arbitrary tables (any 2^K values)
At batch sizes < 32 with group size 128 (typical LLM inference), FLUTE achieves 2-4x speedup over existing GEMM kernels and 1.5-2x end-to-end throughput improvement on LLaMA-3 models.
The paper identifies three fundamental challenges for building a high-performance LUT-quantized matmul kernel on GPUs:
Tensor Cores have strict requirements on data types, shapes, and layouts. Quantized weights at non-standard bit widths (especially 3-bit) cannot be packed evenly into the 128-bit vectorized memory accesses that feed the tensor core pipeline. For example:
- 4-bit: 32 values per 128-bit word (clean)
- 3-bit: 42.67 values per 128-bit word (does not divide evenly)
- 2-bit: 64 values per 128-bit word (clean)
The 3-bit case is problematic: you cannot load a clean set of 3-bit values with a single 128-bit async copy instruction.
LUT-based dequantization requires dynamic indexing into a table. GPUs do not natively support efficient dynamic indexing of data in their fastest on-chip storage (registers). The alternatives are:
- Registers: No dynamic indexing. Would need a switch/case statement.
- Shared memory: Supports dynamic indexing but has limited bandwidth (32 banks, 32-bit each) and potential bank conflicts.
- Constant memory: Broadcasts to all threads if they access the same address, but serializes if they access different addresses.
Since each thread typically looks up a different index, shared memory is the natural choice, but naive implementations suffer from bank conflicts.
With low-bit quantization and small batch sizes, the weight matrix is small, producing fewer output tiles. If the number of tiles doesn't fill all SMs evenly, some SMs sit idle in the last "wave" (wave quantization). This is a significant efficiency loss for the small-matrix regime that LLM inference typically operates in.
FLUTE addresses these challenges with three complementary techniques:
-
Offline weight restructuring (Section 3.1): Reorder quantized weights at model-load time so that after dequantization, the data is already in the layout that tensor cores expect. This moves bit-manipulation overhead from runtime to load time.
-
Vectorized and duplicated lookup table (Section 3.2): Store the LUT in shared memory, but access two values simultaneously (vectorization) and duplicate the table across banks (duplication) to eliminate bank conflicts.
-
Stream-K workload partitioning (Section 3.3): Use fine-grained work distribution across SMs to minimize wave quantization effects.
Consider 3-bit quantization. Each weight is a 3-bit index into a lookup table. Packing these into 128-bit words for async copy:
- 128 / 3 = 42.67 — doesn't divide evenly
- You can't load exactly N complete 3-bit values with a single vector load
Standard approaches pad to 4 bits (wasting 25% of storage) or use complex runtime bit manipulation to extract 3-bit fields from packed words.
FLUTE splits the 3-bit representation into two "bit-slices":
- A 1-bit partition (the most significant bit)
- A 2-bit partition (the two least significant bits)
Each partition is stored separately and can be loaded with standard 128-bit async copy instructions:
- The 1-bit partition: 128 values per 128-bit word
- The 2-bit partition: 64 values per 128-bit word
After loading both slices into registers, they are combined via bit manipulation:
combined_index = (bit_slice_1 << 2) | bit_slice_2
This avoids any runtime overhead from non-aligned bit extraction.
The quantized weight matrix is permuted offline (at model load time) so that after
the bit-slices are loaded and dequantized, the resulting values are already in the
exact register layout that the m16n8k16 tensor core instruction expects.
This is possible because the quantized weights are static during inference — they never change. So the permutation is computed once and applied once. At runtime, the kernel simply loads pre-permuted data and feeds it to tensor cores without any reordering overhead.
The permutation accounts for:
- The thread-to-element mapping of the MMA instruction
- The shared-memory-to-register copy layout (ldmatrix)
- The bit-slice separation
4-bit is simpler: 32 values per 128-bit word, clean division. No bit-slice decomposition needed. The offline restructuring still applies — weights are permuted so that the dequantized layout matches tensor core expectations.
2-bit is also clean: 64 values per 128-bit word. Same approach as 4-bit.
The lookup table for dequantization is stored in shared memory. For K-bit quantization, the table has 2^K entries. When 32 threads in a warp each look up a different index, the access pattern can cause bank conflicts.
Shared memory has 32 banks, each 4 bytes wide. If two threads access different 4-byte words in the same bank, the accesses are serialized.
For a 4-bit LUT with 16 entries of 2 bytes (half precision) each:
- Total LUT size: 32 bytes
- The 16 half values occupy banks 0-7 (2 half values per 4-byte bank)
- Threads accessing different indices in the same bank conflict
FLUTE creates an expanded lookup table containing every possible pair of consecutive indices. Instead of looking up one value at a time, it looks up two values simultaneously.
For 4-bit quantization:
- Original table: 2^4 = 16 entries of
half(2 bytes each) = 32 bytes - Vectorized table: 2^8 = 256 entries of
half2(4 bytes each) = 1024 bytes
The kernel extracts pairs of 4-bit indices from packed data, forms an 8-bit index,
and uses it to load a half2 containing both dequantized values in a single shared
memory transaction. This halves the number of shared memory accesses.
For 3-bit quantization:
- Original: 2^3 = 8 entries
- Vectorized: 2^6 = 64 entries of
half2= 256 bytes
Even with vectorization, bank conflicts can still occur. For the 4-bit vectorized table (256 × 4 bytes = 1024 bytes), the entries map across 256 banks positions, cycling through all 32 banks 8 times. If 8 threads in a warp happen to access entries that map to the same bank, you get an 8-way conflict.
FLUTE mitigates this by duplicating the entire vectorized table multiple times in shared memory, placing each copy at a different base address that shifts the bank alignment. When a thread would conflict on one copy, it can access a different copy that maps to a different bank.
The number of duplicates is a tuning parameter. For 4-bit with 256 entries:
- 1 copy: up to 8-way conflicts
- 2 copies: up to 4-way conflicts
- 4 copies: up to 2-way conflicts
- 8 copies: conflict-free (8 KB total — still small vs. 48-164 KB shared memory)
For 3-bit with 64 entries:
- Vectorized table is only 256 bytes
- 2-way conflicts max, so fewer duplicates needed
The duplication count is selected during auto-tuning (see Section 9).
The dequantization in the kernel (packbits_utils.hpp) supports multiple modes:
enum QuantMapModeEnum {
Basic, // Standard per-element LUT lookup
Vectorized, // Vectorized half2 lookup (default)
Vectorized_32, // Vectorized with 32-entry table
Vectorized_16, // Vectorized with 16-entry table
Vectorized_8, // Vectorized with 8-entry table
WarpShuffle, // __shfl_sync-based lookup (registers)
Marlin // Marlin-style arithmetic dequant
};The Vectorized mode is the default and primary mode. The WarpShuffle mode uses
__shfl_sync() for in-register lookups (similar to bitsandbytes' approach). The
Marlin mode delegates to Marlin's lop3-based arithmetic dequantization for
uniform INT4.
Standard GEMM kernels partition the output matrix into tiles and launch one threadblock per tile. If the number of tiles doesn't divide evenly by the number of SMs, the last wave has idle SMs.
Example: 32 output tiles on 132 SMs (H100). Only 32/132 = 24% utilization. Even with split-K to create more blocks, the granularity is coarse.
Stream-K (introduced by CUTLASS) partitions work at a finer granularity than output tiles. Instead of assigning one complete output tile to each threadblock, it distributes individual K-tiles across threadblocks.
The work is linearized: all (M-tile, N-tile, K-tile) combinations are laid out in a 1D sequence and distributed evenly across a fixed number of threadblocks (typically = num_SMs).
When multiple threadblocks contribute to the same output tile (because they process different K-ranges), they synchronize via a semaphore-based fixup:
- Non-finishing blocks store partial accumulator values in a global workspace
- Synchronization via
cutlass::Barrierprimitives (wait_lt,wait_eq,arrive_inc) - The finishing block reads, reduces, and writes the final result
FLUTE's TileScheduler (tile_scheduler_utils.hpp) implements both Split-K
and Stream-K modes:
enum DecompositionModeEnum {
SplitK, // Fixed K-split across slices
StreamK // Fine-grained K-tile distribution
};In Stream-K mode:
- Total tiles =
tiles_M × tiles_N × tiles_K tiles_per_block = total_tiles / num_blocksblocks_special = total_tiles % num_blocks(these get one extra tile)
The FixupHelper handles the inter-block reduction:
BACKWARDSflag reverses logical block ordering so the last block coordinates- Partial sums accumulated in FP32 for numerical stability
- Global reduction done in FP16 to minimize memory traffic
FLUTE is built entirely on CUTLASS 3.x (specifically v3.4.1) using the CuTe (CUDA Templates) abstraction layer. This is a significant architectural choice that differs from hand-written CUDA kernels like Marlin.
CUTLASS 3.x decomposes GEMM into composable layers:
- Device layer: Top-level API, manages grid launch
- Kernel layer: Thread block-level orchestration
- Collective layer: Multi-thread cooperation patterns (sync, pipelining)
- Tiled MMA/Copy: Spatial micro-kernels for tiling
- Atom layer: Hardware-specific instructions (MMA, ldmatrix, cp.async)
FLUTE customizes the Collective and Tiled Copy layers to inject LUT dequantization into the standard GEMM pipeline.
- Layouts:
SmemLayoutA,SmemLayoutQ,SmemLayoutS, etc. with 3x3x3 swizzle patterns for bank-conflict-free shared memory access - TiledCopy: Separate copy operations for A matrix (activations), Q matrix (packed quantized weights), Q2 (second bit-slice for 3-bit), and S (scales)
- TiledMma: SM80_16x8x16 MMA operations for half/bfloat16
- Async copy:
cp.asyncfor global → shared memory transfers with predication - Register fragments:
FragA,FragB,FragC,FragSfor tensor core inputs
The kernel's main loop (from qgemm_kernel.hpp) follows this pattern:
1. PREFETCH: Load lookup table from global → shared memory (once)
2. TILE LOOP: For each K-tile:
a. Async copy: input tile (X) from global → shared
b. Async copy: quantized weight slices (Q1, Q2, S) from global → shared
c. Wait for copies to complete
3. FRAGMENT LOOP: For each register-backed fragment within the tile:
a. Copy fragment data from shared → registers (ldmatrix for A)
b. Load packed weight data from shared → registers
c. For 3-bit: Combine bit-slices in registers
Q_combined = combine(Q1_reg, Q2_reg)
d. Vectorized dequantization:
W_dequant = vec_dequantize(Q_combined, scale_reg, LUT_shared)
e. Tensor core MMA:
Y_reg = tensor_core_mma(Y_reg, X_reg, W_dequant)
4. EPILOGUE: Convert FP32 accumulators → FP16, write to global memory
(with Stream-K fixup if needed)
The kernel uses circular shared memory buffers with configurable pipeline depth
(Stages template parameter, typically 2-4). This overlaps global→shared copies
with shared→register copies and computation:
- Stage N: Computing MMA on fragments from shared memory
- Stage N+1: Loading next tile from global to shared memory
The number of stages is a tuning parameter (see Section 9).
Repository: https://github.com/HanGuo97/flute
| File | Purpose |
|---|---|
qgemm_kernel.hpp |
Main kernel: Template device function qgemm_device and host launcher qgemm_host. Contains the full GEMM pipeline with dequantization. |
config.hpp |
Configuration: GemmConfig template with all tile sizes, thread counts, shared memory layouts, MMA configurations, copy operations. |
packbits_utils.hpp |
Dequantization: DequantizationTraits template with specializations for 2/3/4-bit, vectorized/shuffle/Marlin modes. Core dequant logic. |
tile_scheduler_utils.hpp |
Work distribution: TileScheduler with Split-K and Stream-K modes. FixupHelper for inter-block reduction. |
conversion_utils.hpp |
Type conversion: Register-level tensor type conversion using CUTLASS converters. |
marlin_utils.hpp |
Marlin compatibility: Marlin-style lop3-based INT4 dequantization for uniform quantization mode. |
qgemm_kernel_raw_generated.cu |
Generated instantiations: Pre-compiled kernel variants for supported shapes/configs. |
qgemm_kernel_example.cu |
Example: Template instantiation example showing how to configure a kernel. |
qgemm.cpp |
PyTorch binding: C++ entry point that dispatches to the appropriate kernel template. |
hadamard_transform_cuda.cu |
Hadamard transform: CUDA kernel for the HadaCore integration. |
cutlass_extensions_bf16.h |
BF16 extensions: Additional bfloat16 support utilities. |
| File | Purpose |
|---|---|
ops.py |
PyTorch custom op registration with fake tensor implementations for torch.compile. |
tune.py |
Auto-tuning: benchmarks multiple kernel configurations and selects the fastest. |
packbits_utils.py |
Weight packing: to_binary, from_binary, pack_bools_into_integers, pack_integer_tensors. |
nf_utils.py |
NormalFloat codebook generation via inverse Gaussian CDF. Quantization/dequantization. |
utils.py |
General utilities. |
codegen_utils.py |
Code generation helpers for kernel instantiation. |
The GemmConfig template is parameterized by:
Data types:
T — compute type (half, bfloat16)
TQ — quantized weight type (int16)
TC — accumulation type (float)
TR — reduction type
Threading:
Warps — number of warps per block
Threads — total threads (must be multiple of 128)
Quantization:
NumBits — 2, 3, or 4
GroupSize — 32, 64, 128, or 256
NumPacked — number of packed elements per int16
Tiling:
TileM, TileK, TileP — tile dimensions for M, K, packed-weight axes
Stages — pipeline depth (2-4)
StagesG — pipeline stages for scale loading
Copy operations:
G2SCopySizeA, G2SCopySizeQ, etc. — transfer granularity
MMA configuration:
MmaThrM, MmaThrN, MmaThrK — thread layout within MMA
MmaPrmM, MmaPrmN, MmaPrmK — permutation within MMA
FLUTE is shape-specialized — for each combination of (M, N, K, num_bits, group_size, dtype, GPU), a specific kernel configuration is selected via benchmarking.
The template_id parameter encodes a specific combination of:
- Tile sizes (TileM, TileN, TileK)
- Pipeline stages
- Number of LUT duplicates (for bank conflict mitigation)
- Thread block configuration
- MMA layout
From tune.py:
- For a given matrix shape and quantization config, enumerate candidate
template_idvalues - For each candidate, run the kernel at least 100 times
- Measure average execution time
- Select the fastest
template_id - Cache the result for future use
The tuned template_id is stored in the model's metadata and passed to qgemm()
at inference time.
After tuning, the framework runs correctness checks:
- Generates test cases with known-good outputs
- Compares against thresholds: FP16 ≤ 2.0e-3, BF16 ≤ 1.1e-2
- Each new model shape requires re-tuning
- Different tensor parallel configurations create different shapes
- The team is working on JIT tuning to reduce this constraint
- As of January 2025, experimental auto-tune support removes some shape/GPU specialization
The standard NF codebook (same concept as QLoRA's NF4) generates quantization levels from the inverse Gaussian CDF:
- Generate 2^(b-1) evenly-spaced probability values in [δ, 1/2] and [1/2, 1-δ] where δ = 1/2 × (1/30 + 1/32)
- Convert to quantiles via inverse CDF: q_i = Φ^(-1)(p_i)
- Normalize: q̃_i = q_i / q_{2^b - 1}
The result is a symmetric codebook in [-1, 1] optimized for normally-distributed weights.
For a weight group u with absmax s = max(|u|):
- Quantize: c_j = argmin_i |q̃_i - u_j/s|
- Dequantize: T[Q_{ij}] × s_{(i×j) mod B}
NFL extends NF by learning the scale parameter σ̃:
- Reformulate quantization: c_j = argmin_i |sσ̃q_i - u_j|
- Initialize σ̃ from the standard NF normalization constant: σ̃ = 1/Φ^(-1)(1-δ)
- Optimize σ̃ via gradient descent on negative log-likelihood
- Use calibration data: 128 examples × 2048 tokens from WikiText-2
- Apply straight-through estimator for the argmin gradient
- Save the learned scale as sσ̃/σ (preserves dequantization format)
This adds minimal overhead (learning one scalar per group) but measurably improves quantization quality.
LLaMA-3.1 8B with NFL W4G64:
- WikiText-2 perplexity: 6.24 (vs 6.31 unquantized — actually better due to the calibration fitting)
LLaMA-3.1 70B with NFL W4G64:
- WikiText-2 perplexity: 3.09 (vs 2.82 unquantized)
4-bit quantization, group size 128:
- 2-4× speedup over FP16
torch.mmat batch < 32 - Outperforms bitsandbytes and BitBLAS-NF4 LUT kernels
- Competitive with uniform-quantization kernels (Marlin, BitBLAS-INT4)
- At batch sizes > 32, advantage diminishes (GEMM becomes compute-bound)
3-bit quantization:
- Supported where most other LUT kernels don't support it at all
- Consistent speedups across group sizes 32, 64, 128, 256
LLaMA-3 8B (batch=1, single GPU):
- 4-bit, group=128: ~2.2× tokens/s improvement, perplexity 6.2
- 3-bit, group=128: ~2.4× tokens/s improvement, perplexity 4.6
LLaMA-3 70B (tensor parallelism):
- 4-bit, group=256: ~1.9-2.0× improvement (4×A6000, 2×A100)
- 3-bit, group=256: ~1.7-2.0× improvement (4×A6000, 2×A100)
LLaMA-3.1 405B: Enables single-node inference (impossible without quantization)
Optimized for Ampere GPUs (A100, A6000, RTX 4090). Not yet optimized for Hopper (H100), though it runs. bfloat16 is slower than float16, likely due to lack of Ampere hardware-accelerated bfloat16 atomic-add.
| Aspect | FLUTE | Marlin |
|---|---|---|
| Quantization type | LUT-based (arbitrary codebooks) | Uniform (INT4/INT8 linear) |
| Bit widths | 2, 3, 4 | 4, 8 |
| Dequant method | Shared memory LUT lookup | lop3 bit manipulation in registers |
| Work distribution | Stream-K (CUTLASS) | Custom stripe partitioning |
| Implementation | CUTLASS 3 / CuTe templates | Hand-written CUDA |
| Weight format | Offline-restructured, bit-sliced | Custom tiled INT4 packing |
| Bank conflict handling | LUT duplication + vectorization | N/A (arithmetic dequant) |
| Target GPU | Ampere (SM80) | Ampere + Hopper |
| Performance (4-bit) | Competitive at batch < 32 | Slightly faster at small batch |
| 3-bit support | Yes | No |
| Codebook flexibility | Arbitrary | Linear only |
Key insight: Marlin uses register-level arithmetic for dequantization (no memory access), while FLUTE uses shared memory lookup. For uniform quantization, Marlin's approach is faster. For non-uniform/codebook quantization, FLUTE's approach is necessary.
FLUTE also includes a Marlin mode in its QuantMapModeEnum that delegates to
Marlin-style lop3 dequantization for the uniform INT4 case.
| Aspect | FLUTE | bitsandbytes |
|---|---|---|
| Approach | Fused dequant+GEMM | Separate dequant, then cuBLAS |
| Tensor cores | Yes (via CUTLASS MMA) | No (dequant only, cuBLAS for GEMM) |
| LUT mechanism | Vectorized shared memory | __shfl_sync in registers |
| Bit widths | 2, 3, 4 | 2, 3, 4, 5 (kbit branch) |
| Performance | 2-4× over dequant+cuBLAS | Baseline (dequant+cuBLAS) |
| Aspect | FLUTE | Proposed kbit GEMM |
|---|---|---|
| Framework | CUTLASS 3 / CuTe | Hand-written CUDA |
| LUT storage | Shared memory (vectorized+duplicated) | Registers (__shfl_sync) |
| Work distribution | Stream-K (CUTLASS built-in) | Persistent kernel with split-K |
| Bit widths | 2, 3, 4 | 2, 3, 4, 5 |
| Weight format | Bit-slice decomposed, offline restructured | Bit-plane (from __ballot_sync), tiled |
| Scale format | FP16 group scales | E4M4 absmax (1 byte per block of 32) |
| Block size | Configurable (32, 64, 128, 256) | Fixed at 32 |
| Target GPU | Ampere | Ampere + Hopper |
This section provides a side-by-side analysis of every major design decision,
referencing the actual bitsandbytes kbit implementation on the
feature/kbit-quantization branch (csrc/ops.cu lines 649-869) and the planned
GEMM kernel design from agents/kbit_gemm_context.md.
This is the single biggest architectural difference between the two kernels.
FLUTE: Vectorized shared memory LUT with duplication
FLUTE stores the lookup table in shared memory. To reduce the number of shared memory transactions, it creates a "vectorized" table containing every possible pair of consecutive indices. For 4-bit quantization:
- Original table: 16 entries × 2 bytes (half) = 32 bytes
- Vectorized table: 256 entries × 4 bytes (half2) = 1024 bytes
The kernel extracts pairs of 4-bit indices from packed weight data, forms an
8-bit combined index, and fetches a half2 from shared memory in one transaction.
This halves the number of shared memory reads.
To handle bank conflicts (up to 8-way for 4-bit), FLUTE duplicates the entire vectorized table multiple times in shared memory at different base addresses, shifting bank alignment. The duplication count is auto-tuned per shape/GPU. Worst case: 8 copies × 1 KB = 8 KB of shared memory for the table alone.
Modes in packbits_utils.hpp:
enum QuantMapModeEnum {
Basic, // Per-element LUT lookup
Vectorized, // Vectorized half2 lookup (default)
WarpShuffle, // __shfl_sync-based (register)
Marlin // lop3 arithmetic dequant
};kbit: Register shuffle via __shfl_sync
The bitsandbytes kbit kernel stores the codebook in a single register per lane:
// ops.cu line ~766 (standalone dequant), GEMM plan uses same pattern:
float cb = (lane_id < (1 << K)) ? codebook[lane_id] : 0.0f;
// ...
float val = __shfl_sync(0xFFFFFFFF, cb, idx) * amax;For the GEMM kernel, the codebook is pre-converted to half at kernel start:
half cb_h = (lane < (1 << K_BITS))
? __float2half(codebook[lane]) : __float2half(0.0f);
// In inner loop:
half val = __shfl_sync(0xFFFFFFFF, cb_h, idx);Each lane holds one codebook entry in a register. Lookup is a warp shuffle with arbitrary per-thread source lane selection. Cost: 1 cycle on the shuffle unit, zero memory bandwidth consumed.
Why kbit's approach is better for our use case:
- Our codebooks have at most 2^5 = 32 entries (K=2..5), fitting exactly in a 32-lane warp. No shared memory needed at all.
- Shuffle is 1 cycle with zero bank conflicts by definition.
- No shared memory space consumed by the table — more room for A and B tiles.
- No duplication/tuning complexity.
- The shuffle approach is already proven in the existing standalone dequant
kernel (
ops.culine 783).
FLUTE needs shared memory because it's designed to be generic — it supports
arbitrary table sizes that could exceed 32 entries. For exactly this reason,
FLUTE also offers a WarpShuffle mode, but it isn't the default.
FLUTE: Contiguous K-bit packing with bit-slice decomposition
FLUTE packs quantized indices contiguously. For 4-bit: two 4-bit indices per
uint8, or 8 per uint32. The packed int16 values are loaded via 128-bit
async copies.
For 3-bit (which doesn't divide evenly into 128-bit words), FLUTE uses bit-slice decomposition: split each 3-bit index into a 1-bit MSB and a 2-bit LSB, store them in separate arrays, load each with clean 128-bit copies, and combine in registers:
combined_index = (bit_slice_1 << 2) | bit_slice_2
The offline restructuring permutes packed weights so that after loading and
dequantization, values land in the exact register positions that m16n8k16
tensor cores expect. This means the kernel never does runtime reordering.
kbit: Bit-plane format via __ballot_sync
The bitsandbytes quantize kernel (ops.cu line 706) produces K separate
uint32 bit-plane words per block of 32 elements:
// pack_kbit_warp<K>:
for (int bit = 0; bit < K; bit++)
packed_words[bit] = __ballot_sync(0xFFFFFFFF, (qval >> bit) & 1);Bit-plane 0 contains bit 0 of all 32 elements, bit-plane 1 contains bit 1, etc.
The GEMM repack kernel retiles this from flat sequential into
[k_tile][n_tile][col][k_block][bit_plane] order for coalesced tile loads.
To extract an index in the GEMM kernel:
for (int b = 0; b < K_BITS; b++)
idx |= ((planes[b] >> row) & 1) << b;Comparison:
| Aspect | FLUTE | kbit |
|---|---|---|
| Storage unit | Contiguous K-bit fields in int16 | K separate uint32 bit-plane words |
| 3-bit handling | Bit-slice split (1+2), two separate loads | Natural: K=3 bit-planes, same as K=2,4,5 |
| 5-bit handling | Not supported | Natural: K=5 bit-planes |
| Extraction cost | Shift+mask to isolate K-bit field from packed word | K shift+mask+OR to assemble index from planes |
| Memory footprint | K bits per element | K bits per element (identical) |
| Runtime reordering | None (offline permutation matches tensor core layout) | None (repack kernel produces tile-aligned layout) |
The bit-plane format's key advantage is uniformity: K=2,3,4,5 all work identically with no special cases. FLUTE needs separate code paths for 3-bit (the bit-slice decomposition). The bit-plane extraction cost (K INT32 ops per element) runs on integer ALU concurrent with tensor core MMA, so it's effectively hidden.
FLUTE: FP16 group scales
FLUTE uses standard half-precision scales with configurable group sizes
(32, 64, 128, 256). Dequantization is: value = table[index] * scale.
The scales are loaded from global → shared memory alongside the packed weights,
with their own pipeline stage (StagesG). Inside the fragment loop, scale values
are applied via __hmul2() paired half multiplication.
Storage overhead per element: 2 bytes / group_size. For group_size=128: 0.0156 bytes/element. For group_size=32: 0.0625 bytes/element.
kbit: E4M4 absmax (1 byte per block of 32)
The kbit system uses a custom 8-bit floating point format for the per-block
absmax value (ops.cu line 722):
// E4M4: 4-bit exponent (bias=11) + 4-bit mantissa
// Normal: 2^(e-11) * (1 + m/16), range ~[6.1e-5, 31.0]
// Decode: construct IEEE 754 float via bit manipulation
unsigned int ieee = (unsigned int)(e - E4M4_BIAS + 127) << 23
| (unsigned int)m << 19;
return __uint_as_float(ieee);Dequantization is: value = codebook[index] * absmax. The absmax is always
per-block (blocksize=32), giving fine-grained scaling.
Storage overhead: 1 byte / 32 = 0.03125 bytes/element. This is:
- 2× less than FLUTE with group_size=32 (0.0625 bytes/element)
- Same as FLUTE with group_size=64 in absolute bytes, but kbit gets per-32-element granularity vs FLUTE's per-64-element granularity
- Max relative error from E4M4: 6.25% (1/16 from 4-bit mantissa)
In the GEMM kernel, absmax decode happens once per block-of-32 per column per K-tile (256 decodes total for TILE_N=128, TILE_K=64). The decode is ~5 integer ALU ops, negligible compared to MMA throughput.
Why E4M4 matters:
At K=2 (2-bit quantization), each element is 2 bits = 0.25 bytes. FLUTE's FP16 scale at group_size=128 adds 0.0156 bytes/element (6.25% overhead). kbit's E4M4 at blocksize=32 adds 0.03125 bytes/element (12.5% overhead) but with 4× finer granularity — and in 1 byte instead of 2. The finer granularity typically improves quantization quality more than the coarser group hurts it.
FLUTE: Stream-K via CUTLASS
FLUTE uses CUTLASS's built-in Stream-K decomposition (tile_scheduler_utils.hpp).
All (M,N,K) tiles are linearized into a 1D work sequence and distributed evenly
across num_blocks threadblocks:
tiles_per_block = total_tiles / num_blocks;
blocks_special = total_tiles % num_blocks; // get +1 tileWhen multiple blocks contribute to the same output tile (different K-ranges),
the FixupHelper coordinates via cutlass::Barrier primitives. Partial sums
are stored in FP32 in a global workspace; the finishing block reduces and
converts to FP16.
Grid launch: dim3(num_blocks) for Stream-K mode.
kbit: Persistent kernel with linearized work assignment
The kbit GEMM plan launches exactly num_SMs blocks. Work items are linearized
as (m_tile, n_tile, k_chunk) triples, ordered so that all k_chunks for a given
(m,n) output tile are contiguous:
int work_per_block = div_ceil(total_work, gridDim.x);
int my_start = blockIdx.x * work_per_block;
int my_end = min(my_start + work_per_block, total_work);Key optimization: when consecutive work items share the same output tile, the block keeps accumulators in registers across k_chunks — no intermediate write. The pipeline restarts between chunks (~2-tile cost), but accumulators persist.
Output write uses a three-way branch:
- Full K-range ownership → write FP16 directly (common case for large M)
- First contributor → write FP32 to workspace (overwrite, acts as zero+write)
- Subsequent contributors → atomicAdd FP32 to workspace
A per-tile atomic counter tracks when the last contributor finishes, which then converts FP32 → FP16 in the final output.
Comparison:
| Aspect | FLUTE (Stream-K) | kbit (Persistent) |
|---|---|---|
| Implementation | CUTLASS built-in | Hand-written |
| Launch config | dim3(num_blocks) |
dim3(num_SMs) |
| Granularity | Per K-tile | Per k_chunk (multiple K-tiles) |
| Sync mechanism | cutlass::Barrier semaphores |
atomicAdd + atomic counter |
| Accumulator reuse | Each block handles isolated work items | Consecutive same-(m,n) items share accumulators |
| Reduction | Finishing block reduces all partials | Last contributor (via counter) converts to FP16 |
| Dependency | Requires CUTLASS | Self-contained |
The persistent kernel's accumulator-reuse optimization is significant: for problems where each block handles multiple k_chunks for the same output tile, it avoids writing and re-reading intermediate FP32 partials. Stream-K doesn't have this optimization — each block writes its partial to global memory.
| Bits | FLUTE | kbit |
|---|---|---|
| 2-bit | Yes (build from source) | Yes |
| 3-bit | Yes (bit-slice decomposition) | Yes (bit-plane, no special case) |
| 4-bit | Yes (primary target) | Yes |
| 5-bit | No | Yes |
FLUTE's lack of 5-bit support is likely because the bit-slice approach would need a 2+3 or 1+4 split, adding another code path. The kbit bit-plane format handles K=5 identically to K=2,3,4.
FLUTE: CUTLASS 3 / CuTe templates
- All tiling, pipelining, and MMA via CUTLASS abstractions
- Shared memory layouts use CuTe's swizzle patterns (3×3×3)
- Async copies via
cp.asyncmanaged by CUTLASS pipeline stages TiledCopyandTiledMmahandle thread-to-data mappingGemmConfigtemplate encodes the full kernel configuration- Code generation produces template instantiations per (shape, bits, GPU)
Pros: Less custom infrastructure to write, well-tested pipeline/sync code. Cons: Massive template expansion, slow compile, CUTLASS version dependency (pinned to v3.4.1), shape-specialized binaries.
kbit: Hand-written CUDA
- Custom tiling with explicit loop structures
- Manual
cp.asyncpipeline (2-stage double buffer) - Inline PTX for
ldmatrixandmma.syncinstructions - No external dependencies beyond CUDA toolkit
- Single compilation unit (
kernels.cu) with template params<K_BITS, M_BLOCKS> - Kernel config selected at launch time based on M dimension
Pros: Full control over register allocation and scheduling, no dependency management, single binary works for all shapes of the same (K, M_BLOCKS). Cons: Must implement all infrastructure manually, more potential for bugs in pipeline/sync code.
Both kernels use the same fundamental MMA instruction: m16n8k16 with FP16
inputs and FP32 accumulation.
FLUTE: CuTe's SM80_16x8x16_F32F16F16F32 atom, configured via TiledMma
with customizable thread layout (MmaThrM × MmaThrN × MmaThrK) and
permutation (MmaPrmM × MmaPrmN × MmaPrmK).
kbit: Direct inline PTX mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32
instruction. Thread-to-fragment mapping hand-computed:
- 4 threads per column (lane/4 = column index)
- Row indices: {2i, 2i+1, 2i+8, 2i+9} where i = lane%4
- FragA: M_BLOCKS × half2[2] per k-sub-tile
- FragB: half2[2] per N-block (dequantized on the fly, not stored)
The kbit design explicitly exploits the 4-threads-per-column property for shared memory access: when loading bit-plane words, 4 threads read the same K addresses, getting a free 4-way broadcast with zero bank conflicts. FLUTE doesn't need this optimization because its offline restructuring already places data in the correct register positions.
FLUTE: Configurable multi-stage pipeline (2-4 stages, auto-tuned). Separate pipeline stages for different data streams:
Stages: Main pipeline depth for A and Q tilesStagesG: Separate depth for scale factor loadingStagesGView: View stages for handling GroupSize/TileK relationships
Circular shared memory buffers managed by CUTLASS pipeline abstractions.
kbit: 2-stage double-buffered pipeline (fixed).
- Stage 0 and Stage 1 alternate in shared memory
cp_async_fence()andcp_async_wait<1>()for synchronization- Pipeline restarts when switching k_chunks (2-tile cost)
The kbit approach is simpler but less flexible. FLUTE's ability to tune the pipeline depth per shape can yield better performance in specific cases.
Both require offline weight restructuring, but the details differ.
FLUTE offline restructuring:
- Quantize weights to K-bit indices using a codebook (NF or custom)
- Pack indices contiguously (for 3-bit: split into 1+2 bit-slices)
- Permute packed words so that after loading and dequantization, values land directly in tensor core register positions
- The permutation encodes: thread-to-element MMA mapping + ldmatrix layout + bit-slice separation
This is a single combined permutation that folds multiple concerns together.
kbit offline restructuring:
- Quantize weights via
kQuantizeBlockwise_kbit→ flat bit-plane format (K uint32 words per block of 32 elements, sequential) - Encode absmax from float32 to E4M4 uint8
- Retile bit-planes from flat →
[k_tile][n_tile][col][k_block][bit_plane] - Retile absmax from flat →
[k_tile][n_tile][col][k_block]
The kbit repack is a simpler gather/permutation — it only changes the tile
layout, not the data format within tiles. No MMA-layout-aware permutation is
needed because the GEMM kernel handles the thread-to-element mapping at runtime
via the bit-plane extraction + __shfl_sync codebook lookup.
FLUTE is better when:
- You need arbitrary codebook sizes (> 32 entries)
- You want to leverage CUTLASS's tested infrastructure
- You need auto-tuning across many different matrix shapes
- You need Stream-K's sophisticated edge-case handling
- 3-bit and 4-bit are the primary targets
kbit is better when:
- Codebooks are ≤ 32 entries (K ≤ 5) — register shuffle is strictly faster
- You need 5-bit support
- You want zero external dependencies
- Fine-grained E4M4 absmax (per-32-element) is important
- You need a single binary that works across all shapes (no re-tuning)
- You want Hopper GPU support from the start
- The bit-plane format naturally handles all K values uniformly
-
Shape specialization: Each matrix shape requires separate tuning and compilation. Different tensor parallel configurations create different shapes, limiting supported models. (Partial mitigation via auto-tune as of Jan 2025.)
-
Ampere-only optimization: Not yet leveraging Hopper features (TMA, warp specialization, distributed shared memory). Runs on H100 but not at peak.
-
bfloat16 performance: Slower than float16 on Ampere due to lack of hardware-accelerated bfloat16 atomic-add (needed for Stream-K reduction).
-
Large batch degradation: Performance advantage diminishes at batch > 32 as the GEMM becomes compute-bound rather than memory-bandwidth-bound.
-
Numerical issues: Some instability reported with 4-bit, group-size=256 on A100.
-
No 5-bit support: FLUTE supports 2, 3, 4-bit only. The kbit design supports 5-bit as well.
- Paper (ArXiv): https://arxiv.org/abs/2407.10960
- Paper (PDF): https://arxiv.org/pdf/2407.10960
- Paper (HTML): https://arxiv.org/html/2407.10960
- Paper (ACL Anthology): https://aclanthology.org/2024.findings-emnlp.724/
- GitHub Repository: https://github.com/HanGuo97/flute
- HuggingFace Paper Page: https://huggingface.co/papers/2407.10960
- Main kernel: https://github.com/HanGuo97/flute/blob/main/flute/csrc/qgemm_kernel.hpp
- Configuration: https://github.com/HanGuo97/flute/blob/main/flute/csrc/config.hpp
- Dequantization: https://github.com/HanGuo97/flute/blob/main/flute/csrc/packbits_utils.hpp
- Tile scheduling: https://github.com/HanGuo97/flute/blob/main/flute/csrc/tile_scheduler_utils.hpp
- Weight packing: https://github.com/HanGuo97/flute/blob/main/flute/packbits_utils.py
- NF utilities: https://github.com/HanGuo97/flute/blob/main/flute/nf_utils.py
- Auto-tuning: https://github.com/HanGuo97/flute/blob/main/flute/tune.py
- Ops/dispatch: https://github.com/HanGuo97/flute/blob/main/flute/ops.py
- HuggingFace Hub: Models under the
HanGuo97organization- LLaMA-3.1: 8B, 70B, 405B (base + instruct, NFL W4G64 default)
- LLaMA-3: 8B, 70B
- Gemma-2: 9B, 27B
- CUTLASS 3.x: https://github.com/NVIDIA/cutlass (required dependency, v3.4.1)
- HIGGS: Vector dequantization extension, NAACL 2025
- HadaCore: Hadamard transform integration
- Marlin: https://github.com/IST-DASLab/marlin (comparison kernel for uniform INT4)
- LUT-GEMM: Earlier work on lookup-table-based GEMM kernels
- LUT Tensor Core (arxiv 2408.06003): Hardware/software co-design for LUT operations
- MarkTechPost: https://www.marktechpost.com/2024/07/26/flute-a-cuda-kernel-designed-for-fused-quantized-matrix-multiplications-to-accelerate-llm-inference/
- Semantic Scholar: https://www.semanticscholar.org/paper/Fast-Matrix-Multiplications-for-Lookup-LLMs-Guo-Brandon/be66705b36912679ea373184aaf057aa365d292a
- AlphaXiv Discussion: https://www.alphaxiv.org/abs/2407.10960
# Default (CUDA 12.1)
pip install flute-kernel
# CUDA 11.8
pip install flute-kernel -i https://flute-ai.github.io/whl/cu118
# CUDA 12.4
pip install flute-kernel -i https://flute-ai.github.io/whl/cu124
# From source (required for 2-bit)
git clone https://github.com/HanGuo97/flute.git
cd flute
pip install -e .@inproceedings{guo2024flute,
title={Fast Matrix Multiplications for Lookup Table-Quantized LLMs},
author={Guo, Han and Brandon, William and Cholakov, Radostin and
Ragan-Kelley, Jonathan and Xing, Eric P. and Kim, Yoon},
booktitle={Findings of EMNLP},
year={2024}
}