Open
Conversation
Fused INT4 weight dequantization + matrix-vector multiplication in a
single kernel pass. Loads packed uint4 weights from DDR, dequantizes
in-register using the aie::unpack chain, and MACs with bf16 activation
vector, 4x DDR bandwidth reduction vs bf16 GEMV.
Kernel optimizations:
- Compile-time GROUP_SIZE and DIM_K for loop count optimization
- Double-pump: processes 2 groups (64 elements) per iteration, giving
the compiler two independent unpack chains to interleave
- AIE_PREPARE_FOR_PIPELINING and AIE_LOOP_MIN_ITERATION_COUNT hints
Tested on AMD Ryzen AI 9 HX 370 (NPU2, 8 columns):
- 2048x8192 (Llama up_proj): 561 us, 16.9 GB/s effective bandwidth
- 8192x2048 (Llama down_proj): 665 us, 14.2 GB/s effective bandwidth
- Integration tested with real Llama 3.2 1B weights (cosine sim >0.999)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Added
aie_kernels/generic/fused_dequant_gemv.cc: C++ AIE kernel that fuses INT4 weight dequantization with matrix-vector multiplication in a single pass. Uses the provenaie::unpackchain fromexpand.cc(uint4 → uint8 → uint16 → bf16 → scale → MAC). Includes a double-pump optimization that processes 2 groups per iteration for instruction interleaving, and compile-timeDIM_K/GROUP_SIZEfor loop specialization.iron/operators/gemv_int4/: Python operator (op.py, design.py, reference.py, test.py) following the existinggemv/anddequant/patterns. Packed weight buffer layout:[uint4 weights | bf16 per-group scales]per tile.Changed
Nothing. This is a new operator with no modifications to existing code.
Removed
Nothing.
Motivation
Decode inference is bandwidth-bound, GEMV loads each weight once per token. INT4 weights are half the size of INT8 and a quarter of bf16, making INT4 GEMV the highest-impact single operator for decode throughput.
This is a foundation operator: a standalone, individually tested building block for INT4 weight inference. It is not intended as an optimized fused decode kernel, the goal is to provide a correct,
well-tested INT4 GEMV that can be composed into fused pipelines.
Credit to @jgmelber for the original fused dequant-GEMV concept in PR #79 and the INT4 GEMV benchmarks in PR #71 (21.4 GB/s on 8192×2048). That work demonstrated the viability of fused INT4 inference on AIE and informed the design here. If you have insights on further kernel optimizations I'd love to hear them, my focus was on getting the foundation block right rather than a fully optimized fused kernel.
Test results
Checklist