Skip to content

Add GEMV INT 4 #101

Open
albiol2004 wants to merge 4 commits intoamd:develfrom
albiol2004:gemv-int4
Open

Add GEMV INT 4 #101
albiol2004 wants to merge 4 commits intoamd:develfrom
albiol2004:gemv-int4

Conversation

@albiol2004
Copy link
Copy Markdown

Added

  • aie_kernels/generic/fused_dequant_gemv.cc : C++ AIE kernel that fuses INT4 weight dequantization with matrix-vector multiplication in a single pass. Uses the proven aie::unpack chain from expand.cc (uint4 → uint8 → uint16 → bf16 → scale → MAC). Includes a double-pump optimization that processes 2 groups per iteration for instruction interleaving, and compile-time DIM_K/GROUP_SIZE for loop specialization.
  • iron/operators/gemv_int4/ : Python operator (op.py, design.py, reference.py, test.py) following the existing gemv/ and dequant/ patterns. Packed weight buffer layout: [uint4 weights | bf16 per-group scales] per tile.

Changed

Nothing. This is a new operator with no modifications to existing code.

Removed

Nothing.

Motivation

Decode inference is bandwidth-bound, GEMV loads each weight once per token. INT4 weights are half the size of INT8 and a quarter of bf16, making INT4 GEMV the highest-impact single operator for decode throughput.

This is a foundation operator: a standalone, individually tested building block for INT4 weight inference. It is not intended as an optimized fused decode kernel, the goal is to provide a correct,
well-tested INT4 GEMV that can be composed into fused pipelines.

Credit to @jgmelber for the original fused dequant-GEMV concept in PR #79 and the INT4 GEMV benchmarks in PR #71 (21.4 GB/s on 8192×2048). That work demonstrated the viability of fused INT4 inference on AIE and informed the design here. If you have insights on further kernel optimizations I'd love to hear them, my focus was on getting the foundation block right rather than a fully optimized fused kernel.

Test results

  • 7/7 parametrized tests pass (2048×2048, 8192×2048, 2048×8192 at 4/8 columns, tsi=1/4)
  • Integration tested with real Llama 3.2 1B weights (cosine similarity >0.999 vs CPU reference)
  • Performance (8 columns): 2048×8192 at 561 us / 16.9 GB/s, 8192×2048 at 665 us / 14.2 GB/s

Checklist

  • Tests pass locally
  • Code formatted (black + clang-format)
  • License headers on all files
  • No changes to existing code

  Fused INT4 weight dequantization + matrix-vector multiplication in a
  single kernel pass. Loads packed uint4 weights from DDR, dequantizes
  in-register using the aie::unpack chain, and MACs with bf16 activation
  vector, 4x DDR bandwidth reduction vs bf16 GEMV.

  Kernel optimizations:
  - Compile-time GROUP_SIZE and DIM_K for loop count optimization
  - Double-pump: processes 2 groups (64 elements) per iteration, giving
    the compiler two independent unpack chains to interleave
  - AIE_PREPARE_FOR_PIPELINING and AIE_LOOP_MIN_ITERATION_COUNT hints

  Tested on AMD Ryzen AI 9 HX 370 (NPU2, 8 columns):
  - 2048x8192 (Llama up_proj): 561 us, 16.9 GB/s effective bandwidth
  - 8192x2048 (Llama down_proj): 665 us, 14.2 GB/s effective bandwidth
  - Integration tested with real Llama 3.2 1B weights (cosine sim >0.999)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant