Add GEMV INT 4 by albiol2004 · Pull Request #101 · amd/IRON

albiol2004 · 2026-04-09T17:50:14Z

Added

aie_kernels/generic/fused_dequant_gemv.cc : C++ AIE kernel that fuses INT4 weight dequantization with matrix-vector multiplication in a single pass. Uses the proven aie::unpack chain from expand.cc (uint4 → uint8 → uint16 → bf16 → scale → MAC). Includes a double-pump optimization that processes 2 groups per iteration for instruction interleaving, and compile-time DIM_K/GROUP_SIZE for loop specialization.
iron/operators/gemv_int4/ : Python operator (op.py, design.py, reference.py, test.py) following the existing gemv/ and dequant/ patterns. Packed weight buffer layout: [uint4 weights | bf16 per-group scales] per tile.

Changed

Nothing. This is a new operator with no modifications to existing code.

Removed

Nothing.

Motivation

Decode inference is bandwidth-bound, GEMV loads each weight once per token. INT4 weights are half the size of INT8 and a quarter of bf16, making INT4 GEMV the highest-impact single operator for decode throughput.

This is a foundation operator: a standalone, individually tested building block for INT4 weight inference. It is not intended as an optimized fused decode kernel, the goal is to provide a correct,
well-tested INT4 GEMV that can be composed into fused pipelines.

Credit to @jgmelber for the original fused dequant-GEMV concept in PR #79 and the INT4 GEMV benchmarks in PR #71 (21.4 GB/s on 8192×2048). That work demonstrated the viability of fused INT4 inference on AIE and informed the design here. If you have insights on further kernel optimizations I'd love to hear them, my focus was on getting the foundation block right rather than a fully optimized fused kernel.

Test results

7/7 parametrized tests pass (2048×2048, 8192×2048, 2048×8192 at 4/8 columns, tsi=1/4)
Integration tested with real Llama 3.2 1B weights (cosine similarity >0.999 vs CPU reference)
Performance (8 columns): 2048×8192 at 561 us / 16.9 GB/s, 8192×2048 at 665 us / 14.2 GB/s

Checklist

Tests pass locally
Code formatted (black + clang-format)
License headers on all files
No changes to existing code

Fused INT4 weight dequantization + matrix-vector multiplication in a single kernel pass. Loads packed uint4 weights from DDR, dequantizes in-register using the aie::unpack chain, and MACs with bf16 activation vector, 4x DDR bandwidth reduction vs bf16 GEMV. Kernel optimizations: - Compile-time GROUP_SIZE and DIM_K for loop count optimization - Double-pump: processes 2 groups (64 elements) per iteration, giving the compiler two independent unpack chains to interleave - AIE_PREPARE_FOR_PIPELINING and AIE_LOOP_MIN_ITERATION_COUNT hints Tested on AMD Ryzen AI 9 HX 370 (NPU2, 8 columns): - 2048x8192 (Llama up_proj): 561 us, 16.9 GB/s effective bandwidth - 8192x2048 (Llama down_proj): 665 us, 14.2 GB/s effective bandwidth - Integration tested with real Llama 3.2 1B weights (cosine sim >0.999)

albiol2004 added 4 commits April 9, 2026 19:04

first working version of INT 4 GEMV

4a3314d

1.6x speedup, GROUP_SIZE at compile time

9cd050a

double-pump kernel + K at compile time

6c48321

albiol2004 requested review from andrej, hunhoffe and jgmelber as code owners April 9, 2026 17:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GEMV INT 4 #101

Add GEMV INT 4 #101
albiol2004 wants to merge 4 commits intoamd:develfrom
albiol2004:gemv-int4

albiol2004 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

albiol2004 commented Apr 9, 2026

Added

Changed

Removed

Motivation

Test results

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant