I2_S is the quantization format used to store BitNet b1.58 ternary weights in GGUF files. It packs 4 ternary values per byte using a block-interleaved layout.
This document is intended for developers building alternative inference runtimes (WebGPU, Vulkan, Metal, etc.) who need to load and dequantize BitNet weights directly.
Each weight is one of three values, stored in 2 bits:
| Bits | Value |
|---|---|
| 00 | 0 |
| 01 | +1 |
| 10 | -1 |
| 11 | unused |
Weights are stored in blocks of 128 elements (32 bytes each). Within a block, the 128 elements are split into 4 groups of 32. Each byte encodes one element from each group:
bits [7:6] → element at position gp (group 0, offset 0)
bits [5:4] → element at position 32 + gp (group 1, offset 32)
bits [3:2] → element at position 64 + gp (group 2, offset 64)
bits [1:0] → element at position 96 + gp (group 3, offset 96)
To extract the element at logical index k:
block = k // 128
pos = k % 128
group = pos // 32
gp = pos % 32
byte_offset = block * 32 + gp
shift = 6 - 2 * group
value = (byte >> shift) & 0x03The total byte size per tensor is:
ceil(num_elements / 4) + 32
The trailing 32 bytes store a single float32 scale value, replicated 8 times.
- Type ID is 36 in the Eddie-Wang1120/llama.cpp fork (not type 27, which is I64 in upstream ggml)
- GGUF metadata uses architecture prefix
bitnet-25, notbitnetorllama token_embd.weightis stored as F16 (type 1) — embeddings are not quantized to I2_S- There is no
output.weighttensor — the model uses tied embeddings (lm_headreusestoken_embd)