Add bf16 avx2 8bit dequant and Nbit dequant by cyyever · Pull Request #5634 · pytorch/FBGEMM

cyyever · 2026-04-15T02:47:53Z

No description provided.

meta-codesync · 2026-04-15T06:36:35Z

@q10 has imported this pull request. If you are a Meta employee, you can view this in D100932926.

q10 · 2026-04-15T07:12:43Z

@cyyever looks like the CPU huilds are failing

cyyever · 2026-04-15T09:01:03Z

@q10 It is likely that this PR depends on #5635 because FBGEMM_CODE affects runtime instruction selection. I will rebase once it is merged..

cyyever · 2026-04-16T01:39:55Z

@q10 rebased it

q10 · 2026-04-16T04:56:51Z

@cyyever looks like there are still build failures

cyyever · 2026-04-16T05:55:25Z

@q10 I have interesting findings

cyyever · 2026-04-23T09:04:42Z

@q10 I have relaxed the comparison error bounds

Ref path computes in double precision while AVX2 path uses fp32 FMA; results may differ by ~1 fp32 ULP which can cross a bf16 bucket boundary, so bit-exact EXPECT_EQ is too strict. Compare as float with ~2 bf16 ULPs tolerance.

…utput path The SVE dispatcher in FusedNBitRowwiseQuantizedSBHalfToFloatOrHalf was silently dropping the is_uint16_t_of_type_bf16 template parameter, so on SVE hardware a <float16, /*bf16=*/true> call produced fp16-formatted bits where bf16 was expected. Thread a bool IS_BF16_OUT template arg through FusedNBitRowwiseQuantizedSBHalfToFloatOrHalfNeon and add a NEON bf16 write-out branch using the same round-to-nearest-ties-to-even formula as Bf16ConvertAvx2.h (val + ((val>>16)&1) + 0x7FFF, take high 16 bits).

meta-cla Bot added the cla signed label Apr 15, 2026

cyyever changed the title ~~Add bf16 avx2 8bit dequant~~ Add bf16 avx2 8bit dequant and Nbit dequant Apr 15, 2026

cyyever marked this pull request as draft April 15, 2026 09:04

cyyever force-pushed the impl-bf16-avx2-8bit-dequant branch from 13bb9db to 48c01f8 Compare April 16, 2026 01:39

cyyever marked this pull request as ready for review April 16, 2026 01:39

cyyever force-pushed the impl-bf16-avx2-8bit-dequant branch from 48c01f8 to a808f38 Compare April 16, 2026 06:08

cyyever marked this pull request as draft April 16, 2026 06:09

cyyever force-pushed the impl-bf16-avx2-8bit-dequant branch from a808f38 to 4ccc4b6 Compare April 16, 2026 06:30

cyyever force-pushed the impl-bf16-avx2-8bit-dequant branch from 4ccc4b6 to bcdc3bf Compare April 23, 2026 09:00

cyyever marked this pull request as ready for review April 23, 2026 09:04

q10 reviewed Apr 24, 2026

View reviewed changes

Comment thread src/QuantUtils.cc

q10 reviewed Apr 24, 2026

View reviewed changes

Comment thread src/EmbeddingSpMDM.cc

q10 reviewed Apr 24, 2026

View reviewed changes

Comment thread src/EmbeddingSpMDM.cc

q10 reviewed Apr 24, 2026

View reviewed changes

Comment thread src/RefImplementations.cc Outdated

cyyever force-pushed the impl-bf16-avx2-8bit-dequant branch 4 times, most recently from 40ab93c to 12bbd45 Compare May 2, 2026 00:23

cyyever added 5 commits May 5, 2026 17:23

Add Bf16ConvertAvx2.h with shared fp32->bf16 helpers

96a37b8

Add bf16 output support to AVX2 dequantization kernels

7691f5b

Remove redundant direct AVX512-BF16 test, use generic dispatcher

2babe9c

Relax bf16 dequant test comparison to tolerance-based

0ff36b8

Ref path computes in double precision while AVX2 path uses fp32 FMA; results may differ by ~1 fp32 ULP which can cross a bf16 bucket boundary, so bit-exact EXPECT_EQ is too strict. Compare as float with ~2 bf16 ULPs tolerance.

Reuse cvt_fp32x16_bf16x16 helper in FloatToBfloat16_avx2

b1b1739

cyyever force-pushed the impl-bf16-avx2-8bit-dequant branch from 4141f97 to b1b1739 Compare May 5, 2026 09:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add bf16 avx2 8bit dequant and Nbit dequant#5634

Add bf16 avx2 8bit dequant and Nbit dequant#5634
cyyever wants to merge 6 commits into
pytorch:mainfrom
cyyever:impl-bf16-avx2-8bit-dequant

cyyever commented Apr 15, 2026

Uh oh!

meta-codesync Bot commented Apr 15, 2026

Uh oh!

q10 commented Apr 15, 2026

Uh oh!

cyyever commented Apr 15, 2026 •

edited

Loading

Uh oh!

cyyever commented Apr 16, 2026

Uh oh!

q10 commented Apr 16, 2026

Uh oh!

cyyever commented Apr 16, 2026

Uh oh!

cyyever commented Apr 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cyyever commented Apr 15, 2026

Uh oh!

meta-codesync Bot commented Apr 15, 2026

Uh oh!

q10 commented Apr 15, 2026

Uh oh!

cyyever commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyyever commented Apr 16, 2026

Uh oh!

q10 commented Apr 16, 2026

Uh oh!

cyyever commented Apr 16, 2026

Uh oh!

cyyever commented Apr 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cyyever commented Apr 15, 2026 •

edited

Loading