Fix intra-warp and inter-warp race conditions in bounds_check_indices v1 and v2 CUDA kernels by gchalump · Pull Request #5638 · pytorch/FBGEMM

gchalump · 2026-04-15T15:57:50Z

Summary:
Fix race conditions in bounds_check_indices_kernel_v1 and
bounds_check_indices_kernel_v2.

Both kernels launch with dim3(kWarpSize, kNumThreads/kWarpSize) — all 32 lanes
(threadIdx.x) in a warp share the same (b, t) pair (determined by threadIdx.y).

Intra-warp race fix (v1 and v2):

Gate offset check/correction on threadIdx.x == 0 so only lane 0 performs the
check, warning print, and adjust_offset_kernel write.
Broadcast corrected indices_start and indices_end from lane 0 to all lanes
via __shfl_sync(0xFFFFFFFF, ..., 0) before entering the index loop.

Inter-warp race fix — last-element correction (v1 only):

Previously every warp ran the last-element correction (offsets[total_B]),
causing N concurrent writes to the same address and inflated warning counts.
Gate on warp_idx == 0 so exactly one warp in the grid performs the check,
matching v2's existing b_t_start == 0 guard.

Inter-warp race fix — IGNORE mode (v1 and v2):

IGNORE mode previously called adjust_offset_kernel unconditionally, causing
warp N+1 to write back its valid-but-uncorrected offsets[N+1] and overwrite
warp N's correction of the same element.
Add the same bounds-check guard used by WARNING mode so adjust_offset_kernel
only runs when offsets are actually out of range.

Correction ordering (v1 only):

Reorder corrections so the last-element check (offsets[total_B]) runs before
the per-b_t offset correction within the same lane-0 block. This ensures
indices_end reflects the corrected offsets[total_B] when b_t + 1 == total_B.

Post-correction assertions (v1 and v2):

Add post-broadcast CUDA_KERNEL_ASSERT checks on every lane to validate
indices_start >= 0, indices_start <= indices_end, and
indices_end <= num_indices after correction.
Add best-effort backward monotonicity check: assert that
indices_start >= offsets[b_t - 1] to detect non-monotonic offsets that
per-pair checks miss. This read may race with the previous warp's correction
but catches the common case.

The __shfl_sync is safe because the loop/branch conditions depend only on
threadIdx.y (not threadIdx.x), so all lanes in a warp always take the same
code paths and reach the shuffle together.

Differential Revision: D100898565

… v1 and v2 CUDA kernels Summary: Fix race conditions in `bounds_check_indices_kernel_v1` and `bounds_check_indices_kernel_v2`. Both kernels launch with `dim3(kWarpSize, kNumThreads/kWarpSize)` — all 32 lanes (`threadIdx.x`) in a warp share the same `(b, t)` pair (determined by `threadIdx.y`). **Intra-warp race fix (v1 and v2):** - Gate offset check/correction on `threadIdx.x == 0` so only lane 0 performs the check, warning print, and `adjust_offset_kernel` write. - Broadcast corrected `indices_start` and `indices_end` from lane 0 to all lanes via `__shfl_sync(0xFFFFFFFF, ..., 0)` before entering the index loop. **Inter-warp race fix — last-element correction (v1 only):** - Previously every warp ran the last-element correction (`offsets[total_B]`), causing N concurrent writes to the same address and inflated warning counts. - Gate on `warp_idx == 0` so exactly one warp in the grid performs the check, matching v2's existing `b_t_start == 0` guard. **Inter-warp race fix — IGNORE mode (v1 and v2):** - IGNORE mode previously called `adjust_offset_kernel` unconditionally, causing warp N+1 to write back its valid-but-uncorrected `offsets[N+1]` and overwrite warp N's correction of the same element. - Add the same bounds-check guard used by WARNING mode so `adjust_offset_kernel` only runs when offsets are actually out of range. **Correction ordering (v1 only):** - Reorder corrections so the last-element check (`offsets[total_B]`) runs before the per-b_t offset correction within the same lane-0 block. This ensures `indices_end` reflects the corrected `offsets[total_B]` when `b_t + 1 == total_B`. **Post-correction assertions (v1 and v2):** - Add post-broadcast `CUDA_KERNEL_ASSERT` checks on every lane to validate `indices_start >= 0`, `indices_start <= indices_end`, and `indices_end <= num_indices` after correction. - Add best-effort backward monotonicity check: assert that `indices_start >= offsets[b_t - 1]` to detect non-monotonic offsets that per-pair checks miss. This read may race with the previous warp's correction but catches the common case. The `__shfl_sync` is safe because the loop/branch conditions depend only on `threadIdx.y` (not `threadIdx.x`), so all lanes in a warp always take the same code paths and reach the shuffle together. Differential Revision: D100898565

meta-codesync · 2026-04-15T15:57:58Z

@gchalump has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100898565.

meta-cla Bot added the cla signed label Apr 15, 2026

meta-codesync Bot added fb-exported meta-exported labels Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix intra-warp and inter-warp race conditions in bounds_check_indices v1 and v2 CUDA kernels#5638

Fix intra-warp and inter-warp race conditions in bounds_check_indices v1 and v2 CUDA kernels#5638
gchalump wants to merge 1 commit intopytorch:mainfrom
gchalump:export-D100898565

gchalump commented Apr 15, 2026

Uh oh!

meta-codesync Bot commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gchalump commented Apr 15, 2026

Uh oh!

meta-codesync Bot commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant