Skip to content

[CK] Use SpaceFillingCurve in ThreadwiseTensorSliceTransfer_v3r1#5047

Open
tenpercent wants to merge 6 commits into
developfrom
users/tenpercent/ck/threadwise-transfer-v3r1-sfc
Open

[CK] Use SpaceFillingCurve in ThreadwiseTensorSliceTransfer_v3r1#5047
tenpercent wants to merge 6 commits into
developfrom
users/tenpercent/ck/threadwise-transfer-v3r1-sfc

Conversation

@tenpercent
Copy link
Copy Markdown
Contributor

@tenpercent tenpercent commented Mar 3, 2026

Summary

Refactors ThreadwiseTensorSliceTransfer_v3r1 to use SpaceFillingCurve for index computation, replacing O(N^2) nested ComputeForwardSweep/ComputeDataIndex/ComputeMoveOnDim calls with O(1) precomputed lookup via SFC::GetIndex().

This reuses the existing SpaceFillingCurve infrastructure already used by v6r1, v7r2, etc. Changes are confined to a single file.

What changed

  • RunRead(): replaced static_ford serpentine loop with static_for over SrcSFC::GetIndex()/GetForwardStep()
  • RunWrite(): same pattern with DstSFC
  • TransferDataFromSrcThreadScratchToDstThreadScratch(): OOB check path uses SrcSFC::GetIndex()
  • Coordinate reset: uses ComputeSFCCoordinateResetStep instead of ComputeCoordinateResetStep

Results (example_grouped_conv_fwd_xdl_fp16, n=10, interleaved, -j1, -ftime-trace)

TU Baseline (mean) New (mean) Delta Wilcoxon p Mann-Whitney p
grouped_conv_fwd_xdl_fp16 (host) 14,883 ms 14,829 ms -0.4% 0.625 0.427
grouped_conv_fwd_xdl_fp16 (device) 27,675 ms 26,938 ms -2.7% 0.0039 0.0002
Total (all TUs) 57,502 ms 56,741 ms -1.3%

Negative controls (unrelated TUs — no significant change)

TU Delta Wilcoxon p
device_memory (host) -0.3% 0.92
device_memory (device) +0.3% 0.77
host_tensor (host) +0.7% 0.43
host_tensor (device) +0.9% 0.32

Assembly equivalence

GPU assembly compared with -save-temps=obj on example_grouped_conv_fwd_xdl_fp16 (gfx942). Result: 7656 instructions in both builds, identical instruction mix. Only difference is a register allocation swap (v[160:161] <-> v[162:163] in two buffer_load_dwordx2 / v_cndmask_b32_e64 pairs) — functionally identical, no runtime impact.

Methodology

  • 10 interleaved runs (baseline1, new1, baseline2, new2, ...) on same node to eliminate ordering/warmup bias
  • Wilcoxon signed-rank test (paired) and Mann-Whitney U test (unpaired)
  • Built with patched clang (LLVM 22) on ctr2-alola-compile-12, -j1 for accurate per-TU timing
  • Raw data: Slurm job 277012, assembly comparison: Slurm job 277013

Test plan

  • Compile-time benchmark with statistical significance (p < 0.01)
  • Negative controls confirm stable node
  • GPU assembly equivalence verified
  • Full CI

Tracking issue: #4229


Generated-by: Claude Code (claude-opus-4-6)

@tenpercent
Copy link
Copy Markdown
Contributor Author

Link #4229 for tracking

@tenpercent tenpercent force-pushed the users/tenpercent/ck/threadwise-transfer-v3r1-sfc branch from 01719db to 5050701 Compare March 10, 2026 23:24
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has been inactive for 25 days and will be marked as stale.

If you would like to keep this PR open, please:

  • Add new commits
  • Add a comment explaining why it should remain open

This PR will be automatically closed in 5 days if no further activity occurs.

@github-actions github-actions Bot added the Stale PR has no activity for 25+ days label Apr 22, 2026
@tenpercent tenpercent force-pushed the users/tenpercent/ck/threadwise-transfer-v3r1-sfc branch from 5050701 to 28863cf Compare April 23, 2026 18:47
@tenpercent tenpercent marked this pull request as ready for review April 23, 2026 18:51
@tenpercent tenpercent requested a review from a team as a code owner April 23, 2026 18:51
@tenpercent tenpercent removed the Stale PR has no activity for 25+ days label Apr 23, 2026
@tenpercent tenpercent changed the base branch from develop to users/tenpercent/ck/tensor-descriptor-lambda-elimination April 23, 2026 19:11
@tenpercent tenpercent changed the base branch from users/tenpercent/ck/tensor-descriptor-lambda-elimination to develop April 23, 2026 19:14
tenpercent added a commit that referenced this pull request Apr 23, 2026
Replace O(N^2) ComputeForwardSweep/ComputeDataIndex/ComputeMoveOnDim
calls with O(1) SpaceFillingCurve::GetIndex() lookups in RunRead(),
RunWrite(), and TransferDataFromSrcThreadScratchToDstThreadScratch().

This reuses the existing SpaceFillingCurve infrastructure (already used
by v6r1, v7r2, etc.) to precompute the serpentine traversal indices at
compile time, reducing template instantiation depth significantly.

Salvages the approach from PR #5047, rebased onto current develop which
has the ThreadwiseTransferHelper refactoring.

Generated-by: Claude Code (claude-opus-4-6)
Replace O(N^2) ComputeForwardSweep/ComputeDataIndex/ComputeMoveOnDim
calls with O(1) SpaceFillingCurve::GetIndex() lookups in RunRead(),
RunWrite(), and TransferDataFromSrcThreadScratchToDstThreadScratch().

This reuses the existing SpaceFillingCurve infrastructure (already used
by v6r1, v7r2, etc.) to precompute the serpentine traversal indices at
compile time, reducing template instantiation depth significantly.

Salvages the approach from PR #5047, rebased onto current develop which
has the ThreadwiseTransferHelper refactoring.

Generated-by: Claude Code (claude-opus-4-6)
@tenpercent tenpercent marked this pull request as draft April 23, 2026 19:20
@tenpercent tenpercent force-pushed the users/tenpercent/ck/threadwise-transfer-v3r1-sfc branch from 28863cf to 8f82d59 Compare April 23, 2026 19:28
@tenpercent tenpercent marked this pull request as ready for review April 23, 2026 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant