[CK] Use SpaceFillingCurve in ThreadwiseTensorSliceTransfer_v3r1 by tenpercent · Pull Request #5047 · ROCm/rocm-libraries

tenpercent · 2026-03-03T05:11:37Z

Summary

Refactors ThreadwiseTensorSliceTransfer_v3r1 to use SpaceFillingCurve for index computation, replacing O(N^2) nested ComputeForwardSweep/ComputeDataIndex/ComputeMoveOnDim calls with O(1) precomputed lookup via SFC::GetIndex().

This reuses the existing SpaceFillingCurve infrastructure already used by v6r1, v7r2, etc. Changes are confined to a single file.

What changed

RunRead(): replaced static_ford serpentine loop with static_for over SrcSFC::GetIndex()/GetForwardStep()
RunWrite(): same pattern with DstSFC
TransferDataFromSrcThreadScratchToDstThreadScratch(): OOB check path uses SrcSFC::GetIndex()
Coordinate reset: uses ComputeSFCCoordinateResetStep instead of ComputeCoordinateResetStep

Results (`example_grouped_conv_fwd_xdl_fp16`, n=10, interleaved, `-j1`, `-ftime-trace`)

TU	Baseline (mean)	New (mean)	Delta	Wilcoxon p	Mann-Whitney p
`grouped_conv_fwd_xdl_fp16` (host)	14,883 ms	14,829 ms	-0.4%	0.625	0.427
`grouped_conv_fwd_xdl_fp16` (device)	27,675 ms	26,938 ms	-2.7%	0.0039	0.0002
Total (all TUs)	57,502 ms	56,741 ms	-1.3%

Negative controls (unrelated TUs — no significant change)

TU	Delta	Wilcoxon p
`device_memory` (host)	-0.3%	0.92
`device_memory` (device)	+0.3%	0.77
`host_tensor` (host)	+0.7%	0.43
`host_tensor` (device)	+0.9%	0.32

Assembly equivalence

GPU assembly compared with -save-temps=obj on example_grouped_conv_fwd_xdl_fp16 (gfx942). Result: 7656 instructions in both builds, identical instruction mix. Only difference is a register allocation swap (v[160:161] <-> v[162:163] in two buffer_load_dwordx2 / v_cndmask_b32_e64 pairs) — functionally identical, no runtime impact.

Methodology

10 interleaved runs (baseline1, new1, baseline2, new2, ...) on same node to eliminate ordering/warmup bias
Wilcoxon signed-rank test (paired) and Mann-Whitney U test (unpaired)
Built with patched clang (LLVM 22) on ctr2-alola-compile-12, -j1 for accurate per-TU timing
Raw data: Slurm job 277012, assembly comparison: Slurm job 277013

Test plan

Compile-time benchmark with statistical significance (p < 0.01)
Negative controls confirm stable node
GPU assembly equivalence verified
Full CI

Tracking issue: #4229

Generated-by: Claude Code (claude-opus-4-6)

tenpercent · 2026-03-03T22:36:55Z

Link #4229 for tracking

github-actions · 2026-04-22T21:35:01Z

This pull request has been inactive for 25 days and will be marked as stale.

If you would like to keep this PR open, please:

Add new commits
Add a comment explaining why it should remain open

This PR will be automatically closed in 5 days if no further activity occurs.

Replace O(N^2) ComputeForwardSweep/ComputeDataIndex/ComputeMoveOnDim calls with O(1) SpaceFillingCurve::GetIndex() lookups in RunRead(), RunWrite(), and TransferDataFromSrcThreadScratchToDstThreadScratch(). This reuses the existing SpaceFillingCurve infrastructure (already used by v6r1, v7r2, etc.) to precompute the serpentine traversal indices at compile time, reducing template instantiation depth significantly. Salvages the approach from PR #5047, rebased onto current develop which has the ThreadwiseTransferHelper refactoring. Generated-by: Claude Code (claude-opus-4-6)

…3r1-sfc

github-actions Bot added the project: composablekernel label Mar 3, 2026

assistant-librarian Bot added the organization: ROCm label Mar 3, 2026

tenpercent force-pushed the users/tenpercent/ck/threadwise-transfer-v3r1-sfc branch from 01719db to 5050701 Compare March 10, 2026 23:24

github-actions Bot added the Stale PR has no activity for 25+ days label Apr 22, 2026

tenpercent force-pushed the users/tenpercent/ck/threadwise-transfer-v3r1-sfc branch from 5050701 to 28863cf Compare April 23, 2026 18:47

tenpercent marked this pull request as ready for review April 23, 2026 18:51

tenpercent requested a review from a team as a code owner April 23, 2026 18:51

tenpercent removed the Stale PR has no activity for 25+ days label Apr 23, 2026

tenpercent changed the base branch from develop to users/tenpercent/ck/tensor-descriptor-lambda-elimination April 23, 2026 19:11

tenpercent changed the base branch from users/tenpercent/ck/tensor-descriptor-lambda-elimination to develop April 23, 2026 19:14

tenpercent marked this pull request as draft April 23, 2026 19:20

tenpercent force-pushed the users/tenpercent/ck/threadwise-transfer-v3r1-sfc branch from 28863cf to 8f82d59 Compare April 23, 2026 19:28

tenpercent marked this pull request as ready for review April 23, 2026 19:49

tenpercent added 5 commits April 23, 2026 15:35

Merge branch 'develop' into users/tenpercent/ck/threadwise-transfer-v…

90d2294

…3r1-sfc

Merge branch 'develop' into users/tenpercent/ck/threadwise-transfer-v…

6fa744b

…3r1-sfc

Merge branch 'develop' into users/tenpercent/ck/threadwise-transfer-v…

5456afb

…3r1-sfc

Merge branch 'develop' into users/tenpercent/ck/threadwise-transfer-v…

41f60c5

…3r1-sfc

Merge branch 'develop' into users/tenpercent/ck/threadwise-transfer-v…

447d0db

…3r1-sfc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CK] Use SpaceFillingCurve in ThreadwiseTensorSliceTransfer_v3r1#5047

[CK] Use SpaceFillingCurve in ThreadwiseTensorSliceTransfer_v3r1#5047
tenpercent wants to merge 6 commits into
developfrom
users/tenpercent/ck/threadwise-transfer-v3r1-sfc

tenpercent commented Mar 3, 2026 •

edited

Loading

Uh oh!

tenpercent commented Mar 3, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tenpercent commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Results (example_grouped_conv_fwd_xdl_fp16, n=10, interleaved, -j1, -ftime-trace)

Negative controls (unrelated TUs — no significant change)

Assembly equivalence

Methodology

Test plan

Uh oh!

tenpercent commented Mar 3, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tenpercent commented Mar 3, 2026 •

edited

Loading

Results (`example_grouped_conv_fwd_xdl_fp16`, n=10, interleaved, `-j1`, `-ftime-trace`)