Replace nested static_for lambdas with compile-time search helper by tenpercent · Pull Request #6696 · ROCm/rocm-libraries

tenpercent · 2026-04-22T23:50:44Z

Summary

Add sequence_find_value and find_in_tuple_of_sequences compile-time search helpers with O(1) template depth
Replace nested static_for lambdas in TensorDescriptor::GetTransformAndItsUpperDimension and InitializeElementSize
Apply same optimizations to TensorAdaptor

Supersedes #4287. Conflict-resolved rebase of ROCm/composable_kernel#3600 onto current develop.

Motivation

The TensorDescriptor and TensorAdaptor classes had excessive template instantiation from:

Nested static_for loops with lambdas creating unique closure types at every call site
generate_tuple with lambdas causing per-type instantiation overhead

The new helpers use constexpr array lookup and pack expansion instead of recursive template patterns, achieving O(1) template depth.

Results (`example_grouped_conv_fwd_xdl_fp16`, n=10, interleaved, `-j1`, `-ftime-trace`)

TU	Baseline (mean)	New (mean)	Delta	Wilcoxon p	Mann-Whitney p
`grouped_conv_fwd_xdl_fp16` (host)	14,886 ms	13,353 ms	-10.3%	0.002	0.0002
`grouped_conv_fwd_xdl_fp16` (device)	27,762 ms	25,629 ms	-7.7%	0.002	0.0002
Total (all TUs)	57,732 ms	54,030 ms	-6.4%

Unrelated TUs (device_memory, host_tensor, convolution_parameter) show no significant difference (p > 0.3), serving as negative controls.

Methodology

10 interleaved runs (baseline₁, new₁, baseline₂, new₂, ...) on the same node to eliminate ordering/warmup bias
Wilcoxon signed-rank test (paired, non-parametric) and Mann-Whitney U test (unpaired)
Built with patched clang (LLVM 22) on ctr2-alola-compile-11, -j1 for accurate per-TU timing
Raw data available in Slurm job 275230 results

Test plan

11 unit tests added (5 for sequence_find_value, 6 for find_in_tuple_of_sequences)
Compile-time benchmark with statistical significance (p < 0.01)
Full CI

Tracking issue: #4229

Generated-by: Claude Code (claude-sonnet-4-6)

Add sequence_find_value and find_in_tuple_of_sequences helpers using O(1) template depth via constexpr array lookup and pack expansion. Replace nested static_for lambdas in TensorDescriptor and TensorAdaptor: - GetTransformAndItsUpperDimension: use find_in_tuple_of_sequences - InitializeElementSize: replace generate_tuple lambda with pack expansion This eliminates the bulk of applier::operator() instantiations that previously came from nested lambda closures in the tensor descriptor transform search. Conflict-resolved rebase of ROCm/composable_kernel#3600 onto develop. Supersedes #4287. Generated-by: Claude Code (claude-sonnet-4-6)

tenpercent requested a review from a team as a code owner April 22, 2026 23:50

github-actions Bot added the project: composablekernel label Apr 22, 2026

assistant-librarian Bot added the organization: ROCm label Apr 23, 2026

tenpercent mentioned this pull request Apr 23, 2026

[CK] Use SpaceFillingCurve in ThreadwiseTensorSliceTransfer_v3r1 #5047

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace nested static_for lambdas with compile-time search helper#6696

Replace nested static_for lambdas with compile-time search helper#6696
tenpercent wants to merge 1 commit into
developfrom
users/tenpercent/ck/tensor-descriptor-lambda-elimination

tenpercent commented Apr 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tenpercent commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Results (example_grouped_conv_fwd_xdl_fp16, n=10, interleaved, -j1, -ftime-trace)

Methodology

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tenpercent commented Apr 22, 2026 •

edited

Loading

Results (`example_grouped_conv_fwd_xdl_fp16`, n=10, interleaved, `-j1`, `-ftime-trace`)