Skip to content

Replace nested static_for lambdas with compile-time search helper#6696

Open
tenpercent wants to merge 1 commit into
developfrom
users/tenpercent/ck/tensor-descriptor-lambda-elimination
Open

Replace nested static_for lambdas with compile-time search helper#6696
tenpercent wants to merge 1 commit into
developfrom
users/tenpercent/ck/tensor-descriptor-lambda-elimination

Conversation

@tenpercent
Copy link
Copy Markdown
Contributor

@tenpercent tenpercent commented Apr 22, 2026

Summary

  • Add sequence_find_value and find_in_tuple_of_sequences compile-time search helpers with O(1) template depth
  • Replace nested static_for lambdas in TensorDescriptor::GetTransformAndItsUpperDimension and InitializeElementSize
  • Apply same optimizations to TensorAdaptor

Supersedes #4287. Conflict-resolved rebase of ROCm/composable_kernel#3600 onto current develop.

Motivation

The TensorDescriptor and TensorAdaptor classes had excessive template instantiation from:

  1. Nested static_for loops with lambdas creating unique closure types at every call site
  2. generate_tuple with lambdas causing per-type instantiation overhead

The new helpers use constexpr array lookup and pack expansion instead of recursive template patterns, achieving O(1) template depth.

Results (example_grouped_conv_fwd_xdl_fp16, n=10, interleaved, -j1, -ftime-trace)

TU Baseline (mean) New (mean) Delta Wilcoxon p Mann-Whitney p
grouped_conv_fwd_xdl_fp16 (host) 14,886 ms 13,353 ms -10.3% 0.002 0.0002
grouped_conv_fwd_xdl_fp16 (device) 27,762 ms 25,629 ms -7.7% 0.002 0.0002
Total (all TUs) 57,732 ms 54,030 ms -6.4%

Unrelated TUs (device_memory, host_tensor, convolution_parameter) show no significant difference (p > 0.3), serving as negative controls.

Methodology

  • 10 interleaved runs (baseline₁, new₁, baseline₂, new₂, ...) on the same node to eliminate ordering/warmup bias
  • Wilcoxon signed-rank test (paired, non-parametric) and Mann-Whitney U test (unpaired)
  • Built with patched clang (LLVM 22) on ctr2-alola-compile-11, -j1 for accurate per-TU timing
  • Raw data available in Slurm job 275230 results

Test plan

  • 11 unit tests added (5 for sequence_find_value, 6 for find_in_tuple_of_sequences)
  • Compile-time benchmark with statistical significance (p < 0.01)
  • Full CI

Tracking issue: #4229


Generated-by: Claude Code (claude-sonnet-4-6)

Add sequence_find_value and find_in_tuple_of_sequences helpers using
O(1) template depth via constexpr array lookup and pack expansion.

Replace nested static_for lambdas in TensorDescriptor and TensorAdaptor:
- GetTransformAndItsUpperDimension: use find_in_tuple_of_sequences
- InitializeElementSize: replace generate_tuple lambda with pack expansion

This eliminates the bulk of applier::operator() instantiations that
previously came from nested lambda closures in the tensor descriptor
transform search.

Conflict-resolved rebase of ROCm/composable_kernel#3600 onto develop.
Supersedes #4287.

Generated-by: Claude Code (claude-sonnet-4-6)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant