[Reduce_then_scan refactor pt 2] Relaxing requirement subgroup size#2657
Conversation
f46fcce to
191a9e3
Compare
There was a problem hiding this comment.
Pull request overview
This PR continues the reduce_then_scan refactor by removing hard-coded compile-time sub-group sizes (e.g., 32/16) and expanding applicability of the reduce-then-scan pattern across more devices (including CPU), while attempting to preserve performance via runtime sub-group queries and adjusted work-group sizing.
Changes:
- Removes the device capability gating around
reduce_then_scanand switches several algorithms to always use it (with remaining gating only for limited-output cases). - Refactors sub-group scan building blocks to query sub-group sizing at runtime and updates downstream KT utilities to match the new API.
- Adjusts CPU work-group sizing caps and communication strategy (favoring SLM-based comms on CPU / non-trivially-copyable types).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h |
Removes gating/fallbacks so more scan/copy/set operations use reduce-then-scan by default. |
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h |
Removes compile-time sub-group size template params; adds runtime device sub-group size queries and new work-group caps. |
include/oneapi/dpl/experimental/kt/internal/sub_group/sub_group_scan.h |
Updates KT sub-group scan wrapper calls to the new reduce-then-scan scan primitive signatures. |
include/oneapi/dpl/experimental/kt/internal/cooperative_lookback.h |
Updates cooperative lookback’s use of sub-group scan primitives to match new templates. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const std::uint8_t __min_sub_group_size = | ||
| *std::min_element(__supported_sg_sizes.begin(), __supported_sg_sizes.end()); | ||
| const std::uint8_t __max_sub_group_size = | ||
| *std::max_element(__supported_sg_sizes.begin(), __supported_sg_sizes.end()); |
There was a problem hiding this comment.
sub_group_sizes can be empty on some devices, but this code unconditionally dereferences min_element/max_element results. That is undefined behavior and can crash when get_info<sycl::info::device::sub_group_sizes>() returns an empty list. Handle the empty case (e.g., fall back to the multi-pass scan path / return an error / choose a safe default) and consider reusing the existing helpers that already document this behavior (see parallel_backend_sycl_utils.h where empty is explicitly handled).
| const std::uint8_t __min_sub_group_size = | |
| *std::min_element(__supported_sg_sizes.begin(), __supported_sg_sizes.end()); | |
| const std::uint8_t __max_sub_group_size = | |
| *std::max_element(__supported_sg_sizes.begin(), __supported_sg_sizes.end()); | |
| const bool __has_supported_sg_sizes = !__supported_sg_sizes.empty(); | |
| const std::uint8_t __min_sub_group_size = __has_supported_sg_sizes | |
| ? *std::min_element(__supported_sg_sizes.begin(), __supported_sg_sizes.end()) | |
| : std::uint8_t{1}; | |
| const std::uint8_t __max_sub_group_size = __has_supported_sg_sizes | |
| ? *std::max_element(__supported_sg_sizes.begin(), __supported_sg_sizes.end()) | |
| : std::uint8_t{1}; |
| std::uint8_t __sub_group_local_id = __sub_group.get_local_linear_id(); | ||
| const std::uint8_t __sub_group_size = __sub_group.get_max_local_range()[0]; | ||
| _ONEDPL_PRAGMA_UNROLL | ||
| for (std::uint8_t __shift = 1; __shift <= __sub_group_size / 2; __shift <<= 1) |
There was a problem hiding this comment.
The scan loop bound __shift <= __sub_group_size / 2 only produces a correct Hillis–Steele style scan when the sub-group size is a power of two. Since this refactor aims to support arbitrary sub-group sizes, this can compute incorrect results for sizes like 6/24/etc. Consider iterating while __shift < __sub_group_size (and keeping the mask logic) so all needed shift distances are covered for non-power-of-two sizes.
| for (std::uint8_t __shift = 1; __shift <= __sub_group_size / 2; __shift <<= 1) | |
| for (std::uint8_t __shift = 1; __shift < __sub_group_size; __shift <<= 1) |
| std::uint8_t __sub_group_local_id = __sub_group.get_local_linear_id(); | ||
| const std::uint8_t __sub_group_size = __sub_group.get_max_local_range()[0]; | ||
| _ONEDPL_PRAGMA_UNROLL | ||
| for (std::uint8_t __shift = 1; __shift <= __sub_group_size / 2; __shift <<= 1) | ||
| { |
There was a problem hiding this comment.
Same issue as the exclusive variant: __shift <= __sub_group_size / 2 assumes power-of-two sub-group sizes and can yield incorrect scans for non-power-of-two sizes. If arbitrary sub-group sizes are supported, the shift loop should cover all powers of two < __sub_group_size.
| _GenInput __gen_transform{__unary_op}; | ||
|
|
||
| const std::size_t __n = oneapi::dpl::__ranges::__size(__in_rng); | ||
| return __parallel_transform_reduce_then_scan<sizeof(typename _InitType::__value_type), _CustomName>( | ||
| __q_local, __n, std::forward<_Range1>(__in_rng), std::forward<_Range2>(__out_rng), __gen_transform, | ||
| __binary_op, __gen_transform, _ScanInputTransform{}, _WriteOp{}, __init, _Inclusive{}, |
There was a problem hiding this comment.
This block re-declares __n (shadowing the function parameter) and then uses the re-computed size instead of the already-provided __n. This makes control flow harder to follow and can re-trigger a potentially non-trivial __ranges::__size() computation. Prefer using the existing __n parameter (or rename the local variable if a recomputation is truly needed).
191a9e3 to
1c2fcc1
Compare
5cd54c4 to
c32888d
Compare
This reverts commit 4f46e97.
This reverts commit 0af5084.
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
7dc127d to
4a1560b
Compare
Relaxes the requirement of subgroup size 32 / 16 for reduce_then_scan (without sacrificing performance).
sycl::reqd_sub_group_sizethis can be treated in practice as a constexpr to enable optimizations anyway[[sycl::reqd_sub_group_size(...)]]with[[_ONEDPL_SYCL_REQD_SUB_GROUP_SIZE_IF_SUPPORTED(32)]]to allow the kernel to run on devices that don't support sub-group size 32Full picture: