Skip to content

[Reduce_then_scan refactor pt 2] Relaxing requirement subgroup size#2657

Open
danhoeflinger wants to merge 11 commits intodev/dhoeflin/enable_reduce_then_scan_everywherefrom
dev/dhoeflin/remove_subgroup_size_requirement
Open

[Reduce_then_scan refactor pt 2] Relaxing requirement subgroup size#2657
danhoeflinger wants to merge 11 commits intodev/dhoeflin/enable_reduce_then_scan_everywherefrom
dev/dhoeflin/remove_subgroup_size_requirement

Conversation

@danhoeflinger
Copy link
Copy Markdown
Contributor

@danhoeflinger danhoeflinger commented Apr 7, 2026

Relaxes the requirement of subgroup size 32 / 16 for reduce_then_scan (without sacrificing performance).

  • Remove the compile-time __sub_group_size template parameter from scan building blocks, replacing it with a runtime query via sub_group::get_max_local_range() to support arbitrary sub-group sizes. For compilers which support sycl::reqd_sub_group_size this can be treated in practice as a constexpr to enable optimizations anyway
  • Remove helpers to hardcode sub group size and to determine if the required subgroup size is available. Remove gating around reduce_then_scan (except for in cases of output limited checks).
  • Replace [[sycl::reqd_sub_group_size(...)]] with [[_ONEDPL_SYCL_REQD_SUB_GROUP_SIZE_IF_SUPPORTED(32)]] to allow the kernel to run on devices that don't support sub-group size 32
  • Limit workgroup size on CPU to 256, rather than 8k. This is a big difference in performance for reduce_then_scan on CPU targets (along with SLM implementations rather than subgroup communication).

Full picture:

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR continues the reduce_then_scan refactor by removing hard-coded compile-time sub-group sizes (e.g., 32/16) and expanding applicability of the reduce-then-scan pattern across more devices (including CPU), while attempting to preserve performance via runtime sub-group queries and adjusted work-group sizing.

Changes:

  • Removes the device capability gating around reduce_then_scan and switches several algorithms to always use it (with remaining gating only for limited-output cases).
  • Refactors sub-group scan building blocks to query sub-group sizing at runtime and updates downstream KT utilities to match the new API.
  • Adjusts CPU work-group sizing caps and communication strategy (favoring SLM-based comms on CPU / non-trivially-copyable types).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Removes gating/fallbacks so more scan/copy/set operations use reduce-then-scan by default.
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h Removes compile-time sub-group size template params; adds runtime device sub-group size queries and new work-group caps.
include/oneapi/dpl/experimental/kt/internal/sub_group/sub_group_scan.h Updates KT sub-group scan wrapper calls to the new reduce-then-scan scan primitive signatures.
include/oneapi/dpl/experimental/kt/internal/cooperative_lookback.h Updates cooperative lookback’s use of sub-group scan primitives to match new templates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +2078 to +2081
const std::uint8_t __min_sub_group_size =
*std::min_element(__supported_sg_sizes.begin(), __supported_sg_sizes.end());
const std::uint8_t __max_sub_group_size =
*std::max_element(__supported_sg_sizes.begin(), __supported_sg_sizes.end());
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sub_group_sizes can be empty on some devices, but this code unconditionally dereferences min_element/max_element results. That is undefined behavior and can crash when get_info<sycl::info::device::sub_group_sizes>() returns an empty list. Handle the empty case (e.g., fall back to the multi-pass scan path / return an error / choose a safe default) and consider reusing the existing helpers that already document this behavior (see parallel_backend_sycl_utils.h where empty is explicitly handled).

Suggested change
const std::uint8_t __min_sub_group_size =
*std::min_element(__supported_sg_sizes.begin(), __supported_sg_sizes.end());
const std::uint8_t __max_sub_group_size =
*std::max_element(__supported_sg_sizes.begin(), __supported_sg_sizes.end());
const bool __has_supported_sg_sizes = !__supported_sg_sizes.empty();
const std::uint8_t __min_sub_group_size = __has_supported_sg_sizes
? *std::min_element(__supported_sg_sizes.begin(), __supported_sg_sizes.end())
: std::uint8_t{1};
const std::uint8_t __max_sub_group_size = __has_supported_sg_sizes
? *std::max_element(__supported_sg_sizes.begin(), __supported_sg_sizes.end())
: std::uint8_t{1};

Copilot uses AI. Check for mistakes.
std::uint8_t __sub_group_local_id = __sub_group.get_local_linear_id();
const std::uint8_t __sub_group_size = __sub_group.get_max_local_range()[0];
_ONEDPL_PRAGMA_UNROLL
for (std::uint8_t __shift = 1; __shift <= __sub_group_size / 2; __shift <<= 1)
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scan loop bound __shift <= __sub_group_size / 2 only produces a correct Hillis–Steele style scan when the sub-group size is a power of two. Since this refactor aims to support arbitrary sub-group sizes, this can compute incorrect results for sizes like 6/24/etc. Consider iterating while __shift < __sub_group_size (and keeping the mask logic) so all needed shift distances are covered for non-power-of-two sizes.

Suggested change
for (std::uint8_t __shift = 1; __shift <= __sub_group_size / 2; __shift <<= 1)
for (std::uint8_t __shift = 1; __shift < __sub_group_size; __shift <<= 1)

Copilot uses AI. Check for mistakes.
Comment on lines 1305 to 1309
std::uint8_t __sub_group_local_id = __sub_group.get_local_linear_id();
const std::uint8_t __sub_group_size = __sub_group.get_max_local_range()[0];
_ONEDPL_PRAGMA_UNROLL
for (std::uint8_t __shift = 1; __shift <= __sub_group_size / 2; __shift <<= 1)
{
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as the exclusive variant: __shift <= __sub_group_size / 2 assumes power-of-two sub-group sizes and can yield incorrect scans for non-power-of-two sizes. If arbitrary sub-group sizes are supported, the shift loop should cover all powers of two < __sub_group_size.

Copilot uses AI. Check for mistakes.
Comment on lines 719 to 724
_GenInput __gen_transform{__unary_op};

const std::size_t __n = oneapi::dpl::__ranges::__size(__in_rng);
return __parallel_transform_reduce_then_scan<sizeof(typename _InitType::__value_type), _CustomName>(
__q_local, __n, std::forward<_Range1>(__in_rng), std::forward<_Range2>(__out_rng), __gen_transform,
__binary_op, __gen_transform, _ScanInputTransform{}, _WriteOp{}, __init, _Inclusive{},
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block re-declares __n (shadowing the function parameter) and then uses the re-computed size instead of the already-provided __n. This makes control flow harder to follow and can re-trigger a potentially non-trivial __ranges::__size() computation. Prefer using the existing __n parameter (or rename the local variable if a recomputation is truly needed).

Copilot uses AI. Check for mistakes.
@danhoeflinger danhoeflinger added this to the 2022.13.0 milestone Apr 13, 2026
@danhoeflinger danhoeflinger force-pushed the dev/dhoeflin/remove_subgroup_size_requirement branch from 191a9e3 to 1c2fcc1 Compare April 13, 2026 19:32
@danhoeflinger danhoeflinger marked this pull request as ready for review May 6, 2026 13:23
@danhoeflinger danhoeflinger force-pushed the dev/dhoeflin/enable_reduce_then_scan_everywhere branch from 5cd54c4 to c32888d Compare May 7, 2026 02:47
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants