[Reduce_then_scan refactor pt 2] Relaxing requirement subgroup size by danhoeflinger · Pull Request #2657 · uxlfoundation/oneDPL

danhoeflinger · 2026-04-07T16:14:58Z

Relaxes the requirement of subgroup size 32 / 16 for reduce_then_scan (without sacrificing performance).

Remove the compile-time __sub_group_size template parameter from scan building blocks, replacing it with a runtime query via sub_group::get_max_local_range() to support arbitrary sub-group sizes. For compilers which support sycl::reqd_sub_group_size this can be treated in practice as a constexpr to enable optimizations anyway
Remove helpers to hardcode sub group size and to determine if the required subgroup size is available. Remove gating around reduce_then_scan (except for in cases of output limited checks).
Replace [[sycl::reqd_sub_group_size(...)]] with [[_ONEDPL_SYCL_REQD_SUB_GROUP_SIZE_IF_SUPPORTED(32)]] to allow the kernel to run on devices that don't support sub-group size 32
Limit workgroup size on CPU to 256, rather than 8k. This is a big difference in performance for reduce_then_scan on CPU targets (along with SLM implementations rather than subgroup communication).

Full picture:

Relax trivially copyable requirement
Relax subgroup size requirements <------
Implement output limited scan_copy
Remove dead code implementations (scan_then_propagate, reduce_by_segment_fallback, scan_by_segment_fallback)

Copilot

Pull request overview

This PR continues the reduce_then_scan refactor by removing hard-coded compile-time sub-group sizes (e.g., 32/16) and expanding applicability of the reduce-then-scan pattern across more devices (including CPU), while attempting to preserve performance via runtime sub-group queries and adjusted work-group sizing.

Changes:

Removes the device capability gating around reduce_then_scan and switches several algorithms to always use it (with remaining gating only for limited-output cases).
Refactors sub-group scan building blocks to query sub-group sizing at runtime and updates downstream KT utilities to match the new API.
Adjusts CPU work-group sizing caps and communication strategy (favoring SLM-based comms on CPU / non-trivially-copyable types).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h`	Removes gating/fallbacks so more scan/copy/set operations use reduce-then-scan by default.
`include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h`	Removes compile-time sub-group size template params; adds runtime device sub-group size queries and new work-group caps.
`include/oneapi/dpl/experimental/kt/internal/sub_group/sub_group_scan.h`	Updates KT sub-group scan wrapper calls to the new reduce-then-scan scan primitive signatures.
`include/oneapi/dpl/experimental/kt/internal/cooperative_lookback.h`	Updates cooperative lookback’s use of sub-group scan primitives to match new templates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-09T18:36:39Z

+    const std::uint8_t __min_sub_group_size =
+        *std::min_element(__supported_sg_sizes.begin(), __supported_sg_sizes.end());
+    const std::uint8_t __max_sub_group_size =
+        *std::max_element(__supported_sg_sizes.begin(), __supported_sg_sizes.end());


sub_group_sizes can be empty on some devices, but this code unconditionally dereferences min_element/max_element results. That is undefined behavior and can crash when get_info<sycl::info::device::sub_group_sizes>() returns an empty list. Handle the empty case (e.g., fall back to the multi-pass scan path / return an error / choose a safe default) and consider reusing the existing helpers that already document this behavior (see parallel_backend_sycl_utils.h where empty is explicitly handled).

Suggested change

const std::uint8_t __min_sub_group_size =

*std::min_element(__supported_sg_sizes.begin(), __supported_sg_sizes.end());

const std::uint8_t __max_sub_group_size =

*std::max_element(__supported_sg_sizes.begin(), __supported_sg_sizes.end());

const bool __has_supported_sg_sizes = !__supported_sg_sizes.empty();

const std::uint8_t __min_sub_group_size = __has_supported_sg_sizes

? *std::min_element(__supported_sg_sizes.begin(), __supported_sg_sizes.end())

: std::uint8_t{1};

const std::uint8_t __max_sub_group_size = __has_supported_sg_sizes

? *std::max_element(__supported_sg_sizes.begin(), __supported_sg_sizes.end())

: std::uint8_t{1};

Copilot · 2026-04-09T18:36:40Z

    std::uint8_t __sub_group_local_id = __sub_group.get_local_linear_id();
+    const std::uint8_t __sub_group_size = __sub_group.get_max_local_range()[0];
    _ONEDPL_PRAGMA_UNROLL
    for (std::uint8_t __shift = 1; __shift <= __sub_group_size / 2; __shift <<= 1)


The scan loop bound __shift <= __sub_group_size / 2 only produces a correct Hillis–Steele style scan when the sub-group size is a power of two. Since this refactor aims to support arbitrary sub-group sizes, this can compute incorrect results for sizes like 6/24/etc. Consider iterating while __shift < __sub_group_size (and keeping the mask logic) so all needed shift distances are covered for non-power-of-two sizes.

Suggested change

for (std::uint8_t __shift = 1; __shift <= __sub_group_size / 2; __shift <<= 1)

for (std::uint8_t __shift = 1; __shift < __sub_group_size; __shift <<= 1)

Copilot · 2026-04-09T18:36:40Z

    std::uint8_t __sub_group_local_id = __sub_group.get_local_linear_id();
+    const std::uint8_t __sub_group_size = __sub_group.get_max_local_range()[0];
    _ONEDPL_PRAGMA_UNROLL
    for (std::uint8_t __shift = 1; __shift <= __sub_group_size / 2; __shift <<= 1)
    {


Same issue as the exclusive variant: __shift <= __sub_group_size / 2 assumes power-of-two sub-group sizes and can yield incorrect scans for non-power-of-two sizes. If arbitrary sub-group sizes are supported, the shift loop should cover all powers of two < __sub_group_size.

Copilot · 2026-04-09T18:36:40Z

        _GenInput __gen_transform{__unary_op};

        const std::size_t __n = oneapi::dpl::__ranges::__size(__in_rng);
        return __parallel_transform_reduce_then_scan<sizeof(typename _InitType::__value_type), _CustomName>(
            __q_local, __n, std::forward<_Range1>(__in_rng), std::forward<_Range2>(__out_rng), __gen_transform,
            __binary_op, __gen_transform, _ScanInputTransform{}, _WriteOp{}, __init, _Inclusive{},


This block re-declares __n (shadowing the function parameter) and then uses the re-computed size instead of the already-provided __n. This makes control flow harder to follow and can re-trigger a potentially non-trivial __ranges::__size() computation. Prefer using the existing __n parameter (or rename the local variable if a recomputation is truly needed).

This reverts commit 4f46e97.

This reverts commit 0af5084.

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

danhoeflinger marked this pull request as draft April 7, 2026 16:15

danhoeflinger force-pushed the dev/dhoeflin/remove_subgroup_size_requirement branch from f46fcce to 191a9e3 Compare April 7, 2026 18:21

danhoeflinger mentioned this pull request Apr 9, 2026

[Reduce_then_scan refactor pt 1] Relaxing requirement of trivial copyable types #2656

Open

danhoeflinger requested a review from Copilot April 9, 2026 18:30

Copilot started reviewing on behalf of danhoeflinger April 9, 2026 18:31 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

danhoeflinger added this to the 2022.13.0 milestone Apr 13, 2026

danhoeflinger force-pushed the dev/dhoeflin/remove_subgroup_size_requirement branch from 191a9e3 to 1c2fcc1 Compare April 13, 2026 19:32

danhoeflinger marked this pull request as ready for review May 6, 2026 13:23

danhoeflinger force-pushed the dev/dhoeflin/enable_reduce_then_scan_everywhere branch from 5cd54c4 to c32888d Compare May 7, 2026 02:47

danhoeflinger added 11 commits May 6, 2026 22:47

Revert "properly deciding threshold"

676b715

This reverts commit 4f46e97.

Revert "reverting subgroup size changes (for now)"

bb413b8

This reverts commit 0af5084.

formatting

0f0b5da

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

fix bad rebase

43284f0

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

subgroup=32 if supported

6ec70ac

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

remove explicit unrolling (for test / benchmark)

25fc4d2

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

adjust block to be based upon compute units

d3ac9e0

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

forcing workgroups to power of 2

267b2c9

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

typo

71bb47e

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

remove in-place exclusive scan workaround

84be0d3

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

Improve PVC tuning

4a1560b

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

danhoeflinger force-pushed the dev/dhoeflin/remove_subgroup_size_requirement branch from 7dc127d to 4a1560b Compare May 7, 2026 02:47

SergeyKopienko mentioned this pull request May 8, 2026

Add proper bounded output support to set-algorithms from oneapi::dpl::ranges namespace with hetero policies #2681

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Reduce_then_scan refactor pt 2] Relaxing requirement subgroup size#2657

[Reduce_then_scan refactor pt 2] Relaxing requirement subgroup size#2657
danhoeflinger wants to merge 11 commits intodev/dhoeflin/enable_reduce_then_scan_everywherefrom
dev/dhoeflin/remove_subgroup_size_requirement

danhoeflinger commented Apr 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    const std::uint8_t __min_sub_group_size =
-        *std::min_element(__supported_sg_sizes.begin(), __supported_sg_sizes.end());
-    const std::uint8_t __max_sub_group_size =
-        *std::max_element(__supported_sg_sizes.begin(), __supported_sg_sizes.end());
+    const bool __has_supported_sg_sizes = !__supported_sg_sizes.empty();
+    const std::uint8_t __min_sub_group_size = __has_supported_sg_sizes
+        ? *std::min_element(__supported_sg_sizes.begin(), __supported_sg_sizes.end())
+        : std::uint8_t{1};
+    const std::uint8_t __max_sub_group_size = __has_supported_sg_sizes
+        ? *std::max_element(__supported_sg_sizes.begin(), __supported_sg_sizes.end())
+        : std::uint8_t{1};

	for (std::uint8_t __shift = 1; __shift <= __sub_group_size / 2; __shift <<= 1)
	for (std::uint8_t __shift = 1; __shift < __sub_group_size; __shift <<= 1)

Conversation

danhoeflinger commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danhoeflinger commented Apr 7, 2026 •

edited

Loading