Skip to content
Open
Changes from 19 commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
c193d39
First draft device_vector background
danhoeflinger Apr 3, 2026
830b2d7
adding key design decisions
danhoeflinger Apr 3, 2026
0d6c2ed
adding a usage example
danhoeflinger Apr 3, 2026
650a25e
experimental device vector
danhoeflinger Apr 3, 2026
82c6e4d
minor improvements
danhoeflinger Apr 7, 2026
a48f51d
replacing table with targeted section
danhoeflinger Apr 7, 2026
af51e09
adding extra decisions and open questions
danhoeflinger Apr 7, 2026
5b7e982
removing redundant section
danhoeflinger Apr 7, 2026
d5ab776
wording
danhoeflinger Apr 7, 2026
8dedb9c
removing allocator
danhoeflinger Apr 7, 2026
5d0077e
discussion of queue association and open questions
danhoeflinger Apr 8, 2026
de91e6c
iterate from self-review
danhoeflinger Apr 9, 2026
ba93348
removing unused vector from example
danhoeflinger Apr 9, 2026
012561c
More iteration on self review
danhoeflinger Apr 9, 2026
7b12b88
language improvements for decisions
danhoeflinger Apr 13, 2026
f313319
Updates for table to show proposed
danhoeflinger Apr 13, 2026
ea49728
removing redundancy
danhoeflinger Apr 13, 2026
4a398f5
improve examples
danhoeflinger Apr 13, 2026
83cf635
remove boost compute from table
danhoeflinger Apr 13, 2026
9818222
add language about boost compute
danhoeflinger Apr 13, 2026
94be8a2
fix comment
danhoeflinger Apr 15, 2026
ac39f92
adjust language of multi-device
danhoeflinger Apr 15, 2026
55de5ed
Notes on queue vs device&context
danhoeflinger Apr 15, 2026
de933fd
remove noexcept / some const until we know better from implementation…
danhoeflinger Apr 15, 2026
49d9ab1
adding ranges proposal
danhoeflinger Apr 15, 2026
3e8db04
update rfc
danhoeflinger Apr 21, 2026
439fdbf
initial investigation into usage
danhoeflinger Apr 16, 2026
05d0e14
wider usage of thrust device_vector
danhoeflinger Apr 16, 2026
d6c17ed
adding evolutions from device_vector
danhoeflinger Apr 16, 2026
eadc7e9
separate usage pattern study
danhoeflinger Apr 16, 2026
797e7f2
separate POC lessons
danhoeflinger Apr 16, 2026
5bddf11
removing unnecessary bloat
danhoeflinger Apr 16, 2026
9422da6
support no_init
danhoeflinger Apr 16, 2026
e1078cf
update no_init tag in table
danhoeflinger Apr 16, 2026
1bd160a
example usage updates
danhoeflinger Apr 16, 2026
5fee623
incorporate sycl-thrust impl into table
danhoeflinger Apr 16, 2026
726cdb0
remove boost compute ref
danhoeflinger Apr 16, 2026
a88b9f6
improve open question; minor improvements
danhoeflinger Apr 20, 2026
3305e82
adding device & context overloads
danhoeflinger Apr 20, 2026
78154c5
adding pointer ref and optional explicit queue args
danhoeflinger Apr 20, 2026
b36ae14
improve open question organization
danhoeflinger Apr 20, 2026
7680ccc
changing from default queue to default device
danhoeflinger Apr 20, 2026
f395e9f
adding open question on less used apis
danhoeflinger Apr 20, 2026
5eec3b2
further improve open questions
danhoeflinger Apr 20, 2026
4a1d7d3
better sychronization design
danhoeflinger Apr 20, 2026
57eaf3b
minor API fixesl language updates
danhoeflinger Apr 20, 2026
a9d7f1c
remove redundancy
danhoeflinger Apr 20, 2026
8c39ac7
remove decision of allocator
danhoeflinger Apr 20, 2026
d4a42e9
removing helper class section
danhoeflinger Apr 20, 2026
6fc2a41
fixed incorrect language in device_view
danhoeflinger Apr 20, 2026
ac287e8
removing unnecessary text
danhoeflinger Apr 20, 2026
b496fd0
improve usage pattern study
danhoeflinger Apr 21, 2026
b75eedd
fixing example usage and filling in missing APIs
danhoeflinger Apr 21, 2026
751bd4a
improve usage study specifics
danhoeflinger Apr 23, 2026
162c495
minor range language changes
danhoeflinger Apr 23, 2026
7879272
adjust table ordering
danhoeflinger Apr 23, 2026
9721af1
simplified device_array proposal
danhoeflinger Apr 23, 2026
ac21e46
refine device_array proposal
danhoeflinger Apr 24, 2026
01d358c
refine device_vector compat proposal
danhoeflinger Apr 24, 2026
7e6faa3
adjust top line RFC to point to compat layer & device_array
danhoeflinger Apr 24, 2026
e60de0b
Add details to range support
danhoeflinger Apr 28, 2026
5a3e69d
minor details to device_vector
danhoeflinger Apr 28, 2026
557894d
adding open questions
danhoeflinger Apr 28, 2026
ad17c41
distill device_vector compat
danhoeflinger Apr 28, 2026
6f0672d
distill top readme
danhoeflinger Apr 28, 2026
0509bfc
distill device array
danhoeflinger Apr 28, 2026
0befbb9
diagram updates
danhoeflinger Apr 28, 2026
02a0ba9
fix table
danhoeflinger Apr 28, 2026
dbbdfca
diagram typo
danhoeflinger Apr 28, 2026
5c1f519
remove incorrect example
danhoeflinger Apr 28, 2026
f6811fb
reword question
danhoeflinger Apr 28, 2026
b7c1e21
rewording
danhoeflinger Apr 28, 2026
be87126
adjust namespace section; reduce redundancy
danhoeflinger Apr 28, 2026
51ba663
minor fixes
danhoeflinger Apr 28, 2026
22f5123
mentioning no default construction
danhoeflinger Apr 28, 2026
be4169c
copy semantics
danhoeflinger Apr 28, 2026
c6cb82b
minor fixes
danhoeflinger Apr 28, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
294 changes: 294 additions & 0 deletions rfcs/proposed/device_vector/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,294 @@
# `device_vector` for oneDPL

## Introduction

This RFC proposes adding a `device_vector` container to oneDPL that provides
a `std::vector`-like interface for managing device memory. A `device_vector`
is a widely expected companion to parallel/device algorithm libraries, giving
users a familiar, RAII-managed container for data that lives on an accelerator.

### Motivation

- **Migration from CUDA/Thrust** - Thrust's `device_vector` is heavily used
in CUDA codebases. Providing an equivalent in oneDPL lowers the barrier
for porting to SYCL backends. SYCLomatic already generates code targeting
a `dpct::device_vector` compatibility shim, and having an official oneDPL
type would give that migration a stable target, in a repository which is
actively maintained.
- **Ease of use** - Users currently must manually manage USM allocations or
SYCL buffers and pair them with raw pointers or iterators. A
`device_vector` encapsulates allocation, sizing, and lifetime in a
single object and integrates directly with oneDPL algorithms.

## Comparison of Existing Implementations

| Implementation | Source |
|---|---|
| **Thrust** (`thrust::device_vector`) | [NVIDIA/cccl - device_vector.h](https://github.com/NVIDIA/cccl/blob/main/thrust/thrust/device_vector.h) |
| **SYCLomatic** (`dpct::device_vector`) | [SYCLomatic - vector.h](https://github.com/oneapi-src/SYCLomatic/blob/SYCLomatic/clang/runtime/dpct-rt/include/dpct/dpl_extras/vector.h) |
| **Distributed Ranges** (`dr::sp::device_vector`) | [distributed-ranges - device_vector.hpp](https://github.com/oneapi-src/distributed-ranges/blob/main/include/dr/sp/device_vector.hpp) |
### 1. How They Differ

| Aspect | Thrust | SYCLomatic | Distributed Ranges | Proposed (oneDPL) |
|---|---|---|---|---|
| **Default Allocator** | `thrust::device_allocator<T>` (CUDA `cudaMalloc`) | USM: `sycl::usm_allocator<T, shared>` / Buffer: `__buffer_allocator<T>` | None (must be specified; typically `device_allocator<T>`) | N/A; Always uses `sycl::malloc_device` directly |
| **Memory Model** | **Device memory** via `cudaMalloc`; host access triggers explicit transfers | **Shared memory** via USM shared or SYCL buffer/accessor; runtime manages placement | **Device memory** via `sycl::malloc_device`; host access triggers explicit transfers | **Device memory** via `sycl::malloc_device`; host access triggers explicit transfers |
| **Host Element Access** | Via `device_reference` proxy (explicit device-to-host copy) | Via `device_reference` proxy (runtime-managed migration) | Via `device_ref` proxy (explicit `queue.memcpy().wait()`) | Via `device_reference` proxy (explicit device-to-host copy) |
| **std::vector Interop** | Copy constructors from/to `std::vector` | Copy/move + implicit `operator std::vector()` | No direct interop | Explicit constructor + `operator std::vector()` |
| **Multi-device** | No | No | Yes (`rank()` tracks owning device) | see [open question](#open-questions) |
| **Queue Association** | Implicit (CUDA stream) | Global default queue | Global default queue | Explicit `sycl::queue` parameter on constructors (see [open question](#open-questions)) |
| **Uninitialized Construction** | `default_init_t`, `no_init_t` tags | Not supported | Not supported | see [open question](#open-questions) |

### 2. Boost.Compute

[Boost.Compute](https://github.com/boostorg/compute/blob/master/include/boost/compute/container/vector.hpp)
(`boost::compute::vector`) is an OpenCL-based device vector built on `cl::Buffer`.
Its most relevant design choice is **explicit queue association**: constructors and
operations accept a `command_queue` parameter, making the queue relationship clear
rather than relying on a global default. This is the closest existing precedent
for our proposed explicit `sycl::queue` constructor parameter (see
[open question](#open-questions) on queue association).

## Proposal

The proposal consists of the high-level design decisions below and an API
skeleton that mirrors `std::vector` where applicable, adapted for device
memory semantics.


### High Level Decisions

- **Use device memory as baseline, copy to/from host on demand when required**
This matches semantics of all pre-existing implementations other than SYCLomatic
where the runtime handles where memory lives. Shared memory has significantly
worse performance than device memory, and if users want those semantics, they
can directly use usm shared memory or sycl buffers.

- **Use USM, no support for buffer-backed device_vector**
As mentioned above, buffers provide an alternative but similar semantic to
`device_vector`. We learned from the SYCLomatic implementation that offering
`device_vector` functionality with sycl buffer backing is awkward and breaks
our decision to have the data live on the device. If users want a device_vector,
they need USM support.

- **Type T should only require device copyability**
We should not need anything except device copyability (for copy to and from
the device).

- **We don't need a tag system for dispatch to specific hardware**
Execution policies dictate where algorithms are run. We don't intend to
provide other flavors of vector / iterator which would have different tags,
which would be required to dispatch based upon tag.

- **device_pointer should be [device copyable](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec::device.copyable), indirectly device accessible and usable with good performance on the device**
The intent is for these to be directly usable in sycl kernels and oneDPL algorithms.
Ideally, no hard coded support will be required for `device_vector` within oneDPL's
input processing code. The indirectly device accessible trait should provide everything
we need here, including for it to be composable with custom iterator adapters.

- **`device_reference` supports all compound assignment and increment/decrement operators**
Following Thrust's convention, `device_reference<T>` will support all compound
assignment operators (`+=`, `-=`, `*=`, `/=`, `%=`, `&=`, `|=`, `^=`, `<<=`, `>>=`)
and increment/decrement (`++`, `--`). This makes host-side element access behave
as close to a real `T&` as possible and eases migration from Thrust codebases
where users expect expressions like `d_vec[i] += 1` to work. Each such operation
implies a synchronous round-trip (read-modify-write) to device memory.

- **No custom allocator template parameter; use `sycl::malloc_device` directly**
The SYCL 2020 spec intentionally excludes `sycl::usm_allocator` for device
memory because the C++ `Allocator` named requirement mandates host-accessible
memory, which device USM is not. A device memory allocator cannot satisfy
`Allocator`, so a custom allocator template parameter in the `std::vector`
sense is not meaningful here. We use `sycl::malloc_device` / `sycl::free`
directly, keeping the implementation simpler.

### API Skeleton

```cpp
namespace oneapi::dpl::experimental {

template <typename T>
class device_vector {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about allocator? At least I can use USM-shared and USM-device memory.
How we can specify that?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RFC explicitly states that this choice is deliberately not provided. That can be discussed, of course. Also note that sycl::usm_allocator does not support usm::alloc::device.

Copy link
Copy Markdown
Contributor Author

@danhoeflinger danhoeflinger Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe the SYCL folks have deliberately not provided such an allocator because of some requirements on a C++ allocator which do not align with allocation on the device. Generally, I think providing some generic allocator betrays the semantics here, as we always want the data to live on the device.

Its worth thinking about if there are use cases which would benefit from a C++ allocator-like template argument which we defined to only provide USM device memory, but I don't have a convincing motivating case at this time.

The best I can think of is some sort of debugging features like device memory bookkeeping, but it is a significantly more complex thing to add such a feature (and we could always extend it in the future).

public:
// Standard container typedefs
using value_type = T;
using size_type = std::size_t;
using difference_type = std::ptrdiff_t;
using reference = device_reference<T>; // proxy
using const_reference = device_reference<const T>; // proxy
using pointer = device_pointer<T>;
using const_pointer = device_pointer<const T>;
using iterator = /* device-aware random access iterator */;
using const_iterator = /* const version */;

// Constructors / assignment / destructor -- mirrors std::vector
// All constructors accept an optional sycl::queue argument currently,
// but this is an open question.
// (defaulted, e.g. to a global default queue).
device_vector(sycl::queue q = /* default queue */);
explicit device_vector(size_type count,
sycl::queue q = /* default queue */);
device_vector(size_type count, const T& value,
sycl::queue q = /* default queue */);
template <typename InputIt>
device_vector(InputIt first, InputIt last,
sycl::queue q = /* default queue */);
device_vector(std::initializer_list<T> init,
sycl::queue q = /* default queue */);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first look this default may cause a lot of problems: what is default queue?
For example if I create some syql::queue with CPU-selector this default will not work correctly together with it.
May be make sense to remove it at all?

Copy link
Copy Markdown
Contributor Author

@danhoeflinger danhoeflinger Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My expectation would be we would use the same as device policy's default queue __internal::__get_default_queue().

My understanding of what happens if you have provide a cpu-selector queue:
If the device does not support aspect::usm_device_allocations, sycl::malloc_device will throw errc::feature_not_supported, I think this is appropriate for a device_vector.

For intel implementation of sycl, I believe that CPU devices do have that aspect (but I think it really depends upon the implementation of the Unified Runtime), and it will be equivalent to a malloc on the host. Dereferencing this pointer on the "host" is UB according to the SYCL spec. Currently, dereferencing a pointer to such an allocation on the "host" happens to work, but our implementation of device_vector will not rely upon that. I think all this works as intended, with no special handling on our side.

device_vector(const device_vector&); // also copies the queue
device_vector(device_vector&&) noexcept;
device_vector& operator=(const device_vector&);
device_vector& operator=(device_vector&&) noexcept;
~device_vector();

// Interop with std::vector
explicit device_vector(const std::vector<T>&,
sycl::queue q = /* default queue */);
explicit operator std::vector<T>() const;

// Element access (proxy references, implies host-device transfer)
reference operator[](size_type pos);
const_reference operator[](size_type pos) const;
reference front();
const_reference front() const;
reference back();
const_reference back() const;
pointer data() noexcept;
const_pointer data() const noexcept;

// Iterators
iterator begin() noexcept;
iterator end() noexcept;
// + const/reverse variants

// Queue access
sycl::queue get_queue() const noexcept;
Comment thread
SergeyKopienko marked this conversation as resolved.
Outdated

// Capacity
bool empty() const noexcept;
size_type size() const noexcept;
size_type capacity() const noexcept;
void reserve(size_type new_cap);
void shrink_to_fit();

// Modifiers
void assign(size_type count, const T& value);
template <typename InputIt>
void assign(InputIt first, InputIt last);
void assign(std::initializer_list<T> ilist);
void clear() noexcept;
void push_back(const T& value);
void pop_back();
iterator insert(const_iterator pos, const T& value);
iterator insert(const_iterator pos, size_type count, const T& value);
template <typename InputIt>
iterator insert(const_iterator pos, InputIt first, InputIt last);
iterator erase(const_iterator pos);
iterator erase(const_iterator first, const_iterator last);
void resize(size_type count);
void resize(size_type count, const T& value);
void swap(device_vector& other) noexcept;
};

} // namespace oneapi::dpl::experimental
```

### Usage Examples

```cpp
#include <oneapi/dpl/device_vector>
#include <oneapi/dpl/algorithm>
#include <oneapi/dpl/execution>
#include <sycl/sycl.hpp>

// Basic construction and algorithm use
sycl::queue q;
oneapi::dpl::experimental::device_vector<float> d_vec(1024, q); // 1024 elements on q's device

// Fill from host data (interop constructor)
std::vector<float> host_data(1024, 3.14f);
oneapi::dpl::experimental::device_vector<float> d_vec2(host_data, q);

// Use with oneDPL algorithms -- iterators work directly
auto policy = oneapi::dpl::execution::make_device_policy(q);
std::sort(policy, d_vec2.begin(), d_vec2.end());

// Host-side element access via proxy reference (synchronous, slow)
float val = d_vec2[0]; // device-to-host transfer
d_vec2[0] = 42.0f; // host-to-device transfer

// Extract raw pointer for use in SYCL kernels
oneapi::dpl::experimental::device_pointer<float> d_ptr = d_vec2.data();

// device_pointer or possibly device_iterator (see open question)
auto d_iter = d_vec.begin();
q.parallel_for(sycl::range<1>(1024), [=](sycl::id<1> i) {
// Device-side side access (direct USM pointer dereference, fast)
d_ptr[i] *= 2.0f;
d_iter[i] += 3.0f;
}).wait();

// Copy back to host
std::vector<float> result = static_cast<std::vector<float>>(d_vec2);

// Read the element at d_vec[5] on the host
d_iter += 5;
std::cout << *d_iter;
```

### Helper Types

A `device_vector` requires two supporting types:

- **`device_pointer<T>`** -- wraps a raw device pointer; models random
access iterator; dereference returns `device_reference<T>`.
- **`device_reference<T>`** -- proxy reference for host-side element
access; reads/writes trigger synchronous `memcpy` on the host path,
direct dereference on the device path. Existing implementations range
from full-featured (Thrust: all compound assignment and
increment/decrement operators) to minimal (Distributed Ranges: only
`operator T()` and `operator=`). Our proposal follows Thrust's
full-featured approach (see design decisions above).

## Open Questions

- **Should we try to support multiple devices?**
Distributed ranges work has been halted, but is this an important use case to
preserve, or unnecessary complication for users? We could always add another
type distributed_device_vector in the future if it seems necessary.

- **Queue association model and its impact on device_pointer**
We want `device_pointer` to be device copyable so it can be used directly
in kernels and oneDPL algorithms. However, host-side dereference (returning
a proxy reference that does `memcpy`) requires a `sycl::queue`, and
`sycl::queue` is not device copyable. This creates a tension with several
possible resolutions, here are the leading 2:

1. **Store a raw `sycl::queue*` in device_pointer.** `device_pointer`
holds `T*` + `sycl::queue*` -- a struct of two raw pointers is trivially
copyable and therefore device copyable. On the host, the queue pointer
is dereferenced to obtain the queue for `memcpy`. On the device, the
queue pointer is unused dead bits. This preserves both device copyability
and host-side dereference with the correct queue. The queue must
outlive the pointer, but this is fine because it lives in the vector
the same as the memory itself, which will be freed if the vector leaves
scope.
2. **Use a global default queue only.** `device_vector` does not accept a
queue on construction. `device_pointer` holds only a raw `T*`, is device
copyable, and host-side dereference uses the default queue. Simple, but
limits users to a single device/queue.

Among existing implementations, Boost.Compute's explicit `command_queue`
parameter on constructors is the closest precedent for queue association.
CUDA-based Thrust avoids the problem entirely since `cudaMemcpy` is a
global function that doesn't require a queue object.

- **Should we support uninitialized / default-initialized construction?**
Thrust provides `default_init_t` and `no_init_t` tags that let users skip
value-initialization when constructing a `device_vector`. No other existing
implementation supports this. It fits nicely with uninitialized_* APIs.
Should we support similar tags?

- **Should we use device_pointer as the device iterator?**
It seems there is no use case for a separate device_iterator, but it's
worth considering.
Loading