-
Notifications
You must be signed in to change notification settings - Fork 117
[RFC] Device Array and compat Device Vector #2660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 19 commits
c193d39
830b2d7
0d6c2ed
650a25e
82c6e4d
a48f51d
af51e09
5b7e982
d5ab776
8dedb9c
5d0077e
de91e6c
ba93348
012561c
7b12b88
f313319
ea49728
4a398f5
83cf635
9818222
94be8a2
ac39f92
55de5ed
de933fd
49d9ab1
3e8db04
439fdbf
05d0e14
d6c17ed
eadc7e9
797e7f2
5bddf11
9422da6
e1078cf
1bd160a
5fee623
726cdb0
a88b9f6
3305e82
78154c5
b36ae14
7680ccc
f395e9f
5eec3b2
4a1d7d3
57eaf3b
a9d7f1c
8c39ac7
d4a42e9
6fc2a41
ac287e8
b496fd0
b75eedd
751bd4a
162c495
7879272
9721af1
ac21e46
01d358c
7e6faa3
e60de0b
5a3e69d
557894d
ad17c41
6f0672d
0509bfc
0befbb9
02a0ba9
dbbdfca
5c1f519
f6811fb
b7c1e21
be87126
51ba663
22f5123
be4169c
c6cb82b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,294 @@ | ||
| # `device_vector` for oneDPL | ||
|
|
||
| ## Introduction | ||
|
|
||
| This RFC proposes adding a `device_vector` container to oneDPL that provides | ||
| a `std::vector`-like interface for managing device memory. A `device_vector` | ||
| is a widely expected companion to parallel/device algorithm libraries, giving | ||
| users a familiar, RAII-managed container for data that lives on an accelerator. | ||
|
|
||
| ### Motivation | ||
|
|
||
| - **Migration from CUDA/Thrust** - Thrust's `device_vector` is heavily used | ||
| in CUDA codebases. Providing an equivalent in oneDPL lowers the barrier | ||
| for porting to SYCL backends. SYCLomatic already generates code targeting | ||
| a `dpct::device_vector` compatibility shim, and having an official oneDPL | ||
| type would give that migration a stable target, in a repository which is | ||
| actively maintained. | ||
| - **Ease of use** - Users currently must manually manage USM allocations or | ||
| SYCL buffers and pair them with raw pointers or iterators. A | ||
| `device_vector` encapsulates allocation, sizing, and lifetime in a | ||
| single object and integrates directly with oneDPL algorithms. | ||
|
|
||
| ## Comparison of Existing Implementations | ||
|
|
||
| | Implementation | Source | | ||
| |---|---| | ||
| | **Thrust** (`thrust::device_vector`) | [NVIDIA/cccl - device_vector.h](https://github.com/NVIDIA/cccl/blob/main/thrust/thrust/device_vector.h) | | ||
| | **SYCLomatic** (`dpct::device_vector`) | [SYCLomatic - vector.h](https://github.com/oneapi-src/SYCLomatic/blob/SYCLomatic/clang/runtime/dpct-rt/include/dpct/dpl_extras/vector.h) | | ||
| | **Distributed Ranges** (`dr::sp::device_vector`) | [distributed-ranges - device_vector.hpp](https://github.com/oneapi-src/distributed-ranges/blob/main/include/dr/sp/device_vector.hpp) | | ||
| ### 1. How They Differ | ||
|
|
||
| | Aspect | Thrust | SYCLomatic | Distributed Ranges | Proposed (oneDPL) | | ||
| |---|---|---|---|---| | ||
| | **Default Allocator** | `thrust::device_allocator<T>` (CUDA `cudaMalloc`) | USM: `sycl::usm_allocator<T, shared>` / Buffer: `__buffer_allocator<T>` | None (must be specified; typically `device_allocator<T>`) | N/A; Always uses `sycl::malloc_device` directly | | ||
| | **Memory Model** | **Device memory** via `cudaMalloc`; host access triggers explicit transfers | **Shared memory** via USM shared or SYCL buffer/accessor; runtime manages placement | **Device memory** via `sycl::malloc_device`; host access triggers explicit transfers | **Device memory** via `sycl::malloc_device`; host access triggers explicit transfers | | ||
| | **Host Element Access** | Via `device_reference` proxy (explicit device-to-host copy) | Via `device_reference` proxy (runtime-managed migration) | Via `device_ref` proxy (explicit `queue.memcpy().wait()`) | Via `device_reference` proxy (explicit device-to-host copy) | | ||
| | **std::vector Interop** | Copy constructors from/to `std::vector` | Copy/move + implicit `operator std::vector()` | No direct interop | Explicit constructor + `operator std::vector()` | | ||
| | **Multi-device** | No | No | Yes (`rank()` tracks owning device) | see [open question](#open-questions) | | ||
| | **Queue Association** | Implicit (CUDA stream) | Global default queue | Global default queue | Explicit `sycl::queue` parameter on constructors (see [open question](#open-questions)) | | ||
| | **Uninitialized Construction** | `default_init_t`, `no_init_t` tags | Not supported | Not supported | see [open question](#open-questions) | | ||
|
|
||
| ### 2. Boost.Compute | ||
|
|
||
| [Boost.Compute](https://github.com/boostorg/compute/blob/master/include/boost/compute/container/vector.hpp) | ||
| (`boost::compute::vector`) is an OpenCL-based device vector built on `cl::Buffer`. | ||
| Its most relevant design choice is **explicit queue association**: constructors and | ||
| operations accept a `command_queue` parameter, making the queue relationship clear | ||
| rather than relying on a global default. This is the closest existing precedent | ||
| for our proposed explicit `sycl::queue` constructor parameter (see | ||
| [open question](#open-questions) on queue association). | ||
|
|
||
| ## Proposal | ||
|
|
||
| The proposal consists of the high-level design decisions below and an API | ||
| skeleton that mirrors `std::vector` where applicable, adapted for device | ||
| memory semantics. | ||
|
|
||
|
|
||
| ### High Level Decisions | ||
|
|
||
| - **Use device memory as baseline, copy to/from host on demand when required** | ||
| This matches semantics of all pre-existing implementations other than SYCLomatic | ||
| where the runtime handles where memory lives. Shared memory has significantly | ||
| worse performance than device memory, and if users want those semantics, they | ||
| can directly use usm shared memory or sycl buffers. | ||
|
|
||
| - **Use USM, no support for buffer-backed device_vector** | ||
| As mentioned above, buffers provide an alternative but similar semantic to | ||
| `device_vector`. We learned from the SYCLomatic implementation that offering | ||
| `device_vector` functionality with sycl buffer backing is awkward and breaks | ||
| our decision to have the data live on the device. If users want a device_vector, | ||
| they need USM support. | ||
|
|
||
| - **Type T should only require device copyability** | ||
| We should not need anything except device copyability (for copy to and from | ||
| the device). | ||
|
|
||
| - **We don't need a tag system for dispatch to specific hardware** | ||
| Execution policies dictate where algorithms are run. We don't intend to | ||
| provide other flavors of vector / iterator which would have different tags, | ||
| which would be required to dispatch based upon tag. | ||
|
|
||
| - **device_pointer should be [device copyable](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec::device.copyable), indirectly device accessible and usable with good performance on the device** | ||
| The intent is for these to be directly usable in sycl kernels and oneDPL algorithms. | ||
| Ideally, no hard coded support will be required for `device_vector` within oneDPL's | ||
| input processing code. The indirectly device accessible trait should provide everything | ||
| we need here, including for it to be composable with custom iterator adapters. | ||
|
|
||
| - **`device_reference` supports all compound assignment and increment/decrement operators** | ||
| Following Thrust's convention, `device_reference<T>` will support all compound | ||
| assignment operators (`+=`, `-=`, `*=`, `/=`, `%=`, `&=`, `|=`, `^=`, `<<=`, `>>=`) | ||
| and increment/decrement (`++`, `--`). This makes host-side element access behave | ||
| as close to a real `T&` as possible and eases migration from Thrust codebases | ||
| where users expect expressions like `d_vec[i] += 1` to work. Each such operation | ||
| implies a synchronous round-trip (read-modify-write) to device memory. | ||
|
|
||
| - **No custom allocator template parameter; use `sycl::malloc_device` directly** | ||
| The SYCL 2020 spec intentionally excludes `sycl::usm_allocator` for device | ||
| memory because the C++ `Allocator` named requirement mandates host-accessible | ||
| memory, which device USM is not. A device memory allocator cannot satisfy | ||
| `Allocator`, so a custom allocator template parameter in the `std::vector` | ||
| sense is not meaningful here. We use `sycl::malloc_device` / `sycl::free` | ||
| directly, keeping the implementation simpler. | ||
|
|
||
| ### API Skeleton | ||
|
|
||
| ```cpp | ||
| namespace oneapi::dpl::experimental { | ||
|
|
||
| template <typename T> | ||
| class device_vector { | ||
| public: | ||
| // Standard container typedefs | ||
| using value_type = T; | ||
| using size_type = std::size_t; | ||
| using difference_type = std::ptrdiff_t; | ||
| using reference = device_reference<T>; // proxy | ||
| using const_reference = device_reference<const T>; // proxy | ||
| using pointer = device_pointer<T>; | ||
| using const_pointer = device_pointer<const T>; | ||
| using iterator = /* device-aware random access iterator */; | ||
| using const_iterator = /* const version */; | ||
|
|
||
| // Constructors / assignment / destructor -- mirrors std::vector | ||
| // All constructors accept an optional sycl::queue argument currently, | ||
| // but this is an open question. | ||
| // (defaulted, e.g. to a global default queue). | ||
| device_vector(sycl::queue q = /* default queue */); | ||
| explicit device_vector(size_type count, | ||
| sycl::queue q = /* default queue */); | ||
| device_vector(size_type count, const T& value, | ||
| sycl::queue q = /* default queue */); | ||
| template <typename InputIt> | ||
| device_vector(InputIt first, InputIt last, | ||
| sycl::queue q = /* default queue */); | ||
| device_vector(std::initializer_list<T> init, | ||
| sycl::queue q = /* default queue */); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At first look this default may cause a lot of problems: what is default queue?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My expectation would be we would use the same as device policy's default queue My understanding of what happens if you have provide a cpu-selector queue: For intel implementation of sycl, I believe that CPU devices do have that aspect (but I think it really depends upon the implementation of the Unified Runtime), and it will be equivalent to a malloc on the host. Dereferencing this pointer on the "host" is UB according to the SYCL spec. Currently, dereferencing a pointer to such an allocation on the "host" happens to work, but our implementation of |
||
| device_vector(const device_vector&); // also copies the queue | ||
| device_vector(device_vector&&) noexcept; | ||
| device_vector& operator=(const device_vector&); | ||
| device_vector& operator=(device_vector&&) noexcept; | ||
| ~device_vector(); | ||
|
|
||
| // Interop with std::vector | ||
| explicit device_vector(const std::vector<T>&, | ||
| sycl::queue q = /* default queue */); | ||
| explicit operator std::vector<T>() const; | ||
|
|
||
| // Element access (proxy references, implies host-device transfer) | ||
| reference operator[](size_type pos); | ||
| const_reference operator[](size_type pos) const; | ||
| reference front(); | ||
| const_reference front() const; | ||
| reference back(); | ||
| const_reference back() const; | ||
| pointer data() noexcept; | ||
| const_pointer data() const noexcept; | ||
|
|
||
| // Iterators | ||
| iterator begin() noexcept; | ||
| iterator end() noexcept; | ||
| // + const/reverse variants | ||
|
|
||
| // Queue access | ||
| sycl::queue get_queue() const noexcept; | ||
|
SergeyKopienko marked this conversation as resolved.
Outdated
|
||
|
|
||
| // Capacity | ||
| bool empty() const noexcept; | ||
| size_type size() const noexcept; | ||
| size_type capacity() const noexcept; | ||
| void reserve(size_type new_cap); | ||
| void shrink_to_fit(); | ||
|
|
||
| // Modifiers | ||
| void assign(size_type count, const T& value); | ||
| template <typename InputIt> | ||
| void assign(InputIt first, InputIt last); | ||
| void assign(std::initializer_list<T> ilist); | ||
| void clear() noexcept; | ||
| void push_back(const T& value); | ||
| void pop_back(); | ||
| iterator insert(const_iterator pos, const T& value); | ||
| iterator insert(const_iterator pos, size_type count, const T& value); | ||
| template <typename InputIt> | ||
| iterator insert(const_iterator pos, InputIt first, InputIt last); | ||
| iterator erase(const_iterator pos); | ||
| iterator erase(const_iterator first, const_iterator last); | ||
| void resize(size_type count); | ||
| void resize(size_type count, const T& value); | ||
| void swap(device_vector& other) noexcept; | ||
| }; | ||
|
|
||
| } // namespace oneapi::dpl::experimental | ||
| ``` | ||
|
|
||
| ### Usage Examples | ||
|
|
||
| ```cpp | ||
| #include <oneapi/dpl/device_vector> | ||
| #include <oneapi/dpl/algorithm> | ||
| #include <oneapi/dpl/execution> | ||
| #include <sycl/sycl.hpp> | ||
|
|
||
| // Basic construction and algorithm use | ||
| sycl::queue q; | ||
| oneapi::dpl::experimental::device_vector<float> d_vec(1024, q); // 1024 elements on q's device | ||
|
|
||
| // Fill from host data (interop constructor) | ||
| std::vector<float> host_data(1024, 3.14f); | ||
| oneapi::dpl::experimental::device_vector<float> d_vec2(host_data, q); | ||
|
|
||
| // Use with oneDPL algorithms -- iterators work directly | ||
| auto policy = oneapi::dpl::execution::make_device_policy(q); | ||
| std::sort(policy, d_vec2.begin(), d_vec2.end()); | ||
|
|
||
| // Host-side element access via proxy reference (synchronous, slow) | ||
| float val = d_vec2[0]; // device-to-host transfer | ||
| d_vec2[0] = 42.0f; // host-to-device transfer | ||
|
|
||
| // Extract raw pointer for use in SYCL kernels | ||
| oneapi::dpl::experimental::device_pointer<float> d_ptr = d_vec2.data(); | ||
|
|
||
| // device_pointer or possibly device_iterator (see open question) | ||
| auto d_iter = d_vec.begin(); | ||
| q.parallel_for(sycl::range<1>(1024), [=](sycl::id<1> i) { | ||
| // Device-side side access (direct USM pointer dereference, fast) | ||
| d_ptr[i] *= 2.0f; | ||
| d_iter[i] += 3.0f; | ||
| }).wait(); | ||
|
|
||
| // Copy back to host | ||
| std::vector<float> result = static_cast<std::vector<float>>(d_vec2); | ||
|
|
||
| // Read the element at d_vec[5] on the host | ||
| d_iter += 5; | ||
| std::cout << *d_iter; | ||
| ``` | ||
|
|
||
| ### Helper Types | ||
|
|
||
| A `device_vector` requires two supporting types: | ||
|
|
||
| - **`device_pointer<T>`** -- wraps a raw device pointer; models random | ||
| access iterator; dereference returns `device_reference<T>`. | ||
| - **`device_reference<T>`** -- proxy reference for host-side element | ||
| access; reads/writes trigger synchronous `memcpy` on the host path, | ||
| direct dereference on the device path. Existing implementations range | ||
| from full-featured (Thrust: all compound assignment and | ||
| increment/decrement operators) to minimal (Distributed Ranges: only | ||
| `operator T()` and `operator=`). Our proposal follows Thrust's | ||
| full-featured approach (see design decisions above). | ||
|
|
||
| ## Open Questions | ||
|
|
||
| - **Should we try to support multiple devices?** | ||
| Distributed ranges work has been halted, but is this an important use case to | ||
| preserve, or unnecessary complication for users? We could always add another | ||
| type distributed_device_vector in the future if it seems necessary. | ||
|
|
||
| - **Queue association model and its impact on device_pointer** | ||
| We want `device_pointer` to be device copyable so it can be used directly | ||
| in kernels and oneDPL algorithms. However, host-side dereference (returning | ||
| a proxy reference that does `memcpy`) requires a `sycl::queue`, and | ||
| `sycl::queue` is not device copyable. This creates a tension with several | ||
| possible resolutions, here are the leading 2: | ||
|
|
||
| 1. **Store a raw `sycl::queue*` in device_pointer.** `device_pointer` | ||
| holds `T*` + `sycl::queue*` -- a struct of two raw pointers is trivially | ||
| copyable and therefore device copyable. On the host, the queue pointer | ||
| is dereferenced to obtain the queue for `memcpy`. On the device, the | ||
| queue pointer is unused dead bits. This preserves both device copyability | ||
| and host-side dereference with the correct queue. The queue must | ||
| outlive the pointer, but this is fine because it lives in the vector | ||
| the same as the memory itself, which will be freed if the vector leaves | ||
| scope. | ||
| 2. **Use a global default queue only.** `device_vector` does not accept a | ||
| queue on construction. `device_pointer` holds only a raw `T*`, is device | ||
| copyable, and host-side dereference uses the default queue. Simple, but | ||
| limits users to a single device/queue. | ||
|
|
||
| Among existing implementations, Boost.Compute's explicit `command_queue` | ||
| parameter on constructors is the closest precedent for queue association. | ||
| CUDA-based Thrust avoids the problem entirely since `cudaMemcpy` is a | ||
| global function that doesn't require a queue object. | ||
|
|
||
| - **Should we support uninitialized / default-initialized construction?** | ||
| Thrust provides `default_init_t` and `no_init_t` tags that let users skip | ||
| value-initialization when constructing a `device_vector`. No other existing | ||
| implementation supports this. It fits nicely with uninitialized_* APIs. | ||
| Should we support similar tags? | ||
|
|
||
| - **Should we use device_pointer as the device iterator?** | ||
| It seems there is no use case for a separate device_iterator, but it's | ||
| worth considering. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about allocator? At least I can use USM-shared and USM-device memory.
How we can specify that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RFC explicitly states that this choice is deliberately not provided. That can be discussed, of course. Also note that
sycl::usm_allocatordoes not supportusm::alloc::device.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe the SYCL folks have deliberately not provided such an allocator because of some requirements on a C++ allocator which do not align with allocation on the device. Generally, I think providing some generic allocator betrays the semantics here, as we always want the data to live on the device.
Its worth thinking about if there are use cases which would benefit from a C++ allocator-like template argument which we defined to only provide USM device memory, but I don't have a convincing motivating case at this time.
The best I can think of is some sort of debugging features like device memory bookkeeping, but it is a significantly more complex thing to add such a feature (and we could always extend it in the future).