diff --git a/rfcs/proposed/device_vector/README.md b/rfcs/proposed/device_vector/README.md new file mode 100644 index 00000000000..87fe87de5ea --- /dev/null +++ b/rfcs/proposed/device_vector/README.md @@ -0,0 +1,172 @@ +# `device_array` and `compat::device_vector` for oneDPL + +## Introduction + +This RFC proposes adding a container to oneDPL that provides +a `std::vector`-like interface for managing device memory. + +### Motivation + +- **Migration from CUDA/Thrust** - Thrust's `device_vector` is heavily used + in CUDA codebases. Providing an equivalent in oneDPL lowers the barrier + for porting to SYCL backends. SYCLomatic already generates code targeting + a `dpct::device_vector` compatibility shim, and having an official oneDPL + type would give that migration a stable target, in a repository which is + actively maintained. +- **Ease of use** - Users currently must manually manage USM allocations or + SYCL buffers and pair them with raw pointers or iterators. A + `device_vector` encapsulates allocation, sizing, and lifetime in a + single object and integrates directly with oneDPL algorithms. +- **Real-world usage patterns** - A [detailed survey](usage_pattern_study.md) + of real-world usage informed the design. Key findings: + + 1. **Construction + bulk transfer + raw pointer extraction** are the core + operations across all domains. `device_vector` is primarily used as an + RAII device memory manager and host-device data shuttle. + 2. **`begin()`/`end()` integration with parallel algorithms** is the + second-most critical capability. + 3. **Some popular AI/ML projects** (FAISS, cuDF, cuML) have **moved away from + `thrust::device_vector`** due to unwanted value initialization, lack of + stream parameters, and header bloat — then built alternatives that + prioritize explicit async control and uninitialized allocation. Other HPC + and ML projects remain heavy users. + 4. **Full `std::vector`-like modifiers** (`push_back`, `insert`, `erase`) + are rarely used in real workloads. + +## Comparison of Existing device_vector Implementations + +| Implementation | Source | +|---|---| +| **Thrust** (`thrust::device_vector`) | [NVIDIA/cccl - device_vector.h](https://github.com/NVIDIA/cccl/blob/main/thrust/thrust/device_vector.h) | +| **SYCLomatic** (`dpct::device_vector`) | [SYCLomatic - vector.h](https://github.com/oneapi-src/SYCLomatic/blob/SYCLomatic/clang/runtime/dpct-rt/include/dpct/dpl_extras/vector.h) | +| **Distributed Ranges** (`dr::sp::device_vector`) | [distributed-ranges - device_vector.hpp](https://github.com/oneapi-src/distributed-ranges/blob/main/include/dr/sp/device_vector.hpp) | +| **sycl-thrust** (`thrust::device_vector`) | [SparseBLAS/sycl-thrust - device_vector.h](https://github.com/SparseBLAS/sycl-thrust/blob/main/include/thrust/device_vector.h) | + +### How They Differ + +| Aspect | Proposed (oneDPL) | Thrust | sycl-thrust | SYCLomatic | +|---|---|---|---|---| +| **Default Allocator** | `device_allocator` wrapping `sycl::malloc_device`; custom `DeviceAllocator` concept | `thrust::device_allocator` (CUDA `cudaMalloc`) | `device_allocator` (`sycl::malloc_device`); supports alignment template parameter | USM: `sycl::usm_allocator` / Buffer: `__buffer_allocator` | +| **Memory Model** | **Device memory** via `sycl::malloc_device`; host access triggers explicit transfers | **Device memory** via `cudaMalloc`; host access triggers explicit transfers | **Device memory** via `sycl::malloc_device`; explicit transfers | **Shared memory** via USM shared or SYCL buffer/accessor; runtime manages placement | +| **Host Element Access** | `device_array`: explicit `read()`/`write()`; compat `device_vector`: `device_reference` proxy | Via `device_reference` proxy (explicit device-to-host copy) | Via `device_reference` proxy (`__SYCL_DEVICE_ONLY__` bifurcation) | Via `device_reference` proxy (runtime-managed migration) | +| **std::vector Interop** | Explicit constructor + `to_vector()` | Copy constructors from/to `std::vector` | Constructor from `std::vector` | Copy/move + implicit `operator std::vector()` | +| **Queue Association** | Stores context + device; queue provided per-operation or created on demand | Implicit (CUDA stream) | Allocator stores `device` + `context`; queue resolved at runtime via pointer introspection | Global default queue | +| **Uninitialized Construction** | `device_array`: uninitialized by default; compat `device_vector`: `no_init_t` tag | `default_init_t`, `no_init_t` tags | Not supported | Not supported | + +## Proposal + +The proposal consists of two complementary types that share an underlying +implementation: + +1. **[`device_array`](device_array.md)** — the primary API. + A clean, explicit container for device memory with no proxy types. Raw `T*` + iterators, explicit `read()`/`write()` for host access, uninitialized by + default, and range support via `device_span`. + +2. **[`compat::device_vector`](device_vector_compat.md)** — a + Thrust compatibility layer built on `device_array`. Adds `device_pointer`, + `device_reference`, and `operator[]` proxy semantics for drop-in migration + from `thrust::device_vector`. + +### Class Relationships + +```mermaid +classDiagram + direction LR + + namespace experimental { + class device_array~T, Alloc~ { + iterators: T* + host access: read() / write() + } + + class device_span~T~ { + non-owning view + trivially copyable + } + } + + namespace experimental_compat { + class device_vector~T, Alloc~ { + Thrust compat layer + } + + class device_pointer~T~ { + wraps T* + context* + } + + class device_reference~T~ { + proxy for host access + } + } + + device_array --> device_span : .span() + device_vector *-- device_array : contains + device_vector --> device_pointer : begin()/end() + device_pointer --> device_reference : operator*() / operator[]() +``` + +### Design Decisions + +- **Use USM device memory as baseline, copy to/from host on demand when required.** + This matches semantics of all pre-existing implementations other than SYCLomatic + where the runtime handles where memory lives. Shared memory has significantly + worse performance than device memory, and if users want those semantics, they + can directly use usm shared memory or sycl buffers. + +- **Store context, not queue.** `sycl::malloc_device` requires only + a context (and a device which can be looked up with the pointer). Storing a queue would tie the container to a particular + queue and imply synchronization semantics. Queues are accepted per-operation + or created on demand. + +- **Type T should only require device copyability.** + We should not need anything except device copyability (for copy to and from + the device). + +- **No tag system for dispatch to specific hardware.** + Execution policies dictate where algorithms are run. We don't intend to + provide other flavors of vector / iterator which would have different tags, + which would be required to dispatch based upon tag. + +- **Custom `DeviceAllocator` concept for pluggable allocation.** + A minimal allocator interface — just `allocate(n, ctx, dev)` and + `deallocate(p, n, ctx, dev)` — that avoids the `std::allocator` named + requirements (which mandate host-accessible memory). Enables pool allocators, + aligned allocation, and other strategies. See the + [device_array allocator section](device_array.md#allocator) for details. + +- **No `push_back`, `insert`, `erase`.** + Rarely used in practice (see [usage study](usage_pattern_study.md)), + high implementation complexity for device memory. + +- **Host-side operations block but do not synchronize with prior work.** + The user is responsible for ensuring prior kernels have completed before + host-side access. This can be achieved via an in-order queue or explicit + event waits. `device_array` additionally offers async overloads with + `depends_on` events. + +## Open Questions + +- **Should `device_array`'s async overloads be in the initial release or + deferred?** + see [device_array](device_array.md). + +- **Header organization?** + - We could have a `` header and automatically include `device_array` with other includes? + Alternatives: + - Individual headers: + `` and ``.. `device_vector` + would transitively include `device_array` since it depends on it. + - We could have a `compat` header and a individual `device_array` header. However, if we intend to use `device_array` within our own sycl implementations, that may impact our decision here. + +- **Compatibility namespace naming?** This proposal places the + Thrust-compatible types in `oneapi::dpl::experimental::compat`. Several + aspects are worth discussing: + - Should `compat` be nested under `experimental`, or should it be + `oneapi::dpl::compat` directly? Or alternatively `oneapi::dpl::ext::compat`. + - Is `compat` the right name? Alternatives: `thrust_compat`, `migration`, + `legacy`. `compat` is concise but doesn't indicate what it's compatible + *with*. `thrust_compat` is more explicit but ties the namespace to a + specific vendor's API. + - Moreover, is this repository where we want the compatibility headers? I think yes, otherwise they will be too cumbersome to use, but it worth raising. + diff --git a/rfcs/proposed/device_vector/device_array.md b/rfcs/proposed/device_vector/device_array.md new file mode 100644 index 00000000000..3811b638ca9 --- /dev/null +++ b/rfcs/proposed/device_vector/device_array.md @@ -0,0 +1,336 @@ +# `device_array` — Simplified Device Memory Container + +`device_array` provides RAII device allocation, explicit host-device transfers, +and direct use of raw USM pointers as iterators. It emphasizes convenience of +use while making host-side use explicit. It provides control over synchronization +of host side operations. + +This focus provides support for the main usage pattern for users of `device_vector`, +and fits nicely within SYCL while avoiding much of the complexity of `device_vector`. + +See the [device_vector RFC](../../../rfcs/proposed/device_vector/README.md) for +full motivation, usage study, and comparison of existing implementations. This +document only describes `device_array`. + +## API + +```cpp +namespace oneapi::dpl::experimental { + +template > +class device_array { +public: + using value_type = T; + using allocator_type = Alloc; + using size_type = std::size_t; + using difference_type = std::ptrdiff_t; + using pointer = T*; + using const_pointer = const T*; + using iterator = T*; + using const_iterator = const T*; + + // Construction + + // Allocate uninitialized device memory (default — no kernel launch or memset) + // From queue (extracts context + device; queue is not retained) + device_array(size_type count, sycl::queue q); + // From context + device + device_array(size_type count, sycl::context ctx, sycl::device dev); + + // Allocate and fill with value (requires kernel launch or memset) + device_array(size_type count, const T& value, sycl::queue q); + device_array(size_type count, const T& value, + sycl::context ctx, sycl::device dev); + + // Construct from host data (upload) + template + device_array(InputIt first, InputIt last, sycl::queue q); + device_array(std::initializer_list init, sycl::queue q); + device_array(const std::vector& src, sycl::queue q); + + template + device_array(InputIt first, InputIt last, sycl::context ctx, sycl::device dev); + device_array(std::initializer_list init, sycl::context ctx, sycl::device dev); + device_array(const std::vector& src, sycl::context ctx, sycl::device dev); + + // Deleted Copy — use copy_from for explicit device-to-device copies) + device_array(const device_array&) = delete; + device_array& operator=(const device_array&) = delete; + + // Move + device_array(device_array&&); + device_array& operator=(device_array&&); + + ~device_array(); + + // Device-to-device copy (allocates on the provided context+device) + // Supports cross-device copies: source and destination may be on different devices + static device_array copy_from(const device_array& src, sycl::queue q); + static device_array copy_from(const device_array& src, + size_type offset, size_type count, sycl::queue q); + static device_array copy_from(const device_array& src, + sycl::context ctx, sycl::device dev); + static device_array copy_from(const device_array& src, + size_type offset, size_type count, + sycl::context ctx, sycl::device dev); + + // Host-device transfer + + // Bulk download + std::vector to_vector() const; + std::vector to_vector(sycl::queue q) const; + + // Bulk upload (resizes to match) + void assign(const T* first, const T* last); + void assign(const T* first, const T* last, sycl::queue q); + void assign(const std::vector& src); + void assign(const std::vector& src, sycl::queue q); + + // Single-element host access (blocking, creates queue from context & device) + T read(size_type pos) const; + void write(size_type pos, const T& value); + + // Single-element host access (blocking, provided queue is used for copy submissions) + T read(size_type pos, sycl::queue q) const; + void write(size_type pos, const T& value, sycl::queue q); + + // Asynchronous single-element access, events allow synchronization with event driven workloads + sycl::event async_read(size_type pos, T& out, + sycl::queue q, + const std::vector& depends_on = {}) const; + sycl::event async_write(size_type pos, const T& value, + sycl::queue q, + const std::vector& depends_on = {}); + + // Asynchronous bulk transfer + sycl::event async_to_vector(std::vector& out, + sycl::queue q, + const std::vector& depends_on = {}) const; + sycl::event async_assign(const T* first, const T* last, + sycl::queue q, + const std::vector& depends_on = {}); + + // Device iteration — raw USM pointers + iterator begin(); + const_iterator begin() const; + iterator end(); + const_iterator end() const; + pointer data(); + const_pointer data() const; + + // Capacity + size_type size() const; + size_type capacity() const; + bool empty() const; + + // Resize — new elements are uninitialized by default + void resize(size_type count); + void resize(size_type count, sycl::queue q); + // Resize — new elements filled with value + void resize(size_type count, const T& value); + void resize(size_type count, const T& value, sycl::queue q); + + void reserve(size_type new_cap); + void clear(); + void swap(device_array& other); + + // Views + device_span span(); + device_span span() const; + + // Allocator access + allocator_type get_allocator() const; + + // Context / device access + sycl::context get_context() const; + sycl::device get_device() const; +}; + +} // namespace oneapi::dpl::experimental +``` + +## Allocator + +`device_array` accepts an optional allocator template parameter for device +memory allocation. The default allocator wraps `sycl::malloc_device` / +`sycl::free`. + +### Allocator Requirements + +A type `Alloc` satisfies `DeviceAllocator` for type `T` if, given an instance +`a` of type `Alloc`, a pointer `p` of type `T*`, a `std::size_t n`, a +`sycl::context ctx`, and a `sycl::device dev`, the following expressions are +valid: + +| Expression | Return type | Semantics | +|---|---|---| +| `a.allocate(n, ctx, dev)` | `T*` | Allocate device memory for `n` objects of type `T` | +| `a.deallocate(p, n, ctx, dev)` | `void` | Free device memory previously allocated by `allocate` | + +The allocator is not required to support `construct`, `destroy`, or any of the +`std::allocator` named requirements beyond `allocate`/`deallocate`. Device +memory is not host-accessible, so construction and destruction happen via +kernel launches or memcpy, managed by `device_array` itself. + +The allocator must be copy-constructible and copy-assignable. + +```cpp +// Default allocator +template +struct device_allocator { + T* allocate(std::size_t n, sycl::context ctx, sycl::device dev) { + return sycl::malloc_device(n, dev, ctx); + } + void deallocate(T* p, std::size_t n, sycl::context ctx, sycl::device dev) { + sycl::free(p, ctx); + } +}; +``` + +### C++20 Concept (informational; enforced via SFINAE on C++17) + +```cpp +template +concept DeviceAllocator = requires(Alloc a, T* p, std::size_t n, + sycl::context ctx, sycl::device dev) { + { a.allocate(n, ctx, dev) } -> std::same_as; + { a.deallocate(p, n, ctx, dev) } -> std::same_as; +}; +``` + +## `device_span` + +`device_array` is not device-copyable (it owns memory). For kernel capture, +non-owning views, and range composition, use `device_span` via `.span()`. + +`device_span` is guaranteed trivially copyable (and therefore device copyable), +has `enable_borrowed_range = true` and `enable_view = true`, and models +`contiguous_range` + `sized_range`. + +### Definition + +**C++23 and later:** `std::span` is guaranteed trivially copyable (per P2251R1), +so `device_span` is simply an alias: + +```cpp +#if __cplusplus >= 202302L // C++23 + +template +using device_span = std::span; + +#else // C++20 + +template +class device_span { + T* __ptr = nullptr; + std::size_t __size = 0; +public: + using element_type = T; + using value_type = std::remove_cv_t; + using size_type = std::size_t; + using difference_type = std::ptrdiff_t; + using pointer = T*; + using reference = T&; + using iterator = T*; + + device_span() = default; + device_span(T* ptr, std::size_t size) : __ptr(ptr), __size(size) {} + + template + device_span(T (&arr)[N]) : __ptr(arr), __size(N) {} + + // Implicit conversion from device_array + device_span(device_array& arr); + + T* begin() const { return __ptr; } + T* end() const { return __ptr + __size; } + T* data() const { return __ptr; } + std::size_t size() const { return __size; } + bool empty() const { return __size == 0; } + + T& operator[](std::size_t i) const { return __ptr[i]; } + T& front() const { return __ptr[0]; } + T& back() const { return __ptr[__size - 1]; } + + device_span first(std::size_t count) const { return {__ptr, count}; } + device_span last(std::size_t count) const { return {__ptr + __size - count, count}; } + device_span subspan(std::size_t offset, std::size_t count) const { + return {__ptr + offset, count}; + } +}; + +template +inline constexpr bool std::ranges::enable_borrowed_range< + oneapi::dpl::experimental::device_span> = true; + +template +inline constexpr bool std::ranges::enable_view< + oneapi::dpl::experimental::device_span> = true; + +#endif +``` + +## Usage Examples + +```cpp +#include +#include +#include +#include + +namespace dpl = oneapi::dpl::experimental; + +sycl::queue q{sycl::property::queue::in_order{}}; + +// --- RAII allocation + upload from host --- +std::vector host_data(1024, 3.14f); +dpl::device_array d(host_data, q); + +// --- Use with oneDPL algorithms (raw T* iterators) --- +auto policy = oneapi::dpl::execution::make_device_policy(q); +std::sort(policy, d.begin(), d.end()); + +// --- Use in a SYCL kernel --- +float* ptr = d.data(); +q.parallel_for(sycl::range<1>(d.size()), [=](sycl::id<1> i) { + ptr[i] *= 2.0f; +}).wait(); + +// --- Explicit single-element host access --- +float val = d.read(0, q); // synchronous read +d.write(0, 42.0f, q); // synchronous write + +// --- Async transfer with dependency --- +float result; +sycl::event e = d.async_read(0, result, q, {some_prior_event}); +e.wait(); + +// --- Bulk download --- +std::vector out = d.to_vector(q); + +// --- Output buffer (uninitialized by default — no memset) --- +dpl::device_array output(1024, q); +std::transform(policy, d.begin(), d.end(), output.begin(), + [](float x) { return x * 2.0f; }); + +// --- Zero-initialized allocation (opt-in) --- +dpl::device_array zeroed(1024, 0.0f, q); + +// For kernel capture or composition with range adaptors, use device_span: +auto s = d.span(); // returns device_span +auto pipeline = s | std::views::take(100); +oneapi::dpl::ranges::for_each(policy, pipeline, [](float& x) { x += 1.0f; }); + +// Capture a device_span into a kernel: +auto s2 = d.span(); +q.parallel_for(sycl::range<1>(s2.size()), [=](sycl::id<1> i) { + s2[i] *= 2.0f; +}).wait(); +``` + +## Open Questions + +- **Should async overloads be in the initial proposal or deferred?** + This provides more control over synchronization than merely an in-order queue, + but it is unclear whether users who are wanting this would just want to work + with USM memory and memcpy directly. diff --git a/rfcs/proposed/device_vector/device_vector_compat.md b/rfcs/proposed/device_vector/device_vector_compat.md new file mode 100644 index 00000000000..a98c8761cfc --- /dev/null +++ b/rfcs/proposed/device_vector/device_vector_compat.md @@ -0,0 +1,242 @@ +# `device_vector` Compatibility Layer + +A Thrust-compatible device-memory vector, living in a +compatibility namespace. Adds `device_pointer`, `device_reference`, and +implicit host-access semantics on top of `device_array`'s explicit API. + +See the [usage study](usage_pattern_study.md) for evidence on which Thrust APIs +are actually used, and [device_array](device_array.md) for the underlying +container. + +The goal is a near drop-in replacement for `thrust::device_vector`, covering +the functionality that is actually used in practice, adapted to fit within SYCL. + +## Relationship to `device_array` + +`compat::device_vector` contains a `device_array` for core functionality. + +It uses an iterator/pointer type, `device_pointer`, as a wrapper for USM memory, and reference type, `device_reference`, as a reference proxy type to enable host-side usage with implicit memory transfers. These types hold a pointer to a `sycl::context` to facilitate creation of a queue for memcpy. + +## Differences from Thrust + +1. **Context + device (or queue) always required** — no implicit default device. +2. **No default constructor** — a device association is always required. +3. **No `push_back`, `insert`, `erase`** — rarely used, unnecessary complexity. +4. **No `host_vector` type** — use `std::vector` directly. +5. **No system tag dispatch** — execution policies determine where algorithms run. + +## Namespace +We are using `oneapi::dpl::experimental::compat` for these compatibility classes. The intention would be to promote +these to `oneapi::dpl::compat` as they exit experimental. We could add `ext` as well, but this further elongates the +already long name. Other elements which graduate from SYCLomatic, but don't belong in oneDPL proper may end up living +in this `compat` namespace in the future. + +## API + +```cpp +namespace oneapi::dpl::experimental::compat { + +// ========================================================================= +// device_pointer +// ========================================================================= +// Wraps a raw T* from device_array. Dereference provides device_reference. + +template +class device_pointer { + T* __ptr = nullptr; + const sycl::context* __ctx = nullptr; // non-owning, from device_vector + +public: + using iterator_concept = std::random_access_iterator_tag; + using value_type = std::remove_cv_t; + using difference_type = std::ptrdiff_t; + using reference = device_reference; + + device_pointer() = default; + explicit device_pointer(T* ptr, const sycl::context* ctx = nullptr); + + // Raw pointer access — unwraps back to the T* that device_array uses + T* get() const; + + reference operator*() const; + reference operator[](difference_type n) const; + + /* Full random access iterator arithmetic + comparison*/ + +}; + +// ========================================================================= +// device_reference +// ========================================================================= + +template +class device_reference { +public: + operator T() const; // read + const device_reference& operator=(const T& val) const; // write + const device_reference& operator=(const device_reference&) const; + + // Compound assignment (each is a synchronous read-modify-write) + const device_reference& operator+=(const T&) const; + /* all other compound assignments... */ + + const device_reference& operator++() const; + T operator++(int) const; + const device_reference& operator--() const; + T operator--(int) const; + + device_pointer operator&() const; + + friend void swap(const device_reference& a, const device_reference& b); +}; + +// ========================================================================= +// device_vector +// ========================================================================= + +template > +class device_vector { + device_array __impl; // stores context + device + allocator + +public: + using value_type = T; + using allocator_type = Alloc; + using size_type = std::size_t; + using difference_type = std::ptrdiff_t; + using reference = device_reference; + using const_reference = device_reference; + using pointer = device_pointer; + using const_pointer = device_pointer; + using iterator = device_pointer; + using const_iterator = device_pointer; + + + // construction from size + device_vector(size_type count, sycl::context ctx, sycl::device dev); + // construction from size and value + device_vector(size_type count, const T& value, + sycl::context ctx, sycl::device dev); + + //construction from iterators + template + device_vector(InputIt first, InputIt last, + sycl::context ctx, sycl::device dev); + //construction from initializer_list + device_vector(std::initializer_list init, + sycl::context ctx, sycl::device dev); + // construction from std::vector + explicit device_vector(const std::vector& src, + sycl::context ctx, sycl::device dev); + + /* Copy of all above constructors for `sycl::queue` (extracts context + device), using no_init_t to avoid initialization, and with explicit allocator */ + + + // Copy / move + device_vector(const device_vector&); + device_vector(device_vector&&) noexcept; + device_vector& operator=(const device_vector&); + device_vector& operator=(device_vector&&) noexcept; + + ~device_vector(); + + // Assign from host vector (bulk upload) + device_vector& operator=(const std::vector& src); + + // Convert to host vector (bulk download) + explicit operator std::vector() const; + + // --- Element access (proxy references) --- + reference operator[](size_type pos); + const_reference operator[](size_type pos) const; + reference front(); + const_reference front() const; + reference back(); + const_reference back() const; + + // --- Pointer access (device_pointer wrapping device_array's T*) --- + pointer data(); + const_pointer data() const; + + // --- Iterators (device_pointer wrapping device_array's T*) --- + iterator begin(); + const_iterator begin() const; + iterator end(); + const_iterator end() const; + + // --- Capacity (forwarded to device_array) --- + size_type size() const; + size_type capacity() const; + bool empty() const; + + void resize(size_type count); + void resize(size_type count, const T& value); + void resize(size_type count, no_init_t); + void reserve(size_type new_cap); + void clear(); + + // --- Swap --- + void swap(device_vector& other); + + // --- Access to underlying device_array --- + device_array& base(); + const device_array& base() const; + + // --- Allocator --- + allocator_type get_allocator() const; + + // --- Context / device --- + sycl::context get_context() const; + sycl::device get_device() const; +}; + +} // namespace oneapi::dpl::experimental::compat +``` + +## Usage Example + +```cpp +#include +#include +#include +#include + +namespace compat = oneapi::dpl::experimental::compat; + +sycl::queue q{sycl::property::queue::in_order{}}; + +// --- Thrust-like construction --- +std::vector host_data(1024, 3.14f); +compat::device_vector d(host_data, q); + +// --- Thrust-like algorithm use --- +auto policy = oneapi::dpl::execution::make_device_policy(q); +std::sort(policy, d.begin(), d.end()); + +// --- Thrust-like element access (proxy, synchronous) --- +float val = d[0]; // implicit device-to-host +d[0] = 42.0f; // implicit host-to-device +d[1] += 10.0f; // read-modify-write round-trip + +// --- Raw pointer for kernels --- +float* ptr = d.data().get(); +q.parallel_for(sycl::range<1>(d.size()), [=](sycl::id<1> i) { + ptr[i] *= 2.0f; +}).wait(); + +// --- Device-to-device copy via device_pointer --- +compat::device_vector d2(d.begin(), d.end(), q); // D2D copy + +// --- Bulk copy back to host --- +std::vector result = static_cast>(d); + +// --- Construction from context + device (no queue needed) --- +sycl::context ctx = q.get_context(); +sycl::device dev = q.get_device(); +compat::device_vector d3(1024, ctx, dev); + +// --- Gradual migration to device_array --- +auto& arr = d.base(); +float v = arr.read(5, q); // explicit, no proxy +arr.write(5, v * 2.0f, q); // explicit +std::vector out = arr.to_vector(q); +``` diff --git a/rfcs/proposed/device_vector/usage_pattern_study.md b/rfcs/proposed/device_vector/usage_pattern_study.md new file mode 100644 index 00000000000..66f610c4657 --- /dev/null +++ b/rfcs/proposed/device_vector/usage_pattern_study.md @@ -0,0 +1,259 @@ +# `device_vector` Usage Pattern Study + +This document surveys real-world usage of `thrust::device_vector`, +`dpct::device_vector`, and alternative device vector implementations across +open-source projects. It serves as supporting evidence for the design +decisions in the [device_vector RFC](README.md). + +## Thrust (`thrust::device_vector`) — Native CUDA Projects + +GitHub code search reports ~2,520 files containing `thrust::device_vector` +across the platform, spanning AI/ML, HPC, scientific computing, robotics, +graph analytics, and databases. A survey of notable projects follows. + +### Sparse BLAS (spblas-reference) + +[spblas-reference](https://github.com/SparseBLAS/spblas-reference) (sparse +BLAS standard reference implementation) demonstrates the minimal-but-dominant +pattern: `device_vector` as an **RAII device memory manager and host-device +data shuttle**: + +1. **Constructing from `std::vector`** (~90% of uses) — bulk host-to-device + transfer at setup time. +2. **Allocating output buffers by size** — e.g. after a symbolic phase + computes output NNZ, a `device_vector` is constructed with just a count. +3. **Extracting raw device pointers** via `.data().get()` — every + `device_vector` is ultimately unwrapped to a raw pointer for passing + to library APIs (`csr_view`, `std::span`). +4. **Copying results back to host** via `thrust::copy(d.begin(), d.end(), + host.begin())` — used in every test for verification. + +Notably absent from spblas: element-level access (`operator[]`), `resize()`, +`push_back()`, `insert()`/`erase()`, or device-side algorithms on iterators. + +### AI/ML Projects + +**Notable finding:** A few high profile performance-sensitive AI/ML codebases have +**explicitly moved away from `thrust::device_vector`**, while other ML +projects remain heavy users. + +- [CUTLASS](https://github.com/NVIDIA/cutlass) (NVIDIA, 9.5k stars) — Uses + `device_vector` in **30 files**, but only in examples and tests as + scaffolding. Pattern: construct from host data, extract raw pointer via + `.data().get()`, pass to GEMM kernels. Never used in hot-path + implementations. +- [FAISS](https://github.com/facebookresearch/faiss) (Meta, 39.7k stars) — + **Rolled their own `DeviceVector`** instead of using + `thrust::device_vector`. Reasons cited: control over streams, avoiding + unwanted `T()` initialization on `resize()`, and custom memory growth + strategy (power-of-2 below 4MB, 1.25x to 128MB, exact above). +- [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) (23.8k stars) — + Uses `device_vector` in **24 files** for GPU kernel implementations (mode, + graph reindex, CTC align, kron). Pattern: temporary containers within GPU + kernels combined with `thrust::sort_by_key`, `thrust::reduce_by_key`. Use + raw pointers or begin() directly into thrust. Some [usage of element access + after a thrust API](https://github.com/PaddlePaddle/Paddle/blob/d6e489cc39412ac278bbc1dda352742dddb7e57d/paddle/phi/kernels/funcs/mode.h#L189). +- [cuDF](https://github.com/rapidsai/cudf) (RAPIDS, 9.6k stars) — + **Explicitly discourages `thrust::device_vector`** in developer guide. + Recommends `rmm::device_uvector` for uninitialized allocation and + stream-ordered operations. +- [cuML](https://github.com/rapidsai/cuml) (RAPIDS, 5.2k stars) — Only **4 + files** still using it. Developer guide discourages it in favor of + `MLCommon::device_buffer` for stream-safe allocation. +- [H2O4GPU](https://github.com/h2oai/h2o4gpu) (466 stars) — Heavy user + (**17 files**) for K-means, GLM, TSVD, ARIMA. Notable patterns: arrays of + `device_vector` pointers for multi-GPU (`thrust::device_vector + *centroid_dots[n_gpu]`), `thrust::inner_product()` for convergence checks, + raw pointer extraction for cuBLAS/cuSOLVER calls. Used via raw pointers, or + [c]begin() going into thrust APIs, element-wise host access to print. + +### HPC / Scientific Computing / Graph Analytics + +These domains remain the heaviest `thrust::device_vector` users: + +- [Gunrock](https://github.com/gunrock/gunrock) (1k stars) — **49 files**, + heaviest user among notable projects. Type alias + `device_vector_t = thrust::device_vector`. Used for BFS outputs, + graph frontier data. Raw pointer via `.data().get()`, uses resize() and + begin(), end() with thrust API. Copies to ostream from device_ptr for print. +- [AmgX](https://github.com/NVIDIA/AMGX) (NVIDIA, 662 stars) — Custom + allocator wrapper using `cudaMallocAsync`/`cudaFreeAsync` for stream-ordered + allocation: `thrust::device_vector>`. Extract raw + pointers, some direct element access after thrust APIs. +- [GPU-Voxels](https://github.com/fzi-forschungszentrum-informatik/gpu-voxels) + (315 stars) — **29 files**, robotics collision detection. Class members for + octree nodes, voxel lists. Tracks allocations via + `thrust::device_vector`. +- [ISCE3](https://github.com/isce-framework/isce3) (204 stars) — SAR radar, + **21 files**. Class members for satellite orbit data + (`thrust::device_vector _position, _velocity`). Extract raw pointers to + pass to kernels directly. Use of resize(). +- [Feltor](https://github.com/feltor-dev/feltor) (38 stars) — Plasma physics. + Type aliases as vocabulary types: + `using DVec = thrust::device_vector`. + +### Consolidated Construction Patterns (ordered by frequency) + +1. `thrust::device_vector d_v = h_v;` (copy from host_vector) +2. `thrust::device_vector d_v(N);` (sized, value-initialized) +3. `thrust::device_vector d_v(N, val);` (sized with fill value) +4. `thrust::device_vector d_v(ptr, ptr + N);` (from host pointer range) +5. `new thrust::device_vector(N);` (heap-allocated, multi-GPU) + +### Why AI/ML Projects Rejected `thrust::device_vector` + +The reasons cited by FAISS, cuDF, and cuML for moving away are instructive +for our design: + +1. **Unwanted value initialization** — `resize()` and sized construction + zero-initialize elements via device kernel. For large temporary buffers + this is wasted work. (Supports our open question on `no_init_t` tags.) +2. **No stream/queue parameter** — Operations are synchronous or use a + default stream, preventing overlap with other work. +3. **Header includes device code** — Forces `.cu` compilation even for host + code that just manages device_vectors. + +## dpct (`dpct::device_vector`) — Migrated CUDA-to-SYCL Projects + +A broader survey of `dpct::device_vector` usage across ~18 repositories +(111 code results on GitHub) shows additional patterns beyond the +spblas-minimal case: + +**Projects surveyed include:** +- [HeCBench](https://github.com/ORNL/HeCBench) (ORNL, 285+ stars) — HPC + benchmark suite +- [oneAPI-samples](https://github.com/oneapi-src/oneAPI-samples) (Intel, + 1139+ stars) — radix sort migration samples +- [SYCLomatic-test](https://github.com/oneapi-src/SYCLomatic-test) — + official compatibility test suite +- Various sparse matrix, radio astronomy, and optimization codes + +**Consolidated operation frequency:** + +| Operation | Frequency | Example | +|---|---|---| +| Construction from size | Very high | `dpct::device_vector v(N)` | +| Assignment from `std::vector` | Very high | `d_vec = h_vec` (H2D) | +| `begin()`/`end()` for algorithms | Very high | `sort(policy, dv.begin(), dv.end())` | +| `data()` + raw pointer extraction | Very high | `get_raw_pointer(dv.data())` for kernels | +| Copy back to host | High | `std::copy(policy, dv.begin(), dv.end(), h.begin())` or `h = d` | +| Construction from `std::vector` | High | `dpct::device_vector dv(host_vec)` | +| `size()` | Medium | For bounds checks, kernel launch args | +| `operator[]` | Medium | Host-side element access | +| `resize()` | Medium | Output buffer sizing | +| As class/struct member | Medium | Sparse matrices, particle systems | +| `clear()`, `push_back()`, `insert()`, `erase()` | Low | Mostly in tests, not real workloads | + +**Notable real-world patterns:** +- **Default-construct then assign** — oneAPI-samples shows a "reset and + re-sort" loop: `dpct::device_vector d_keys;` then `d_keys = h_keys;` + each iteration. +- **Raw pointer extraction is ubiquitous** — nearly every project that passes + data to SYCL kernels unwraps `device_vector` to a raw pointer. This + validates our `device_pointer::get()` and the importance of making + `device_pointer` device-copyable for direct kernel capture. It seems likely + this is a product of cuda patterns, but removes all possibility of overhead + for performance sensitive codes, so it makes sense. + +## Alternatives Built by Projects That Rejected `thrust::device_vector` + +Two notable alternatives were built by high-performance projects that found +`thrust::device_vector` insufficient. Understanding their designs is +instructive for our proposal. + +### FAISS `DeviceVector` (Meta, 39.7k stars) + +[Source: `faiss/gpu/utils/DeviceVector.cuh`](https://github.com/facebookresearch/faiss/blob/main/faiss/gpu/utils/DeviceVector.cuh) + +FAISS built a minimal replacement explicitly motivated by three deficiencies +in Thrust (from the class comment): *"has more control over streams, whether +resize() initializes new space with T() (which we don't want), and control +on how much the reserved space grows."* It is restricted to POD types only. + +**Key design choices:** +- **Explicit `cudaStream_t` on every host mutating operation** — `resize(n, stream)`, + `append(ptr, n, stream)`, `setAt(i, val, stream)`, `getAt(i, stream)`, + `reserve(n, stream)`, `reclaim(exact, stream)`. +- **No initialization on `resize()`** — comment: *"Don't bother zero + initializing the newly accessible memory (unlike thrust::device_vector)"*. + However, newly allocated raw capacity *is* zeroed. +- **Tiered growth strategy** — power-of-2 below 4M elements, 1.25x up to + 128M elements, exact allocation above. Prevents overallocation for large + buffers. +- **No iterators, no `operator[]`** — element access only via explicit + `setAt()`/`getAt()` methods with stream parameter. No proxy references. +- **Auto-detects host vs device source** — `append()` uses + `cudaPointerGetAttributes` to pick the right `memcpy` direction. +- **Custom memory resource** — allocates through FAISS's `GpuResources` + abstraction (pool for temporaries, `cudaMalloc` for persistent data), + not through CUDA allocator APIs directly. +- **No copy semantics** — move-only via the underlying RAII memory handle. + +**What it strips out vs Thrust:** iterators, `operator[]`, `push_back`, +`insert`/`erase`, copy construction, implicit host↔device conversion, +value initialization, STL container compatibility. What it **adds:** +explicit stream parameter, `append()` with auto-direction detection, +`reclaim()` for capacity shrinking, and return values indicating whether +reallocation occurred. + +### RAPIDS `rmm::device_uvector` + +[Source: `rmm/include/rmm/device_uvector.hpp`](https://github.com/rapidsai/rmm/blob/main/cpp/include/rmm/device_uvector.hpp) + +RMM's `device_uvector` (the "u" stands for uninitialized) is the +recommended replacement for `thrust::device_vector` across the RAPIDS +ecosystem (cuDF, cuML, cuGraph). It was motivated by the same concerns +as FAISS plus a desire for pluggable memory resources. + +**Key design choices:** +- **Explicit `cuda_stream_view` on every operation** — construction, + resize, reserve, element access, copy construction all require a stream. +- **No initialization** — construction and resize never launch a kernel + to zero-fill or value-initialize. This is the defining feature. +- **No geometric growth** — `resize()` allocates exactly the requested + size. RMM's pool memory resources handle allocation performance, so + container-level overallocation is unnecessary. +- **Deleted default and copy constructors** — must provide stream to + construct. Copy requires explicit call: `device_uvector(other, stream)`. +- **Iterators are raw `T*` pointers** — usable in device code and thrust + algorithms, but dereferencing on host is undefined behavior. No proxy + references. +- **No `operator[]`** — element access via explicit `element(i, stream)` + (synchronous D→H) and `set_element_async(i, val, stream)` (async H→D). + The async setter deliberately deletes its rvalue overload to prevent + dangling references. +- **No bulk host↔device transfer API** — no constructor from host data, + no `assign()` from host range. Users must use `cudaMemcpy` directly. +- **Pluggable memory resource** — uses `device_async_resource_ref` + (type-erased, `std::pmr`-style). Default is `cudaMalloc`, but can be + pool, arena, etc. +- **`static_assert(is_trivially_copyable)`** — only trivially copyable + types. +- **Implicit conversion to `cuda::std::span`** — lightweight view + interop. +- **Stores device ID** — destructor deallocates on the correct device even + if a different device is current. + +**What it strips out vs Thrust:** `operator[]`, `push_back`, `insert`/`erase`, +`assign`, `clear`, `swap`, implicit copy, host range construction, value +initialization, geometric growth, non-trivial types. What it **adds:** +explicit stream everywhere, pluggable memory resource, span conversion, +device-aware destruction. + +Note: RMM also provides `rmm::device_vector`, which is just a type alias +for `thrust::device_vector>` — same Thrust +interface but with RMM-backed allocation. + +## Summary +Most usage seems to focus on `device_vector` as a convenient way to allocate and control lifetime of device memory. +Usage largely focuses on: + * copies to and from host side vector all at once + * getting raw pointers to use directly on the device + * using begin() and end() iterators as input to algorithms + +Host side usage + * Largely not present + * If present, mostly used in tests / useful in debugging + * dpct migrations use `operator[]` in some cases, but this may be from CUDA migration patterns rather than intentional + * Cases which do need host access (FAISS, RMM) have replaced device_vector with alternatives that allow access with explicit stream synchronization. +