Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
c193d39
First draft device_vector background
danhoeflinger Apr 3, 2026
830b2d7
adding key design decisions
danhoeflinger Apr 3, 2026
0d6c2ed
adding a usage example
danhoeflinger Apr 3, 2026
650a25e
experimental device vector
danhoeflinger Apr 3, 2026
82c6e4d
minor improvements
danhoeflinger Apr 7, 2026
a48f51d
replacing table with targeted section
danhoeflinger Apr 7, 2026
af51e09
adding extra decisions and open questions
danhoeflinger Apr 7, 2026
5b7e982
removing redundant section
danhoeflinger Apr 7, 2026
d5ab776
wording
danhoeflinger Apr 7, 2026
8dedb9c
removing allocator
danhoeflinger Apr 7, 2026
5d0077e
discussion of queue association and open questions
danhoeflinger Apr 8, 2026
de91e6c
iterate from self-review
danhoeflinger Apr 9, 2026
ba93348
removing unused vector from example
danhoeflinger Apr 9, 2026
012561c
More iteration on self review
danhoeflinger Apr 9, 2026
7b12b88
language improvements for decisions
danhoeflinger Apr 13, 2026
f313319
Updates for table to show proposed
danhoeflinger Apr 13, 2026
ea49728
removing redundancy
danhoeflinger Apr 13, 2026
4a398f5
improve examples
danhoeflinger Apr 13, 2026
83cf635
remove boost compute from table
danhoeflinger Apr 13, 2026
9818222
add language about boost compute
danhoeflinger Apr 13, 2026
94be8a2
fix comment
danhoeflinger Apr 15, 2026
ac39f92
adjust language of multi-device
danhoeflinger Apr 15, 2026
55de5ed
Notes on queue vs device&context
danhoeflinger Apr 15, 2026
de933fd
remove noexcept / some const until we know better from implementation…
danhoeflinger Apr 15, 2026
49d9ab1
adding ranges proposal
danhoeflinger Apr 15, 2026
3e8db04
update rfc
danhoeflinger Apr 21, 2026
439fdbf
initial investigation into usage
danhoeflinger Apr 16, 2026
05d0e14
wider usage of thrust device_vector
danhoeflinger Apr 16, 2026
d6c17ed
adding evolutions from device_vector
danhoeflinger Apr 16, 2026
eadc7e9
separate usage pattern study
danhoeflinger Apr 16, 2026
797e7f2
separate POC lessons
danhoeflinger Apr 16, 2026
5bddf11
removing unnecessary bloat
danhoeflinger Apr 16, 2026
9422da6
support no_init
danhoeflinger Apr 16, 2026
e1078cf
update no_init tag in table
danhoeflinger Apr 16, 2026
1bd160a
example usage updates
danhoeflinger Apr 16, 2026
5fee623
incorporate sycl-thrust impl into table
danhoeflinger Apr 16, 2026
726cdb0
remove boost compute ref
danhoeflinger Apr 16, 2026
a88b9f6
improve open question; minor improvements
danhoeflinger Apr 20, 2026
3305e82
adding device & context overloads
danhoeflinger Apr 20, 2026
78154c5
adding pointer ref and optional explicit queue args
danhoeflinger Apr 20, 2026
b36ae14
improve open question organization
danhoeflinger Apr 20, 2026
7680ccc
changing from default queue to default device
danhoeflinger Apr 20, 2026
f395e9f
adding open question on less used apis
danhoeflinger Apr 20, 2026
5eec3b2
further improve open questions
danhoeflinger Apr 20, 2026
4a1d7d3
better sychronization design
danhoeflinger Apr 20, 2026
57eaf3b
minor API fixesl language updates
danhoeflinger Apr 20, 2026
a9d7f1c
remove redundancy
danhoeflinger Apr 20, 2026
8c39ac7
remove decision of allocator
danhoeflinger Apr 20, 2026
d4a42e9
removing helper class section
danhoeflinger Apr 20, 2026
6fc2a41
fixed incorrect language in device_view
danhoeflinger Apr 20, 2026
ac287e8
removing unnecessary text
danhoeflinger Apr 20, 2026
b496fd0
improve usage pattern study
danhoeflinger Apr 21, 2026
b75eedd
fixing example usage and filling in missing APIs
danhoeflinger Apr 21, 2026
751bd4a
improve usage study specifics
danhoeflinger Apr 23, 2026
162c495
minor range language changes
danhoeflinger Apr 23, 2026
7879272
adjust table ordering
danhoeflinger Apr 23, 2026
9721af1
simplified device_array proposal
danhoeflinger Apr 23, 2026
ac21e46
refine device_array proposal
danhoeflinger Apr 24, 2026
01d358c
refine device_vector compat proposal
danhoeflinger Apr 24, 2026
7e6faa3
adjust top line RFC to point to compat layer & device_array
danhoeflinger Apr 24, 2026
e60de0b
Add details to range support
danhoeflinger Apr 28, 2026
5a3e69d
minor details to device_vector
danhoeflinger Apr 28, 2026
557894d
adding open questions
danhoeflinger Apr 28, 2026
ad17c41
distill device_vector compat
danhoeflinger Apr 28, 2026
6f0672d
distill top readme
danhoeflinger Apr 28, 2026
0509bfc
distill device array
danhoeflinger Apr 28, 2026
0befbb9
diagram updates
danhoeflinger Apr 28, 2026
02a0ba9
fix table
danhoeflinger Apr 28, 2026
dbbdfca
diagram typo
danhoeflinger Apr 28, 2026
5c1f519
remove incorrect example
danhoeflinger Apr 28, 2026
f6811fb
reword question
danhoeflinger Apr 28, 2026
b7c1e21
rewording
danhoeflinger Apr 28, 2026
be87126
adjust namespace section; reduce redundancy
danhoeflinger Apr 28, 2026
51ba663
minor fixes
danhoeflinger Apr 28, 2026
22f5123
mentioning no default construction
danhoeflinger Apr 28, 2026
be4169c
copy semantics
danhoeflinger Apr 28, 2026
c6cb82b
minor fixes
danhoeflinger Apr 28, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
172 changes: 172 additions & 0 deletions rfcs/proposed/device_vector/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
# `device_array` and `compat::device_vector` for oneDPL

## Introduction

This RFC proposes adding a container to oneDPL that provides
a `std::vector`-like interface for managing device memory.

### Motivation

- **Migration from CUDA/Thrust** - Thrust's `device_vector` is heavily used
in CUDA codebases. Providing an equivalent in oneDPL lowers the barrier
for porting to SYCL backends. SYCLomatic already generates code targeting
a `dpct::device_vector` compatibility shim, and having an official oneDPL
type would give that migration a stable target, in a repository which is
actively maintained.
- **Ease of use** - Users currently must manually manage USM allocations or
SYCL buffers and pair them with raw pointers or iterators. A
`device_vector` encapsulates allocation, sizing, and lifetime in a
single object and integrates directly with oneDPL algorithms.
- **Real-world usage patterns** - A [detailed survey](usage_pattern_study.md)
of real-world usage informed the design. Key findings:

1. **Construction + bulk transfer + raw pointer extraction** are the core
operations across all domains. `device_vector` is primarily used as an
RAII device memory manager and host-device data shuttle.
2. **`begin()`/`end()` integration with parallel algorithms** is the
second-most critical capability.
3. **Some popular AI/ML projects** (FAISS, cuDF, cuML) have **moved away from
`thrust::device_vector`** due to unwanted value initialization, lack of
stream parameters, and header bloat — then built alternatives that
prioritize explicit async control and uninitialized allocation. Other HPC
and ML projects remain heavy users.
4. **Full `std::vector`-like modifiers** (`push_back`, `insert`, `erase`)
are rarely used in real workloads.

## Comparison of Existing device_vector Implementations

| Implementation | Source |
|---|---|
| **Thrust** (`thrust::device_vector`) | [NVIDIA/cccl - device_vector.h](https://github.com/NVIDIA/cccl/blob/main/thrust/thrust/device_vector.h) |
| **SYCLomatic** (`dpct::device_vector`) | [SYCLomatic - vector.h](https://github.com/oneapi-src/SYCLomatic/blob/SYCLomatic/clang/runtime/dpct-rt/include/dpct/dpl_extras/vector.h) |
| **Distributed Ranges** (`dr::sp::device_vector`) | [distributed-ranges - device_vector.hpp](https://github.com/oneapi-src/distributed-ranges/blob/main/include/dr/sp/device_vector.hpp) |
| **sycl-thrust** (`thrust::device_vector`) | [SparseBLAS/sycl-thrust - device_vector.h](https://github.com/SparseBLAS/sycl-thrust/blob/main/include/thrust/device_vector.h) |

### How They Differ

| Aspect | Proposed (oneDPL) | Thrust | sycl-thrust | SYCLomatic |
|---|---|---|---|---|
| **Default Allocator** | `device_allocator<T>` wrapping `sycl::malloc_device`; custom `DeviceAllocator` concept | `thrust::device_allocator<T>` (CUDA `cudaMalloc`) | `device_allocator<T>` (`sycl::malloc_device`); supports alignment template parameter | USM: `sycl::usm_allocator<T, shared>` / Buffer: `__buffer_allocator<T>` |
| **Memory Model** | **Device memory** via `sycl::malloc_device`; host access triggers explicit transfers | **Device memory** via `cudaMalloc`; host access triggers explicit transfers | **Device memory** via `sycl::malloc_device`; explicit transfers | **Shared memory** via USM shared or SYCL buffer/accessor; runtime manages placement |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "host access triggers explicit transfers" accurate? I mean, if something is triggered by another action, it is implicit and not explicit, no?

| **Host Element Access** | `device_array`: explicit `read()`/`write()`; compat `device_vector`: `device_reference` proxy | Via `device_reference` proxy (explicit device-to-host copy) | Via `device_reference` proxy (`__SYCL_DEVICE_ONLY__` bifurcation) | Via `device_reference` proxy (runtime-managed migration) |
| **std::vector Interop** | Explicit constructor + `to_vector()` | Copy constructors from/to `std::vector` | Constructor from `std::vector` | Copy/move + implicit `operator std::vector()` |
| **Queue Association** | Stores context + device; queue provided per-operation or created on demand | Implicit (CUDA stream) | Allocator stores `device` + `context`; queue resolved at runtime via pointer introspection | Global default queue |
| **Uninitialized Construction** | `device_array`: uninitialized by default; compat `device_vector`: `no_init_t` tag | `default_init_t`, `no_init_t` tags | Not supported | Not supported |

## Proposal

The proposal consists of two complementary types that share an underlying
implementation:

1. **[`device_array<T, Alloc>`](device_array.md)** — the primary API.
A clean, explicit container for device memory with no proxy types. Raw `T*`
iterators, explicit `read()`/`write()` for host access, uninitialized by
default, and range support via `device_span`.

2. **[`compat::device_vector<T, Alloc>`](device_vector_compat.md)** — a
Thrust compatibility layer built on `device_array`. Adds `device_pointer`,
`device_reference`, and `operator[]` proxy semantics for drop-in migration
from `thrust::device_vector`.

### Class Relationships

```mermaid
classDiagram
direction LR

namespace experimental {
class device_array~T, Alloc~ {
iterators: T*
host access: read() / write()
}

class device_span~T~ {
non-owning view
trivially copyable
}
}

namespace experimental_compat {
class device_vector~T, Alloc~ {
Thrust compat layer
}

class device_pointer~T~ {
wraps T* + context*
}

class device_reference~T~ {
proxy for host access
}
}

device_array --> device_span : .span()
device_vector *-- device_array : contains
device_vector --> device_pointer : begin()/end()
device_pointer --> device_reference : operator*() / operator[]()
```

### Design Decisions

- **Use USM device memory as baseline, copy to/from host on demand when required.**
This matches semantics of all pre-existing implementations other than SYCLomatic
where the runtime handles where memory lives. Shared memory has significantly
worse performance than device memory, and if users want those semantics, they
can directly use usm shared memory or sycl buffers.

- **Store context, not queue.** `sycl::malloc_device` requires only
a context (and a device which can be looked up with the pointer). Storing a queue would tie the container to a particular
queue and imply synchronization semantics. Queues are accepted per-operation
or created on demand.

- **Type T should only require device copyability.**
We should not need anything except device copyability (for copy to and from
the device).

- **No tag system for dispatch to specific hardware.**
Execution policies dictate where algorithms are run. We don't intend to
provide other flavors of vector / iterator which would have different tags,
which would be required to dispatch based upon tag.

- **Custom `DeviceAllocator` concept for pluggable allocation.**
A minimal allocator interface — just `allocate(n, ctx, dev)` and
`deallocate(p, n, ctx, dev)` — that avoids the `std::allocator` named
requirements (which mandate host-accessible memory). Enables pool allocators,
aligned allocation, and other strategies. See the
[device_array allocator section](device_array.md#allocator) for details.

- **No `push_back`, `insert`, `erase`.**
Rarely used in practice (see [usage study](usage_pattern_study.md)),
high implementation complexity for device memory.

- **Host-side operations block but do not synchronize with prior work.**
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "block" mean in this sentence, and is it important in the context of design decisions?

Copy link
Copy Markdown
Contributor Author

@danhoeflinger danhoeflinger May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means they wait() until the operation is complete before returning, but no guarantee that previously queue work finishes before this starts. This is mentioned because it is a departure from the thrust::device_vector. I can make this more clear, and provide better justification for why this decision is made (fitting better with sycl, and also from the usage study showing user's preferences)

The user is responsible for ensuring prior kernels have completed before
host-side access. This can be achieved via an in-order queue or explicit
event waits. `device_array` additionally offers async overloads with
`depends_on` events.

## Open Questions

- **Should `device_array`'s async overloads be in the initial release or
deferred?**
see [device_array](device_array.md).

- **Header organization?**
- We could have a `<oneapi/dpl/compat>` header and automatically include `device_array` with other includes?
Alternatives:
- Individual headers:
`<oneapi/dpl/device_array>` and `<oneapi/dpl/device_vector>`.. `device_vector`
would transitively include `device_array` since it depends on it.
- We could have a `compat` header and a individual `device_array` header. However, if we intend to use `device_array` within our own sycl implementations, that may impact our decision here.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not quite clear to me. What do you mean by "within our own sycl implementation", why would we want to use device_array there, and how that impacts the decision about headers?


- **Compatibility namespace naming?** This proposal places the
Thrust-compatible types in `oneapi::dpl::experimental::compat`. Several
aspects are worth discussing:
- Should `compat` be nested under `experimental`, or should it be
`oneapi::dpl::compat` directly? Or alternatively `oneapi::dpl::ext::compat`.
- Is `compat` the right name? Alternatives: `thrust_compat`, `migration`,
`legacy`. `compat` is concise but doesn't indicate what it's compatible
*with*. `thrust_compat` is more explicit but ties the namespace to a
specific vendor's API.
- Moreover, is this repository where we want the compatibility headers? I think yes, otherwise they will be too cumbersome to use, but it worth raising.

Loading
Loading