diff --git a/rfcs/proposed/device_vector/README.md b/rfcs/proposed/device_vector/README.md
new file mode 100644
index 00000000000..87fe87de5ea
--- /dev/null
+++ b/rfcs/proposed/device_vector/README.md
@@ -0,0 +1,172 @@
+# `device_array` and `compat::device_vector` for oneDPL
+
+## Introduction
+
+This RFC proposes adding a container to oneDPL that provides
+a `std::vector`-like interface for managing device memory.
+
+### Motivation
+
+- **Migration from CUDA/Thrust** - Thrust's `device_vector` is heavily used
+  in CUDA codebases. Providing an equivalent in oneDPL lowers the barrier
+  for porting to SYCL backends. SYCLomatic already generates code targeting
+  a `dpct::device_vector` compatibility shim, and having an official oneDPL
+  type would give that migration a stable target, in a repository which is
+  actively maintained.
+- **Ease of use** - Users currently must manually manage USM allocations or
+  SYCL buffers and pair them with raw pointers or iterators. A
+  `device_vector` encapsulates allocation, sizing, and lifetime in a
+  single object and integrates directly with oneDPL algorithms.
+- **Real-world usage patterns** - A [detailed survey](usage_pattern_study.md)
+  of real-world usage informed the design. Key findings:
+
+  1. **Construction + bulk transfer + raw pointer extraction** are the core
+     operations across all domains. `device_vector` is primarily used as an
+     RAII device memory manager and host-device data shuttle.
+  2. **`begin()`/`end()` integration with parallel algorithms** is the
+     second-most critical capability.
+  3. **Some popular AI/ML projects** (FAISS, cuDF, cuML) have **moved away from
+     `thrust::device_vector`** due to unwanted value initialization, lack of
+     stream parameters, and header bloat — then built alternatives that
+     prioritize explicit async control and uninitialized allocation. Other HPC
+     and ML projects remain heavy users.
+  4. **Full `std::vector`-like modifiers** (`push_back`, `insert`, `erase`)
+     are rarely used in real workloads.
+
+## Comparison of Existing device_vector Implementations
+
+| Implementation | Source |
+|---|---|
+| **Thrust** (`thrust::device_vector`) | [NVIDIA/cccl - device_vector.h](https://github.com/NVIDIA/cccl/blob/main/thrust/thrust/device_vector.h) |
+| **SYCLomatic** (`dpct::device_vector`) | [SYCLomatic - vector.h](https://github.com/oneapi-src/SYCLomatic/blob/SYCLomatic/clang/runtime/dpct-rt/include/dpct/dpl_extras/vector.h) |
+| **Distributed Ranges** (`dr::sp::device_vector`) | [distributed-ranges - device_vector.hpp](https://github.com/oneapi-src/distributed-ranges/blob/main/include/dr/sp/device_vector.hpp) |
+| **sycl-thrust** (`thrust::device_vector`) | [SparseBLAS/sycl-thrust - device_vector.h](https://github.com/SparseBLAS/sycl-thrust/blob/main/include/thrust/device_vector.h) |
+
+### How They Differ
+
+| Aspect | Proposed (oneDPL) | Thrust | sycl-thrust | SYCLomatic |
+|---|---|---|---|---|
+| **Default Allocator** | `device_allocator<T>` wrapping `sycl::malloc_device`; custom `DeviceAllocator` concept | `thrust::device_allocator<T>` (CUDA `cudaMalloc`) | `device_allocator<T>` (`sycl::malloc_device`); supports alignment template parameter | USM: `sycl::usm_allocator<T, shared>` / Buffer: `__buffer_allocator<T>` |
+| **Memory Model** | **Device memory** via `sycl::malloc_device`; host access triggers explicit transfers | **Device memory** via `cudaMalloc`; host access triggers explicit transfers | **Device memory** via `sycl::malloc_device`; explicit transfers | **Shared memory** via USM shared or SYCL buffer/accessor; runtime manages placement |
+| **Host Element Access** | `device_array`: explicit `read()`/`write()`; compat `device_vector`: `device_reference` proxy | Via `device_reference` proxy (explicit device-to-host copy) | Via `device_reference` proxy (`__SYCL_DEVICE_ONLY__` bifurcation) | Via `device_reference` proxy (runtime-managed migration) |
+| **std::vector Interop** | Explicit constructor + `to_vector()` | Copy constructors from/to `std::vector` | Constructor from `std::vector` | Copy/move + implicit `operator std::vector()` |
+| **Queue Association** | Stores context + device; queue provided per-operation or created on demand | Implicit (CUDA stream) | Allocator stores `device` + `context`; queue resolved at runtime via pointer introspection | Global default queue |
+| **Uninitialized Construction** | `device_array`: uninitialized by default; compat `device_vector`: `no_init_t` tag | `default_init_t`, `no_init_t` tags | Not supported | Not supported |
+
+## Proposal
+
+The proposal consists of two complementary types that share an underlying
+implementation:
+
+1. **[`device_array<T, Alloc>`](device_array.md)** — the primary API.
+   A clean, explicit container for device memory with no proxy types. Raw `T*`
+   iterators, explicit `read()`/`write()` for host access, uninitialized by
+   default, and range support via `device_span`.
+
+2. **[`compat::device_vector<T, Alloc>`](device_vector_compat.md)** — a
+   Thrust compatibility layer built on `device_array`. Adds `device_pointer`,
+   `device_reference`, and `operator[]` proxy semantics for drop-in migration
+   from `thrust::device_vector`.
+
+### Class Relationships
+
+```mermaid
+classDiagram
+    direction LR
+
+    namespace experimental {
+        class device_array~T, Alloc~ {
+            iterators: T*
+            host access: read() / write()
+        }
+
+        class device_span~T~ {
+            non-owning view
+            trivially copyable
+        }
+    }
+
+    namespace experimental_compat {
+        class device_vector~T, Alloc~ {
+            Thrust compat layer
+        }
+
+        class device_pointer~T~ {
+            wraps T* + context*
+        }
+
+        class device_reference~T~ {
+            proxy for host access
+        }
+    }
+
+    device_array --> device_span : .span()
+    device_vector *-- device_array : contains
+    device_vector --> device_pointer : begin()/end()
+    device_pointer --> device_reference : operator*() / operator[]()
+```
+
+### Design Decisions
+
+- **Use USM device memory as baseline, copy to/from host on demand when required.**
+   This matches semantics of all pre-existing implementations other than SYCLomatic
+   where the runtime handles where memory lives. Shared memory has significantly
+   worse performance than device memory, and if users want those semantics, they
+   can directly use usm shared memory or sycl buffers.
+
+- **Store context, not queue.** `sycl::malloc_device` requires only
+  a context (and a device which can be looked up with the pointer). Storing a queue would tie the container to a particular
+  queue and imply synchronization semantics. Queues are accepted per-operation
+  or created on demand.
+
+- **Type T should only require device copyability.**
+  We should not need anything except device copyability (for copy to and from
+  the device).
+
+- **No tag system for dispatch to specific hardware.**
+  Execution policies dictate where algorithms are run. We don't intend to
+  provide other flavors of vector / iterator which would have different tags,
+  which would be required to dispatch based upon tag.
+
+- **Custom `DeviceAllocator` concept for pluggable allocation.**
+  A minimal allocator interface — just `allocate(n, ctx, dev)` and
+  `deallocate(p, n, ctx, dev)` — that avoids the `std::allocator` named
+  requirements (which mandate host-accessible memory). Enables pool allocators,
+  aligned allocation, and other strategies. See the
+  [device_array allocator section](device_array.md#allocator) for details.
+
+- **No `push_back`, `insert`, `erase`.**
+  Rarely used in practice (see [usage study](usage_pattern_study.md)),
+  high implementation complexity for device memory.
+
+- **Host-side operations block but do not synchronize with prior work.**
+  The user is responsible for ensuring prior kernels have completed before
+  host-side access. This can be achieved via an in-order queue or explicit
+  event waits. `device_array` additionally offers async overloads with
+  `depends_on` events.
+
+## Open Questions
+
+- **Should `device_array`'s async overloads be in the initial release or
+  deferred?**
+  see [device_array](device_array.md).
+
+- **Header organization?**
+  - We could have a `<oneapi/dpl/compat>` header and automatically include `device_array` with other includes?
+  Alternatives:
+  - Individual headers:
+  `<oneapi/dpl/device_array>` and `<oneapi/dpl/device_vector>`.. `device_vector`
+  would transitively include `device_array` since it depends on it.
+  - We could have a `compat` header and a individual `device_array` header. However, if we intend to use `device_array` within our own sycl implementations, that may impact our decision here.
+
+- **Compatibility namespace naming?** This proposal places the
+  Thrust-compatible types in `oneapi::dpl::experimental::compat`. Several
+  aspects are worth discussing:
+  - Should `compat` be nested under `experimental`, or should it be
+    `oneapi::dpl::compat` directly? Or alternatively `oneapi::dpl::ext::compat`.
+  - Is `compat` the right name? Alternatives: `thrust_compat`, `migration`,
+    `legacy`. `compat` is concise but doesn't indicate what it's compatible
+    *with*. `thrust_compat` is more explicit but ties the namespace to a
+    specific vendor's API.
+  - Moreover, is this repository where we want the compatibility headers?  I think yes, otherwise they will be too cumbersome to use, but it worth raising.
+
diff --git a/rfcs/proposed/device_vector/device_array.md b/rfcs/proposed/device_vector/device_array.md
new file mode 100644
index 00000000000..3811b638ca9
--- /dev/null
+++ b/rfcs/proposed/device_vector/device_array.md
@@ -0,0 +1,336 @@
+# `device_array<T>` — Simplified Device Memory Container
+
+`device_array` provides RAII device allocation, explicit host-device transfers,
+and direct use of raw USM pointers as iterators. It emphasizes convenience of
+use while making host-side use explicit. It provides control over synchronization
+of host side operations.
+
+This focus provides support for the main usage pattern for users of `device_vector`,
+and fits nicely within SYCL while avoiding much of the complexity of `device_vector`.
+
+See the [device_vector RFC](../../../rfcs/proposed/device_vector/README.md) for
+full motivation, usage study, and comparison of existing implementations. This
+document only describes `device_array`.
+
+## API
+
+```cpp
+namespace oneapi::dpl::experimental {
+
+template <typename T, typename Alloc = device_allocator<T>>
+class device_array {
+public:
+    using value_type      = T;
+    using allocator_type  = Alloc;
+    using size_type       = std::size_t;
+    using difference_type = std::ptrdiff_t;
+    using pointer         = T*;
+    using const_pointer   = const T*;
+    using iterator        = T*;
+    using const_iterator  = const T*;
+
+    // Construction
+
+    // Allocate uninitialized device memory (default — no kernel launch or memset)
+    // From queue (extracts context + device; queue is not retained)
+    device_array(size_type count, sycl::queue q);
+    // From context + device
+    device_array(size_type count, sycl::context ctx, sycl::device dev);
+
+    // Allocate and fill with value (requires kernel launch or memset)
+    device_array(size_type count, const T& value, sycl::queue q);
+    device_array(size_type count, const T& value,
+                 sycl::context ctx, sycl::device dev);
+
+    // Construct from host data (upload)
+    template <typename InputIt>
+    device_array(InputIt first, InputIt last, sycl::queue q);
+    device_array(std::initializer_list<T> init, sycl::queue q);
+    device_array(const std::vector<T>& src, sycl::queue q);
+
+    template <typename InputIt>
+    device_array(InputIt first, InputIt last, sycl::context ctx, sycl::device dev);
+    device_array(std::initializer_list<T> init, sycl::context ctx, sycl::device dev);
+    device_array(const std::vector<T>& src, sycl::context ctx, sycl::device dev);
+
+    // Deleted Copy — use copy_from for explicit device-to-device copies)
+    device_array(const device_array&) = delete;
+    device_array& operator=(const device_array&) = delete;
+
+    // Move
+    device_array(device_array&&);
+    device_array& operator=(device_array&&);
+
+    ~device_array();
+
+    // Device-to-device copy (allocates on the provided context+device)
+    // Supports cross-device copies: source and destination may be on different devices
+    static device_array copy_from(const device_array& src, sycl::queue q);
+    static device_array copy_from(const device_array& src,
+                                  size_type offset, size_type count, sycl::queue q);
+    static device_array copy_from(const device_array& src,
+                                  sycl::context ctx, sycl::device dev);
+    static device_array copy_from(const device_array& src,
+                                  size_type offset, size_type count,
+                                  sycl::context ctx, sycl::device dev);
+
+    // Host-device transfer
+
+    // Bulk download
+    std::vector<T> to_vector() const;
+    std::vector<T> to_vector(sycl::queue q) const;
+
+    // Bulk upload (resizes to match)
+    void assign(const T* first, const T* last);
+    void assign(const T* first, const T* last, sycl::queue q);
+    void assign(const std::vector<T>& src);
+    void assign(const std::vector<T>& src, sycl::queue q);
+
+    // Single-element host access (blocking, creates queue from context & device)
+    T read(size_type pos) const;
+    void write(size_type pos, const T& value);
+
+    // Single-element host access (blocking, provided queue is used for copy submissions)
+    T read(size_type pos, sycl::queue q) const;
+    void write(size_type pos, const T& value, sycl::queue q);
+
+    // Asynchronous single-element access, events allow synchronization with event driven workloads
+    sycl::event async_read(size_type pos, T& out,
+                     sycl::queue q,
+                     const std::vector<sycl::event>& depends_on = {}) const;
+    sycl::event async_write(size_type pos, const T& value,
+                      sycl::queue q,
+                      const std::vector<sycl::event>& depends_on = {});
+
+    // Asynchronous bulk transfer
+    sycl::event async_to_vector(std::vector<T>& out,
+                          sycl::queue q,
+                          const std::vector<sycl::event>& depends_on = {}) const;
+    sycl::event async_assign(const T* first, const T* last,
+                       sycl::queue q,
+                       const std::vector<sycl::event>& depends_on = {});
+
+    // Device iteration — raw USM pointers
+    iterator       begin();
+    const_iterator begin() const;
+    iterator       end();
+    const_iterator end() const;
+    pointer        data();
+    const_pointer  data() const;
+
+    // Capacity
+    size_type size()     const;
+    size_type capacity() const;
+    bool      empty()    const;
+
+    // Resize — new elements are uninitialized by default
+    void resize(size_type count);
+    void resize(size_type count, sycl::queue q);
+    // Resize — new elements filled with value
+    void resize(size_type count, const T& value);
+    void resize(size_type count, const T& value, sycl::queue q);
+
+    void reserve(size_type new_cap);
+    void clear();
+    void swap(device_array& other);
+
+    // Views
+    device_span<T>       span();
+    device_span<const T> span() const;
+
+    // Allocator access
+    allocator_type get_allocator() const;
+
+    // Context / device access
+    sycl::context get_context() const;
+    sycl::device  get_device()  const;
+};
+
+} // namespace oneapi::dpl::experimental
+```
+
+## Allocator
+
+`device_array` accepts an optional allocator template parameter for device
+memory allocation. The default allocator wraps `sycl::malloc_device` /
+`sycl::free`.
+
+### Allocator Requirements
+
+A type `Alloc` satisfies `DeviceAllocator` for type `T` if, given an instance
+`a` of type `Alloc`, a pointer `p` of type `T*`, a `std::size_t n`, a
+`sycl::context ctx`, and a `sycl::device dev`, the following expressions are
+valid:
+
+| Expression | Return type | Semantics |
+|---|---|---|
+| `a.allocate(n, ctx, dev)` | `T*` | Allocate device memory for `n` objects of type `T` |
+| `a.deallocate(p, n, ctx, dev)` | `void` | Free device memory previously allocated by `allocate` |
+
+The allocator is not required to support `construct`, `destroy`, or any of the
+`std::allocator` named requirements beyond `allocate`/`deallocate`. Device
+memory is not host-accessible, so construction and destruction happen via
+kernel launches or memcpy, managed by `device_array` itself.
+
+The allocator must be copy-constructible and copy-assignable.
+
+```cpp
+// Default allocator
+template <typename T>
+struct device_allocator {
+    T* allocate(std::size_t n, sycl::context ctx, sycl::device dev) {
+        return sycl::malloc_device<T>(n, dev, ctx);
+    }
+    void deallocate(T* p, std::size_t n, sycl::context ctx, sycl::device dev) {
+        sycl::free(p, ctx);
+    }
+};
+```
+
+### C++20 Concept (informational; enforced via SFINAE on C++17)
+
+```cpp
+template <typename Alloc, typename T>
+concept DeviceAllocator = requires(Alloc a, T* p, std::size_t n,
+                                   sycl::context ctx, sycl::device dev) {
+    { a.allocate(n, ctx, dev) } -> std::same_as<T*>;
+    { a.deallocate(p, n, ctx, dev) } -> std::same_as<void>;
+};
+```
+
+## `device_span<T>`
+
+`device_array` is not device-copyable (it owns memory). For kernel capture,
+non-owning views, and range composition, use `device_span<T>` via `.span()`.
+
+`device_span` is guaranteed trivially copyable (and therefore device copyable),
+has `enable_borrowed_range = true` and `enable_view = true`, and models
+`contiguous_range` + `sized_range`.
+
+### Definition
+
+**C++23 and later:** `std::span` is guaranteed trivially copyable (per P2251R1),
+so `device_span` is simply an alias:
+
+```cpp
+#if __cplusplus >= 202302L  // C++23
+
+template <typename T>
+using device_span = std::span<T>;
+
+#else  // C++20
+
+template <typename T>
+class device_span {
+    T* __ptr = nullptr;
+    std::size_t __size = 0;
+public:
+    using element_type    = T;
+    using value_type      = std::remove_cv_t<T>;
+    using size_type       = std::size_t;
+    using difference_type = std::ptrdiff_t;
+    using pointer         = T*;
+    using reference       = T&;
+    using iterator        = T*;
+
+    device_span() = default;
+    device_span(T* ptr, std::size_t size) : __ptr(ptr), __size(size) {}
+
+    template <std::size_t N>
+    device_span(T (&arr)[N]) : __ptr(arr), __size(N) {}
+
+    // Implicit conversion from device_array
+    device_span(device_array<T>& arr);
+
+    T*          begin() const { return __ptr; }
+    T*          end()   const { return __ptr + __size; }
+    T*          data()  const { return __ptr; }
+    std::size_t size()  const { return __size; }
+    bool        empty() const { return __size == 0; }
+
+    T& operator[](std::size_t i) const { return __ptr[i]; }
+    T& front() const { return __ptr[0]; }
+    T& back()  const { return __ptr[__size - 1]; }
+
+    device_span first(std::size_t count) const { return {__ptr, count}; }
+    device_span last(std::size_t count) const { return {__ptr + __size - count, count}; }
+    device_span subspan(std::size_t offset, std::size_t count) const {
+        return {__ptr + offset, count};
+    }
+};
+
+template <typename T>
+inline constexpr bool std::ranges::enable_borrowed_range<
+    oneapi::dpl::experimental::device_span<T>> = true;
+
+template <typename T>
+inline constexpr bool std::ranges::enable_view<
+    oneapi::dpl::experimental::device_span<T>> = true;
+
+#endif
+```
+
+## Usage Examples
+
+```cpp
+#include <oneapi/dpl/device_array>
+#include <oneapi/dpl/algorithm>
+#include <oneapi/dpl/execution>
+#include <sycl/sycl.hpp>
+
+namespace dpl = oneapi::dpl::experimental;
+
+sycl::queue q{sycl::property::queue::in_order{}};
+
+// --- RAII allocation + upload from host ---
+std::vector<float> host_data(1024, 3.14f);
+dpl::device_array<float> d(host_data, q);
+
+// --- Use with oneDPL algorithms (raw T* iterators) ---
+auto policy = oneapi::dpl::execution::make_device_policy(q);
+std::sort(policy, d.begin(), d.end());
+
+// --- Use in a SYCL kernel ---
+float* ptr = d.data();
+q.parallel_for(sycl::range<1>(d.size()), [=](sycl::id<1> i) {
+    ptr[i] *= 2.0f;
+}).wait();
+
+// --- Explicit single-element host access ---
+float val = d.read(0, q);     // synchronous read
+d.write(0, 42.0f, q);         // synchronous write
+
+// --- Async transfer with dependency ---
+float result;
+sycl::event e = d.async_read(0, result, q, {some_prior_event});
+e.wait();
+
+// --- Bulk download ---
+std::vector<float> out = d.to_vector(q);
+
+// --- Output buffer (uninitialized by default — no memset) ---
+dpl::device_array<float> output(1024, q);
+std::transform(policy, d.begin(), d.end(), output.begin(),
+               [](float x) { return x * 2.0f; });
+
+// --- Zero-initialized allocation (opt-in) ---
+dpl::device_array<float> zeroed(1024, 0.0f, q);
+
+// For kernel capture or composition with range adaptors, use device_span:
+auto s = d.span();  // returns device_span<float>
+auto pipeline = s | std::views::take(100);
+oneapi::dpl::ranges::for_each(policy, pipeline, [](float& x) { x += 1.0f; });
+
+// Capture a device_span into a kernel:
+auto s2 = d.span();
+q.parallel_for(sycl::range<1>(s2.size()), [=](sycl::id<1> i) {
+    s2[i] *= 2.0f;
+}).wait();
+```
+
+## Open Questions
+
+- **Should async overloads be in the initial proposal or deferred?** 
+    This provides more control over synchronization than merely an in-order queue,
+    but it is unclear whether users who are wanting this would just want to work
+    with USM memory and memcpy directly.
diff --git a/rfcs/proposed/device_vector/device_vector_compat.md b/rfcs/proposed/device_vector/device_vector_compat.md
new file mode 100644
index 00000000000..a98c8761cfc
--- /dev/null
+++ b/rfcs/proposed/device_vector/device_vector_compat.md
@@ -0,0 +1,242 @@
+# `device_vector<T>` Compatibility Layer
+
+A Thrust-compatible device-memory vector, living in a
+compatibility namespace. Adds `device_pointer`, `device_reference`, and
+implicit host-access semantics on top of `device_array`'s explicit API.
+
+See the [usage study](usage_pattern_study.md) for evidence on which Thrust APIs
+are actually used, and [device_array](device_array.md) for the underlying
+container.
+
+The goal is a near drop-in replacement for `thrust::device_vector`, covering
+the functionality that is actually used in practice, adapted to fit within SYCL.
+
+## Relationship to `device_array`
+
+`compat::device_vector<T, Alloc>` contains a `device_array<T, Alloc>` for core functionality.
+
+It uses an iterator/pointer type, `device_pointer`, as a wrapper for USM memory, and reference type, `device_reference`, as a reference proxy type to enable host-side usage with implicit memory transfers. These types hold a pointer to a `sycl::context` to facilitate creation of a queue for memcpy.
+
+## Differences from Thrust
+
+1. **Context + device (or queue) always required** — no implicit default device.
+2. **No default constructor** — a device association is always required.
+3. **No `push_back`, `insert`, `erase`** — rarely used, unnecessary complexity.
+4. **No `host_vector` type** — use `std::vector<T>` directly.
+5. **No system tag dispatch** — execution policies determine where algorithms run.
+
+## Namespace
+We are using `oneapi::dpl::experimental::compat` for these compatibility classes. The intention would be to promote
+these to `oneapi::dpl::compat` as they exit experimental. We could add `ext` as well, but this further elongates the
+already long name. Other elements which graduate from SYCLomatic, but don't belong in oneDPL proper may end up living
+in this `compat` namespace in the future.
+
+## API
+
+```cpp
+namespace oneapi::dpl::experimental::compat {
+
+// =========================================================================
+// device_pointer<T>
+// =========================================================================
+// Wraps a raw T* from device_array. Dereference provides device_reference.
+
+template <typename T>
+class device_pointer {
+    T* __ptr = nullptr;
+    const sycl::context* __ctx = nullptr;  // non-owning, from device_vector
+
+public:
+    using iterator_concept  = std::random_access_iterator_tag;
+    using value_type        = std::remove_cv_t<T>;
+    using difference_type   = std::ptrdiff_t;
+    using reference         = device_reference<T>;
+
+    device_pointer() = default;
+    explicit device_pointer(T* ptr, const sycl::context* ctx = nullptr);
+
+    // Raw pointer access — unwraps back to the T* that device_array uses
+    T* get() const;
+
+    reference operator*() const;
+    reference operator[](difference_type n) const;
+
+    /* Full random access iterator arithmetic + comparison*/
+
+};
+
+// =========================================================================
+// device_reference<T>
+// =========================================================================
+
+template <typename T>
+class device_reference {
+public:
+    operator T() const;                                       // read
+    const device_reference& operator=(const T& val) const;    // write
+    const device_reference& operator=(const device_reference&) const;
+
+    // Compound assignment (each is a synchronous read-modify-write)
+    const device_reference& operator+=(const T&) const;
+    /* all other compound assignments... */
+
+    const device_reference& operator++() const;
+    T operator++(int) const;
+    const device_reference& operator--() const;
+    T operator--(int) const;
+
+    device_pointer<T> operator&() const;
+
+    friend void swap(const device_reference& a, const device_reference& b);
+};
+
+// =========================================================================
+// device_vector<T, Alloc>
+// =========================================================================
+
+template <typename T, typename Alloc = device_allocator<T>>
+class device_vector {
+    device_array<T, Alloc> __impl;  // stores context + device + allocator
+
+public:
+    using value_type      = T;
+    using allocator_type  = Alloc;
+    using size_type       = std::size_t;
+    using difference_type = std::ptrdiff_t;
+    using reference       = device_reference<T>;
+    using const_reference = device_reference<const T>;
+    using pointer         = device_pointer<T>;
+    using const_pointer   = device_pointer<const T>;
+    using iterator        = device_pointer<T>;
+    using const_iterator  = device_pointer<const T>;
+
+
+    // construction from size
+    device_vector(size_type count, sycl::context ctx, sycl::device dev);
+    // construction from size and value
+    device_vector(size_type count, const T& value,
+                  sycl::context ctx, sycl::device dev);
+
+    //construction from iterators
+    template <typename InputIt>
+    device_vector(InputIt first, InputIt last,
+                  sycl::context ctx, sycl::device dev);
+    //construction from initializer_list
+    device_vector(std::initializer_list<T> init,
+                  sycl::context ctx, sycl::device dev);
+    // construction from std::vector
+    explicit device_vector(const std::vector<T>& src,
+                           sycl::context ctx, sycl::device dev);
+
+    /* Copy of all above constructors for `sycl::queue` (extracts context + device), using no_init_t to avoid initialization, and with explicit allocator */
+
+
+    // Copy / move
+    device_vector(const device_vector&);
+    device_vector(device_vector&&) noexcept;
+    device_vector& operator=(const device_vector&);
+    device_vector& operator=(device_vector&&) noexcept;
+
+    ~device_vector();
+
+    // Assign from host vector (bulk upload)
+    device_vector& operator=(const std::vector<T>& src);
+
+    // Convert to host vector (bulk download)
+    explicit operator std::vector<T>() const;
+
+    // --- Element access (proxy references) ---
+    reference       operator[](size_type pos);
+    const_reference operator[](size_type pos) const;
+    reference       front();
+    const_reference front() const;
+    reference       back();
+    const_reference back() const;
+
+    // --- Pointer access (device_pointer wrapping device_array's T*) ---
+    pointer       data();
+    const_pointer data() const;
+
+    // --- Iterators (device_pointer wrapping device_array's T*) ---
+    iterator       begin();
+    const_iterator begin() const;
+    iterator       end();
+    const_iterator end() const;
+
+    // --- Capacity (forwarded to device_array) ---
+    size_type size()     const;
+    size_type capacity() const;
+    bool      empty()    const;
+
+    void resize(size_type count);
+    void resize(size_type count, const T& value);
+    void resize(size_type count, no_init_t);
+    void reserve(size_type new_cap);
+    void clear();
+
+    // --- Swap ---
+    void swap(device_vector& other);
+
+    // --- Access to underlying device_array ---
+    device_array<T, Alloc>&       base();
+    const device_array<T, Alloc>& base() const;
+
+    // --- Allocator ---
+    allocator_type get_allocator() const;
+
+    // --- Context / device ---
+    sycl::context get_context() const;
+    sycl::device  get_device()  const;
+};
+
+} // namespace oneapi::dpl::experimental::compat
+```
+
+## Usage Example
+
+```cpp
+#include <oneapi/dpl/device_vector>
+#include <oneapi/dpl/algorithm>
+#include <oneapi/dpl/execution>
+#include <sycl/sycl.hpp>
+
+namespace compat = oneapi::dpl::experimental::compat;
+
+sycl::queue q{sycl::property::queue::in_order{}};
+
+// --- Thrust-like construction ---
+std::vector<float> host_data(1024, 3.14f);
+compat::device_vector<float> d(host_data, q);
+
+// --- Thrust-like algorithm use ---
+auto policy = oneapi::dpl::execution::make_device_policy(q);
+std::sort(policy, d.begin(), d.end());
+
+// --- Thrust-like element access (proxy, synchronous) ---
+float val = d[0];          // implicit device-to-host
+d[0] = 42.0f;              // implicit host-to-device
+d[1] += 10.0f;             // read-modify-write round-trip
+
+// --- Raw pointer for kernels ---
+float* ptr = d.data().get();
+q.parallel_for(sycl::range<1>(d.size()), [=](sycl::id<1> i) {
+    ptr[i] *= 2.0f;
+}).wait();
+
+// --- Device-to-device copy via device_pointer ---
+compat::device_vector<float> d2(d.begin(), d.end(), q);  // D2D copy
+
+// --- Bulk copy back to host ---
+std::vector<float> result = static_cast<std::vector<float>>(d);
+
+// --- Construction from context + device (no queue needed) ---
+sycl::context ctx = q.get_context();
+sycl::device  dev = q.get_device();
+compat::device_vector<float> d3(1024, ctx, dev);
+
+// --- Gradual migration to device_array ---
+auto& arr = d.base();
+float v = arr.read(5, q);           // explicit, no proxy
+arr.write(5, v * 2.0f, q);          // explicit
+std::vector<float> out = arr.to_vector(q);
+```
diff --git a/rfcs/proposed/device_vector/usage_pattern_study.md b/rfcs/proposed/device_vector/usage_pattern_study.md
new file mode 100644
index 00000000000..66f610c4657
--- /dev/null
+++ b/rfcs/proposed/device_vector/usage_pattern_study.md
@@ -0,0 +1,259 @@
+# `device_vector` Usage Pattern Study
+
+This document surveys real-world usage of `thrust::device_vector`,
+`dpct::device_vector`, and alternative device vector implementations across
+open-source projects. It serves as supporting evidence for the design
+decisions in the [device_vector RFC](README.md).
+
+## Thrust (`thrust::device_vector`) — Native CUDA Projects
+
+GitHub code search reports ~2,520 files containing `thrust::device_vector`
+across the platform, spanning AI/ML, HPC, scientific computing, robotics,
+graph analytics, and databases. A survey of notable projects follows.
+
+### Sparse BLAS (spblas-reference)
+
+[spblas-reference](https://github.com/SparseBLAS/spblas-reference) (sparse
+BLAS standard reference implementation) demonstrates the minimal-but-dominant
+pattern: `device_vector` as an **RAII device memory manager and host-device
+data shuttle**:
+
+1. **Constructing from `std::vector`** (~90% of uses) — bulk host-to-device
+   transfer at setup time.
+2. **Allocating output buffers by size** — e.g. after a symbolic phase
+   computes output NNZ, a `device_vector` is constructed with just a count.
+3. **Extracting raw device pointers** via `.data().get()` — every
+   `device_vector` is ultimately unwrapped to a raw pointer for passing
+   to library APIs (`csr_view`, `std::span`).
+4. **Copying results back to host** via `thrust::copy(d.begin(), d.end(),
+   host.begin())` — used in every test for verification.
+
+Notably absent from spblas: element-level access (`operator[]`), `resize()`,
+`push_back()`, `insert()`/`erase()`, or device-side algorithms on iterators.
+
+### AI/ML Projects
+
+**Notable finding:**  A few high profile performance-sensitive AI/ML codebases have
+**explicitly moved away from `thrust::device_vector`**, while other ML
+projects remain heavy users.
+
+- [CUTLASS](https://github.com/NVIDIA/cutlass) (NVIDIA, 9.5k stars) — Uses
+  `device_vector` in **30 files**, but only in examples and tests as
+  scaffolding. Pattern: construct from host data, extract raw pointer via
+  `.data().get()`, pass to GEMM kernels. Never used in hot-path
+  implementations.
+- [FAISS](https://github.com/facebookresearch/faiss) (Meta, 39.7k stars) —
+  **Rolled their own `DeviceVector<T>`** instead of using
+  `thrust::device_vector`. Reasons cited: control over streams, avoiding
+  unwanted `T()` initialization on `resize()`, and custom memory growth
+  strategy (power-of-2 below 4MB, 1.25x to 128MB, exact above).
+- [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) (23.8k stars) —
+  Uses `device_vector` in **24 files** for GPU kernel implementations (mode,
+  graph reindex, CTC align, kron). Pattern: temporary containers within GPU
+  kernels combined with `thrust::sort_by_key`, `thrust::reduce_by_key`. Use
+  raw pointers or begin() directly into thrust. Some [usage of element access
+  after a thrust API](https://github.com/PaddlePaddle/Paddle/blob/d6e489cc39412ac278bbc1dda352742dddb7e57d/paddle/phi/kernels/funcs/mode.h#L189).
+- [cuDF](https://github.com/rapidsai/cudf) (RAPIDS, 9.6k stars) —
+  **Explicitly discourages `thrust::device_vector`** in developer guide.
+  Recommends `rmm::device_uvector` for uninitialized allocation and
+  stream-ordered operations.
+- [cuML](https://github.com/rapidsai/cuml) (RAPIDS, 5.2k stars) — Only **4
+  files** still using it. Developer guide discourages it in favor of
+  `MLCommon::device_buffer<T>` for stream-safe allocation.
+- [H2O4GPU](https://github.com/h2oai/h2o4gpu) (466 stars) — Heavy user
+  (**17 files**) for K-means, GLM, TSVD, ARIMA. Notable patterns: arrays of
+  `device_vector` pointers for multi-GPU (`thrust::device_vector<T>
+  *centroid_dots[n_gpu]`), `thrust::inner_product()` for convergence checks,
+  raw pointer extraction for cuBLAS/cuSOLVER calls. Used via raw pointers, or
+  [c]begin() going into thrust APIs, element-wise host access to print.
+
+### HPC / Scientific Computing / Graph Analytics
+
+These domains remain the heaviest `thrust::device_vector` users:
+
+- [Gunrock](https://github.com/gunrock/gunrock) (1k stars) — **49 files**,
+  heaviest user among notable projects. Type alias
+  `device_vector_t = thrust::device_vector<type_t>`. Used for BFS outputs,
+  graph frontier data. Raw pointer via `.data().get()`, uses resize() and
+  begin(), end() with thrust API. Copies to ostream from device_ptr for print.
+- [AmgX](https://github.com/NVIDIA/AMGX) (NVIDIA, 662 stars) — Custom
+  allocator wrapper using `cudaMallocAsync`/`cudaFreeAsync` for stream-ordered
+  allocation: `thrust::device_vector<T, thrust_amgx_allocator<T>>`. Extract raw
+  pointers, some direct element access after thrust APIs.
+- [GPU-Voxels](https://github.com/fzi-forschungszentrum-informatik/gpu-voxels)
+  (315 stars) — **29 files**, robotics collision detection. Class members for
+  octree nodes, voxel lists. Tracks allocations via
+  `thrust::device_vector<void*>`.
+- [ISCE3](https://github.com/isce-framework/isce3) (204 stars) — SAR radar,
+  **21 files**. Class members for satellite orbit data
+  (`thrust::device_vector<Vec3> _position, _velocity`). Extract raw pointers to
+  pass to kernels directly. Use of resize().
+- [Feltor](https://github.com/feltor-dev/feltor) (38 stars) — Plasma physics.
+  Type aliases as vocabulary types:
+  `using DVec = thrust::device_vector<double>`.
+
+### Consolidated Construction Patterns (ordered by frequency)
+
+1. `thrust::device_vector<T> d_v = h_v;` (copy from host_vector)
+2. `thrust::device_vector<T> d_v(N);` (sized, value-initialized)
+3. `thrust::device_vector<T> d_v(N, val);` (sized with fill value)
+4. `thrust::device_vector<T> d_v(ptr, ptr + N);` (from host pointer range)
+5. `new thrust::device_vector<T>(N);` (heap-allocated, multi-GPU)
+
+### Why AI/ML Projects Rejected `thrust::device_vector`
+
+The reasons cited by FAISS, cuDF, and cuML for moving away are instructive
+for our design:
+
+1. **Unwanted value initialization** — `resize()` and sized construction
+   zero-initialize elements via device kernel. For large temporary buffers
+   this is wasted work. (Supports our open question on `no_init_t` tags.)
+2. **No stream/queue parameter** — Operations are synchronous or use a
+   default stream, preventing overlap with other work.
+3. **Header includes device code** — Forces `.cu` compilation even for host
+   code that just manages device_vectors.
+
+## dpct (`dpct::device_vector`) — Migrated CUDA-to-SYCL Projects
+
+A broader survey of `dpct::device_vector` usage across ~18 repositories
+(111 code results on GitHub) shows additional patterns beyond the
+spblas-minimal case:
+
+**Projects surveyed include:**
+- [HeCBench](https://github.com/ORNL/HeCBench) (ORNL, 285+ stars) — HPC
+  benchmark suite
+- [oneAPI-samples](https://github.com/oneapi-src/oneAPI-samples) (Intel,
+  1139+ stars) — radix sort migration samples
+- [SYCLomatic-test](https://github.com/oneapi-src/SYCLomatic-test) —
+  official compatibility test suite
+- Various sparse matrix, radio astronomy, and optimization codes
+
+**Consolidated operation frequency:**
+
+| Operation | Frequency | Example |
+|---|---|---|
+| Construction from size | Very high | `dpct::device_vector<T> v(N)` |
+| Assignment from `std::vector` | Very high | `d_vec = h_vec` (H2D) |
+| `begin()`/`end()` for algorithms | Very high | `sort(policy, dv.begin(), dv.end())` |
+| `data()` + raw pointer extraction | Very high | `get_raw_pointer(dv.data())` for kernels |
+| Copy back to host | High | `std::copy(policy, dv.begin(), dv.end(), h.begin())` or `h = d` |
+| Construction from `std::vector` | High | `dpct::device_vector<T> dv(host_vec)` |
+| `size()` | Medium | For bounds checks, kernel launch args |
+| `operator[]` | Medium | Host-side element access |
+| `resize()` | Medium | Output buffer sizing |
+| As class/struct member | Medium | Sparse matrices, particle systems |
+| `clear()`, `push_back()`, `insert()`, `erase()` | Low | Mostly in tests, not real workloads |
+
+**Notable real-world patterns:**
+- **Default-construct then assign** — oneAPI-samples shows a "reset and
+  re-sort" loop: `dpct::device_vector<T> d_keys;` then `d_keys = h_keys;`
+  each iteration.
+- **Raw pointer extraction is ubiquitous** — nearly every project that passes
+  data to SYCL kernels unwraps `device_vector` to a raw pointer. This
+  validates our `device_pointer::get()` and the importance of making
+  `device_pointer` device-copyable for direct kernel capture. It seems likely
+  this is a product of cuda patterns, but removes all possibility of overhead
+  for performance sensitive codes, so it makes sense.
+
+## Alternatives Built by Projects That Rejected `thrust::device_vector`
+
+Two notable alternatives were built by high-performance projects that found
+`thrust::device_vector` insufficient. Understanding their designs is
+instructive for our proposal.
+
+### FAISS `DeviceVector<T>` (Meta, 39.7k stars)
+
+[Source: `faiss/gpu/utils/DeviceVector.cuh`](https://github.com/facebookresearch/faiss/blob/main/faiss/gpu/utils/DeviceVector.cuh)
+
+FAISS built a minimal replacement explicitly motivated by three deficiencies
+in Thrust (from the class comment): *"has more control over streams, whether
+resize() initializes new space with T() (which we don't want), and control
+on how much the reserved space grows."* It is restricted to POD types only.
+
+**Key design choices:**
+- **Explicit `cudaStream_t` on every host mutating operation** — `resize(n, stream)`,
+  `append(ptr, n, stream)`, `setAt(i, val, stream)`, `getAt(i, stream)`,
+  `reserve(n, stream)`, `reclaim(exact, stream)`.
+- **No initialization on `resize()`** — comment: *"Don't bother zero
+  initializing the newly accessible memory (unlike thrust::device_vector)"*.
+  However, newly allocated raw capacity *is* zeroed.
+- **Tiered growth strategy** — power-of-2 below 4M elements, 1.25x up to
+  128M elements, exact allocation above. Prevents overallocation for large
+  buffers.
+- **No iterators, no `operator[]`** — element access only via explicit
+  `setAt()`/`getAt()` methods with stream parameter. No proxy references.
+- **Auto-detects host vs device source** — `append()` uses
+  `cudaPointerGetAttributes` to pick the right `memcpy` direction.
+- **Custom memory resource** — allocates through FAISS's `GpuResources`
+  abstraction (pool for temporaries, `cudaMalloc` for persistent data),
+  not through CUDA allocator APIs directly.
+- **No copy semantics** — move-only via the underlying RAII memory handle.
+
+**What it strips out vs Thrust:** iterators, `operator[]`, `push_back`,
+`insert`/`erase`, copy construction, implicit host↔device conversion,
+value initialization, STL container compatibility. What it **adds:**
+explicit stream parameter, `append()` with auto-direction detection,
+`reclaim()` for capacity shrinking, and return values indicating whether
+reallocation occurred.
+
+### RAPIDS `rmm::device_uvector<T>`
+
+[Source: `rmm/include/rmm/device_uvector.hpp`](https://github.com/rapidsai/rmm/blob/main/cpp/include/rmm/device_uvector.hpp)
+
+RMM's `device_uvector` (the "u" stands for uninitialized) is the
+recommended replacement for `thrust::device_vector` across the RAPIDS
+ecosystem (cuDF, cuML, cuGraph). It was motivated by the same concerns
+as FAISS plus a desire for pluggable memory resources.
+
+**Key design choices:**
+- **Explicit `cuda_stream_view` on every operation** — construction,
+  resize, reserve, element access, copy construction all require a stream.
+- **No initialization** — construction and resize never launch a kernel
+  to zero-fill or value-initialize. This is the defining feature.
+- **No geometric growth** — `resize()` allocates exactly the requested
+  size. RMM's pool memory resources handle allocation performance, so
+  container-level overallocation is unnecessary.
+- **Deleted default and copy constructors** — must provide stream to
+  construct. Copy requires explicit call: `device_uvector(other, stream)`.
+- **Iterators are raw `T*` pointers** — usable in device code and thrust
+  algorithms, but dereferencing on host is undefined behavior. No proxy
+  references.
+- **No `operator[]`** — element access via explicit `element(i, stream)`
+  (synchronous D→H) and `set_element_async(i, val, stream)` (async H→D).
+  The async setter deliberately deletes its rvalue overload to prevent
+  dangling references.
+- **No bulk host↔device transfer API** — no constructor from host data,
+  no `assign()` from host range. Users must use `cudaMemcpy` directly.
+- **Pluggable memory resource** — uses `device_async_resource_ref`
+  (type-erased, `std::pmr`-style). Default is `cudaMalloc`, but can be
+  pool, arena, etc.
+- **`static_assert(is_trivially_copyable<T>)`** — only trivially copyable
+  types.
+- **Implicit conversion to `cuda::std::span<T>`** — lightweight view
+  interop.
+- **Stores device ID** — destructor deallocates on the correct device even
+  if a different device is current.
+
+**What it strips out vs Thrust:** `operator[]`, `push_back`, `insert`/`erase`,
+`assign`, `clear`, `swap`, implicit copy, host range construction, value
+initialization, geometric growth, non-trivial types. What it **adds:**
+explicit stream everywhere, pluggable memory resource, span conversion,
+device-aware destruction.
+
+Note: RMM also provides `rmm::device_vector<T>`, which is just a type alias
+for `thrust::device_vector<T, rmm::mr::thrust_allocator<T>>` — same Thrust
+interface but with RMM-backed allocation.
+
+## Summary
+Most usage seems to focus on `device_vector` as a convenient way to allocate and control lifetime of device memory.
+Usage largely focuses on:
+ * copies to and from host side vector all at once
+ * getting raw pointers to use directly on the device
+ * using begin() and end() iterators as input to algorithms
+
+Host side usage
+ * Largely not present
+ * If present, mostly used in tests / useful in debugging
+ * dpct migrations use `operator[]` in some cases, but this may be from CUDA migration patterns rather than intentional
+ * Cases which do need host access (FAISS, RMM) have replaced device_vector with alternatives that allow access with explicit stream synchronization.
+