feat: Add custom memory pool support for tensor allocation#96
feat: Add custom memory pool support for tensor allocation#96elibol merged 2 commits intoNVlabs:mainfrom
Conversation
2f37344 to
b909509
Compare
- Add MemPool wrapping CUmemoryPool with device ownership and RAII drop - set_device_pool() registers pool per-device; rejects cross-device pools - ExecutionContext::new() auto-resolves pool from stream's device via pool_for_stream() - All scheduling paths (.sync, .await, .schedule, .sync_on, .async_on) carry pool - Add cuda-async integration tests covering lifecycle, scheduling freeze, and cross-device rejection
|
I've refactored the implementation since the last push — cleaner and less intrusive. @elibol PTAL when you're available, thank you! |
|
No immediate concerns! We already reviewed the approach extensively. I will begin looking at these once we have v0.0.2 cut. I'd like to have your PRs merged in after that, so we can iterate and address bugs while folks try out v0.0.2. Hoping to get it out soon. We might just cut v0.1.0 immediately after v0.0.2. Just trying to align versions and features in a manageable way. I'll prioritize this and your other PR next week 🙂 |
|
Sounds good to me, thank you for the update. |
|
LGTM! Merging! |
| props.allocType = cuda_bindings::CUmemAllocationType_enum_CU_MEM_ALLOCATION_TYPE_PINNED; | ||
| props.handleTypes = cuda_bindings::CUmemAllocationHandleType_enum_CU_MEM_HANDLE_TYPE_NONE; | ||
| props.location.type_ = cuda_bindings::CUmemLocationType_enum_CU_MEM_LOCATION_TYPE_DEVICE; | ||
| props.location.__bindgen_anon_1.id = self.ordinal as c_int; |
There was a problem hiding this comment.
I have multiple CUDA environments thus generated bindings vary accrodingly.
@ur4t Thanks for the investigation! Could you share which CUDA version you're using? including pre-13.x?
There was a problem hiding this comment.
Could you share which CUDA version you're using? including pre-13.x?
@goog00 CUDA 13.0 installed along with PyTorch. And I have checked cuda.h in CUDA 13.1, it uses int. Only CUDA 13.2 uses a union wrapper.
There was a problem hiding this comment.
@ur4t Thank you for checking! I'll open a follow-up PR to handle the version compatibility across different CUDA releases.
Summary
MemPoolRAII wrapper incuda-corewith create/default/threshold/destroy lifecycleset_device_pool/get_device_pool/clear_device_poolonAsyncDeviceContext; pool is frozen intoExecutionContextat scheduling time so in-flight ops are unaffected by later changesExecutionContext::alloc_async()routes throughcuMemAllocFromPoolAsyncwhen a pool is set, falls back tocuMemAllocAsyncotherwise. All tensor/copy callsites updatedset_device_poolvalidates device affinity;.schedule()now reads pool (was alwaysNone), aligned with.sync()/.await.schedule()path;No behavior change when no pool is configured.
Test plan
cargo test -p cuda-async --test pool_allocationpasses./scripts/run_all.shpasses