Skip to content

feat: Add custom memory pool support for tensor allocation#96

Merged
elibol merged 2 commits intoNVlabs:mainfrom
goog00:allocator
Apr 29, 2026
Merged

feat: Add custom memory pool support for tensor allocation#96
elibol merged 2 commits intoNVlabs:mainfrom
goog00:allocator

Conversation

@goog00
Copy link
Copy Markdown
Contributor

@goog00 goog00 commented Apr 15, 2026

Summary

  • Pool primitives: MemPool RAII wrapper in cuda-core with create/default/threshold/destroy lifecycle
  • Device-level configuration: set_device_pool / get_device_pool / clear_device_pool on AsyncDeviceContext; pool is frozen into ExecutionContext at scheduling time so in-flight ops are unaffected by later changes
  • Unified allocation path: ExecutionContext::alloc_async() routes through cuMemAllocFromPoolAsync when a pool is set, falls back to cuMemAllocAsync otherwise. All tensor/copy callsites updated
  • Safety: set_device_pool validates device affinity; .schedule() now reads pool (was always None), aligned with .sync() / .await
  • Tests: 12 integration tests covering pool lifecycle, allocation routing, freeze semantics, and the explicit .schedule() path;

No behavior change when no pool is configured.

Test plan

  • cargo test -p cuda-async --test pool_allocation passes
  • ./scripts/run_all.sh passes

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 15, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@goog00 goog00 force-pushed the allocator branch 3 times, most recently from 2f37344 to b909509 Compare April 24, 2026 06:42
- Add MemPool wrapping CUmemoryPool with device ownership and RAII drop
- set_device_pool() registers pool per-device; rejects cross-device pools
- ExecutionContext::new() auto-resolves pool from stream's device via pool_for_stream()
- All scheduling paths (.sync, .await, .schedule, .sync_on, .async_on) carry pool
- Add cuda-async integration tests covering lifecycle, scheduling freeze, and cross-device rejection
@goog00
Copy link
Copy Markdown
Contributor Author

goog00 commented Apr 24, 2026

I've refactored the implementation since the last push — cleaner and less intrusive.
Happy to discuss any concerns.

@elibol PTAL when you're available, thank you!

@elibol
Copy link
Copy Markdown
Collaborator

elibol commented Apr 24, 2026

No immediate concerns! We already reviewed the approach extensively. I will begin looking at these once we have v0.0.2 cut. I'd like to have your PRs merged in after that, so we can iterate and address bugs while folks try out v0.0.2. Hoping to get it out soon. We might just cut v0.1.0 immediately after v0.0.2. Just trying to align versions and features in a manageable way.

I'll prioritize this and your other PR next week 🙂

@jollylili jollylili added the P0 label Apr 24, 2026
@goog00
Copy link
Copy Markdown
Contributor Author

goog00 commented Apr 25, 2026

Sounds good to me, thank you for the update.

@elibol
Copy link
Copy Markdown
Collaborator

elibol commented Apr 29, 2026

LGTM! Merging!

@elibol elibol merged commit c831c24 into NVlabs:main Apr 29, 2026
4 checks passed
Comment thread cuda-core/src/runtime.rs
props.allocType = cuda_bindings::CUmemAllocationType_enum_CU_MEM_ALLOCATION_TYPE_PINNED;
props.handleTypes = cuda_bindings::CUmemAllocationHandleType_enum_CU_MEM_HANDLE_TYPE_NONE;
props.location.type_ = cuda_bindings::CUmemLocationType_enum_CU_MEM_LOCATION_TYPE_DEVICE;
props.location.__bindgen_anon_1.id = self.ordinal as c_int;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@goog00 @elibol id in CUmemLocation_st is actually an int field which should be accessed directly, but bindgen sometimes produces a c_int as expected and sometimes produces an union wrapper struct. Investigating.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@goog00 @elibol id in CUmemLocation_st is an union wrapper in CUDA 13.2. I have multiple CUDA environments thus generated bindings vary accrodingly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have multiple CUDA environments thus generated bindings vary accrodingly.

@ur4t Thanks for the investigation! Could you share which CUDA version you're using? including pre-13.x?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you share which CUDA version you're using? including pre-13.x?

@goog00 CUDA 13.0 installed along with PyTorch. And I have checked cuda.h in CUDA 13.1, it uses int. Only CUDA 13.2 uses a union wrapper.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ur4t Thank you for checking! I'll open a follow-up PR to handle the version compatibility across different CUDA releases.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ur4t @elibol Fixed in #128

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants