Skip to content

shim and vxdna: update to support unlimited dev no allocation#1234

Open
wendyliang25 wants to merge 2 commits intoamd:mainfrom
wendyliang25:accel-dev-bo-shim
Open

shim and vxdna: update to support unlimited dev no allocation#1234
wendyliang25 wants to merge 2 commits intoamd:mainfrom
wendyliang25:accel-dev-bo-shim

Conversation

@wendyliang25
Copy link
Copy Markdown
Contributor

See details description in this PR: https://github.com/amd/xdna-driver/pull/1233

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the shim and vxdna virt-accel implementation to support “unlimited” device BO allocations by enabling a dynamically expandable device heap made up of multiple 64MB chunks (up to a configured maximum).

Changes:

  • Add DEV_HEAP VA-range reservation and fixed-address mappings in vxdna to place heap chunks contiguously in userspace VA.
  • Add an expandable heap buffer in the shim with a max size (512MB) and retry/expand behavior on allocation retry errors.
  • Adjust ccmd exception handling/logging behavior to avoid QEMU-side cascading failures.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/vxdna/src/vaccel_amdxdna.h Adds heap VA tracking in vxdna_context, BO type accessor, and changes ccmd error-wrap behavior.
src/vxdna/src/vaccel_amdxdna.cpp Implements heap VA reservation + MAP_FIXED mapping for DEV_HEAP and updates error response behavior.
src/shim/kmq/pcidev.cpp Creates an expandable DEV_HEAP BO on first open and retries DEV BO allocation by expanding heap on retry error.
src/shim/buffer.h Introduces shared heap_page_size constant and new expandable buffer constructor signature.
src/shim/buffer.cpp Implements expandable buffer constructor and refines expand() behavior for heap growth.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

maxzhen
maxzhen previously approved these changes Apr 2, 2026
Copy link
Copy Markdown
Collaborator

@maxzhen maxzhen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comment and clarify locking schema when we need to expand heap.

Copilot AI review requested due to automatic review settings April 3, 2026 06:41
@wendyliang25
Copy link
Copy Markdown
Contributor Author

wendyliang25 commented Apr 3, 2026

Please add comment and clarify locking schema when we need to expand heap.

Hi @maxzhen , i have added the comments about the locking:

  // we need to lock when we are trying to allocate DEV BO as
  // it is possible that the heap is being expanded by another thread.

Please check again. thanks

@wendyliang25 wendyliang25 requested a review from maxzhen April 3, 2026 06:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/shim/kmq/pcidev.cpp:19

  • Debug.num_heap_pages can be configured to 0, which makes heap_sz become 0 and results in constructing an expandable DEV_HEAP buffer with initial_size == 0 (eventually attempting to create a 0-byte BO). Consider clamping get_heap_num_pages() to at least 1 (or explicitly rejecting 0 with a clear error) so the first-open heap allocation is always valid.
unsigned int
get_heap_num_pages()
{
  static unsigned int num = 0;

  if (!num)
    num = xrt_core::config::detail::get_uint_value("Debug.num_heap_pages", 1);
  return num;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

maxzhen
maxzhen previously approved these changes Apr 3, 2026
Add userspace shim support for the kernel's expandable device heap.
The shim drives heap expansion by catching -EAGAIN from BO allocation
ioctls and requesting heap growth before retrying.

- buffer.cpp: Add bo_buf::expand() to create a new DEV_HEAP BO via
  CREATE_BO ioctl, then mmap it at the correct offset to maintain a
  contiguous virtual address range. Track expansion chunks and unmap
  them on destruction.
- buffer.h: Add m_expand_bo_handles vector and expand() declaration.
- kmq/pcidev.cpp: Catch -EAGAIN in pdev::alloc_bo(), call expand()
  on the heap BO to grow it, and retry the allocation. Log when
  expansion is triggered.

Signed-off-by: Wendy Liang <wendy.liang@amd.com>
Add expandable device heap support in vxdna (QEMU virtio-accel backend)
with contiguous virtual address management and robust error handling.

Heap VA management:
- Reserve a contiguous HEAP_MAX_SIZE (512MB) VA range with PROT_NONE
  on first DEV_HEAP BO creation. Use MAP_FIXED to place each expansion
  chunk at m_heap_base + m_heap_cur_offset, ensuring the guest kernel
  sees a contiguous UVA space matching its expectations.
- Track heap state with m_heap_destroyed flag: once any DEV_HEAP BO is
  destroyed, the contiguous VA invariant is broken so all further
  DEV/DEV_HEAP allocations are rejected with -EINVAL.
- Add bounds check: reject expansion if m_heap_cur_offset + size
  exceeds HEAP_MAX_SIZE with -ENOSPC.
- Move m_heap_cur_offset update after add_bo() and response write to
  avoid permanently wasting VA budget on failed operations.

Error handling:
- vxdna_ccmd_error_wrap: Remove exception re-throwing after catch.
  Re-throwing caused QEMU to send 0x1200 error responses and skip
  fence signaling, hanging the guest indefinitely on sync_wait.
  Now errors are logged and written to the response buffer only.
- write_err_rsp: Log and return instead of throwing when resp resource
  is missing, preventing exceptions from escaping the error wrapper.
- vxdna_bo constructor: Add mmap_target parameter to support MAP_FIXED
  placement. Add get_type() accessor for BO type queries.

Cleanup:
- Clean up reserved VA range in vxdna_context destructor.
- Update Doxygen for vxdna_ccmd_error_wrap to reflect non-re-throwing
  behavior.

Signed-off-by: Wendy Liang <wendy.liang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants