Skip to content

accel/amdxdna: implement unlimited dev bo support#1233

Open
wendyliang25 wants to merge 2 commits intoamd:mainfrom
wendyliang25:accel-dev-bo
Open

accel/amdxdna: implement unlimited dev bo support#1233
wendyliang25 wants to merge 2 commits intoamd:mainfrom
wendyliang25:accel-dev-bo

Conversation

@wendyliang25
Copy link
Copy Markdown
Contributor

@wendyliang25 wendyliang25 commented Apr 2, 2026

implement unlimited dev bo support.
AIE2 executes instructions on a buffer which is internal to XRT. Due to TLB requirement from NPU firmware, each chunk of the memory has 64MB limitation. If we want to have more than 64MB npu heap buffer, we need to have multiple chunks.

Today the staging driver in drivers/accel/amdxdna only supports single chunk. this patch set enable multiple chunks support.
In the beginning, when the shim opens drm client, it will allocate first 64MB chunk of heap memory. At runtime, if heap buffer allocation fails, userspace shim layer will request to expand the heap, every expansion will add one 64MB chunk until it meets the requirement or hit the max heap size, and then try to allocate the buffer from heap again.
also see PR: #1234

Test with baremetal Linux with/without force_iova= module parameter, and with QEMU KVM guest.

example output from QEMU KVM:

installed_testcases/xrt_unlimited_bo/xrt_unlimited_bo
-------------------------------------------------------
    [#37] : runlist
-------------------------------------------------------
        Going to start the runlist #5
                Sync bos to device
                Sync bos to device
                Sync ofm from device
                Validating ofm  passed
                Sync ofm from device
                Validating ofm  passed
-------------------------------------------------------
    [#38] : runlist
-------------------------------------------------------
        Going to start the runlist #6
                Sync bos to device
                Sync bos to device
                Sync bos to device
                Sync ofm from device
                Validating ofm  passed
                Sync ofm from device
                Validating ofm  passed
                Sync ofm from device
                Validating ofm  passed
INFO: TEST PASSED
                                       .--.
                _______             .-"  .'
        .---u"""       """"---._  ."    %
      .'                        "--.    %
 __.--'  o                          "".. "
(____.                                  ":
 `----.__                                 ".
         `----------__                     ".
               ".   . ""--.                 ".
                 ". ". bIt ""-.              ".
                   "-.)        ""-.           ".
                                   "".         ".
                                      "".       ".
                                         "".      ".
                                            "".    ".
                      ^~^~^~^~^~^~^~^~^~^~^~^~^"".  "^~^~^~^~^
                                            ^~^~^~^  ~^~
                                                 ^~^~^~


=== PASSED: configs/runlist_bo_on_another_bank.ini ===

===============================
Results: 15 passed, 0 failed out of 15 total
===================s

Copilot AI review requested due to automatic review settings April 2, 2026 20:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds “unlimited” device BO support for the amdxdna stack by enabling the device heap to grow in 64MB chunks (up to a configured maximum) and updating both the userspace shim and kernel driver to coordinate heap expansion and firmware mappings.

Changes:

  • Kernel: track a per-client list of DEV_HEAP chunks, grow committed heap on demand, and notify firmware/hwctxs when new heap chunks are added.
  • Userspace shim: retry DEV BO allocations on -EAGAIN by expanding the heap one 64MB chunk at a time (up to 512MB).
  • vxdna virt layer: reserve a large VA range for a contiguous heap layout and map new heap chunks into it with fixed-address mmaps.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/vxdna/src/vaccel_amdxdna.h Adds heap VA reservation state and adjusts teardown/error wrapping behavior.
src/vxdna/src/vaccel_amdxdna.cpp Maps DEV_HEAP chunks into a reserved VA window; tracks committed heap growth.
src/shim/kmq/pcidev.cpp Retries DEV BO allocation by expanding heap on EAGAIN.
src/shim/buffer.h Moves heap page size constant to a shared header; adds expandable buffer ctor.
src/shim/buffer.cpp Implements expandable heap buffer allocation via multiple DRM BO chunks.
drivers/accel/amdxdna/npu6_regs.c Plumbs dev_heap_max_size for NPU6.
drivers/accel/amdxdna/npu5_regs.c Plumbs dev_heap_max_size for NPU5.
drivers/accel/amdxdna/npu4_regs.c Plumbs dev_heap_max_size for NPU4.
drivers/accel/amdxdna/npu1_regs.c Plumbs dev_heap_max_size for NPU1.
drivers/accel/amdxdna/amdxdna_pci_drv.h Adds heap expansion notification hook and client heap chunk tracking fields.
drivers/accel/amdxdna/amdxdna_pci_drv.c Initializes and tears down per-client heap chunk list.
drivers/accel/amdxdna/amdxdna_gem.h Adds per-heap-chunk list linkage in GEM objects.
drivers/accel/amdxdna/amdxdna_gem.c Implements committed-heap allocation windowing, chunk expansion, and DEV BO vmap across chunks.
drivers/accel/amdxdna/aie2_pci.h Adds AIE2 heap size constant and new FW API prototypes.
drivers/accel/amdxdna/aie2_pci.c Wires heap expansion notification op into AIE2 ops.
drivers/accel/amdxdna/aie2_msg_priv.h Adds new mailbox opcode for “add host buffer”.
drivers/accel/amdxdna/aie2_message.c Splits host buffer mapping into MAP + ADD messages for multi-chunk heaps.
drivers/accel/amdxdna/aie2_ctx.c Pins/unpins all heap chunks and maps/extends heap buffers per hwctx; implements notify hook.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@maxzhen
Copy link
Copy Markdown
Collaborator

maxzhen commented Apr 2, 2026

Hi @wendyliang25 , can we split the change into multiple pull requests. The first PR is for driver change only. This will ensure the driver change does not break backward compatibility. Then, we can file another PR for user space change. I'd like to set up the rule that UMD and KMD changes are not allowed in one PR to avoid breaking changes for old UMD.

Copilot AI review requested due to automatic review settings April 2, 2026 22:32
@wendyliang25
Copy link
Copy Markdown
Contributor Author

Hi @wendyliang25 , can we split the change into multiple pull requests. The first PR is for driver change only. This will ensure the driver change does not break backward compatibility. Then, we can file another PR for user space change. I'd like to set up the rule that UMD and KMD changes are not allowed in one PR to avoid breaking changes for old UMD.

Hi @maxzhen , i have split the original PR, this one only has the kernel driver change, the UMD (shim and vxdna) changes are in this PR: #1234. could you check again? thanks

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (2)

drivers/accel/amdxdna/amdxdna_gem.c:1110

  • amdxdna_gem_pin_nolock()/amdxdna_gem_unpin() now no-op for AMDXDNA_BO_DEV, but call sites like amdxdna_drm_sync_bo_ioctl() rely on pinning to ensure a stable page backing before calling amdxdna_gem_vmap() (which will invoke the DEV BO .vmap path and expects heap-chunk pages/sgt to exist). With the current change, syncing/vmapping a DEV BO can fail (or map invalid pages) unless some other path already pinned all heap chunks.

Consider making DEV BO pin/unpin propagate to the underlying heap-chunk(s) (e.g., pin/unpin the chunk list, or at least the chunk range that covers abo->mm_node), or special-case the sync/vmap paths to pin the heap chunks when operating on a DEV BO.

	if (abo->type == AMDXDNA_BO_DEV)
		return 0; /* Heap chunks are pinned at expansion time */

	if (is_import_bo(abo))
		return 0;

drivers/accel/amdxdna/amdxdna_pci_drv.c:111

  • PR description says the driver allocates the first 64MB heap chunk when the DRM client is opened, but amdxdna_drm_open() still only initializes client state (no DEV_HEAP BO is created/queued there). The only creation path for heap chunks appears to be AMDXDNA_BO_DEV_HEAP in amdxdna_drm_create_bo_ioctl().

Either update the PR description to match the actual behavior (heap created by userspace via ioctl), or add the missing kernel-side allocation at open if that’s required for the design.

	init_srcu_struct(&client->hwctx_srcu);
	xa_init_flags(&client->hwctx_xa, XA_FLAGS_ALLOC);
	mutex_init(&client->mm_lock);
	INIT_LIST_HEAD(&client->dev_heap_chunks);

	mutex_lock(&xdna->dev_lock);
	list_add_tail(&client->node, &xdna->client_list);
	mutex_unlock(&xdna->dev_lock);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings April 3, 2026 01:33
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

drivers/accel/amdxdna/npu1_regs.c:145

  • dev_heap_max_size for NPU1 is set to AIE2_DEVM_SIZE (64MB) while NPU4/5/6 use AIE2_DEVM_MAX_SIZE (512MB). If NPU1 hardware/firmware supports the same multi-chunk heap expansion, this will unintentionally disable “unlimited dev bo” on NPU1. If the smaller max is intentional for NPU1, please add a brief comment explaining why to avoid confusion.
	.dev_mem_base      = AIE2_DEVM_BASE,
	.dev_mem_size      = AIE2_DEVM_SIZE,
	.default_vbnv      = "RyzenAI-npu1",
	.dev_heap_max_size = AIE2_DEVM_SIZE,
	.device_type       = AMDXDNA_DEV_TYPE_KMQ,
	.dev_priv          = &npu1_dev_priv,
	.fw_feature_tbl    = npu1_fw_feature_table,
	.ops               = &aie2_ops,

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings April 3, 2026 06:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings April 6, 2026 06:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings April 7, 2026 07:38
@wendyliang25 wendyliang25 requested a review from houlz0507 April 7, 2026 07:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Implement userspace-driven device heap expansion for the AMDXDNA
accelerator driver. Instead of allocating a single large heap upfront,
the heap starts with a single 64MB chunk and grows on demand when
device BO allocations fail with -EAGAIN.

Each heap chunk is vmapped once at creation time, caching the kernel
virtual address in chunk->mem.kva. DEV BO CPU access (e.g. command
buffer fill) uses direct pointer arithmetic into the chunk's cached
kva, avoiding per-access vmap/kvmalloc overhead. Command buffer DEV
BOs cache their resolved kva at creation for zero-overhead access on
the submission hot path.

Key changes:
- amdxdna_gem.c: Add heap chunk creation, expansion, and drm_mm-based
  device address allocation. Implement amdxdna_gem_dev_bo_kva() to
  resolve DEV BO kernel addresses from chunk kva, and clflush using
  cached chunk mappings.
- aie2_ctx.c: Add heap chunk pin/unpin and firmware mapping helpers.
  Handle firmware notification of heap expansion via
  aie2_notify_heap_expand() with MSG_OP_ADD_HOST_BUFFER. Cache
  cmdbuf kva at hwctx init for lock-free submission.
- aie2_message.c: Add aie2_add_host_buf() for firmware heap expansion
  messages. Use cached cmdbuf kva directly in execbuf paths.
- npu*_regs.c: Add max heap size (512MB) and chunk size (64MB) device
  info fields.

Signed-off-by: Wendy Liang <[email protected]>
Set gobj->import_attach in amdxdna_gem_prime_import() so that
drm_gem_shmem_vmap_locked() takes the dma-buf import path instead of
calling drm_gem_get_pages() on an object with a NULL file pointer.

Without this, imported BOs (e.g. from virtio-gpu in QEMU) trigger a
WARNING in drm_gem_get_pages and vmap fails with -EINVAL.

Signed-off-by: Wendy Liang <[email protected]>
@wendyliang25
Copy link
Copy Markdown
Contributor Author

retest this please

}

/* Unpin all heap chunks. Caller must hold client->mm_lock. */
static void aie2_unpin_heap_chunks_locked(struct amdxdna_client *client)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is not needed and you can inline it to unpin_heap_chunks. Just like pin chunks.

list_for_each_entry_continue(chunk, &client->dev_heap_chunks,
heap_chunk_node) {
if (!AIE_FEATURE_ON(&xdna->dev_handle->aie,
AIE2_ADD_HOST_BUFFER)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why check here? Maybe it is good enough to check when adding more chunks?


chunk = list_first_entry(&client->dev_heap_chunks,
struct amdxdna_gem_obj, heap_chunk_node);
addr = amdxdna_pasid_on(client) ? heap_uva : chunk->mem.dma_addr;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe amdxdna_obj_dma_addr()?

}

addr = amdxdna_pasid_on(client) ?
heap_uva + offset : chunk->mem.dma_addr;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the heap_uva the first chunk uva? heap_uva + offset would be the uva of other chunks?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @houlz0507, the heap_uva is the the first chunk uva, when the first dev_heap BO created from shim, shim reserved the max heap size of address for the whole heap. and thus, in case of PASID, the uva of chunks of the whole heap will be continuous.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i used heap_uva + offset to get the chunks' here to line up with the heap expansion, in dev heap expansion, we need to notify the firmware, i did this in bo creation ioctl which i also lock the device as sending message to firmware needs it. in bo creation, since the chunk is not mapped yet, the uva is not there and thus, use heap_uva + offset.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @houlz0507 , had a discussion with @maxzhen . If we notify firmware before mmap(), the firmware will get fake address. Since when dev bo is created, the chunk of heap should be mapped already, will move the firmware expansion notification to when the dev bo is created.

ret = aie2_add_host_buf(xdna->dev_handle, hwctx->fw_ctx_id,
addr, new_chunk->mem.size);
if (ret) {
XDNA_WARN(xdna, "Add host buf hwctx %d failed, retrying",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does retrying help?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as this function is to notify firmware to add dev heap BO, i was thinking if there is any case in heavy traffic, the cmd will get lost, and thus retry. if the underline mailbox implementation covers that, can remove the retry.

* device address cursor cannot be rolled back. Commit the chunk
* regardless — a failed hwctx is degraded and may need reset.
*/
amdxdna_for_each_hwctx(client, hwctx_id, hwctx) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge with above pin loop?

* Caller ensures device is resumed before notification, so all
* hwctxs in the xarray have valid firmware contexts.
*/
amdxdna_for_each_hwctx(client, hwctx_id, hwctx) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mm_lock required?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @houlz0507 , this function i notify with the newly added chunk, it doesn't change the dev_heap, the mm_lock is for the heaps and bo usage protection, don't look like i need mm_lock here. as I just walk through the hwctx lists which is not under mm_lock protection.

ret = drm_gem_vmap(to_gobj(abo), &map);
if (ret)
XDNA_ERR(abo->client->xdna, "Vmap bo failed, ret %d", ret);
drm_err(to_gobj(abo)->dev, "Vmap bo failed, ret %d", ret);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change to drm_err()?


drm_WARN_ON(&xdna->ddev, !mutex_is_locked(&xdna->dev_lock));

if (!xdna->dev_info->ops->notify_heap_expand) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check feature here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ADD host bo is AIE2 specific feature, if we want to check the feature, will need to add a wrapper for the amdxdna_gem.c to call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants