Skip to content

api-query: Continue device initialization when a backend fails#1032

Open
MarijnS95 wants to merge 1 commit intollvm:mainfrom
Traverse-Research:api-query-continue-on-backend-init-failure
Open

api-query: Continue device initialization when a backend fails#1032
MarijnS95 wants to merge 1 commit intollvm:mainfrom
Traverse-Research:api-query-continue-on-backend-init-failure

Conversation

@MarijnS95
Copy link
Copy Markdown
Contributor

@MarijnS95 MarijnS95 commented Mar 30, 2026

Device::initialize() returns early when any backend fails to initialize, preventing subsequent backends from being discovered. For example, a Vulkan initialization failure (e.g. VK_ERROR_INCOMPATIBLE_DRIVER when MoltenVK is not installed) blocks Metal device discovery on macOS.

Collect backend initialization errors with joinErrors() and return them together instead. Also make lit.cfg.py resilient to an empty device list from api-query.

Test plan

  • Verified api-query outputs Metal device when Vulkan init fails (no MoltenVK)
  • Verified check-hlsl runs Metal tests successfully with this change

🤖 Generated with Claude Code

@MarijnS95 MarijnS95 force-pushed the api-query-continue-on-backend-init-failure branch from f4cbd80 to de099b0 Compare March 30, 2026 13:14
@MarijnS95 MarijnS95 force-pushed the api-query-continue-on-backend-init-failure branch from de099b0 to d9b6e6a Compare March 30, 2026 16:35
@MarijnS95 MarijnS95 requested a review from llvm-beanz March 31, 2026 07:51
@manon-traverse
Copy link
Copy Markdown
Contributor

LGTM

@MarijnS95 MarijnS95 force-pushed the api-query-continue-on-backend-init-failure branch from d9b6e6a to 0fe8047 Compare April 2, 2026 16:17
@bogner
Copy link
Copy Markdown
Contributor

bogner commented Apr 6, 2026

A few things:

  1. This has conflicts due to Restructure how devices are created and automatically cleaned up. #1035 and needs to be updated
  2. Wouldn't we also need a change in api-query.cpp to not abort after calling initializeDevices? This might be something that changed since Restructure how devices are created and automatically cleaned up. #1035
  3. The commit message here could be significantly simplified, which would help make it clear what this does. Commit messages shouldn't repeat every change that a commit will make, but instead focus on pointing out why the change is being made. Here, it would suffice to point out the problem (we quit after each device) and the high level of how we're fixing it (collect the errors, report them, and continue on).

@MarijnS95 MarijnS95 force-pushed the api-query-continue-on-backend-init-failure branch from 0fe8047 to 1d5330c Compare April 7, 2026 06:35
@MarijnS95
Copy link
Copy Markdown
Contributor Author

MarijnS95 commented Apr 7, 2026

  1. Wouldn't we also need a change in api-query.cpp to not abort after calling initializeDevices? This might be something that changed since Restructure how devices are created and automatically cleaned up. #1035

Yup, the original claim that Device::initialize() doesn't bail on error no longer holds true, so the flow had to be rewritten slightly to still print errors but not return them out of initializeDevices().

  1. The commit message here could be significantly simplified, which would help make it clear what this does. Commit messages shouldn't repeat every change that a commit will make, but instead focus on pointing out why the change is being made. Here, it would suffice to point out the problem (we quit after each device) and the high level of how we're fixing it (collect the errors, report them, and continue on).

@bogner just to make sure we're talking about the same commit message:

Device::initialize() returns early when any backend fails to initialize, preventing subsequent backends from being discovered. For example, a Vulkan initialization failure (e.g. VK_ERROR_INCOMPATIBLE_DRIVER when MoltenVK is not installed) blocks Metal device discovery on macOS.

Collect backend initialization errors with joinErrors() and return them together instead. Also make lit.cfg.py resilient to an empty device list from api-query.

This explains why something is wrong together with an example case of when that has an effect (Vulkan init on MacOS), followed by how it was fixed (using joinError(), and the drive-by note aboutlit.cfg.py not allowing empty yaml lists initially).

Is that still too much?

@MarijnS95 MarijnS95 force-pushed the api-query-continue-on-backend-init-failure branch 2 times, most recently from 8a379ad to 68112e4 Compare April 8, 2026 14:31
Copy link
Copy Markdown
Contributor

@bogner bogner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

llvm::Expected<llvm::SmallVector<std::unique_ptr<Device>>>
offloadtest::initializeDevices(const DeviceConfig Config) {
llvm::SmallVector<std::unique_ptr<Device>> Devices;
llvm::Error Result = llvm::Error::success();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the name Result a little confusing here, since it's just the error case of the result. Maybe Err would be a better name for this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason Err isn't chosen is because those are already used inside the if blocks to join into this Result. Perhaps we can rename those to E to allow this to become Err?

@bogner
Copy link
Copy Markdown
Contributor

bogner commented Apr 8, 2026

@bogner just to make sure we're talking about the same commit message:

Device::initialize() returns early when any backend fails to initialize, preventing subsequent backends from being discovered. For example, a Vulkan initialization failure (e.g. VK_ERROR_INCOMPATIBLE_DRIVER when MoltenVK is not installed) blocks Metal device discovery on macOS.
Collect backend initialization errors with joinErrors() and return them together instead. Also make lit.cfg.py resilient to an empty device list from api-query.

This explains why something is wrong together with an example case of when that has an effect (Vulkan init on MacOS), followed by how it was fixed (using joinError(), and the drive-by note aboutlit.cfg.py not allowing empty yaml lists initially).

Is that still too much?

I was looking in the wrong place (this was before our other conversation about PR desciptions matching commit messages). This does look much better. I do still find this a little verbose where it goes into listing the specific APIs that its using, but that's mostly a matter of taste.

Device::initialize() returns early when any backend fails to initialize,
preventing subsequent backends from being discovered. For example, a
Vulkan initialization failure (e.g. VK_ERROR_INCOMPATIBLE_DRIVER when
MoltenVK is not installed) blocks Metal device discovery on macOS.

Collect backend initialization errors with joinErrors() and return them
together instead. Also make lit.cfg.py resilient to an empty device list
from api-query.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MarijnS95
Copy link
Copy Markdown
Contributor Author

MarijnS95 commented Apr 9, 2026

Would dropping with joinErrors() solve that? It doesn't seem to mention any other APIs being used, only APIs being changed (Device::initialize(), lit.cfg.py, and the example about VK_ERROR_INCOMPATIBLE_DRIVER being triggered without MoltenVK).

@MarijnS95 MarijnS95 force-pushed the api-query-continue-on-backend-init-failure branch from 68112e4 to 53ead32 Compare April 9, 2026 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants