Skip to content

Don't exclude strix halo in build#4661

Closed
charest wants to merge 2 commits into
mainfrom
users/cmarc/strix_halo_fix
Closed

Don't exclude strix halo in build#4661
charest wants to merge 2 commits into
mainfrom
users/cmarc/strix_halo_fix

Conversation

@charest
Copy link
Copy Markdown
Contributor

@charest charest commented Apr 17, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Comment on lines -239 to -240
rccl # https://github.com/ROCm/TheRock/issues/150
rccl-tests
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't support additional RCCL targets until #2130 is resolved (see the other exclusions in this file now). Each additional build target adds too much link time to the critical path for every CI/CD build.

On this PR, the comm-libs stage took 3h54m (https://github.com/ROCm/TheRock/actions/runs/24734903539/job/72365287361?pr=4661), compared to the baseline of ~1h30m (https://github.com/ROCm/TheRock/actions/runs/24795953473/job/72570414857). Some part of that can be attributed to cache misses, but in general RCCL's current architecture is not compatible with project requirements. That is being worked on with increasing priority now that RCCL builds are the bottleneck.

Copy link
Copy Markdown

@alex-breslow-amd alex-breslow-amd Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you see if this resolves your issue? @ScottTodd #2130 (comment) @stellaraccident

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build times have improved substantially after pulling in ROCm/rocm-systems#4795 . We could attempt to enable gfx1151 (and other targets) in the RCCL build now.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enabled gfx1151 (and other targets) in #4935, closing this PR.

ScottTodd added a commit that referenced this pull request Apr 30, 2026
## Motivation

RCCL builds have historically been too slow to support for all targets,
see
* #2130
* ROCm/rocm-systems#4795
* #4661

Build time has improved recently. Let's see if it's improved enough to
re-enable these targets.

## Test Plan

Multi-arch CI on this PR, watch comm-libs stage
* Build should succeed
* Build time should be reasonable (not a bottleneck compared to other
stages)
* Tests should run (and probably pass?)
* Inspect the configure/build logs and uploaded artifacts (look for what
was actually enabled, what kpack files are included, etc.)

## Test Result

* comm-libs build stage succeeded ([logs
here](https://github.com/ROCm/TheRock/actions/runs/25134138451/job/73674820623?pr=4935))
* comm-libs build time: 1h43m, math-libs gfx94X build time: 1h41m
* RCCL tests passed on Linux gfx942 ([logs
here](https://github.com/ROCm/TheRock/actions/runs/25134138451/job/73686990909?pr=4935))
* RCCL tests passed on Linux gfx950 ([logs
here](https://github.com/ROCm/TheRock/actions/runs/25134138451/job/73686961882?pr=4935))
* RCCL artifacts were split for each architecture, e.g.
`rccl_lib_gfx1151.tar.zst` which includes
`.kpack/rccl_lib_gfx1151.kpack`

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
@ScottTodd ScottTodd closed this Apr 30, 2026
@github-project-automation github-project-automation Bot moved this from TODO to Done in TheRock Triage Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants