Don't exclude strix halo in build#4661
Conversation
| rccl # https://github.com/ROCm/TheRock/issues/150 | ||
| rccl-tests |
There was a problem hiding this comment.
We can't support additional RCCL targets until #2130 is resolved (see the other exclusions in this file now). Each additional build target adds too much link time to the critical path for every CI/CD build.
On this PR, the comm-libs stage took 3h54m (https://github.com/ROCm/TheRock/actions/runs/24734903539/job/72365287361?pr=4661), compared to the baseline of ~1h30m (https://github.com/ROCm/TheRock/actions/runs/24795953473/job/72570414857). Some part of that can be attributed to cache misses, but in general RCCL's current architecture is not compatible with project requirements. That is being worked on with increasing priority now that RCCL builds are the bottleneck.
There was a problem hiding this comment.
Can you see if this resolves your issue? @ScottTodd #2130 (comment) @stellaraccident
There was a problem hiding this comment.
Build times have improved substantially after pulling in ROCm/rocm-systems#4795 . We could attempt to enable gfx1151 (and other targets) in the RCCL build now.
There was a problem hiding this comment.
Enabled gfx1151 (and other targets) in #4935, closing this PR.
## Motivation RCCL builds have historically been too slow to support for all targets, see * #2130 * ROCm/rocm-systems#4795 * #4661 Build time has improved recently. Let's see if it's improved enough to re-enable these targets. ## Test Plan Multi-arch CI on this PR, watch comm-libs stage * Build should succeed * Build time should be reasonable (not a bottleneck compared to other stages) * Tests should run (and probably pass?) * Inspect the configure/build logs and uploaded artifacts (look for what was actually enabled, what kpack files are included, etc.) ## Test Result * comm-libs build stage succeeded ([logs here](https://github.com/ROCm/TheRock/actions/runs/25134138451/job/73674820623?pr=4935)) * comm-libs build time: 1h43m, math-libs gfx94X build time: 1h41m * RCCL tests passed on Linux gfx942 ([logs here](https://github.com/ROCm/TheRock/actions/runs/25134138451/job/73686990909?pr=4935)) * RCCL tests passed on Linux gfx950 ([logs here](https://github.com/ROCm/TheRock/actions/runs/25134138451/job/73686961882?pr=4935)) * RCCL artifacts were split for each architecture, e.g. `rccl_lib_gfx1151.tar.zst` which includes `.kpack/rccl_lib_gfx1151.kpack` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist