Skip to content

runners(mi355x): exclude broken nodes mia1-p01-g09 + mia1-p01-g11

48c3388
Select commit
Loading
Failed to load commit list.
Merged

[Klaud Cold] runners(mi355x): exclude broken nodes mia1-p01-g09 + mia1-p01-g11 #1498

runners(mi355x): exclude broken nodes mia1-p01-g09 + mia1-p01-g11
48c3388
Select commit
Loading
Failed to load commit list.
Claude / Claude Code Review completed May 18, 2026 in 5m 25s

Code review found 1 potential issue

Found 1 candidates, confirmed 1. See review comments for details.

Details

Severity Count
🔴 Important 0
🟡 Nit 1
🟣 Pre-existing 0
Severity File:Line Issue
🟡 Nit runners/launch_mi355x-amds.sh:190-194 Exclude list misses g12/g31 which share the same docker.sock failure as g11

Annotations

Check warning on line 194 in runners/launch_mi355x-amds.sh

See this annotation in the file changed.

@claude claude / Claude Code Review

Exclude list misses g12/g31 which share the same docker.sock failure as g11

The new `--exclude=mia1-p01-g09,mia1-p01-g11` only covers 1 of the 3 nodes that `KLAUD_DEBUG.md §5.2` explicitly groups as sharing the docker.sock-permissions failure (`mia1-p01-g11 / g12 / g31`). §5.2 also states "Recipe-level workaround: none" — i.e. g12 and g31 are not drained at the SLURM level, so salloc can still land on them and the very next `srun ... docker stop $(docker ps -a -q)` (line 197) will hit the identical cascade this PR is trying to prevent. Consider extending to `--exclude=m