[Klaud Cold] runners(mi355x): exclude broken nodes mia1-p01-g09 + mia1-p01-g11 #1498
+5
−1
Claude / Claude Code Review
completed
May 18, 2026 in 5m 25s
Code review found 1 potential issue
Found 1 candidates, confirmed 1. See review comments for details.
Details
| Severity | Count |
|---|---|
| 🔴 Important | 0 |
| 🟡 Nit | 1 |
| 🟣 Pre-existing | 0 |
| Severity | File:Line | Issue |
|---|---|---|
| 🟡 Nit | runners/launch_mi355x-amds.sh:190-194 |
Exclude list misses g12/g31 which share the same docker.sock failure as g11 |
Annotations
Check warning on line 194 in runners/launch_mi355x-amds.sh
claude / Claude Code Review
Exclude list misses g12/g31 which share the same docker.sock failure as g11
The new `--exclude=mia1-p01-g09,mia1-p01-g11` only covers 1 of the 3 nodes that `KLAUD_DEBUG.md §5.2` explicitly groups as sharing the docker.sock-permissions failure (`mia1-p01-g11 / g12 / g31`). §5.2 also states "Recipe-level workaround: none" — i.e. g12 and g31 are not drained at the SLURM level, so salloc can still land on them and the very next `srun ... docker stop $(docker ps -a -q)` (line 197) will hit the identical cascade this PR is trying to prevent. Consider extending to `--exclude=m
Loading