You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Cyclic (periodic) boundaries run correctly in serial but deadlock at MPIsize ≥ 2 when sediment transport is active (erosion on). This is the second of two cyclic-parallel bugs; the first (a partition-dependent flat-vs-curved geometry/velocity branch, which broke cyclic advection) is fixed in #476. With #476 in place, cyclic advection is parallel-correct, but cyclic flow+sediment still hangs.
The serial-only guard (cyclic bc + MPIsize>1 → error, added in #475) stays until this is fixed.
Symptom
tests/fixtures/cyclic_cyl.yml (cylinder mesh, bc:'ococ', spl.K>0) under mpirun -n 2 (and -n 3, -n 4) hangs indefinitely (watchdog-killed at 90s). Serial (-n 1) completes fine.
Diagnosis (per-rank log files, to avoid mpirun stdout buffering)
The two ranks desync by one or more collectives — a classic collective-count mismatch → deadlock:
rank 0 blocks at the post-deposition localToGlobal/globalToLocal in sedplex._updateSinks (logged updateSinks: before applyDeposit, never reaches after applyDeposit).
rank 1 has already passed that, finished _updateSinks, entered seaplex.seaChange, and blocks in _distanceCoasts → _globalCoastsTree's Allreduce/Allgatherv.
So from some point in the sedChange pipeline onward, one rank issues a different number of MPI collective calls than the other on the cyclic mesh. By the time they reach _updateSinks's post-deposit localToGlobal, they are permanently offset.
Ruled out
With no active pits (active=0 on both ranks — confirmed) _updateSinks itself is symmetric, so the divergence is upstream: in _getSedFlux, fillElevation(sed=True) (a second pit-filling on the sediment-filled topology), or the _distributeSediment/_moveDownstream cascade.
The .any()/early-return gates in _updateSinks/_diffuseLargePit/_addPitMicroTilt are collective-safe (they key on global per-pit data — pitParams/pitVol), so it's NOT those.
Systematic per-rank collective-count audit of the sedChange path: instrument every MPI.*reduce/Allgatherv/bcast, dm.localToGlobal/globalToLocal, and ksp.solve with a per-rank counter, run cyclic_cyl.yml at np=2, and diff the counts to find the first divergent call. The prime suspects are a while-loop whose iteration count differs per rank, or a collective gated by a rank-local condition, somewhere in _getSedFlux / fillElevation(sed=True) / _moveDownstream.
Once fixed, add parallel cyclic regression tests (np=2/3/4) — all current cyclic tests are serial-only (the conftest fixtures are single-rank), which is why this shipped undetected — and lift the serial-only guard in inputparser._readDomain.
Summary
Cyclic (periodic) boundaries run correctly in serial but deadlock at MPIsize ≥ 2 when sediment transport is active (erosion on). This is the second of two cyclic-parallel bugs; the first (a partition-dependent flat-vs-curved geometry/velocity branch, which broke cyclic advection) is fixed in #476. With #476 in place, cyclic advection is parallel-correct, but cyclic flow+sediment still hangs.
The serial-only guard (
cyclic bc + MPIsize>1→ error, added in #475) stays until this is fixed.Symptom
tests/fixtures/cyclic_cyl.yml(cylinder mesh,bc:'ococ',spl.K>0) undermpirun -n 2(and -n 3, -n 4) hangs indefinitely (watchdog-killed at 90s). Serial (-n 1) completes fine.Diagnosis (per-rank log files, to avoid mpirun stdout buffering)
The two ranks desync by one or more collectives — a classic collective-count mismatch → deadlock:
localToGlobal/globalToLocalinsedplex._updateSinks(loggedupdateSinks: before applyDeposit, never reachesafter applyDeposit)._updateSinks, enteredseaplex.seaChange, and blocks in_distanceCoasts→_globalCoastsTree'sAllreduce/Allgatherv.So from some point in the
sedChangepipeline onward, one rank issues a different number of MPI collective calls than the other on the cyclic mesh. By the time they reach_updateSinks's post-depositlocalToGlobal, they are permanently offset.Ruled out
active=0on both ranks — confirmed)_updateSinksitself is symmetric, so the divergence is upstream: in_getSedFlux,fillElevation(sed=True)(a second pit-filling on the sediment-filled topology), or the_distributeSediment/_moveDownstreamcascade..any()/early-return gates in_updateSinks/_diffuseLargePit/_addPitMicroTiltare collective-safe (they key on global per-pit data —pitParams/pitVol), so it's NOT those.Suggested next step
Systematic per-rank collective-count audit of the
sedChangepath: instrument everyMPI.*reduce/Allgatherv/bcast,dm.localToGlobal/globalToLocal, andksp.solvewith a per-rank counter, runcyclic_cyl.ymlat np=2, and diff the counts to find the first divergent call. The prime suspects are awhile-loop whose iteration count differs per rank, or a collective gated by a rank-local condition, somewhere in_getSedFlux/fillElevation(sed=True)/_moveDownstream.Once fixed, add parallel cyclic regression tests (np=2/3/4) — all current cyclic tests are serial-only (the conftest fixtures are single-rank), which is why this shipped undetected — and lift the serial-only guard in
inputparser._readDomain.References
tests/fixtures/cyclic_cyl.{npz,yml}.gospl/sed/sedplex.py(sedChange/_distributeSediment/_moveDownstream/_updateSinks),gospl/sed/seaplex.py(seaChange/_distanceCoasts),gospl/flow/pitfilling.py(fillElevation).