Skip to content

Cyclic flow+sediment deadlocks at MPIsize>1 (collective-count desync in sedChange) #477

@tristan-salles

Description

@tristan-salles

Summary

Cyclic (periodic) boundaries run correctly in serial but deadlock at MPIsize ≥ 2 when sediment transport is active (erosion on). This is the second of two cyclic-parallel bugs; the first (a partition-dependent flat-vs-curved geometry/velocity branch, which broke cyclic advection) is fixed in #476. With #476 in place, cyclic advection is parallel-correct, but cyclic flow+sediment still hangs.

The serial-only guard (cyclic bc + MPIsize>1 → error, added in #475) stays until this is fixed.

Symptom

tests/fixtures/cyclic_cyl.yml (cylinder mesh, bc:'ococ', spl.K>0) under mpirun -n 2 (and -n 3, -n 4) hangs indefinitely (watchdog-killed at 90s). Serial (-n 1) completes fine.

Diagnosis (per-rank log files, to avoid mpirun stdout buffering)

The two ranks desync by one or more collectives — a classic collective-count mismatch → deadlock:

  • rank 0 blocks at the post-deposition localToGlobal/globalToLocal in sedplex._updateSinks (logged updateSinks: before applyDeposit, never reaches after applyDeposit).
  • rank 1 has already passed that, finished _updateSinks, entered seaplex.seaChange, and blocks in _distanceCoasts_globalCoastsTree's Allreduce/Allgatherv.

So from some point in the sedChange pipeline onward, one rank issues a different number of MPI collective calls than the other on the cyclic mesh. By the time they reach _updateSinks's post-deposit localToGlobal, they are permanently offset.

Ruled out

  • With no active pits (active=0 on both ranks — confirmed) _updateSinks itself is symmetric, so the divergence is upstream: in _getSedFlux, fillElevation(sed=True) (a second pit-filling on the sediment-filled topology), or the _distributeSediment/_moveDownstream cascade.
  • The .any()/early-return gates in _updateSinks/_diffuseLargePit/_addPitMicroTilt are collective-safe (they key on global per-pit data — pitParams/pitVol), so it's NOT those.
  • It's not the geometry branch fixed in fix(mesh): partition-invariant flat-vs-curved branch (fixes cyclic advection in parallel) #476 (that fix is in; advection is now consistent; this hang remains).

Suggested next step

Systematic per-rank collective-count audit of the sedChange path: instrument every MPI.*reduce/Allgatherv/bcast, dm.localToGlobal/globalToLocal, and ksp.solve with a per-rank counter, run cyclic_cyl.yml at np=2, and diff the counts to find the first divergent call. The prime suspects are a while-loop whose iteration count differs per rank, or a collective gated by a rank-local condition, somewhere in _getSedFlux / fillElevation(sed=True) / _moveDownstream.

Once fixed, add parallel cyclic regression tests (np=2/3/4) — all current cyclic tests are serial-only (the conftest fixtures are single-rank), which is why this shipped undetected — and lift the serial-only guard in inputparser._readDomain.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or request
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions