Skip to content

Determinism support 2/N#1300

Draft
mar-yan24 wants to merge 7 commits into
google-deepmind:mainfrom
mar-yan24:mark/determinism2-draft
Draft

Determinism support 2/N#1300
mar-yan24 wants to merge 7 commits into
google-deepmind:mainfrom
mar-yan24:mark/determinism2-draft

Conversation

@mar-yan24

Copy link
Copy Markdown
Contributor

Determinism support 2/N: deterministic constraint row allocation

For now I am just putting this in as a draft, opening it up early against main for review. Checks won't pass because 1/N changes are omitted.

Summary

This PR extends the opt-in opt.deterministic flag introduced in #1281 to make constraint row allocation reproducible across repeated runs of the same input. The wp.atomic_add(nefc_out, worldid, N) allocation used inside each constraint kernel is replaced with a deterministic count -> exclusive-scan -> emit pipeline. After 2/N, every constraint row should be at the same position on every run, so d.nefc, d.efc.*, and d.efc.J are bitwise stable.

Guarantees after 2/N (on top of 1/N)

  • Stable d.contact.* ordering (from 1/N).
  • Stable d.nefc across runs.
  • Stable per-row d.efc.* values.
  • Stable dense and sparse d.efc.J (including J_rownnz, J_rowadr, J_colind).

Not yet guaranteed (deliberately out of scope)

  • Bitwise qacc, qvel, qpos. Still gated on deterministic solver reductions.
  • CUDA-graph capture with opt.deterministic=True. Blocked by host-side overflow readback

Changes

  • mujoco_warp/_src/constraint.py: replaces atomic slot allocation with the deterministic count -> scan -> emit pipeline across all constraint families (equality, friction, limit, contact pyramidal, contact elliptic).
  • mujoco_warp/_src/constraint.py: persisted deterministic scratch buffers, all per-family counts, nnz_counts, offsets, nnz_offsets, nefc_base, nnz_base, plus contact world_start / world_end, are allocated once per (m, d) and reused across steps.
  • mujoco_warp/_src/constraint.py: skip zero-size families, python-side early-skip for families whose size == 0 avoids ~10–20 no-op kernel launches per step on models like humanoid.
  • mujoco_warp/_src/types.py: opt.deterministic docstring updated to reflect that the flag now also covers constraint row allocation.
  • mujoco_warp/_src/determinism_test.py: expanded determinism regression coverage for nefc, per-row efc.*, dense and sparse efc.J, canonicalized det-on vs det-off row-multiset equivalence, and benchmark-path smoke coverage for both solver paths.
  • mujoco_warp/_src/benchmark.py, mujoco_warp/testspeed.py: expose --use_cuda_graph at the CLI and pipe it through benchmark() so opt.deterministic=True can be benchmarked on the non-captured path while the host-side overflow readback remains (can remove this if unwanted, mainly for helping me test).

Benchmarks

I used similar benchmarking, just extended, as in the last PR, thanks to claude lol.

Environment:

  • GPU: NVIDIA GeForce RTX 4060 Laptop GPU (8 GiB, sm_89)
  • Warp: 1.13.0.dev20260227, CUDA Toolkit 12.9, Driver 12.5
  • Methodology: 3 trials × 500 measured steps, 50 warmup steps, explicit sync around the timing window, us/step = 1e6 * run_duration / (nworld * nstep).
  • use_cuda_graph=False on both off and on runs (required for deterministic mode today, host-side overflow readback is not capture-safe).
  • Capacity margin applied to both off and on: njmax = baseline + 32, njmax_nnz = baseline + 32 * nv (deterministic overflow validation would otherwise trip after warmup).
  • collision.xml nworld=512 used nccdmax=8 to fit in 8 GiB, applied to both runs.

Newton + Dense

model nworld mean ncon mean nefc off (us/step) on (us/step) overhead
humanoid/humanoid.xml 1 9.53 30.63 4318.30 6022.23 +39.5%
humanoid/humanoid.xml 64 11.16 40.05 86.05 110.70 +28.7%
humanoid/humanoid.xml 512 11.22 44.95 12.23 15.17 +24.1%
collision.xml 1 10.82 23.47 4836.62 6363.59 +31.6%
collision.xml 64 10.81 23.57 77.49 100.23 +29.3%
collision.xml 512 10.82 24.19 11.87 15.02 +26.5%

CG + Sparse

model nworld mean ncon mean nefc off (us/step) on (us/step) overhead
humanoid/humanoid.xml 1 9.16 27.82 8374.90 10129.02 +20.9%
humanoid/humanoid.xml 64 11.18 38.88 128.68 155.16 +20.6%
humanoid/humanoid.xml 512 11.18 43.58 16.21 21.31 +31.4%
collision.xml 1 10.74 23.31 5421.41 7149.21 +31.9%
collision.xml 64 10.74 23.31 93.56 111.70 +19.4%
collision.xml 512 10.74 23.31 12.15 14.05 +15.6%

Overhead range across the matrix: +15.6% .. +39.5%.

The benchmarks are decent, not amazing not horrible. There are several performance enhancements I have in mind (aside from CUDA graph support of course) that might be able to bring the overhead down quite a bit. I already implemented two which helped a decent bit.

Luckily, I was able to test this a couple days ago, but recently I installed the new Windows updates, and I am 99.9% sure these caused CUDA latency regression because I cannot get the same performance benchmarks on either this PR nor am I able to get similar numbers for the last PR. The latency is so bad that I literally cannot test on this windows slop machine so I might struggle in getting better performance benchmarks until I can get a workaround for this security update.

Performance enhancements in this PR

Two small perf enhancements are included in so that 2/N is closer to the 1/N cost. Neither changes the determinism semantics; both are measured against the same matrix.

Persisted deterministic scratch buffers. Allocates every scratch buffer once per (m, d) and reuses it across steps instead of wp.empty(...) on every constraint family. Measured on this branch (humanoid.xml and collision.xml, 3 trials × 500 steps):

model nworld overhead before overhead after pp reduction % reduction
humanoid/humanoid.xml 1 +36.6% +20.3% −16.3 −45%
humanoid/humanoid.xml 64 +40.5% +21.5% −19.0 −47%
humanoid/humanoid.xml 512 +42.5% +36.0% −6.5 −15%
collision.xml 64 +29.1% +12.4% −16.7 −57%
collision.xml 512 +23.1% +13.6% −9.5 −41%

Saves ~370–830 µs/step of device-side work.

Skip zero-size families. Python-side early-skip for constraint families whose family-size is 0 (e.g. unused equality / friction / limit). Avoids ~10–20 no-op kernel launches per step on humanoid-like models. Small but consistent saving, largest impact at small nworld where launch overhead dominates.

mar-yan24 and others added 7 commits April 17, 2026 22:27
(cherry picked from commit 689867c)
(cherry picked from commit e5deba2)
(cherry picked from commit 7664301)
(cherry picked from commit 28270ce)
(cherry picked from commit 96c6e8a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant