Determinism support 2/N by mar-yan24 · Pull Request #1300 · google-deepmind/mujoco_warp

mar-yan24 · 2026-04-19T04:24:57Z

Determinism support 2/N: deterministic constraint row allocation

For now I am just putting this in as a draft, opening it up early against main for review. Checks won't pass because 1/N changes are omitted.

Summary

This PR extends the opt-in opt.deterministic flag introduced in #1281 to make constraint row allocation reproducible across repeated runs of the same input. The wp.atomic_add(nefc_out, worldid, N) allocation used inside each constraint kernel is replaced with a deterministic count -> exclusive-scan -> emit pipeline. After 2/N, every constraint row should be at the same position on every run, so d.nefc, d.efc.*, and d.efc.J are bitwise stable.

Guarantees after 2/N (on top of 1/N)

Stable d.contact.* ordering (from 1/N).
Stable d.nefc across runs.
Stable per-row d.efc.* values.
Stable dense and sparse d.efc.J (including J_rownnz, J_rowadr, J_colind).

Not yet guaranteed (deliberately out of scope)

Bitwise qacc, qvel, qpos. Still gated on deterministic solver reductions.
CUDA-graph capture with opt.deterministic=True. Blocked by host-side overflow readback

Changes

mujoco_warp/_src/constraint.py: replaces atomic slot allocation with the deterministic count -> scan -> emit pipeline across all constraint families (equality, friction, limit, contact pyramidal, contact elliptic).
mujoco_warp/_src/constraint.py: persisted deterministic scratch buffers, all per-family counts, nnz_counts, offsets, nnz_offsets, nefc_base, nnz_base, plus contact world_start / world_end, are allocated once per (m, d) and reused across steps.
mujoco_warp/_src/constraint.py: skip zero-size families, python-side early-skip for families whose size == 0 avoids ~10–20 no-op kernel launches per step on models like humanoid.
mujoco_warp/_src/types.py: opt.deterministic docstring updated to reflect that the flag now also covers constraint row allocation.
mujoco_warp/_src/determinism_test.py: expanded determinism regression coverage for nefc, per-row efc.*, dense and sparse efc.J, canonicalized det-on vs det-off row-multiset equivalence, and benchmark-path smoke coverage for both solver paths.
mujoco_warp/_src/benchmark.py, mujoco_warp/testspeed.py: expose --use_cuda_graph at the CLI and pipe it through benchmark() so opt.deterministic=True can be benchmarked on the non-captured path while the host-side overflow readback remains (can remove this if unwanted, mainly for helping me test).

Benchmarks

I used similar benchmarking, just extended, as in the last PR, thanks to claude lol.

Environment:

GPU: NVIDIA GeForce RTX 4060 Laptop GPU (8 GiB, sm_89)
Warp: 1.13.0.dev20260227, CUDA Toolkit 12.9, Driver 12.5
Methodology: 3 trials × 500 measured steps, 50 warmup steps, explicit sync around the timing window, us/step = 1e6 * run_duration / (nworld * nstep).
use_cuda_graph=False on both off and on runs (required for deterministic mode today, host-side overflow readback is not capture-safe).
Capacity margin applied to both off and on: njmax = baseline + 32, njmax_nnz = baseline + 32 * nv (deterministic overflow validation would otherwise trip after warmup).
collision.xml nworld=512 used nccdmax=8 to fit in 8 GiB, applied to both runs.

Newton + Dense

model	nworld	mean ncon	mean nefc	off (us/step)	on (us/step)	overhead
`humanoid/humanoid.xml`	1	9.53	30.63	4318.30	6022.23	+39.5%
`humanoid/humanoid.xml`	64	11.16	40.05	86.05	110.70	+28.7%
`humanoid/humanoid.xml`	512	11.22	44.95	12.23	15.17	+24.1%
`collision.xml`	1	10.82	23.47	4836.62	6363.59	+31.6%
`collision.xml`	64	10.81	23.57	77.49	100.23	+29.3%
`collision.xml`	512	10.82	24.19	11.87	15.02	+26.5%

CG + Sparse

model	nworld	mean ncon	mean nefc	off (us/step)	on (us/step)	overhead
`humanoid/humanoid.xml`	1	9.16	27.82	8374.90	10129.02	+20.9%
`humanoid/humanoid.xml`	64	11.18	38.88	128.68	155.16	+20.6%
`humanoid/humanoid.xml`	512	11.18	43.58	16.21	21.31	+31.4%
`collision.xml`	1	10.74	23.31	5421.41	7149.21	+31.9%
`collision.xml`	64	10.74	23.31	93.56	111.70	+19.4%
`collision.xml`	512	10.74	23.31	12.15	14.05	+15.6%

Overhead range across the matrix: +15.6% .. +39.5%.

The benchmarks are decent, not amazing not horrible. There are several performance enhancements I have in mind (aside from CUDA graph support of course) that might be able to bring the overhead down quite a bit. I already implemented two which helped a decent bit.

Luckily, I was able to test this a couple days ago, but recently I installed the new Windows updates, and I am 99.9% sure these caused CUDA latency regression because I cannot get the same performance benchmarks on either this PR nor am I able to get similar numbers for the last PR. The latency is so bad that I literally cannot test on this windows slop machine so I might struggle in getting better performance benchmarks until I can get a workaround for this security update.

Performance enhancements in this PR

Two small perf enhancements are included in so that 2/N is closer to the 1/N cost. Neither changes the determinism semantics; both are measured against the same matrix.

Persisted deterministic scratch buffers. Allocates every scratch buffer once per (m, d) and reuses it across steps instead of wp.empty(...) on every constraint family. Measured on this branch (humanoid.xml and collision.xml, 3 trials × 500 steps):

model	nworld	overhead before	overhead after	pp reduction	% reduction
`humanoid/humanoid.xml`	1	+36.6%	+20.3%	−16.3	−45%
`humanoid/humanoid.xml`	64	+40.5%	+21.5%	−19.0	−47%
`humanoid/humanoid.xml`	512	+42.5%	+36.0%	−6.5	−15%
`collision.xml`	64	+29.1%	+12.4%	−16.7	−57%
`collision.xml`	512	+23.1%	+13.6%	−9.5	−41%

Saves ~370–830 µs/step of device-side work.

Skip zero-size families. Python-side early-skip for constraint families whose family-size is 0 (e.g. unused equality / friction / limit). Avoids ~10–20 no-op kernel launches per step on humanoid-like models. Small but consistent saving, largest impact at small nworld where launch overhead dominates.

(cherry picked from commit 689867c)

(cherry picked from commit e5deba2)

(cherry picked from commit 7664301)

(cherry picked from commit 28270ce)

(cherry picked from commit 6fb877c)

(cherry picked from commit 96c6e8a)

(cherry picked from commit 56401ad)

mar-yan24 and others added 7 commits April 17, 2026 22:27

beta determinism 2

7ce36c2

(cherry picked from commit 689867c)

testing for row allocation

06598f5

(cherry picked from commit e5deba2)

cleanup

0b2fecd

(cherry picked from commit 7664301)

collision test expansion

bd5a8cd

(cherry picked from commit 28270ce)

persisted deterministic scratch for performance

e57cd53

(cherry picked from commit 6fb877c)

get rid of colon

f64d5fb

(cherry picked from commit 96c6e8a)

performance: skip zero-size deterministic constraint families

afc9ef9

(cherry picked from commit 56401ad)

eric-heiden mentioned this pull request May 1, 2026

Determinism in Newton newton-physics/newton#2479

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determinism support 2/N#1300

Determinism support 2/N#1300
mar-yan24 wants to merge 7 commits into
google-deepmind:mainfrom
mar-yan24:mark/determinism2-draft

mar-yan24 commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mar-yan24 commented Apr 19, 2026

Determinism support 2/N: deterministic constraint row allocation

Summary

Guarantees after 2/N (on top of 1/N)

Not yet guaranteed (deliberately out of scope)

Changes

Benchmarks

Newton + Dense

CG + Sparse

Performance enhancements in this PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant