Determinism support 2/N#1300
Draft
mar-yan24 wants to merge 7 commits into
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Determinism support 2/N: deterministic constraint row allocation
For now I am just putting this in as a draft, opening it up early against main for review. Checks won't pass because 1/N changes are omitted.
Summary
This PR extends the opt-in
opt.deterministicflag introduced in #1281 to make constraint row allocation reproducible across repeated runs of the same input. Thewp.atomic_add(nefc_out, worldid, N)allocation used inside each constraint kernel is replaced with a deterministic count -> exclusive-scan -> emit pipeline. After 2/N, every constraint row should be at the same position on every run, sod.nefc,d.efc.*, andd.efc.Jare bitwise stable.Guarantees after 2/N (on top of 1/N)
d.contact.*ordering (from 1/N).d.nefcacross runs.d.efc.*values.d.efc.J(includingJ_rownnz,J_rowadr,J_colind).Not yet guaranteed (deliberately out of scope)
qacc,qvel,qpos. Still gated on deterministic solver reductions.opt.deterministic=True. Blocked by host-side overflow readbackChanges
mujoco_warp/_src/constraint.py: replaces atomic slot allocation with the deterministic count -> scan -> emit pipeline across all constraint families (equality, friction, limit, contact pyramidal, contact elliptic).mujoco_warp/_src/constraint.py: persisted deterministic scratch buffers, all per-familycounts,nnz_counts,offsets,nnz_offsets,nefc_base,nnz_base, plus contactworld_start/world_end, are allocated once per(m, d)and reused across steps.mujoco_warp/_src/constraint.py: skip zero-size families, python-side early-skip for families whosesize == 0avoids ~10–20 no-op kernel launches per step on models like humanoid.mujoco_warp/_src/types.py:opt.deterministicdocstring updated to reflect that the flag now also covers constraint row allocation.mujoco_warp/_src/determinism_test.py: expanded determinism regression coverage fornefc, per-rowefc.*, dense and sparseefc.J, canonicalized det-on vs det-off row-multiset equivalence, and benchmark-path smoke coverage for both solver paths.mujoco_warp/_src/benchmark.py,mujoco_warp/testspeed.py: expose--use_cuda_graphat the CLI and pipe it throughbenchmark()soopt.deterministic=Truecan be benchmarked on the non-captured path while the host-side overflow readback remains (can remove this if unwanted, mainly for helping me test).Benchmarks
I used similar benchmarking, just extended, as in the last PR, thanks to claude lol.
Environment:
NVIDIA GeForce RTX 4060 Laptop GPU(8 GiB,sm_89)1.13.0.dev20260227, CUDA Toolkit 12.9, Driver 12.5us/step = 1e6 * run_duration / (nworld * nstep).use_cuda_graph=Falseon both off and on runs (required for deterministic mode today, host-side overflow readback is not capture-safe).njmax = baseline + 32,njmax_nnz = baseline + 32 * nv(deterministic overflow validation would otherwise trip after warmup).collision.xmlnworld=512usednccdmax=8to fit in 8 GiB, applied to both runs.Newton + Dense
humanoid/humanoid.xmlhumanoid/humanoid.xmlhumanoid/humanoid.xmlcollision.xmlcollision.xmlcollision.xmlCG + Sparse
humanoid/humanoid.xmlhumanoid/humanoid.xmlhumanoid/humanoid.xmlcollision.xmlcollision.xmlcollision.xmlOverhead range across the matrix: +15.6% .. +39.5%.
The benchmarks are decent, not amazing not horrible. There are several performance enhancements I have in mind (aside from CUDA graph support of course) that might be able to bring the overhead down quite a bit. I already implemented two which helped a decent bit.
Luckily, I was able to test this a couple days ago, but recently I installed the new Windows updates, and I am 99.9% sure these caused CUDA latency regression because I cannot get the same performance benchmarks on either this PR nor am I able to get similar numbers for the last PR. The latency is so bad that I literally cannot test on this windows slop machine so I might struggle in getting better performance benchmarks until I can get a workaround for this security update.
Performance enhancements in this PR
Two small perf enhancements are included in so that 2/N is closer to the 1/N cost. Neither changes the determinism semantics; both are measured against the same matrix.
Persisted deterministic scratch buffers. Allocates every scratch buffer once per
(m, d)and reuses it across steps instead ofwp.empty(...)on every constraint family. Measured on this branch (humanoid.xmlandcollision.xml, 3 trials × 500 steps):humanoid/humanoid.xmlhumanoid/humanoid.xmlhumanoid/humanoid.xmlcollision.xmlcollision.xmlSaves ~370–830 µs/step of device-side work.
Skip zero-size families. Python-side early-skip for constraint families whose family-size is 0 (e.g. unused equality / friction / limit). Avoids ~10–20 no-op kernel launches per step on humanoid-like models. Small but consistent saving, largest impact at small
nworldwhere launch overhead dominates.