Warp determinism#1355
Conversation
Introduce wp.config.deterministic flag that makes floating-point atomic operations produce bit-exact reproducible results across runs. Two atomic usage patterns are handled transparently: Pattern A (accumulation): atomic_add/sub with unused return values are redirected to scatter buffers during kernel execution, then sorted by (dest_index, thread_id) and reduced in fixed order post-kernel via CUB radix sort and a custom segmented reduce kernel. Pattern B (counter/allocator): atomic_add with consumed return values (slot = atomic_add(counter, 0, 1)) use automatic two-pass execution: Phase 0 records per-thread contributions with side effects suppressed, prefix sum computes deterministic offsets, Phase 1 re-executes with deterministic slot assignments. Both patterns can coexist in a single kernel. Integer atomics with unused return values are left unchanged (already deterministic). CPU execution is unaffected (already sequential). Configurable at global, module, and kernel levels. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds an opt-in deterministic execution mode for supported atomic operations: new global/module/kernel config flags, codegen interception to emit scatter or two‑phase counter patterns, deterministic metadata and buffer management, CUDA device/CPU stubs for deterministic sort‑reduce, build updates, and extensive tests and benchmarks. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User Kernel
participant Python as Python Runtime
participant CUDA as CUDA Device
participant ScatterBuf as Scatter Buffers
participant SortReduce as Sort-Reduce Kernel
participant Dest as Destination Array
User->>CUDA: Launch kernel (scatter pattern)
CUDA->>ScatterBuf: wp::deterministic::scatter (pack key+value, inc counter)
CUDA->>Python: Kernel returns
Python->>ScatterBuf: Read record count
Python->>SortReduce: Call device sort-reduce entry
SortReduce->>ScatterBuf: Radix sort keys/values
SortReduce->>Dest: Deterministic segment-wise reduce into dest
SortReduce->>Python: Complete
Python->>User: Return results
sequenceDiagram
participant User as User Kernel
participant Python as Python Runtime
participant Phase0 as Phase 0 (Count)
participant Scan as Prefix Scan
participant Phase1 as Phase 1 (Execute)
participant Counter as Counter Array
User->>Python: Launch deterministic kernel (counter pattern)
Python->>Phase0: Run kernel with _wp_det_phase=0 (suppress side effects)
Phase0->>Python: Accumulated per-counter counts
Python->>Scan: Compute deterministic prefix sums (offsets)
Scan->>Python: Offsets ready
Python->>Phase1: Run kernel with _wp_det_phase=1 (use offsets)
Phase1->>Counter: Writeback using deterministic offsets
Phase1->>Python: Complete
Python->>User: Return results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
Greptile SummaryThis PR adds a deterministic execution mode (
Confidence Score: 3/5Safe to merge for the common usage patterns; one code path silently double-accumulates integer atomics in helper functions called from counter kernels. The counter two-pass mode guards integer atomics directly in the kernel body, but the analogous guard is missing when those atomics live in a @wp.func helper called by the counter kernel. The helper's _det_has_consumed_atomic is computed from the helper's own AST and is False for helpers with only integer atomics, so _emit_deterministic_atomic returns None with no phase guard. The helper fires in Phase 0 and Phase 1, doubling the accumulated result. The array-store analog was fixed via _det_has_side_effect_store; the same mechanism is not yet applied to integer atomics in user functions. warp/_src/codegen.py — the _emit_deterministic_atomic integer-atomic fall-through does not guard helper-function atomics when called from a two-pass counter kernel Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant launch as wp.launch()
participant LD as _launch_deterministic()
participant GPU as CUDA Kernel
participant SR as run_sort_reduce / run_counter_scan
User->>launch: wp.launch(kernel, dim, ...)
launch->>launch: detect det_meta (scatter/counter targets)
alt Scatter-only kernel
launch->>LD: _launch_deterministic(...)
LD->>LD: allocate scatter_bufs + overflow_buf
LD->>GPU: Phase 1 - scatter (key,val) records
LD->>SR: CUB sort + segmented reduce into dest_array
LD->>LD: check overflow_buf
else Counter (two-pass) kernel
launch->>LD: _launch_deterministic(...)
LD->>LD: allocate counter_bufs + scatter_bufs + overflow_buf
LD->>GPU: Phase 0 - count reservations (side effects suppressed)
LD->>LD: read count[], check overflow
LD->>SR: CUB sort + inclusive prefix scan - per-record slot offsets
LD->>LD: zero cursors[]
LD->>GPU: Phase 1 - execute with deterministic slots + scatter
LD->>SR: sort-reduce scatter records (if any)
LD->>LD: check overflow_buf
end
LD->>User: return (bit-exact results)
|
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (1)
warp/native/deterministic.h (1)
41-49: Clarify the non-CUDA branch comment.Line 41 says “direct accumulation,” but Lines 42-48 are a no-op. Rewording this would prevent confusion during maintenance.
✏️ Suggested comment tweak
- // CPU path: direct accumulation (CPU kernels are sequential). + // Non-CUDA path: no-op for this helper (CPU execution does not use scatter buffers).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@warp/native/deterministic.h` around lines 41 - 49, The comment "CPU path: direct accumulation (CPU kernels are sequential)" is misleading because the non-CUDA branch simply voids a set of unused variables (keys, values, counter, capacity, dest_flat_idx, thread_id, value); change the comment near that void list in deterministic.h to clearly state these variables are intentionally unused in the non-CUDA build (e.g., "Non-CUDA build: no per-thread accumulation — explicitly mark these kernel-specific variables as unused") so maintainers understand why the (void)XXX lines are present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@CHANGELOG.md`:
- Around line 7-11: Update the CHANGELOG entry (mentioning
wp.config.deterministic) to be API-level and include a GitHub issue/PR
reference: remove internal implementation details like "scatter-sort-reduce" and
"two-pass execution with prefix-sum-based slot assignment", instead describe the
user-visible change (e.g., "Added a deterministic execution mode for atomic
operations via wp.config.deterministic = True that makes atomic accumulations
reproducible across runs"), append a short note about scope
(global/module/kernel) and add the GH issue/PR number (e.g., "See `#1234`") and
affected version/release tag.
In `@design/deterministic-execution.md`:
- Around line 108-109: The doc currently mixes exclusive and inclusive scan
conventions: update the design to use a single convention for
wp.utils.array_scan(writeback rule) — either always document it as exclusive
(inclusive=False) and state that the total is computed as prefix[last] +
contrib[last], or document it as inclusive (inclusive=True) and state that the
total is prefix[last]; make the descriptions at the wp.utils.array_scan example
near "Prefix sum" (previously showing inclusive=False) and the section that
references "total comes from the last element" (lines ~159-160) consistent by
choosing one convention and adjusting the text to reference
wp.utils.array_scan(contrib, prefix, inclusive=<True/False>) and the
corresponding rule for computing the thread-local total.
In `@warp/_src/codegen.py`:
- Around line 42-44: The frozenset _DET_ORDER_DEPENDENT_ATOMICS (containing
"atomic_cas" and "atomic_exch") is defined but unused; either remove it or
implement the promised "warn but don't intercept" behavior: locate the code path
that dispatches/handles atomic ops (e.g., the function that processes atomic
intrinsics or emits atomics such as the atomic handling/emit function in
codegen.py), and when an atomic op name is in _DET_ORDER_DEPENDENT_ATOMICS, emit
a clear warning (use warnings.warn or the module logger) stating that
order-dependent atomics are not intercepted and will run with native ordering,
then continue normal processing without interception; if you prefer removal,
delete the _DET_ORDER_DEPENDENT_ATOMICS constant and any related comment and add
a brief unit test or code comment documenting the choice.
- Around line 1881-1882: The fallback that sets flat_idx_expr = "0" when ndim >
4 is unsafe because it silently writes everything to index 0; replace this
silent fallback with an explicit error: detect the unsupported ndim case in the
same code path in codegen.py (the branch that currently assigns flat_idx_expr)
and raise a clear exception (e.g., ValueError or RuntimeError) that includes the
invalid ndim value and a message stating Warp arrays support up to 4 dimensions;
do not assign "0" as a default index.
In `@warp/_src/context.py`:
- Around line 7477-7480: set_param_at_index_from_ctype() and
set_params_from_ctypes() mutate self.params but do not keep self.fwd_args in
sync, so _launch_deterministic() can replay using stale array objects; update
those methods to mirror the logic in set_param_at_index(): when adjoint is False
and the target index is within range(len(self.fwd_args)), assign the new value
into self.fwd_args[index] (or for bulk updates, update the corresponding
slice/indices) — or refactor those methods to call set_param_at_index(index,
value, adjoint) for each changed param so fwd_args stays consistent with params
for deterministic replay.
- Around line 7589-7602: The two-pass path must preserve the caller stream and
keep the counter's initial value: ensure the intermediate scans
(warp._src.utils.array_scan calls), the warp.copy that writes the total back
into the counter array, and the subsequent run_sort_reduce invocation are
executed on the same stream passed into this path (propagate the local stream
object into those calls or use stream-aware variants) and when writing the
counter combine the existing counter value with this-launch total (read the
current counter_arr[0], add inclusive_out[dim_size-1] and write the sum back)
instead of overwriting; update references around array_scan, inclusive_out,
counter_arr (found via kernel.adj.args and fwd_args), warp.copy and
run_sort_reduce so they all use the caller stream and perform an atomic/ordered
add of the previous counter value plus the new total.
In `@warp/_src/deterministic.py`:
- Around line 140-161: The mapping and helpers currently claim support for
warp.float16 but the native reducer entrypoint
wp_deterministic_sort_reduce_float_device (and the C++ reinterpret-casts in
native/deterministic.cu) do not handle 16-bit halves; update _WARP_TO_CTYPE and
the conversion helpers to fail fast for half: remove or change the "wp::half"
mapping so warp_type_to_ctype raises for warp.float16 (or explicitly check and
raise in is_float_type/warp_type_to_ctype), and add a clear ValueError
mentioning warp.float16 and wp_deterministic_sort_reduce_float_device so any
attempt to use half reductions immediately errors until a real half reducer is
implemented.
In `@warp/tests/test_deterministic.py`:
- Around line 789-790: Remove the call to wp.clear_kernel_cache() from this test
module (including the __main__ block) because wp.clear_kernel_cache() is
disallowed in test files; simply delete the line invoking
wp.clear_kernel_cache() so no cache-clearing is performed here.
In `@warp/tests/unittest_suites.py`:
- Line 142: The default_suite() function imports TestDeterministic but never
includes it in the test_classes list, so deterministic tests are skipped; update
the test_classes array/variable inside default_suite() to include
TestDeterministic (alongside the other classes), ensuring the symbol
TestDeterministic is added to the list used to build the suite returned by
default_suite().
---
Nitpick comments:
In `@warp/native/deterministic.h`:
- Around line 41-49: The comment "CPU path: direct accumulation (CPU kernels are
sequential)" is misleading because the non-CUDA branch simply voids a set of
unused variables (keys, values, counter, capacity, dest_flat_idx, thread_id,
value); change the comment near that void list in deterministic.h to clearly
state these variables are intentionally unused in the non-CUDA build (e.g.,
"Non-CUDA build: no per-thread accumulation — explicitly mark these
kernel-specific variables as unused") so maintainers understand why the
(void)XXX lines are present.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 2e7bc7a6-3bed-4194-bd96-d94feaed2cf7
📒 Files selected for processing (13)
CHANGELOG.mdbuild_lib.pydesign/deterministic-execution.mdwarp/_src/codegen.pywarp/_src/context.pywarp/_src/deterministic.pywarp/config.pywarp/native/deterministic.cppwarp/native/deterministic.cuwarp/native/deterministic.hwarp/native/warp.hwarp/tests/test_deterministic.pywarp/tests/unittest_suites.py
Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (3)
warp/_src/deterministic.py (1)
140-143:⚠️ Potential issue | 🔴 CriticalFail fast for
warp.float16deterministic reductions.
warp.float16is still wired through the deterministic scatter/reduce path, but the native entry points only reinterpret buffers asfloat*ordouble*. Half buffers will therefore be read with the wrong element size and produce corrupted reductions. Please reject this path until a real half reducer exists.Minimal safe guard
- if target.value_ctype in ("float", "wp::half"): + if target.value_ctype == "float": fn = runtime.core.wp_deterministic_sort_reduce_float_device + elif target.value_ctype == "wp::half": + raise RuntimeError("Deterministic float16 atomics are not supported yet.") elif target.value_ctype == "double": fn = runtime.core.wp_deterministic_sort_reduce_double_deviceAlso applies to: 159-161, 177-180, 219-223
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@warp/_src/deterministic.py` around lines 140 - 143, The mapping _WARP_TO_CTYPE currently includes warp.float16 but the deterministic scatter/reduce path does not support half-precision and will read buffers with wrong element size; update deterministic.py to fail fast whenever warp.float16 is encountered in the deterministic reduction code paths by removing or guarding the warp.float16 entry in _WARP_TO_CTYPE and adding an explicit check that raises a clear exception (or returns an error) when a reducer or conversion function (the code paths that reference _WARP_TO_CTYPE) sees warp.float16; ensure the exception message names warp.float16 and the deterministic reduction path so callers get a clear rejection until a proper half reducer is implemented.warp/tests/test_deterministic.py (1)
955-956:⚠️ Potential issue | 🟠 MajorRemove the kernel-cache clear from this test module.
wp.clear_kernel_cache()is disallowed in tests and can crash parallel CI runs.Minimal fix
if __name__ == "__main__": - wp.clear_kernel_cache() unittest.main(verbosity=2)As per coding guidelines, "Never call
wp.clear_kernel_cache()orwp.clear_lto_cache()in test files—not in__main__blocks, test methods, or module scope. Cache clearing is not multi-process-safe; concurrent clears cause LLVM crashes."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@warp/tests/test_deterministic.py` around lines 955 - 956, Remove the disallowed kernel-cache clear from the module-level __main__ block: delete the call to wp.clear_kernel_cache() found in the if __name__ == "__main__": section of warp/tests/test_deterministic.py; do not replace it with any cache-clearing call (wp.clear_lto_cache or similar) and ensure no other module-scope or __main__-scoped cache-clear calls remain.design/deterministic-execution.md (1)
111-112:⚠️ Potential issue | 🟡 MinorUse one scan convention throughout the doc.
Lines 111-112 describe an exclusive scan, but Lines 170-172 still say the total comes from the last element of an inclusive scan. Please make the writeback rule consistent with the convention the implementation actually uses.
Also applies to: 170-172
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@design/deterministic-execution.md` around lines 111 - 112, The doc is inconsistent about scan convention: wp.utils.array_scan(contrib, prefix, inclusive=False) is described as exclusive at lines 111-112 but later (lines 170-172) claims the total comes from the last element of an inclusive scan; pick one convention and make the writeback rule consistent with the actual implementation. Update the description of wp.utils.array_scan, the example/notation for "prefix" and the writeback rule (the statement about where the total/last-offset is read) so they all use the same convention (either inclusive or exclusive) and mention the inclusive flag (inclusive=False/True) in the writeback explanation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@warp/_src/codegen.py`:
- Around line 1647-1650: Pattern B (two-pass deterministic atomic interception)
only runs when _det_in_assign is set during emit_Assign(), so atomic calls
nested in subscripts, call arguments, or larger expressions (e.g.,
output[wp.atomic_add(...)]) never use adj._emit_deterministic_atomic and remain
nondeterministic; update the compiler to propagate the
"deterministic-assignment" context beyond plain Assign RHS evaluation by
checking _det_in_assign (or equivalent flag) within expression emitters that can
produce targets/indices/args — specifically modify emit_Subscript, emit_Call
(and other expression emitters referenced around lines 3336-3345) to consult
adj._emit_deterministic_atomic for funcs in _DET_INTERCEPTABLE_ATOMICS (using
func.is_builtin() and func.key) and call
adj._emit_deterministic_atomic(bound_args, return_type, output, output_list)
when the flag is active so atomics inside subscripts/args follow the two-pass
deterministic path.
- Around line 1834-1849: The Pattern B counter/allocator block that runs when
return_is_consumed currently assumes an add-style atomic and unconditionally
emits contrib += value and prefix returns; restrict this path to only add-style
atomics: inside the branch guarded by return_is_consumed (the block that calls
get_or_create_counter_target and writes _wp_det_contrib and _wp_det_prefix via
adj.add_forward), inspect the counter target's atomic/op kind (from the target
returned by get_or_create_counter_target — e.g., target.op, target.atomic_op, or
whatever property names the CounterTarget uses) and only emit Pattern B when
that property indicates an atomic_add; otherwise fail fast (raise a clear
NotImplementedError or emit an error) explaining that consumed-return semantics
for atomic_sub/atomic_max are not implemented. Ensure the check is placed before
emitting the adj.add_forward code so non-add atomics do not get incorrect
rewrites.
In `@warp/_src/deterministic.py`:
- Around line 191-200: The scratch buffers in allocate_counter_buffers always
use warp.int32 which truncates non-32-bit counters; update
allocate_counter_buffers to inspect each CounterTarget.value_ctype and allocate
contrib/prefix with the matching dtype (e.g., warp.int64 or warp.uint64) so the
generated ABI names (_wp_det_contrib/_wp_det_prefix) match the counter width, or
if supporting only 32-bit is preferred, raise an explicit error when a
CounterTarget has a non-32-bit value_ctype; modify allocate_counter_buffers to
perform this dtype-selection/check using the CounterTarget.value_ctype before
creating warp.zeros/warp.empty.
---
Duplicate comments:
In `@design/deterministic-execution.md`:
- Around line 111-112: The doc is inconsistent about scan convention:
wp.utils.array_scan(contrib, prefix, inclusive=False) is described as exclusive
at lines 111-112 but later (lines 170-172) claims the total comes from the last
element of an inclusive scan; pick one convention and make the writeback rule
consistent with the actual implementation. Update the description of
wp.utils.array_scan, the example/notation for "prefix" and the writeback rule
(the statement about where the total/last-offset is read) so they all use the
same convention (either inclusive or exclusive) and mention the inclusive flag
(inclusive=False/True) in the writeback explanation.
In `@warp/_src/deterministic.py`:
- Around line 140-143: The mapping _WARP_TO_CTYPE currently includes
warp.float16 but the deterministic scatter/reduce path does not support
half-precision and will read buffers with wrong element size; update
deterministic.py to fail fast whenever warp.float16 is encountered in the
deterministic reduction code paths by removing or guarding the warp.float16
entry in _WARP_TO_CTYPE and adding an explicit check that raises a clear
exception (or returns an error) when a reducer or conversion function (the code
paths that reference _WARP_TO_CTYPE) sees warp.float16; ensure the exception
message names warp.float16 and the deterministic reduction path so callers get a
clear rejection until a proper half reducer is implemented.
In `@warp/tests/test_deterministic.py`:
- Around line 955-956: Remove the disallowed kernel-cache clear from the
module-level __main__ block: delete the call to wp.clear_kernel_cache() found in
the if __name__ == "__main__": section of warp/tests/test_deterministic.py; do
not replace it with any cache-clearing call (wp.clear_lto_cache or similar) and
ensure no other module-scope or __main__-scoped cache-clear calls remain.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 4ca0283f-84b8-4939-90c1-9a656d3d0098
📒 Files selected for processing (8)
design/deterministic-execution.mdwarp/_src/codegen.pywarp/_src/context.pywarp/_src/deterministic.pywarp/config.pywarp/native/deterministic.cuwarp/native/deterministic.hwarp/tests/test_deterministic.py
✅ Files skipped from review due to trivial changes (2)
- warp/native/deterministic.h
- warp/native/deterministic.cu
🚧 Files skipped from review as they are similar to previous changes (1)
- warp/_src/context.py
|
Tip: Greploop — Automatically fix all review issues by running Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal. |
Generalize deterministic scatter-reduce to operate on Warp leaf types instead of only scalar float values. This preserves existing component-wise atomic semantics for vectors and matrices while keeping the graph-capture-safe fixed-capacity launch path. Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
There was a problem hiding this comment.
♻️ Duplicate comments (2)
warp/tests/test_deterministic.py (1)
1143-1145:⚠️ Potential issue | 🟠 MajorRemove the kernel-cache clear from this test module.
wp.clear_kernel_cache()is explicitly disallowed in test files and can break parallel test runs.Minimal fix
if __name__ == "__main__": - wp.clear_kernel_cache() unittest.main(verbosity=2)As per coding guidelines, "Never call
wp.clear_kernel_cache()orwp.clear_lto_cache()in test files—not in__main__blocks, test methods, or module scope. Cache clearing is not multi-process-safe; concurrent clears cause LLVM crashes."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@warp/tests/test_deterministic.py` around lines 1143 - 1145, Remove the call to wp.clear_kernel_cache() from the module-level __main__ block; locate the if __name__ == "__main__": block (where wp.clear_kernel_cache() is called before unittest.main) and delete that wp.clear_kernel_cache() invocation so tests no longer call kernel-cache clearing from the test module.design/deterministic-execution.md (1)
114-115:⚠️ Potential issue | 🟡 MinorUse one scan convention throughout the design.
The Pattern B section documents
array_scan(..., inclusive=False), but the writeback section still says the total comes from the last element of an inclusive scan. Please pick one convention and make the total-count rule match it everywhere.Also applies to: 173-175
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@design/deterministic-execution.md` around lines 114 - 115, The docs use two different scan conventions; standardize on one and make all references consistent: choose whether wp.utils.array_scan(contrib, prefix, inclusive=True) or inclusive=False is the canonical API, then update Pattern B and the writeback section (and the other occurrence at lines ~173-175) so the phrase "total comes from the last element" matches that convention (e.g., if using inclusive=False, state that total = prefix[-1] + contrib[-1] or if inclusive=True, state total = prefix[-1]); update any explanatory text and examples to reference wp.utils.array_scan(..., inclusive=...) and the single total-count rule consistently.
🧹 Nitpick comments (1)
warp/tests/test_deterministic.py (1)
692-753: Assert the actual slot order, not just counts/permutations.
counter_kernelshould yieldoutput == data_np, andconditional_counter_kernelshould yieldoutput[:expected_count] == data_np[data_np > threshold]. Right now a stable but wrong permutation would still pass because these tests only check counts, sorting, and cross-run reproducibility.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@warp/tests/test_deterministic.py` around lines 692 - 753, Update the two tests to assert actual slot order rather than just counts/permutations: in test_counter_correctness replace the sorted/permutation check with a direct equality assertion that output.numpy() (or output.numpy().tolist()) exactly equals data_np (ensuring dtype/shape match) to verify counter_kernel produces output == data_np; in test_conditional_counter replace the permutation-based checks with a direct equality assertion that output.numpy()[:expected_count] equals data_np[data_np > threshold] to verify conditional_counter_kernel preserves the original relative order of selected elements; keep the existing cross-run determinism checks but change the comparisons to elementwise equality rather than sorted/comparison of permutations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@design/deterministic-execution.md`:
- Around line 114-115: The docs use two different scan conventions; standardize
on one and make all references consistent: choose whether
wp.utils.array_scan(contrib, prefix, inclusive=True) or inclusive=False is the
canonical API, then update Pattern B and the writeback section (and the other
occurrence at lines ~173-175) so the phrase "total comes from the last element"
matches that convention (e.g., if using inclusive=False, state that total =
prefix[-1] + contrib[-1] or if inclusive=True, state total = prefix[-1]); update
any explanatory text and examples to reference wp.utils.array_scan(...,
inclusive=...) and the single total-count rule consistently.
In `@warp/tests/test_deterministic.py`:
- Around line 1143-1145: Remove the call to wp.clear_kernel_cache() from the
module-level __main__ block; locate the if __name__ == "__main__": block (where
wp.clear_kernel_cache() is called before unittest.main) and delete that
wp.clear_kernel_cache() invocation so tests no longer call kernel-cache clearing
from the test module.
---
Nitpick comments:
In `@warp/tests/test_deterministic.py`:
- Around line 692-753: Update the two tests to assert actual slot order rather
than just counts/permutations: in test_counter_correctness replace the
sorted/permutation check with a direct equality assertion that output.numpy()
(or output.numpy().tolist()) exactly equals data_np (ensuring dtype/shape match)
to verify counter_kernel produces output == data_np; in test_conditional_counter
replace the permutation-based checks with a direct equality assertion that
output.numpy()[:expected_count] equals data_np[data_np > threshold] to verify
conditional_counter_kernel preserves the original relative order of selected
elements; keep the existing cross-run determinism checks but change the
comparisons to elementwise equality rather than sorted/comparison of
permutations.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: a316d310-4647-410d-a32d-1acc2968e8e5
📒 Files selected for processing (8)
design/deterministic-execution.mdwarp/_src/codegen.pywarp/_src/context.pywarp/_src/deterministic.pywarp/native/deterministic.cppwarp/native/deterministic.cuwarp/native/warp.hwarp/tests/test_deterministic.py
🚧 Files skipped from review as they are similar to previous changes (4)
- warp/native/deterministic.cpp
- warp/_src/context.py
- warp/_src/codegen.py
- warp/_src/deterministic.py
Replace the capacity-style deterministic scatter option with deterministic_max_records so the API matches the codegen lower-bound model. This lets users size dynamic-loop workloads in terms of per-thread atomic record counts instead of raw buffer capacity. Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (5)
warp/_src/context.py (2)
7628-7641:⚠️ Potential issue | 🟠 MajorKeep the count/scan/reduce fixups on
stream, and accumulate the existing counter value.
array_scan(),warp.copy(), andrun_sort_reduce()all omitstreamhere, so an explicit non-current stream or graph capture can reorder or escape the deterministic pass. Also, Line 7641 writes only this launch's total intocounter_arr[0]; a counter that started atNwill end attotalinstead ofN + total.Also applies to: 7670-7670
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@warp/_src/context.py` around lines 7628 - 7641, The code omits the stream argument and overwrites existing counter values; update calls to warp._src.utils.array_scan(contrib, prefix, inclusive=False) and the inclusive scan warp._src.utils.array_scan(contrib, inclusive_out, inclusive=True) as well as warp.copy and any run_sort_reduce invocations to pass the current non-default stream (preserve graph capture/determinism), and change the write into counter_arr (found by iterating kernel.adj.args for ct.array_var_label and obtained from fwd_args[j]) to add the inclusive_out last-element to the existing counter value instead of replacing it (i.e., read counter_arr[0], add inclusive_out[dim_size-1], and write the sum back using warp.copy on the given stream); apply the same stream+accumulate fix to the similar block around the other occurrence near the 7670 location.
7505-7508:⚠️ Potential issue | 🟠 MajorRaw-ctype array updates are still unsafe for deterministic replays.
Line 7507 only keeps
fwd_argsaligned forset_param_at_index(). The inheritedset_param_at_index_from_ctype()/set_params_from_ctypes()can still replace an array descriptor while_launch_deterministic()later uses the stale Python array object forwarp.copy()andrun_sort_reduce(). Recorded deterministic launches can therefore replay into the wrong array, or fail oncefwd_argsandself.paramsdiverge. If these APIs stay exposed,DeterministicLaunchneeds to reject array-ctype updates or carry the owning Warp array alongside the packed descriptor.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@warp/_src/context.py` around lines 7505 - 7508, The deterministic replay bug: set_param_at_index only updates fwd_args but inherited set_param_at_index_from_ctype / set_params_from_ctypes can replace raw C-type array descriptors in self.params leaving fwd_args stale, which breaks _launch_deterministic and later warp.copy/run_sort_reduce replay; to fix, either (A) make DeterministicLaunch reject/raise when set_param_at_index_from_ctype or set_params_from_ctypes is used with array-ctypes, or (B) ensure those ctype-updating paths also update the corresponding Python owning Warp array stored in fwd_args (and keep self.params and fwd_args synchronized) so that set_param_at_index_from_ctype / set_params_from_ctypes maintain alignment with fwd_args before _launch_deterministic runs.design/deterministic-execution.md (1)
114-115:⚠️ Potential issue | 🟡 MinorUse one scan convention in the design doc.
Line 114 documents
wp.utils.array_scan(contrib, prefix, inclusive=False), but Lines 176-178 say the total comes from the last element of an inclusive scan. Please make both sections describe the same writeback rule. If the scan stays exclusive, the total isprefix[-1] + contrib[-1].Also applies to: 176-178
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@design/deterministic-execution.md` around lines 114 - 115, The doc currently mixes scan conventions: update both occurrences to use the same convention for wp.utils.array_scan(contrib, prefix, inclusive=False) and the writeback rule; either change the function call to inclusive=True and state the total is prefix[-1], or keep inclusive=False and change the total calculation text (the section that currently says the total comes from the last element of an inclusive scan) to the exclusive rule: total = prefix[-1] + contrib[-1]; make this consistent for the wp.utils.array_scan(contrib, prefix, inclusive=False) mention and the later total computation paragraph.warp/tests/test_deterministic.py (1)
1190-1192:⚠️ Potential issue | 🟠 MajorRemove the kernel-cache clear from this test module.
wp.clear_kernel_cache()is explicitly disallowed in test files and can crash parallel test runs.unittest.main()is enough here.🧹 Minimal fix
if __name__ == "__main__": - wp.clear_kernel_cache() unittest.main(verbosity=2)As per coding guidelines, "Never call
wp.clear_kernel_cache()orwp.clear_lto_cache()in test files—not in__main__blocks, test methods, or module scope. Cache clearing is not multi-process-safe; concurrent clears cause LLVM crashes."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@warp/tests/test_deterministic.py` around lines 1190 - 1192, Remove the explicit wp.clear_kernel_cache() call from the module-level __main__ block in this test file; keep only the unittest.main(verbosity=2) invocation so the test can run without clearing kernel caches (do not add wp.clear_lto_cache() either). Locate the __main__ guard containing wp.clear_kernel_cache() and delete that call, ensuring the block now simply calls unittest.main(...) and nothing else.warp/_src/deterministic.py (1)
228-231:⚠️ Potential issue | 🟠 MajorCounter scratch buffers still hard-code
wp.int32.
CounterTarget.value_ctypeis tracked, butcontribandprefixare always allocated aswp.int32. Any deterministicwp.int64/wp.uint64counter will silently wrap in Phase 0/1. Please make these scratch buffers followvalue_ctype, or fail fast for non-32-bit counters until the wider ABI is wired through end-to-end.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@warp/_src/deterministic.py` around lines 228 - 231, The loop that allocates counter scratch buffers currently hardcodes dtype=warp.int32 for contrib and prefix; change the allocation to use each CounterTarget's value_ctype (e.g., use dtype=target.value_ctype or map that C type to the corresponding warp dtype) so contrib and prefix match CounterTarget.value_ctype and avoid silent wrapping, or add an explicit fast-fail that raises if target.value_ctype is not a 32-bit integer type (when full 64-bit ABI isn't supported). Update the allocations in the for _target in counter_targets block (the contrib and prefix warp.zeros/warp.empty calls) to derive dtype from the target's value_ctype and keep shape= (dim_size,) and device=device.
🧹 Nitpick comments (2)
warp/tests/test_deterministic.py (1)
850-879:test_module_option_override()doesn't prove the override is active on CUDA.A single approximate-sum assertion will pass whether this kernel took the deterministic path or the normal atomic path, especially on CPU where both paths are deterministic. Please mirror
test_kernel_decorator_override()here: run the kernel several times on CUDA and assert bit-exact equality across runs, then keep the sum check as a secondary sanity check if you want.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@warp/tests/test_deterministic.py` around lines 850 - 879, test_module_option_override currently only checks approximate sums so it doesn't verify the per-module deterministic=True override on CUDA; update the test (function test_module_option_override) to mirror test_kernel_decorator_override by: when device is CUDA (check device == "cuda" or via wp.get_device_name/device backend), run the kernel per_kernel_det multiple times (e.g., 3+ runs) into separate outputs and assert bit-exact equality of the outputs across runs (use output.numpy() equality checks) to prove determinism, while retaining the existing sum/assert_allclose sanity check as a secondary assertion; ensure you still toggle wp.config.deterministic = False around the launch to verify the per-module override takes effect.warp/_src/deterministic.py (1)
193-193: AnnotatedevicewithDeviceLike.These helper signatures take
deviceuntyped, which diverges from the repo's device-parameter convention and makes this launch path harder to type-check consistently.As per coding guidelines, "Use
DeviceLiketype annotation (fromwarp._src.context) fordeviceparameters. Import underTYPE_CHECKINGto avoid circular imports."Also applies to: 222-222, 235-235
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@warp/_src/deterministic.py` at line 193, The device parameter in allocate_scatter_buffers should be annotated with DeviceLike: import DeviceLike from warp._src.context inside a TYPE_CHECKING block (from typing import TYPE_CHECKING) to avoid circular imports, then change the function signature to accept device: DeviceLike; apply the same change to the other two nearby helper functions in this module that take a device parameter so all device-typed helper signatures use DeviceLike consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@warp/_src/context.py`:
- Around line 7510-7526: Before calling _launch_deterministic in launch, guard
the deterministic path with the same null checks used by the regular CUDA path:
verify self.kernel is not None and that self.hooks.forward exists; if either
check fails raise a clear Python-side RuntimeError (or appropriate exception)
with a descriptive message instead of calling _launch_deterministic with a null
kernel handle. Apply the same change to DeterministicLaunch.launch to keep
behavior consistent.
- Around line 2198-2205: ModuleHasher.hash_kernel currently only hashes
kernel.key and kernel.adj which allows reuse of stale cached modules when
per-kernel overrides (e.g., kernel.options like deterministic or
deterministic_max_records) change; update ModuleHasher.hash_kernel to
incorporate kernel.options (or at minimum the deterministic and
deterministic_max_records flags/values) into the computed hash so that
Module.load and the unique-module cache distinguish kernels with different
per-kernel deterministic overrides, and ensure any code paths that call
hash_kernel (e.g., Module.load) will therefore generate distinct hashes when
those option values differ.
In `@warp/_src/deterministic.py`:
- Around line 68-74: CounterTarget is currently keyed only by array_var_label
causing get_or_create_counter_target() to collapse all logical counters (e.g.,
wp.atomic_add(counter, bucket, 1) across different bucket indices) into one
global sequence; fix by including the logical index in the CounterTarget
key/metadata (add a field for the logical counter index and propagate it from
the codegen caller into get_or_create_counter_target()), or alternatively
enforce deterministic mode by detecting non-constant-zero indices in
wp.atomic_add (and reject/raise an error) so Phase 1 cannot hand out global
offsets for variable indices; update all related uses (CounterTarget class,
get_or_create_counter_target(), and the call sites that build Pattern B
metadata) to reflect the new keying or the rejection behavior.
---
Duplicate comments:
In `@design/deterministic-execution.md`:
- Around line 114-115: The doc currently mixes scan conventions: update both
occurrences to use the same convention for wp.utils.array_scan(contrib, prefix,
inclusive=False) and the writeback rule; either change the function call to
inclusive=True and state the total is prefix[-1], or keep inclusive=False and
change the total calculation text (the section that currently says the total
comes from the last element of an inclusive scan) to the exclusive rule: total =
prefix[-1] + contrib[-1]; make this consistent for the
wp.utils.array_scan(contrib, prefix, inclusive=False) mention and the later
total computation paragraph.
In `@warp/_src/context.py`:
- Around line 7628-7641: The code omits the stream argument and overwrites
existing counter values; update calls to warp._src.utils.array_scan(contrib,
prefix, inclusive=False) and the inclusive scan
warp._src.utils.array_scan(contrib, inclusive_out, inclusive=True) as well as
warp.copy and any run_sort_reduce invocations to pass the current non-default
stream (preserve graph capture/determinism), and change the write into
counter_arr (found by iterating kernel.adj.args for ct.array_var_label and
obtained from fwd_args[j]) to add the inclusive_out last-element to the existing
counter value instead of replacing it (i.e., read counter_arr[0], add
inclusive_out[dim_size-1], and write the sum back using warp.copy on the given
stream); apply the same stream+accumulate fix to the similar block around the
other occurrence near the 7670 location.
- Around line 7505-7508: The deterministic replay bug: set_param_at_index only
updates fwd_args but inherited set_param_at_index_from_ctype /
set_params_from_ctypes can replace raw C-type array descriptors in self.params
leaving fwd_args stale, which breaks _launch_deterministic and later
warp.copy/run_sort_reduce replay; to fix, either (A) make DeterministicLaunch
reject/raise when set_param_at_index_from_ctype or set_params_from_ctypes is
used with array-ctypes, or (B) ensure those ctype-updating paths also update the
corresponding Python owning Warp array stored in fwd_args (and keep self.params
and fwd_args synchronized) so that set_param_at_index_from_ctype /
set_params_from_ctypes maintain alignment with fwd_args before
_launch_deterministic runs.
In `@warp/_src/deterministic.py`:
- Around line 228-231: The loop that allocates counter scratch buffers currently
hardcodes dtype=warp.int32 for contrib and prefix; change the allocation to use
each CounterTarget's value_ctype (e.g., use dtype=target.value_ctype or map that
C type to the corresponding warp dtype) so contrib and prefix match
CounterTarget.value_ctype and avoid silent wrapping, or add an explicit
fast-fail that raises if target.value_ctype is not a 32-bit integer type (when
full 64-bit ABI isn't supported). Update the allocations in the for _target in
counter_targets block (the contrib and prefix warp.zeros/warp.empty calls) to
derive dtype from the target's value_ctype and keep shape= (dim_size,) and
device=device.
In `@warp/tests/test_deterministic.py`:
- Around line 1190-1192: Remove the explicit wp.clear_kernel_cache() call from
the module-level __main__ block in this test file; keep only the
unittest.main(verbosity=2) invocation so the test can run without clearing
kernel caches (do not add wp.clear_lto_cache() either). Locate the __main__
guard containing wp.clear_kernel_cache() and delete that call, ensuring the
block now simply calls unittest.main(...) and nothing else.
---
Nitpick comments:
In `@warp/_src/deterministic.py`:
- Line 193: The device parameter in allocate_scatter_buffers should be annotated
with DeviceLike: import DeviceLike from warp._src.context inside a TYPE_CHECKING
block (from typing import TYPE_CHECKING) to avoid circular imports, then change
the function signature to accept device: DeviceLike; apply the same change to
the other two nearby helper functions in this module that take a device
parameter so all device-typed helper signatures use DeviceLike consistently.
In `@warp/tests/test_deterministic.py`:
- Around line 850-879: test_module_option_override currently only checks
approximate sums so it doesn't verify the per-module deterministic=True override
on CUDA; update the test (function test_module_option_override) to mirror
test_kernel_decorator_override by: when device is CUDA (check device == "cuda"
or via wp.get_device_name/device backend), run the kernel per_kernel_det
multiple times (e.g., 3+ runs) into separate outputs and assert bit-exact
equality of the outputs across runs (use output.numpy() equality checks) to
prove determinism, while retaining the existing sum/assert_allclose sanity check
as a secondary assertion; ensure you still toggle wp.config.deterministic =
False around the launch to verify the per-module override takes effect.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: f76535da-266c-45ce-9df4-48647bb5e6f1
📒 Files selected for processing (4)
design/deterministic-execution.mdwarp/_src/context.pywarp/_src/deterministic.pywarp/tests/test_deterministic.py
Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (6)
warp/_src/context.py (3)
7532-7535:⚠️ Potential issue | 🟠 MajorKeep
fwd_argssynchronized for ctype-based param mutators too.Line 7532 only syncs
self.fwd_argsforset_param_at_index(). Calls throughset_param_at_index_from_ctype()/set_params_from_ctypes()can still leavefwd_argsstale, so deterministic replay can post-process the wrong arrays.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@warp/_src/context.py` around lines 7532 - 7535, The fwd_args list is only updated in set_param_at_index, so calls that mutate params via ctype helpers leave fwd_args stale; modify set_param_at_index_from_ctype and set_params_from_ctypes to also synchronize self.fwd_args (either by delegating to set_param_at_index for each change or by applying the same index-check-and-assign logic used there), honoring the adjoint flag so adjoint updates do not overwrite fwd_args.
7541-7553:⚠️ Potential issue | 🟡 MinorPreserve the forward-hook null guard in deterministic launch routing.
These deterministic branches call
_launch_deterministic()without the regular forward-hook existence check. Ifhooks.forwardis missing, this can pass a null kernel handle to CUDA launch instead of raising the clearer Python-side error.Also applies to: 7825-7854
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@warp/_src/context.py` around lines 7541 - 7553, The deterministic launch branches call _launch_deterministic(...) without checking for hooks.forward, which can allow a null kernel handle to reach CUDA; add the same forward-hook null guard used in the non-deterministic path: verify hooks.forward (or equivalent forward kernel handle) is present before invoking _launch_deterministic, and if missing raise the same Python-side error/exception used elsewhere so we fail early and clearly; apply the same fix to the other deterministic branch around the second occurrence noted (near the 7825-7854 block).
7657-7670:⚠️ Potential issue | 🟠 MajorTwo-pass deterministic path still breaks explicit-stream ordering and counter continuity.
Line 7657/Line 7664 scans, Line 7670 counter writeback, and Line 7699 sort-reduce are not stream-bound in this path. On non-current streams/capture this can reorder work. Also, Line 7670 overwrites the counter with this-launch total instead of preserving and adding the pre-existing counter value.
Also applies to: 7699-7699
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@warp/_src/context.py` around lines 7657 - 7670, The two-pass deterministic path uses array_scan and warp.copy without binding them to the current stream/capture and overwrites the counter rather than adding to the pre-existing value; update the logic around warp._src.utils.array_scan(contrib, ...), the inclusive_out buffer, and the writeback that finds counter_arr via kernel.adj.args and ct.array_var_label so that (1) all scans, the inclusive_out usage, the warp.copy writeback, and the later sort-reduce are submitted/bound to the same current stream/capture to preserve ordering, and (2) when writing the total into counter_arr you first read the existing counter value from counter_arr and add the inclusive_out last element (the this-launch total) to it (or perform an atomic-add equivalent) instead of blindly overwriting; ensure the same stream-binding fix is applied to the sort-reduce at the other location (around the existing sort-reduce call).design/deterministic-execution.md (1)
129-130:⚠️ Potential issue | 🟡 MinorUse one scan convention throughout the design doc.
Pattern B says the prefix pass uses
inclusive=False, but the writeback section says the total comes from “the last element of the inclusive scan.” Those are different rules, so the doc currently describes two incompatible implementations.Also applies to: 191-193
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@design/deterministic-execution.md` around lines 129 - 130, The doc inconsistently specifies scan conventions: Pattern B states the prefix pass uses wp.utils.array_scan(contrib, prefix, inclusive=False) while the writeback section refers to “the last element of the inclusive scan”; pick one convention and update all references to match. Concretely, decide whether prefix scans are inclusive or exclusive, then change mentions in Pattern B, the writeback section, and other occurrences (e.g., lines ~191-193) so they all reference the same wp.utils.array_scan(..., inclusive=...) behavior and explain how the total is derived (either from the last element of the inclusive scan or from last+contrib for exclusive).warp/_src/deterministic.py (2)
283-293:⚠️ Potential issue | 🟠 MajorMatch the counter scratch buffers to the counter width.
CounterTarget.value_ctypeis tracked, butallocate_counter_buffers()always allocateswp.int32for both scratch arrays. That will truncate deterministicint64/uint64counters unless they are rejected earlier. Either allocate these buffers fromvalue_ctype, or fail fast for non-32-bit counters.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@warp/_src/deterministic.py` around lines 283 - 293, The counter scratch buffers are always allocated as warp.int32 in allocate_counter_buffers, which will truncate 64-bit counters; update allocate_counter_buffers to use each CounterTarget's value_ctype (CounterTarget.value_ctype) when creating contrib and prefix (i.e., pass dtype=_target.value_ctype to warp.zeros/warp.empty) or alternatively raise/assert if _target.value_ctype is not a 32-bit type so non-32-bit counters fail fast; ensure you reference the function allocate_counter_buffers and the CounterTarget.value_ctype field when making the change.
189-200:⚠️ Potential issue | 🟠 MajorDon't merge every logical counter in an array into one deterministic target.
get_or_create_counter_target()only keys onarray_var_label. If deterministic mode ever seeswp.atomic_add(counter, bucket, 1), every bucket in that array will reuse one contrib/prefix stream and Phase 1 will hand out slots from a single global sequence. Please include the logical counter index in the target identity, or reject non-constant-zero counter indices before creating aCounterTarget.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@warp/_src/deterministic.py` around lines 189 - 200, get_or_create_counter_target currently deduplicates targets solely by array_var_label which causes all array elements to share one deterministic counter; update get_or_create_counter_target to include the logical counter index in the identity (e.g., use a tuple of (array_var_label, logical_index) or add a logical_index field to CounterTarget) so each distinct bucket gets its own CounterTarget, and before creating a new CounterTarget validate the provided index expression is a constant zero (or reject/non-deterministic indices) if your design only allows index 0; ensure you update the lookup loop to compare the new key (array_var_label plus logical index) and append the new CounterTarget with the logical index populated when creating it.
🧹 Nitpick comments (2)
warp/_src/context.py (1)
1339-1341: Validatedeterministic_max_recordsearly for type/range.This value is currently accepted broadly and coerced later in launch. Adding an upfront
int+ non-negative validation in option ingestion would fail fast and avoid silent coercion paths.Also applies to: 2583-2585
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@warp/_src/context.py` around lines 1339 - 1341, Validate deterministic_max_records immediately before assigning into kernel_options: ensure the value is an int (or can be safely converted) and is >= 0, raising a TypeError for non-int types and a ValueError for negative values so callers fail fast; then assign kernel_options["deterministic_max_records"] = deterministic_max_records only after validation. Apply the same validation logic at the other ingestion site referenced around lines 2583-2585 to keep behavior consistent.warp/tests/test_unique_module.py (1)
170-205: Keep the hash assertion runnable on CPU-only jobs.Line 172 skips the whole test, but the
module.namecomparison at Lines 189-193 is device-independent. Splitting this into a CPU-safe hashing assertion plus a CUDA-only launch check would preserve coverage for the unique-module hashing change on non-CUDA runners.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@warp/tests/test_unique_module.py` around lines 170 - 205, The test currently skips the entire test when CUDA is unavailable, but the module name hashing assertion in test_kernel_options_affect_unique_module_identity (comparing _scatter_normal.module.name and _scatter_deterministic.module.name) is device-independent; change the test to perform the module.name comparison unconditionally, and only guard the CUDA-specific array creation and wp.launch calls with if not wp.is_cuda_available(): self.skipTest(...) or conditional blocks around the CUDA-only code (values/indices/out_* with device="cuda:0" and the wp.launch calls) so the hashing assertion still runs on CPU-only CI while the deterministic launch checks remain CUDA-only.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@asv/benchmarks/atomics.py`:
- Around line 242-243: The class attribute params (and param_names) in the
benchmark classes are defined as lists which creates mutable class state; change
them to tuples so they are immutable (e.g., replace params = (["normal",
"deterministic"], [1, 65536], DETERMINISTIC_BENCHMARK_SIZES) with a
tuple-of-tuples and param_names likewise) and apply the same change to the
second occurrence around lines 317-318; update the definitions referenced as
params and param_names in asv/benchmarks/atomics.py so both benchmark classes
use immutable tuples instead of lists.
In `@warp/_src/codegen.py`:
- Around line 1785-1788: The early-return that skips codegen for integer atomics
(the condition using is_float_type(scalar_dtype) and return_is_consumed) is
unsafe when a counting pass exists; modify that condition to also detect whether
the current kernel contains any consumed/counter atomic (i.e., only return None
when not is_float_type(scalar_dtype) AND not return_is_consumed AND there is no
consumed/counter atomic in the kernel). Use the existing kernel/AST context or
add a predicate (e.g., has_consumed_counter_atomic) to check for consumed
atomics before returning so phase 0 will not double-apply native integer atomics
in mixed kernels.
---
Duplicate comments:
In `@design/deterministic-execution.md`:
- Around line 129-130: The doc inconsistently specifies scan conventions:
Pattern B states the prefix pass uses wp.utils.array_scan(contrib, prefix,
inclusive=False) while the writeback section refers to “the last element of the
inclusive scan”; pick one convention and update all references to match.
Concretely, decide whether prefix scans are inclusive or exclusive, then change
mentions in Pattern B, the writeback section, and other occurrences (e.g., lines
~191-193) so they all reference the same wp.utils.array_scan(..., inclusive=...)
behavior and explain how the total is derived (either from the last element of
the inclusive scan or from last+contrib for exclusive).
In `@warp/_src/context.py`:
- Around line 7532-7535: The fwd_args list is only updated in
set_param_at_index, so calls that mutate params via ctype helpers leave fwd_args
stale; modify set_param_at_index_from_ctype and set_params_from_ctypes to also
synchronize self.fwd_args (either by delegating to set_param_at_index for each
change or by applying the same index-check-and-assign logic used there),
honoring the adjoint flag so adjoint updates do not overwrite fwd_args.
- Around line 7541-7553: The deterministic launch branches call
_launch_deterministic(...) without checking for hooks.forward, which can allow a
null kernel handle to reach CUDA; add the same forward-hook null guard used in
the non-deterministic path: verify hooks.forward (or equivalent forward kernel
handle) is present before invoking _launch_deterministic, and if missing raise
the same Python-side error/exception used elsewhere so we fail early and
clearly; apply the same fix to the other deterministic branch around the second
occurrence noted (near the 7825-7854 block).
- Around line 7657-7670: The two-pass deterministic path uses array_scan and
warp.copy without binding them to the current stream/capture and overwrites the
counter rather than adding to the pre-existing value; update the logic around
warp._src.utils.array_scan(contrib, ...), the inclusive_out buffer, and the
writeback that finds counter_arr via kernel.adj.args and ct.array_var_label so
that (1) all scans, the inclusive_out usage, the warp.copy writeback, and the
later sort-reduce are submitted/bound to the same current stream/capture to
preserve ordering, and (2) when writing the total into counter_arr you first
read the existing counter value from counter_arr and add the inclusive_out last
element (the this-launch total) to it (or perform an atomic-add equivalent)
instead of blindly overwriting; ensure the same stream-binding fix is applied to
the sort-reduce at the other location (around the existing sort-reduce call).
In `@warp/_src/deterministic.py`:
- Around line 283-293: The counter scratch buffers are always allocated as
warp.int32 in allocate_counter_buffers, which will truncate 64-bit counters;
update allocate_counter_buffers to use each CounterTarget's value_ctype
(CounterTarget.value_ctype) when creating contrib and prefix (i.e., pass
dtype=_target.value_ctype to warp.zeros/warp.empty) or alternatively
raise/assert if _target.value_ctype is not a 32-bit type so non-32-bit counters
fail fast; ensure you reference the function allocate_counter_buffers and the
CounterTarget.value_ctype field when making the change.
- Around line 189-200: get_or_create_counter_target currently deduplicates
targets solely by array_var_label which causes all array elements to share one
deterministic counter; update get_or_create_counter_target to include the
logical counter index in the identity (e.g., use a tuple of (array_var_label,
logical_index) or add a logical_index field to CounterTarget) so each distinct
bucket gets its own CounterTarget, and before creating a new CounterTarget
validate the provided index expression is a constant zero (or
reject/non-deterministic indices) if your design only allows index 0; ensure you
update the lookup loop to compare the new key (array_var_label plus logical
index) and append the new CounterTarget with the logical index populated when
creating it.
---
Nitpick comments:
In `@warp/_src/context.py`:
- Around line 1339-1341: Validate deterministic_max_records immediately before
assigning into kernel_options: ensure the value is an int (or can be safely
converted) and is >= 0, raising a TypeError for non-int types and a ValueError
for negative values so callers fail fast; then assign
kernel_options["deterministic_max_records"] = deterministic_max_records only
after validation. Apply the same validation logic at the other ingestion site
referenced around lines 2583-2585 to keep behavior consistent.
In `@warp/tests/test_unique_module.py`:
- Around line 170-205: The test currently skips the entire test when CUDA is
unavailable, but the module name hashing assertion in
test_kernel_options_affect_unique_module_identity (comparing
_scatter_normal.module.name and _scatter_deterministic.module.name) is
device-independent; change the test to perform the module.name comparison
unconditionally, and only guard the CUDA-specific array creation and wp.launch
calls with if not wp.is_cuda_available(): self.skipTest(...) or conditional
blocks around the CUDA-only code (values/indices/out_* with device="cuda:0" and
the wp.launch calls) so the hashing assertion still runs on CPU-only CI while
the deterministic launch checks remain CUDA-only.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 493e06fd-9309-43dc-ad53-7cc5a97aa0b8
📒 Files selected for processing (11)
asv/benchmarks/atomics.pydesign/deterministic-execution.mdwarp/_src/codegen.pywarp/_src/context.pywarp/_src/deterministic.pywarp/config.pywarp/native/deterministic.cppwarp/native/deterministic.cuwarp/native/warp.hwarp/tests/test_deterministic.pywarp/tests/test_unique_module.py
🚧 Files skipped from review as they are similar to previous changes (3)
- warp/native/warp.h
- warp/native/deterministic.cpp
- warp/config.py
|
This determinism mode doesn't yet support atomics inside Warp functions, which is used in MJWarp, for example in smooth.py:
Minimal repro: import warp as wp
wp.init()
a = wp.zeros(1, dtype=wp.float32, device="cuda")
@wp.func
def add_one(x: wp.array(dtype=wp.float32)):
wp.atomic_add(x, 0, 1.0)
@wp.kernel
def k(x: wp.array(dtype=wp.float32)):
add_one(x)
wp.launch(k, dim=1, inputs=[a], device="cuda")
wp.synchronize_device("cuda")
print(a.numpy())With this PR's deterministic path enabled, The generated helper function body gets rewritten to use deterministic scatter / debug plumbing, but its signature does not receive the hidden deterministic params that are appended for kernel entry points ( |
Signed-off-by: Miles Macklin <mmacklin@nvidia.com>
Signed-off-by: Miles Macklin <mmacklin@nvidia.com>
…nostic
scripts/check_tdmpc2_determinism.py — three subchecks (--check init |
update | env). Each runs in <30s; run twice in separate processes and
diff the printed SHA-256 hashes to localize where non-determinism enters.
.context/lessons/determinism.md — empirical results + root cause +
upstream fix-in-flight tracking. TL;DR:
- JAX/XLA algo: bit-ID across processes IFF
XLA_FLAGS=--xla_gpu_deterministic_ops=true
- mujoco_warp env: NOT bit-ID across processes; XLA flag has no effect
on Warp CUDA kernels. wp.set_device("cpu") is the only workaround
today. Fix in NVIDIA/warp#1355 (Warp 1.14, ~Jun 2026) +
mujoco_warp#1281, #1300.
- Therefore end-to-end training trajectories cannot be byte-identical
across two GPU processes. Same-process re-runs appear deterministic.
Rationale for adding now: the TD-MPC2 eval-key isolation fix (b5a8c70)
couldn't be empirically verified by diffing two-process trajectories
because env non-determinism dominates; needed the within-process
key-inspection workaround. Future repeatability questions for any algo
in this repo should consult this lesson.
LESSONS.md updated with index entry.
Keep ordinary CUDA launches on the pre-existing direct native call path so launch microbenchmarks avoid deterministic helper overhead. Deterministic multi-pass launches continue to use the shared helper. Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Add a hands-on deep dive for deterministic atomic patterns, configuration, CUDA graph caveats, and the CUB/CCCL building blocks behind the implementation. Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Document that consumed-return deterministic counters return a base slot and reserve a contiguous range when the atomic add value is greater than one. Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Fail closed for deterministic consumed-return counters during CUDA graph capture because the current scan path allocates CUB temporary storage inside the captured call. Also reject oversized counter launches before allocating int32 prefix buffers and update the docs/tests to match. Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Start the Deep Dive with module-level determinism for existing programs, and clarify when kernel-level deterministic options and unique modules are appropriate. Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Describe True and False deterministic modes as ease-of-use aliases rather than backward compatibility behavior for a new feature. Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Avoid CUDA graph capture for the deterministic consumed-return counter ASV benchmark, annotate boolean deterministic aliases, and document the scatter buffer capacity floor. Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Expand the mode selection guidance, simplify the limitations heading, improve the NVIDIA libraries wording, and add small examples for the most concrete limitations. Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Match the internal deterministic-mode docstring to the public docs by describing boolean aliases as ease-of-use support. Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Explain manual deterministic lowering patterns, add a mixed-reduction workaround, and include a small measured performance table for deterministic atomic modes. Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Call out that highly contended floating-point atomics can speed up under deterministic run-to-run reductions, while sparse atomics usually pay sorting overhead. Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Move performance discussion into its own section with benchmark snippets and separate measurements for deterministic accumulation and slot allocation patterns. Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Split the performance discussion into accumulation and slot-allocation subsections and state that both measurements use CUDA event timing. Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Publish consumed-return counter totals from the exclusive prefix and final contribution on device. This avoids the redundant full inclusive scan and temporary output buffer while keeping the write stream-ordered. Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Keep block-dim-dependent shared tile type and initializer emission consistent by using WP_TILE_BLOCK_DIM for default shared strides, not just shape dimensions. Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Broaden consumed-return atomic_add lowering to use sorted counter records so dynamic indices, sliced counter views, graph capture, and custom adjoint call paths work in deterministic mode. Keep unsupported consumed-return non-add atomics fail-closed and document the remaining limits. Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Custom gradient codegen can run with deterministic support disabled, which leaves deterministic metadata unset. Guard those lookups so custom adjoints use ordinary reverse calls in not_guaranteed mode while preserving metadata propagation when deterministic mode is enabled. Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Clean up user-facing determinism wording and add immediate CUDA launch error checks for deterministic native postpass kernels. Widen launch block arithmetic to avoid int overflow near the supported record limit. Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
The deterministic counter scan now publishes final counter totals directly, so the old counter-total native entry point and ctypes binding are dead code. Remove the stale export to keep the native API surface tidy. Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Custom adjoint functions compile in deterministic mode, but atomics through wp.adjoint[...] are not automatically rewritten. Document that boundary so the public guide and design limitations match the implementation. Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Allow the ASV counter benchmark to exercise deterministic graph capture and guard deterministic counter record ordinal computation with widened arithmetic. Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
ad273f5 to
d0284f4
Compare
Description
Add deterministic execution mode for supported atomic patterns via
wp.config.deterministic, with module-level and unique-kernel overrides through the existing module options system.When enabled, floating-point accumulation atomics (
atomic_add,atomic_sub,atomic_min,atomic_max) are transparently redirected through a deterministic scatter-sort-reduce path, and counter / allocator patterns that consume the atomic return value use an automatic two-pass count-scan-execute path. This provides bit-exact reproducible CUDA results across runs without requiring users to manually rewrite kernels.This PR also adds targeted fixes and coverage for deterministic launch edge cases:
wp.launch(..., record_cmd=True)support for deterministic kernelsChecklist
Test plan
Verified with:
The deterministic test module covers:
+=lowering to deterministic atomicsrecord_cmd=Truesupport for deterministic launchesNew feature / enhancement
Summary by CodeRabbit
New Features
Documentation
Tests
Chores
Closes #1443