Skip to content

Warp determinism#1355

Draft
mmacklin wants to merge 72 commits into
NVIDIA:mainfrom
mmacklin:warp-deterministic
Draft

Warp determinism#1355
mmacklin wants to merge 72 commits into
NVIDIA:mainfrom
mmacklin:warp-deterministic

Conversation

@mmacklin
Copy link
Copy Markdown
Collaborator

@mmacklin mmacklin commented Apr 10, 2026

Description

Add deterministic execution mode for supported atomic patterns via wp.config.deterministic, with module-level and unique-kernel overrides through the existing module options system.

When enabled, floating-point accumulation atomics (atomic_add, atomic_sub, atomic_min, atomic_max) are transparently redirected through a deterministic scatter-sort-reduce path, and counter / allocator patterns that consume the atomic return value use an automatic two-pass count-scan-execute path. This provides bit-exact reproducible CUDA results across runs without requiring users to manually rewrite kernels.

This PR also adds targeted fixes and coverage for deterministic launch edge cases:

  • suppress side effects during the Phase 0 counting pass
  • separate scatter buffers by target and reduction op
  • improve scatter buffer capacity accounting
  • preserve wp.launch(..., record_cmd=True) support for deterministic kernels

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Test plan

Verified with:

uv run warp/tests/test_deterministic.py
uvx pre-commit run --files warp/_src/codegen.py warp/_src/context.py warp/_src/deterministic.py warp/tests/test_deterministic.py
uvx pre-commit run --files design/deterministic-execution.md warp/_src/context.py

The deterministic test module covers:

  • reproducibility for float accumulation atomics
  • += lowering to deterministic atomics
  • float64 support
  • multi-array and 2D indexing cases
  • counter / allocator reproducibility and correctness
  • mixed counter + accumulation kernels
  • per-kernel module option override
  • phase-0 side-effect suppression
  • mixed reduce ops on the same array
  • record_cmd=True support for deterministic launches

New feature / enhancement

import numpy as np
import warp as wp

wp.init()
wp.config.deterministic = True

@wp.kernel
def scatter_add(values: wp.array(dtype=wp.float32),
                indices: wp.array(dtype=wp.int32),
                out: wp.array(dtype=wp.float32)):
    tid = wp.tid()
    wp.atomic_add(out, indices[tid], values[tid])

n = 1024
values_np = np.ones(n, dtype=np.float32)
indices_np = np.arange(n, dtype=np.int32) % 16

values = wp.array(values_np, dtype=wp.float32, device="cuda")
indices = wp.array(indices_np, dtype=wp.int32, device="cuda")

results = []
for _ in range(5):
    out = wp.zeros(16, dtype=wp.float32, device="cuda")
    wp.launch(scatter_add, dim=n, inputs=[values, indices], outputs=[out], device="cuda")
    results.append(out.numpy())

# Bit-exact reproducibility across runs
for i in range(1, len(results)):
    np.testing.assert_array_equal(results[0], results[i])

Summary by CodeRabbit

  • New Features

    • Added a deterministic execution mode for atomics with global/module/kernel toggles, configurable record limits and debug diagnostics; supports ordered scatter reductions and two-pass counter/allocator semantics with GPU implementation and CPU fallbacks.
  • Documentation

    • Added a detailed design specification and changelog entry describing modes, supported patterns, limitations, and configuration semantics.
  • Tests

    • Added extensive deterministic tests covering reproducibility, correctness, overrides, capacity/overflow, and capture/replay.
  • Chores

    • Build system extended to include native deterministic sources for CPU/GPU.

Closes #1443

OpenClaw Bot and others added 3 commits April 10, 2026 03:03
Introduce wp.config.deterministic flag that makes floating-point atomic
operations produce bit-exact reproducible results across runs. Two atomic
usage patterns are handled transparently:

Pattern A (accumulation): atomic_add/sub with unused return values are
redirected to scatter buffers during kernel execution, then sorted by
(dest_index, thread_id) and reduced in fixed order post-kernel via CUB
radix sort and a custom segmented reduce kernel.

Pattern B (counter/allocator): atomic_add with consumed return values
(slot = atomic_add(counter, 0, 1)) use automatic two-pass execution:
Phase 0 records per-thread contributions with side effects suppressed,
prefix sum computes deterministic offsets, Phase 1 re-executes with
deterministic slot assignments.

Both patterns can coexist in a single kernel. Integer atomics with
unused return values are left unchanged (already deterministic). CPU
execution is unaffected (already sequential). Configurable at global,
module, and kernel levels.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 10, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 10, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds an opt-in deterministic execution mode for supported atomic operations: new global/module/kernel config flags, codegen interception to emit scatter or two‑phase counter patterns, deterministic metadata and buffer management, CUDA device/CPU stubs for deterministic sort‑reduce, build updates, and extensive tests and benchmarks.

Changes

Cohort / File(s) Summary
Configuration & Design
CHANGELOG.md, design/deterministic-execution.md, warp/config.py
Introduces wp.config.deterministic and deterministic_debug, adds changelog entry and a design doc describing determinism modes, supported atomic patterns, limits, and debug behavior.
Codegen & Deterministic Metadata
warp/_src/codegen.py, warp/_src/deterministic.py
Adds interceptable atomic allowlist and order-dependent classification, DeterministicMeta and target dataclasses, codegen paths for Pattern A (scatter sort‑reduce) and Pattern B (two‑pass counter), assign-time return-use tracking, store suppression in phase 0, and helpers for target deduplication and buffer sizing.
Context, Kernel Options & Launch Runtime
warp/_src/context.py
Adds kernel decorator params deterministic and deterministic_max_records, per-kernel/module resolution and hashing to populate det_meta, DeterministicLaunch and _launch_deterministic to allocate buffers, orchestrate phase runs, update counters, and invoke post‑kernel sort/reduce; augments generated kernel signatures with hidden deterministic parameters and includes new ctypes binding.
Native Device Support & Headers
warp/native/deterministic.cu, warp/native/deterministic.h, warp/native/deterministic.cpp, warp/native/warp.h
Adds CUDA implementation for deterministic scatter/sort + segmented reduction with scalar-type dispatch, device scatter helper, header API (wp::deterministic::scatter), public entry wp_deterministic_sort_reduce_device, and CPU stub for non-CUDA builds.
Build System
build_lib.py
Adds native/deterministic.cpp to CPU build units and native/deterministic.cu to CUDA build units so deterministic native code is compiled into libraries.
Tests & Suite Integration
warp/tests/test_deterministic.py, warp/tests/unittest_suites.py, warp/tests/test_unique_module.py
Adds comprehensive deterministic tests covering scatter, counters, mixed ops, float64, capacity/overflow, per-kernel/module overrides, CUDA graph/recording/graph capture, and module-hashing tests; integrates TestDeterministic into the default suite.
Benchmarks
asv/benchmarks/atomics.py
Adds deterministic benchmark variants and sizes, new deterministic kernels, captures CUDA graphs for repeated timed replays to measure determinism overhead.

Sequence Diagram(s)

sequenceDiagram
    participant User as User Kernel
    participant Python as Python Runtime
    participant CUDA as CUDA Device
    participant ScatterBuf as Scatter Buffers
    participant SortReduce as Sort-Reduce Kernel
    participant Dest as Destination Array

    User->>CUDA: Launch kernel (scatter pattern)
    CUDA->>ScatterBuf: wp::deterministic::scatter (pack key+value, inc counter)
    CUDA->>Python: Kernel returns
    Python->>ScatterBuf: Read record count
    Python->>SortReduce: Call device sort-reduce entry
    SortReduce->>ScatterBuf: Radix sort keys/values
    SortReduce->>Dest: Deterministic segment-wise reduce into dest
    SortReduce->>Python: Complete
    Python->>User: Return results
Loading
sequenceDiagram
    participant User as User Kernel
    participant Python as Python Runtime
    participant Phase0 as Phase 0 (Count)
    participant Scan as Prefix Scan
    participant Phase1 as Phase 1 (Execute)
    participant Counter as Counter Array

    User->>Python: Launch deterministic kernel (counter pattern)
    Python->>Phase0: Run kernel with _wp_det_phase=0 (suppress side effects)
    Phase0->>Python: Accumulated per-counter counts
    Python->>Scan: Compute deterministic prefix sums (offsets)
    Scan->>Python: Offsets ready
    Python->>Phase1: Run kernel with _wp_det_phase=1 (use offsets)
    Phase1->>Counter: Writeback using deterministic offsets
    Phase1->>Python: Complete
    Python->>User: Return results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 58.02% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Warp determinism' is vague and generic. While it references determinism (a real part of the changeset), it is too broad and non-descriptive to clearly convey the primary change to someone scanning pull request history. Consider a more descriptive title that clarifies the main feature, such as 'Add deterministic execution mode for atomic operations' or 'Implement configurable deterministic atomics with scatter-sort-reduce'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 10, 2026

Greptile Summary

This PR adds a deterministic execution mode (wp.config.deterministic) for CUDA atomic operations, routing floating-point accumulation atomics through a CUB-based scatter-sort-reduce pass and consumed-return counter atomics through a two-pass count-scan-execute scheme. Module- and kernel-level option overrides, CUDA graph capture support, backward-kernel inert locals, and overflow detection are included.

  • Scatter path (atomic_add/atomic_sub/atomic_min/atomic_max with discarded return): kernel writes (dest_idx, thread_id, value) records to a temporary buffer; a post-kernel CUB radix sort + segmented reduce applies them in a fixed order.
  • Counter path (slot = wp.atomic_add(counter, 0, 1)): Phase 0 counts reservations (all side effects suppressed), a prefix scan assigns deterministic slot indices, Phase 1 replays with those indices; Graph._deterministic_buffer_refs keeps scratch buffers alive for graph replay.
  • Backward / helper propagation: reverse kernels use inert det_ctx locals (use_launch_buffers=False); @wp.func helpers with array stores propagate det_ctx via _det_has_side_effect_store.

Confidence Score: 3/5

Safe to merge for the common usage patterns; one code path silently double-accumulates integer atomics in helper functions called from counter kernels.

The counter two-pass mode guards integer atomics directly in the kernel body, but the analogous guard is missing when those atomics live in a @wp.func helper called by the counter kernel. The helper's _det_has_consumed_atomic is computed from the helper's own AST and is False for helpers with only integer atomics, so _emit_deterministic_atomic returns None with no phase guard. The helper fires in Phase 0 and Phase 1, doubling the accumulated result. The array-store analog was fixed via _det_has_side_effect_store; the same mechanism is not yet applied to integer atomics in user functions.

warp/_src/codegen.py — the _emit_deterministic_atomic integer-atomic fall-through does not guard helper-function atomics when called from a two-pass counter kernel

Important Files Changed

Filename Overview
warp/_src/codegen.py Adds 806 lines implementing deterministic mode codegen: atomic interception, scatter/counter emission, backward-kernel inert locals, and helper-function det_ctx propagation. Integer atomics in helper functions called from two-pass counter kernels are not phase-guarded.
warp/_src/context.py Adds _launch_deterministic, DeterministicLaunch, Graph._deterministic_buffer_refs for buffer lifetime, and two-pass orchestration. Overflow detection is thorough; stream/capture handling looks correct.
warp/_src/deterministic.py New module implementing scatter-sort-reduce helpers, scatter/counter buffer allocation, and sort-reduce/scan dispatch. Float16 now correctly dispatches to wp::half; capacity guards look correct.
warp/native/deterministic.cu CUB-based sort-reduce and counter-scan GPU implementation. Generic workspace is used for vector types and gpu_to_gpu mode; scalar run_to_run uses ReduceByKey.
warp/native/deterministic.h Header-only device/host macro definitions for WP_DET_SCATTER_OR_FALLBACK, WP_DET_COUNTER_OR_FALLBACK, WP_DET_STORE_IF_ACTIVE, WP_DET_SIDE_EFFECT_IF_ACTIVE with correct CPU fallbacks.
warp/tests/test_deterministic.py Extensive 2897-line test module; main block uses unittest.main() only. Missing coverage for integer atomics in @wp.func helpers called from counter kernels.
warp/config.py Adds deterministic and deterministic_debug global config options with accurate docstrings.

Sequence Diagram

sequenceDiagram
    participant User
    participant launch as wp.launch()
    participant LD as _launch_deterministic()
    participant GPU as CUDA Kernel
    participant SR as run_sort_reduce / run_counter_scan

    User->>launch: wp.launch(kernel, dim, ...)
    launch->>launch: detect det_meta (scatter/counter targets)

    alt Scatter-only kernel
        launch->>LD: _launch_deterministic(...)
        LD->>LD: allocate scatter_bufs + overflow_buf
        LD->>GPU: Phase 1 - scatter (key,val) records
        LD->>SR: CUB sort + segmented reduce into dest_array
        LD->>LD: check overflow_buf
    else Counter (two-pass) kernel
        launch->>LD: _launch_deterministic(...)
        LD->>LD: allocate counter_bufs + scatter_bufs + overflow_buf
        LD->>GPU: Phase 0 - count reservations (side effects suppressed)
        LD->>LD: read count[], check overflow
        LD->>SR: CUB sort + inclusive prefix scan - per-record slot offsets
        LD->>LD: zero cursors[]
        LD->>GPU: Phase 1 - execute with deterministic slots + scatter
        LD->>SR: sort-reduce scatter records (if any)
        LD->>LD: check overflow_buf
    end

    LD->>User: return (bit-exact results)
Loading

Comments Outside Diff (1)

  1. warp/_src/codegen.py, line 460-465 (link)

    P1 Integer atomics in @wp.func helpers double-execute in two-pass counter kernels

    When a @wp.func helper contains an integer atomic_add / atomic_sub with the return value discarded, and the calling kernel is a two-pass counter kernel, the helper's integer atomic fires unconditionally in Phase 0 (the counting pass). Because Phase 0 runs the kernel body a second time before Phase 1, the integer accumulation doubles.

    The direct-kernel fix (WP_DET_SIDE_EFFECT_IF_ACTIVE when adj._det_has_consumed_atomic) does not extend to helpers because: (1) the helper's _det_has_consumed_atomic is computed from the helper's own AST—False for helpers that only have integer atomics—so the branch is not taken; (2) _emit_deterministic_atomic returns None, falling through to the normal wp::atomic_add(...) without any phase guard; and (3) needs_deterministic is False for this helper so det_ctx is never passed to it.

    Concrete failure: a kernel with both a counter pattern (slot = wp.atomic_add(counter, 0, 1)) and a call to a helper that does wp.atomic_add(hist, bucket, 1) (return unused) will write twice the correct value into hist in every two-pass deterministic launch.

    The existing test_counter_with_integer_accumulation covers the case where both atomics are directly in the kernel body, but not this helper-function variant. The analogous array-store fix (_det_has_side_effect_store) introduced a similar per-helper flag and forced det_ctx propagation; the same mechanism is needed for integer atomics in user functions.

Reviews (17): Last reviewed commit: "Align APIC capture guard with main" | Re-trigger Greptile

Comment thread warp/_src/deterministic.py Outdated
Comment thread warp/tests/test_deterministic.py Outdated
Comment thread warp/_src/context.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (1)
warp/native/deterministic.h (1)

41-49: Clarify the non-CUDA branch comment.

Line 41 says “direct accumulation,” but Lines 42-48 are a no-op. Rewording this would prevent confusion during maintenance.

✏️ Suggested comment tweak
-    // CPU path: direct accumulation (CPU kernels are sequential).
+    // Non-CUDA path: no-op for this helper (CPU execution does not use scatter buffers).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/native/deterministic.h` around lines 41 - 49, The comment "CPU path:
direct accumulation (CPU kernels are sequential)" is misleading because the
non-CUDA branch simply voids a set of unused variables (keys, values, counter,
capacity, dest_flat_idx, thread_id, value); change the comment near that void
list in deterministic.h to clearly state these variables are intentionally
unused in the non-CUDA build (e.g., "Non-CUDA build: no per-thread accumulation
— explicitly mark these kernel-specific variables as unused") so maintainers
understand why the (void)XXX lines are present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@CHANGELOG.md`:
- Around line 7-11: Update the CHANGELOG entry (mentioning
wp.config.deterministic) to be API-level and include a GitHub issue/PR
reference: remove internal implementation details like "scatter-sort-reduce" and
"two-pass execution with prefix-sum-based slot assignment", instead describe the
user-visible change (e.g., "Added a deterministic execution mode for atomic
operations via wp.config.deterministic = True that makes atomic accumulations
reproducible across runs"), append a short note about scope
(global/module/kernel) and add the GH issue/PR number (e.g., "See `#1234`") and
affected version/release tag.

In `@design/deterministic-execution.md`:
- Around line 108-109: The doc currently mixes exclusive and inclusive scan
conventions: update the design to use a single convention for
wp.utils.array_scan(writeback rule) — either always document it as exclusive
(inclusive=False) and state that the total is computed as prefix[last] +
contrib[last], or document it as inclusive (inclusive=True) and state that the
total is prefix[last]; make the descriptions at the wp.utils.array_scan example
near "Prefix sum" (previously showing inclusive=False) and the section that
references "total comes from the last element" (lines ~159-160) consistent by
choosing one convention and adjusting the text to reference
wp.utils.array_scan(contrib, prefix, inclusive=<True/False>) and the
corresponding rule for computing the thread-local total.

In `@warp/_src/codegen.py`:
- Around line 42-44: The frozenset _DET_ORDER_DEPENDENT_ATOMICS (containing
"atomic_cas" and "atomic_exch") is defined but unused; either remove it or
implement the promised "warn but don't intercept" behavior: locate the code path
that dispatches/handles atomic ops (e.g., the function that processes atomic
intrinsics or emits atomics such as the atomic handling/emit function in
codegen.py), and when an atomic op name is in _DET_ORDER_DEPENDENT_ATOMICS, emit
a clear warning (use warnings.warn or the module logger) stating that
order-dependent atomics are not intercepted and will run with native ordering,
then continue normal processing without interception; if you prefer removal,
delete the _DET_ORDER_DEPENDENT_ATOMICS constant and any related comment and add
a brief unit test or code comment documenting the choice.
- Around line 1881-1882: The fallback that sets flat_idx_expr = "0" when ndim >
4 is unsafe because it silently writes everything to index 0; replace this
silent fallback with an explicit error: detect the unsupported ndim case in the
same code path in codegen.py (the branch that currently assigns flat_idx_expr)
and raise a clear exception (e.g., ValueError or RuntimeError) that includes the
invalid ndim value and a message stating Warp arrays support up to 4 dimensions;
do not assign "0" as a default index.

In `@warp/_src/context.py`:
- Around line 7477-7480: set_param_at_index_from_ctype() and
set_params_from_ctypes() mutate self.params but do not keep self.fwd_args in
sync, so _launch_deterministic() can replay using stale array objects; update
those methods to mirror the logic in set_param_at_index(): when adjoint is False
and the target index is within range(len(self.fwd_args)), assign the new value
into self.fwd_args[index] (or for bulk updates, update the corresponding
slice/indices) — or refactor those methods to call set_param_at_index(index,
value, adjoint) for each changed param so fwd_args stays consistent with params
for deterministic replay.
- Around line 7589-7602: The two-pass path must preserve the caller stream and
keep the counter's initial value: ensure the intermediate scans
(warp._src.utils.array_scan calls), the warp.copy that writes the total back
into the counter array, and the subsequent run_sort_reduce invocation are
executed on the same stream passed into this path (propagate the local stream
object into those calls or use stream-aware variants) and when writing the
counter combine the existing counter value with this-launch total (read the
current counter_arr[0], add inclusive_out[dim_size-1] and write the sum back)
instead of overwriting; update references around array_scan, inclusive_out,
counter_arr (found via kernel.adj.args and fwd_args), warp.copy and
run_sort_reduce so they all use the caller stream and perform an atomic/ordered
add of the previous counter value plus the new total.

In `@warp/_src/deterministic.py`:
- Around line 140-161: The mapping and helpers currently claim support for
warp.float16 but the native reducer entrypoint
wp_deterministic_sort_reduce_float_device (and the C++ reinterpret-casts in
native/deterministic.cu) do not handle 16-bit halves; update _WARP_TO_CTYPE and
the conversion helpers to fail fast for half: remove or change the "wp::half"
mapping so warp_type_to_ctype raises for warp.float16 (or explicitly check and
raise in is_float_type/warp_type_to_ctype), and add a clear ValueError
mentioning warp.float16 and wp_deterministic_sort_reduce_float_device so any
attempt to use half reductions immediately errors until a real half reducer is
implemented.

In `@warp/tests/test_deterministic.py`:
- Around line 789-790: Remove the call to wp.clear_kernel_cache() from this test
module (including the __main__ block) because wp.clear_kernel_cache() is
disallowed in test files; simply delete the line invoking
wp.clear_kernel_cache() so no cache-clearing is performed here.

In `@warp/tests/unittest_suites.py`:
- Line 142: The default_suite() function imports TestDeterministic but never
includes it in the test_classes list, so deterministic tests are skipped; update
the test_classes array/variable inside default_suite() to include
TestDeterministic (alongside the other classes), ensuring the symbol
TestDeterministic is added to the list used to build the suite returned by
default_suite().

---

Nitpick comments:
In `@warp/native/deterministic.h`:
- Around line 41-49: The comment "CPU path: direct accumulation (CPU kernels are
sequential)" is misleading because the non-CUDA branch simply voids a set of
unused variables (keys, values, counter, capacity, dest_flat_idx, thread_id,
value); change the comment near that void list in deterministic.h to clearly
state these variables are intentionally unused in the non-CUDA build (e.g.,
"Non-CUDA build: no per-thread accumulation — explicitly mark these
kernel-specific variables as unused") so maintainers understand why the
(void)XXX lines are present.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 2e7bc7a6-3bed-4194-bd96-d94feaed2cf7

📥 Commits

Reviewing files that changed from the base of the PR and between 53a7bf5 and 5ec9b25.

📒 Files selected for processing (13)
  • CHANGELOG.md
  • build_lib.py
  • design/deterministic-execution.md
  • warp/_src/codegen.py
  • warp/_src/context.py
  • warp/_src/deterministic.py
  • warp/config.py
  • warp/native/deterministic.cpp
  • warp/native/deterministic.cu
  • warp/native/deterministic.h
  • warp/native/warp.h
  • warp/tests/test_deterministic.py
  • warp/tests/unittest_suites.py

Comment thread CHANGELOG.md Outdated
Comment thread design/deterministic-execution.md Outdated
Comment thread warp/_src/codegen.py Outdated
Comment thread warp/_src/codegen.py Outdated
Comment thread warp/_src/context.py Outdated
Comment thread warp/_src/context.py Outdated
Comment thread warp/_src/deterministic.py
Comment thread warp/tests/test_deterministic.py Outdated
Comment thread warp/tests/unittest_suites.py Outdated
Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (3)
warp/_src/deterministic.py (1)

140-143: ⚠️ Potential issue | 🔴 Critical

Fail fast for warp.float16 deterministic reductions.

warp.float16 is still wired through the deterministic scatter/reduce path, but the native entry points only reinterpret buffers as float* or double*. Half buffers will therefore be read with the wrong element size and produce corrupted reductions. Please reject this path until a real half reducer exists.

Minimal safe guard
-        if target.value_ctype in ("float", "wp::half"):
+        if target.value_ctype == "float":
             fn = runtime.core.wp_deterministic_sort_reduce_float_device
+        elif target.value_ctype == "wp::half":
+            raise RuntimeError("Deterministic float16 atomics are not supported yet.")
         elif target.value_ctype == "double":
             fn = runtime.core.wp_deterministic_sort_reduce_double_device

Also applies to: 159-161, 177-180, 219-223

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/deterministic.py` around lines 140 - 143, The mapping
_WARP_TO_CTYPE currently includes warp.float16 but the deterministic
scatter/reduce path does not support half-precision and will read buffers with
wrong element size; update deterministic.py to fail fast whenever warp.float16
is encountered in the deterministic reduction code paths by removing or guarding
the warp.float16 entry in _WARP_TO_CTYPE and adding an explicit check that
raises a clear exception (or returns an error) when a reducer or conversion
function (the code paths that reference _WARP_TO_CTYPE) sees warp.float16;
ensure the exception message names warp.float16 and the deterministic reduction
path so callers get a clear rejection until a proper half reducer is
implemented.
warp/tests/test_deterministic.py (1)

955-956: ⚠️ Potential issue | 🟠 Major

Remove the kernel-cache clear from this test module.

wp.clear_kernel_cache() is disallowed in tests and can crash parallel CI runs.

Minimal fix
 if __name__ == "__main__":
-    wp.clear_kernel_cache()
     unittest.main(verbosity=2)

As per coding guidelines, "Never call wp.clear_kernel_cache() or wp.clear_lto_cache() in test files—not in __main__ blocks, test methods, or module scope. Cache clearing is not multi-process-safe; concurrent clears cause LLVM crashes."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/tests/test_deterministic.py` around lines 955 - 956, Remove the
disallowed kernel-cache clear from the module-level __main__ block: delete the
call to wp.clear_kernel_cache() found in the if __name__ == "__main__": section
of warp/tests/test_deterministic.py; do not replace it with any cache-clearing
call (wp.clear_lto_cache or similar) and ensure no other module-scope or
__main__-scoped cache-clear calls remain.
design/deterministic-execution.md (1)

111-112: ⚠️ Potential issue | 🟡 Minor

Use one scan convention throughout the doc.

Lines 111-112 describe an exclusive scan, but Lines 170-172 still say the total comes from the last element of an inclusive scan. Please make the writeback rule consistent with the convention the implementation actually uses.

Also applies to: 170-172

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@design/deterministic-execution.md` around lines 111 - 112, The doc is
inconsistent about scan convention: wp.utils.array_scan(contrib, prefix,
inclusive=False) is described as exclusive at lines 111-112 but later (lines
170-172) claims the total comes from the last element of an inclusive scan; pick
one convention and make the writeback rule consistent with the actual
implementation. Update the description of wp.utils.array_scan, the
example/notation for "prefix" and the writeback rule (the statement about where
the total/last-offset is read) so they all use the same convention (either
inclusive or exclusive) and mention the inclusive flag (inclusive=False/True) in
the writeback explanation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@warp/_src/codegen.py`:
- Around line 1647-1650: Pattern B (two-pass deterministic atomic interception)
only runs when _det_in_assign is set during emit_Assign(), so atomic calls
nested in subscripts, call arguments, or larger expressions (e.g.,
output[wp.atomic_add(...)]) never use adj._emit_deterministic_atomic and remain
nondeterministic; update the compiler to propagate the
"deterministic-assignment" context beyond plain Assign RHS evaluation by
checking _det_in_assign (or equivalent flag) within expression emitters that can
produce targets/indices/args — specifically modify emit_Subscript, emit_Call
(and other expression emitters referenced around lines 3336-3345) to consult
adj._emit_deterministic_atomic for funcs in _DET_INTERCEPTABLE_ATOMICS (using
func.is_builtin() and func.key) and call
adj._emit_deterministic_atomic(bound_args, return_type, output, output_list)
when the flag is active so atomics inside subscripts/args follow the two-pass
deterministic path.
- Around line 1834-1849: The Pattern B counter/allocator block that runs when
return_is_consumed currently assumes an add-style atomic and unconditionally
emits contrib += value and prefix returns; restrict this path to only add-style
atomics: inside the branch guarded by return_is_consumed (the block that calls
get_or_create_counter_target and writes _wp_det_contrib and _wp_det_prefix via
adj.add_forward), inspect the counter target's atomic/op kind (from the target
returned by get_or_create_counter_target — e.g., target.op, target.atomic_op, or
whatever property names the CounterTarget uses) and only emit Pattern B when
that property indicates an atomic_add; otherwise fail fast (raise a clear
NotImplementedError or emit an error) explaining that consumed-return semantics
for atomic_sub/atomic_max are not implemented. Ensure the check is placed before
emitting the adj.add_forward code so non-add atomics do not get incorrect
rewrites.

In `@warp/_src/deterministic.py`:
- Around line 191-200: The scratch buffers in allocate_counter_buffers always
use warp.int32 which truncates non-32-bit counters; update
allocate_counter_buffers to inspect each CounterTarget.value_ctype and allocate
contrib/prefix with the matching dtype (e.g., warp.int64 or warp.uint64) so the
generated ABI names (_wp_det_contrib/_wp_det_prefix) match the counter width, or
if supporting only 32-bit is preferred, raise an explicit error when a
CounterTarget has a non-32-bit value_ctype; modify allocate_counter_buffers to
perform this dtype-selection/check using the CounterTarget.value_ctype before
creating warp.zeros/warp.empty.

---

Duplicate comments:
In `@design/deterministic-execution.md`:
- Around line 111-112: The doc is inconsistent about scan convention:
wp.utils.array_scan(contrib, prefix, inclusive=False) is described as exclusive
at lines 111-112 but later (lines 170-172) claims the total comes from the last
element of an inclusive scan; pick one convention and make the writeback rule
consistent with the actual implementation. Update the description of
wp.utils.array_scan, the example/notation for "prefix" and the writeback rule
(the statement about where the total/last-offset is read) so they all use the
same convention (either inclusive or exclusive) and mention the inclusive flag
(inclusive=False/True) in the writeback explanation.

In `@warp/_src/deterministic.py`:
- Around line 140-143: The mapping _WARP_TO_CTYPE currently includes
warp.float16 but the deterministic scatter/reduce path does not support
half-precision and will read buffers with wrong element size; update
deterministic.py to fail fast whenever warp.float16 is encountered in the
deterministic reduction code paths by removing or guarding the warp.float16
entry in _WARP_TO_CTYPE and adding an explicit check that raises a clear
exception (or returns an error) when a reducer or conversion function (the code
paths that reference _WARP_TO_CTYPE) sees warp.float16; ensure the exception
message names warp.float16 and the deterministic reduction path so callers get a
clear rejection until a proper half reducer is implemented.

In `@warp/tests/test_deterministic.py`:
- Around line 955-956: Remove the disallowed kernel-cache clear from the
module-level __main__ block: delete the call to wp.clear_kernel_cache() found in
the if __name__ == "__main__": section of warp/tests/test_deterministic.py; do
not replace it with any cache-clearing call (wp.clear_lto_cache or similar) and
ensure no other module-scope or __main__-scoped cache-clear calls remain.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 4ca0283f-84b8-4939-90c1-9a656d3d0098

📥 Commits

Reviewing files that changed from the base of the PR and between 5ec9b25 and 72e8e3f.

📒 Files selected for processing (8)
  • design/deterministic-execution.md
  • warp/_src/codegen.py
  • warp/_src/context.py
  • warp/_src/deterministic.py
  • warp/config.py
  • warp/native/deterministic.cu
  • warp/native/deterministic.h
  • warp/tests/test_deterministic.py
✅ Files skipped from review due to trivial changes (2)
  • warp/native/deterministic.h
  • warp/native/deterministic.cu
🚧 Files skipped from review as they are similar to previous changes (1)
  • warp/_src/context.py

Comment thread warp/_src/codegen.py
Comment thread warp/_src/codegen.py Outdated
Comment thread warp/_src/deterministic.py Outdated
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 10, 2026

Tip:

Greploop — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.

Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

Comment thread warp/_src/deterministic.py Outdated
Generalize deterministic scatter-reduce to operate on Warp leaf types instead of only scalar float values. This preserves existing component-wise atomic semantics for vectors and matrices while keeping the graph-capture-safe fixed-capacity launch path.

Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
warp/tests/test_deterministic.py (1)

1143-1145: ⚠️ Potential issue | 🟠 Major

Remove the kernel-cache clear from this test module.

wp.clear_kernel_cache() is explicitly disallowed in test files and can break parallel test runs.

Minimal fix
 if __name__ == "__main__":
-    wp.clear_kernel_cache()
     unittest.main(verbosity=2)

As per coding guidelines, "Never call wp.clear_kernel_cache() or wp.clear_lto_cache() in test files—not in __main__ blocks, test methods, or module scope. Cache clearing is not multi-process-safe; concurrent clears cause LLVM crashes."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/tests/test_deterministic.py` around lines 1143 - 1145, Remove the call
to wp.clear_kernel_cache() from the module-level __main__ block; locate the if
__name__ == "__main__": block (where wp.clear_kernel_cache() is called before
unittest.main) and delete that wp.clear_kernel_cache() invocation so tests no
longer call kernel-cache clearing from the test module.
design/deterministic-execution.md (1)

114-115: ⚠️ Potential issue | 🟡 Minor

Use one scan convention throughout the design.

The Pattern B section documents array_scan(..., inclusive=False), but the writeback section still says the total comes from the last element of an inclusive scan. Please pick one convention and make the total-count rule match it everywhere.

Also applies to: 173-175

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@design/deterministic-execution.md` around lines 114 - 115, The docs use two
different scan conventions; standardize on one and make all references
consistent: choose whether wp.utils.array_scan(contrib, prefix, inclusive=True)
or inclusive=False is the canonical API, then update Pattern B and the writeback
section (and the other occurrence at lines ~173-175) so the phrase "total comes
from the last element" matches that convention (e.g., if using inclusive=False,
state that total = prefix[-1] + contrib[-1] or if inclusive=True, state total =
prefix[-1]); update any explanatory text and examples to reference
wp.utils.array_scan(..., inclusive=...) and the single total-count rule
consistently.
🧹 Nitpick comments (1)
warp/tests/test_deterministic.py (1)

692-753: Assert the actual slot order, not just counts/permutations.

counter_kernel should yield output == data_np, and conditional_counter_kernel should yield output[:expected_count] == data_np[data_np > threshold]. Right now a stable but wrong permutation would still pass because these tests only check counts, sorting, and cross-run reproducibility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/tests/test_deterministic.py` around lines 692 - 753, Update the two
tests to assert actual slot order rather than just counts/permutations: in
test_counter_correctness replace the sorted/permutation check with a direct
equality assertion that output.numpy() (or output.numpy().tolist()) exactly
equals data_np (ensuring dtype/shape match) to verify counter_kernel produces
output == data_np; in test_conditional_counter replace the permutation-based
checks with a direct equality assertion that output.numpy()[:expected_count]
equals data_np[data_np > threshold] to verify conditional_counter_kernel
preserves the original relative order of selected elements; keep the existing
cross-run determinism checks but change the comparisons to elementwise equality
rather than sorted/comparison of permutations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@design/deterministic-execution.md`:
- Around line 114-115: The docs use two different scan conventions; standardize
on one and make all references consistent: choose whether
wp.utils.array_scan(contrib, prefix, inclusive=True) or inclusive=False is the
canonical API, then update Pattern B and the writeback section (and the other
occurrence at lines ~173-175) so the phrase "total comes from the last element"
matches that convention (e.g., if using inclusive=False, state that total =
prefix[-1] + contrib[-1] or if inclusive=True, state total = prefix[-1]); update
any explanatory text and examples to reference wp.utils.array_scan(...,
inclusive=...) and the single total-count rule consistently.

In `@warp/tests/test_deterministic.py`:
- Around line 1143-1145: Remove the call to wp.clear_kernel_cache() from the
module-level __main__ block; locate the if __name__ == "__main__": block (where
wp.clear_kernel_cache() is called before unittest.main) and delete that
wp.clear_kernel_cache() invocation so tests no longer call kernel-cache clearing
from the test module.

---

Nitpick comments:
In `@warp/tests/test_deterministic.py`:
- Around line 692-753: Update the two tests to assert actual slot order rather
than just counts/permutations: in test_counter_correctness replace the
sorted/permutation check with a direct equality assertion that output.numpy()
(or output.numpy().tolist()) exactly equals data_np (ensuring dtype/shape match)
to verify counter_kernel produces output == data_np; in test_conditional_counter
replace the permutation-based checks with a direct equality assertion that
output.numpy()[:expected_count] equals data_np[data_np > threshold] to verify
conditional_counter_kernel preserves the original relative order of selected
elements; keep the existing cross-run determinism checks but change the
comparisons to elementwise equality rather than sorted/comparison of
permutations.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: a316d310-4647-410d-a32d-1acc2968e8e5

📥 Commits

Reviewing files that changed from the base of the PR and between 72e8e3f and 48e7207.

📒 Files selected for processing (8)
  • design/deterministic-execution.md
  • warp/_src/codegen.py
  • warp/_src/context.py
  • warp/_src/deterministic.py
  • warp/native/deterministic.cpp
  • warp/native/deterministic.cu
  • warp/native/warp.h
  • warp/tests/test_deterministic.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • warp/native/deterministic.cpp
  • warp/_src/context.py
  • warp/_src/codegen.py
  • warp/_src/deterministic.py

@shi-eric shi-eric added this to the 1.14.0 milestone Apr 10, 2026
Replace the capacity-style deterministic scatter option with deterministic_max_records so the API matches the codegen lower-bound model. This lets users size dynamic-loop workloads in terms of per-thread atomic record counts instead of raw buffer capacity.

Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
Comment thread warp/_src/codegen.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (5)
warp/_src/context.py (2)

7628-7641: ⚠️ Potential issue | 🟠 Major

Keep the count/scan/reduce fixups on stream, and accumulate the existing counter value.

array_scan(), warp.copy(), and run_sort_reduce() all omit stream here, so an explicit non-current stream or graph capture can reorder or escape the deterministic pass. Also, Line 7641 writes only this launch's total into counter_arr[0]; a counter that started at N will end at total instead of N + total.

Also applies to: 7670-7670

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/context.py` around lines 7628 - 7641, The code omits the stream
argument and overwrites existing counter values; update calls to
warp._src.utils.array_scan(contrib, prefix, inclusive=False) and the inclusive
scan warp._src.utils.array_scan(contrib, inclusive_out, inclusive=True) as well
as warp.copy and any run_sort_reduce invocations to pass the current non-default
stream (preserve graph capture/determinism), and change the write into
counter_arr (found by iterating kernel.adj.args for ct.array_var_label and
obtained from fwd_args[j]) to add the inclusive_out last-element to the existing
counter value instead of replacing it (i.e., read counter_arr[0], add
inclusive_out[dim_size-1], and write the sum back using warp.copy on the given
stream); apply the same stream+accumulate fix to the similar block around the
other occurrence near the 7670 location.

7505-7508: ⚠️ Potential issue | 🟠 Major

Raw-ctype array updates are still unsafe for deterministic replays.

Line 7507 only keeps fwd_args aligned for set_param_at_index(). The inherited set_param_at_index_from_ctype() / set_params_from_ctypes() can still replace an array descriptor while _launch_deterministic() later uses the stale Python array object for warp.copy() and run_sort_reduce(). Recorded deterministic launches can therefore replay into the wrong array, or fail once fwd_args and self.params diverge. If these APIs stay exposed, DeterministicLaunch needs to reject array-ctype updates or carry the owning Warp array alongside the packed descriptor.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/context.py` around lines 7505 - 7508, The deterministic replay bug:
set_param_at_index only updates fwd_args but inherited
set_param_at_index_from_ctype / set_params_from_ctypes can replace raw C-type
array descriptors in self.params leaving fwd_args stale, which breaks
_launch_deterministic and later warp.copy/run_sort_reduce replay; to fix, either
(A) make DeterministicLaunch reject/raise when set_param_at_index_from_ctype or
set_params_from_ctypes is used with array-ctypes, or (B) ensure those
ctype-updating paths also update the corresponding Python owning Warp array
stored in fwd_args (and keep self.params and fwd_args synchronized) so that
set_param_at_index_from_ctype / set_params_from_ctypes maintain alignment with
fwd_args before _launch_deterministic runs.
design/deterministic-execution.md (1)

114-115: ⚠️ Potential issue | 🟡 Minor

Use one scan convention in the design doc.

Line 114 documents wp.utils.array_scan(contrib, prefix, inclusive=False), but Lines 176-178 say the total comes from the last element of an inclusive scan. Please make both sections describe the same writeback rule. If the scan stays exclusive, the total is prefix[-1] + contrib[-1].

Also applies to: 176-178

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@design/deterministic-execution.md` around lines 114 - 115, The doc currently
mixes scan conventions: update both occurrences to use the same convention for
wp.utils.array_scan(contrib, prefix, inclusive=False) and the writeback rule;
either change the function call to inclusive=True and state the total is
prefix[-1], or keep inclusive=False and change the total calculation text (the
section that currently says the total comes from the last element of an
inclusive scan) to the exclusive rule: total = prefix[-1] + contrib[-1]; make
this consistent for the wp.utils.array_scan(contrib, prefix, inclusive=False)
mention and the later total computation paragraph.
warp/tests/test_deterministic.py (1)

1190-1192: ⚠️ Potential issue | 🟠 Major

Remove the kernel-cache clear from this test module.

wp.clear_kernel_cache() is explicitly disallowed in test files and can crash parallel test runs. unittest.main() is enough here.

🧹 Minimal fix
 if __name__ == "__main__":
-    wp.clear_kernel_cache()
     unittest.main(verbosity=2)

As per coding guidelines, "Never call wp.clear_kernel_cache() or wp.clear_lto_cache() in test files—not in __main__ blocks, test methods, or module scope. Cache clearing is not multi-process-safe; concurrent clears cause LLVM crashes."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/tests/test_deterministic.py` around lines 1190 - 1192, Remove the
explicit wp.clear_kernel_cache() call from the module-level __main__ block in
this test file; keep only the unittest.main(verbosity=2) invocation so the test
can run without clearing kernel caches (do not add wp.clear_lto_cache() either).
Locate the __main__ guard containing wp.clear_kernel_cache() and delete that
call, ensuring the block now simply calls unittest.main(...) and nothing else.
warp/_src/deterministic.py (1)

228-231: ⚠️ Potential issue | 🟠 Major

Counter scratch buffers still hard-code wp.int32.

CounterTarget.value_ctype is tracked, but contrib and prefix are always allocated as wp.int32. Any deterministic wp.int64/wp.uint64 counter will silently wrap in Phase 0/1. Please make these scratch buffers follow value_ctype, or fail fast for non-32-bit counters until the wider ABI is wired through end-to-end.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/deterministic.py` around lines 228 - 231, The loop that allocates
counter scratch buffers currently hardcodes dtype=warp.int32 for contrib and
prefix; change the allocation to use each CounterTarget's value_ctype (e.g., use
dtype=target.value_ctype or map that C type to the corresponding warp dtype) so
contrib and prefix match CounterTarget.value_ctype and avoid silent wrapping, or
add an explicit fast-fail that raises if target.value_ctype is not a 32-bit
integer type (when full 64-bit ABI isn't supported). Update the allocations in
the for _target in counter_targets block (the contrib and prefix
warp.zeros/warp.empty calls) to derive dtype from the target's value_ctype and
keep shape= (dim_size,) and device=device.
🧹 Nitpick comments (2)
warp/tests/test_deterministic.py (1)

850-879: test_module_option_override() doesn't prove the override is active on CUDA.

A single approximate-sum assertion will pass whether this kernel took the deterministic path or the normal atomic path, especially on CPU where both paths are deterministic. Please mirror test_kernel_decorator_override() here: run the kernel several times on CUDA and assert bit-exact equality across runs, then keep the sum check as a secondary sanity check if you want.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/tests/test_deterministic.py` around lines 850 - 879,
test_module_option_override currently only checks approximate sums so it doesn't
verify the per-module deterministic=True override on CUDA; update the test
(function test_module_option_override) to mirror test_kernel_decorator_override
by: when device is CUDA (check device == "cuda" or via wp.get_device_name/device
backend), run the kernel per_kernel_det multiple times (e.g., 3+ runs) into
separate outputs and assert bit-exact equality of the outputs across runs (use
output.numpy() equality checks) to prove determinism, while retaining the
existing sum/assert_allclose sanity check as a secondary assertion; ensure you
still toggle wp.config.deterministic = False around the launch to verify the
per-module override takes effect.
warp/_src/deterministic.py (1)

193-193: Annotate device with DeviceLike.

These helper signatures take device untyped, which diverges from the repo's device-parameter convention and makes this launch path harder to type-check consistently.

As per coding guidelines, "Use DeviceLike type annotation (from warp._src.context) for device parameters. Import under TYPE_CHECKING to avoid circular imports."

Also applies to: 222-222, 235-235

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/deterministic.py` at line 193, The device parameter in
allocate_scatter_buffers should be annotated with DeviceLike: import DeviceLike
from warp._src.context inside a TYPE_CHECKING block (from typing import
TYPE_CHECKING) to avoid circular imports, then change the function signature to
accept device: DeviceLike; apply the same change to the other two nearby helper
functions in this module that take a device parameter so all device-typed helper
signatures use DeviceLike consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@warp/_src/context.py`:
- Around line 7510-7526: Before calling _launch_deterministic in launch, guard
the deterministic path with the same null checks used by the regular CUDA path:
verify self.kernel is not None and that self.hooks.forward exists; if either
check fails raise a clear Python-side RuntimeError (or appropriate exception)
with a descriptive message instead of calling _launch_deterministic with a null
kernel handle. Apply the same change to DeterministicLaunch.launch to keep
behavior consistent.
- Around line 2198-2205: ModuleHasher.hash_kernel currently only hashes
kernel.key and kernel.adj which allows reuse of stale cached modules when
per-kernel overrides (e.g., kernel.options like deterministic or
deterministic_max_records) change; update ModuleHasher.hash_kernel to
incorporate kernel.options (or at minimum the deterministic and
deterministic_max_records flags/values) into the computed hash so that
Module.load and the unique-module cache distinguish kernels with different
per-kernel deterministic overrides, and ensure any code paths that call
hash_kernel (e.g., Module.load) will therefore generate distinct hashes when
those option values differ.

In `@warp/_src/deterministic.py`:
- Around line 68-74: CounterTarget is currently keyed only by array_var_label
causing get_or_create_counter_target() to collapse all logical counters (e.g.,
wp.atomic_add(counter, bucket, 1) across different bucket indices) into one
global sequence; fix by including the logical index in the CounterTarget
key/metadata (add a field for the logical counter index and propagate it from
the codegen caller into get_or_create_counter_target()), or alternatively
enforce deterministic mode by detecting non-constant-zero indices in
wp.atomic_add (and reject/raise an error) so Phase 1 cannot hand out global
offsets for variable indices; update all related uses (CounterTarget class,
get_or_create_counter_target(), and the call sites that build Pattern B
metadata) to reflect the new keying or the rejection behavior.

---

Duplicate comments:
In `@design/deterministic-execution.md`:
- Around line 114-115: The doc currently mixes scan conventions: update both
occurrences to use the same convention for wp.utils.array_scan(contrib, prefix,
inclusive=False) and the writeback rule; either change the function call to
inclusive=True and state the total is prefix[-1], or keep inclusive=False and
change the total calculation text (the section that currently says the total
comes from the last element of an inclusive scan) to the exclusive rule: total =
prefix[-1] + contrib[-1]; make this consistent for the
wp.utils.array_scan(contrib, prefix, inclusive=False) mention and the later
total computation paragraph.

In `@warp/_src/context.py`:
- Around line 7628-7641: The code omits the stream argument and overwrites
existing counter values; update calls to warp._src.utils.array_scan(contrib,
prefix, inclusive=False) and the inclusive scan
warp._src.utils.array_scan(contrib, inclusive_out, inclusive=True) as well as
warp.copy and any run_sort_reduce invocations to pass the current non-default
stream (preserve graph capture/determinism), and change the write into
counter_arr (found by iterating kernel.adj.args for ct.array_var_label and
obtained from fwd_args[j]) to add the inclusive_out last-element to the existing
counter value instead of replacing it (i.e., read counter_arr[0], add
inclusive_out[dim_size-1], and write the sum back using warp.copy on the given
stream); apply the same stream+accumulate fix to the similar block around the
other occurrence near the 7670 location.
- Around line 7505-7508: The deterministic replay bug: set_param_at_index only
updates fwd_args but inherited set_param_at_index_from_ctype /
set_params_from_ctypes can replace raw C-type array descriptors in self.params
leaving fwd_args stale, which breaks _launch_deterministic and later
warp.copy/run_sort_reduce replay; to fix, either (A) make DeterministicLaunch
reject/raise when set_param_at_index_from_ctype or set_params_from_ctypes is
used with array-ctypes, or (B) ensure those ctype-updating paths also update the
corresponding Python owning Warp array stored in fwd_args (and keep self.params
and fwd_args synchronized) so that set_param_at_index_from_ctype /
set_params_from_ctypes maintain alignment with fwd_args before
_launch_deterministic runs.

In `@warp/_src/deterministic.py`:
- Around line 228-231: The loop that allocates counter scratch buffers currently
hardcodes dtype=warp.int32 for contrib and prefix; change the allocation to use
each CounterTarget's value_ctype (e.g., use dtype=target.value_ctype or map that
C type to the corresponding warp dtype) so contrib and prefix match
CounterTarget.value_ctype and avoid silent wrapping, or add an explicit
fast-fail that raises if target.value_ctype is not a 32-bit integer type (when
full 64-bit ABI isn't supported). Update the allocations in the for _target in
counter_targets block (the contrib and prefix warp.zeros/warp.empty calls) to
derive dtype from the target's value_ctype and keep shape= (dim_size,) and
device=device.

In `@warp/tests/test_deterministic.py`:
- Around line 1190-1192: Remove the explicit wp.clear_kernel_cache() call from
the module-level __main__ block in this test file; keep only the
unittest.main(verbosity=2) invocation so the test can run without clearing
kernel caches (do not add wp.clear_lto_cache() either). Locate the __main__
guard containing wp.clear_kernel_cache() and delete that call, ensuring the
block now simply calls unittest.main(...) and nothing else.

---

Nitpick comments:
In `@warp/_src/deterministic.py`:
- Line 193: The device parameter in allocate_scatter_buffers should be annotated
with DeviceLike: import DeviceLike from warp._src.context inside a TYPE_CHECKING
block (from typing import TYPE_CHECKING) to avoid circular imports, then change
the function signature to accept device: DeviceLike; apply the same change to
the other two nearby helper functions in this module that take a device
parameter so all device-typed helper signatures use DeviceLike consistently.

In `@warp/tests/test_deterministic.py`:
- Around line 850-879: test_module_option_override currently only checks
approximate sums so it doesn't verify the per-module deterministic=True override
on CUDA; update the test (function test_module_option_override) to mirror
test_kernel_decorator_override by: when device is CUDA (check device == "cuda"
or via wp.get_device_name/device backend), run the kernel per_kernel_det
multiple times (e.g., 3+ runs) into separate outputs and assert bit-exact
equality of the outputs across runs (use output.numpy() equality checks) to
prove determinism, while retaining the existing sum/assert_allclose sanity check
as a secondary assertion; ensure you still toggle wp.config.deterministic =
False around the launch to verify the per-module override takes effect.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: f76535da-266c-45ce-9df4-48647bb5e6f1

📥 Commits

Reviewing files that changed from the base of the PR and between 48e7207 and 3170255.

📒 Files selected for processing (4)
  • design/deterministic-execution.md
  • warp/_src/context.py
  • warp/_src/deterministic.py
  • warp/tests/test_deterministic.py

Comment thread warp/_src/context.py
Comment thread warp/_src/context.py Outdated
Comment thread warp/_src/deterministic.py
OpenClaw Bot added 4 commits April 12, 2026 22:11
Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
Signed-off-by: OpenClaw Bot <bot@openclaw.ai>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (6)
warp/_src/context.py (3)

7532-7535: ⚠️ Potential issue | 🟠 Major

Keep fwd_args synchronized for ctype-based param mutators too.

Line 7532 only syncs self.fwd_args for set_param_at_index(). Calls through set_param_at_index_from_ctype() / set_params_from_ctypes() can still leave fwd_args stale, so deterministic replay can post-process the wrong arrays.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/context.py` around lines 7532 - 7535, The fwd_args list is only
updated in set_param_at_index, so calls that mutate params via ctype helpers
leave fwd_args stale; modify set_param_at_index_from_ctype and
set_params_from_ctypes to also synchronize self.fwd_args (either by delegating
to set_param_at_index for each change or by applying the same
index-check-and-assign logic used there), honoring the adjoint flag so adjoint
updates do not overwrite fwd_args.

7541-7553: ⚠️ Potential issue | 🟡 Minor

Preserve the forward-hook null guard in deterministic launch routing.

These deterministic branches call _launch_deterministic() without the regular forward-hook existence check. If hooks.forward is missing, this can pass a null kernel handle to CUDA launch instead of raising the clearer Python-side error.

Also applies to: 7825-7854

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/context.py` around lines 7541 - 7553, The deterministic launch
branches call _launch_deterministic(...) without checking for hooks.forward,
which can allow a null kernel handle to reach CUDA; add the same forward-hook
null guard used in the non-deterministic path: verify hooks.forward (or
equivalent forward kernel handle) is present before invoking
_launch_deterministic, and if missing raise the same Python-side error/exception
used elsewhere so we fail early and clearly; apply the same fix to the other
deterministic branch around the second occurrence noted (near the 7825-7854
block).

7657-7670: ⚠️ Potential issue | 🟠 Major

Two-pass deterministic path still breaks explicit-stream ordering and counter continuity.

Line 7657/Line 7664 scans, Line 7670 counter writeback, and Line 7699 sort-reduce are not stream-bound in this path. On non-current streams/capture this can reorder work. Also, Line 7670 overwrites the counter with this-launch total instead of preserving and adding the pre-existing counter value.

Also applies to: 7699-7699

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/context.py` around lines 7657 - 7670, The two-pass deterministic
path uses array_scan and warp.copy without binding them to the current
stream/capture and overwrites the counter rather than adding to the pre-existing
value; update the logic around warp._src.utils.array_scan(contrib, ...), the
inclusive_out buffer, and the writeback that finds counter_arr via
kernel.adj.args and ct.array_var_label so that (1) all scans, the inclusive_out
usage, the warp.copy writeback, and the later sort-reduce are submitted/bound to
the same current stream/capture to preserve ordering, and (2) when writing the
total into counter_arr you first read the existing counter value from
counter_arr and add the inclusive_out last element (the this-launch total) to it
(or perform an atomic-add equivalent) instead of blindly overwriting; ensure the
same stream-binding fix is applied to the sort-reduce at the other location
(around the existing sort-reduce call).
design/deterministic-execution.md (1)

129-130: ⚠️ Potential issue | 🟡 Minor

Use one scan convention throughout the design doc.

Pattern B says the prefix pass uses inclusive=False, but the writeback section says the total comes from “the last element of the inclusive scan.” Those are different rules, so the doc currently describes two incompatible implementations.

Also applies to: 191-193

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@design/deterministic-execution.md` around lines 129 - 130, The doc
inconsistently specifies scan conventions: Pattern B states the prefix pass uses
wp.utils.array_scan(contrib, prefix, inclusive=False) while the writeback
section refers to “the last element of the inclusive scan”; pick one convention
and update all references to match. Concretely, decide whether prefix scans are
inclusive or exclusive, then change mentions in Pattern B, the writeback
section, and other occurrences (e.g., lines ~191-193) so they all reference the
same wp.utils.array_scan(..., inclusive=...) behavior and explain how the total
is derived (either from the last element of the inclusive scan or from
last+contrib for exclusive).
warp/_src/deterministic.py (2)

283-293: ⚠️ Potential issue | 🟠 Major

Match the counter scratch buffers to the counter width.

CounterTarget.value_ctype is tracked, but allocate_counter_buffers() always allocates wp.int32 for both scratch arrays. That will truncate deterministic int64/uint64 counters unless they are rejected earlier. Either allocate these buffers from value_ctype, or fail fast for non-32-bit counters.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/deterministic.py` around lines 283 - 293, The counter scratch
buffers are always allocated as warp.int32 in allocate_counter_buffers, which
will truncate 64-bit counters; update allocate_counter_buffers to use each
CounterTarget's value_ctype (CounterTarget.value_ctype) when creating contrib
and prefix (i.e., pass dtype=_target.value_ctype to warp.zeros/warp.empty) or
alternatively raise/assert if _target.value_ctype is not a 32-bit type so
non-32-bit counters fail fast; ensure you reference the function
allocate_counter_buffers and the CounterTarget.value_ctype field when making the
change.

189-200: ⚠️ Potential issue | 🟠 Major

Don't merge every logical counter in an array into one deterministic target.

get_or_create_counter_target() only keys on array_var_label. If deterministic mode ever sees wp.atomic_add(counter, bucket, 1), every bucket in that array will reuse one contrib/prefix stream and Phase 1 will hand out slots from a single global sequence. Please include the logical counter index in the target identity, or reject non-constant-zero counter indices before creating a CounterTarget.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/deterministic.py` around lines 189 - 200,
get_or_create_counter_target currently deduplicates targets solely by
array_var_label which causes all array elements to share one deterministic
counter; update get_or_create_counter_target to include the logical counter
index in the identity (e.g., use a tuple of (array_var_label, logical_index) or
add a logical_index field to CounterTarget) so each distinct bucket gets its own
CounterTarget, and before creating a new CounterTarget validate the provided
index expression is a constant zero (or reject/non-deterministic indices) if
your design only allows index 0; ensure you update the lookup loop to compare
the new key (array_var_label plus logical index) and append the new
CounterTarget with the logical index populated when creating it.
🧹 Nitpick comments (2)
warp/_src/context.py (1)

1339-1341: Validate deterministic_max_records early for type/range.

This value is currently accepted broadly and coerced later in launch. Adding an upfront int + non-negative validation in option ingestion would fail fast and avoid silent coercion paths.

Also applies to: 2583-2585

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/_src/context.py` around lines 1339 - 1341, Validate
deterministic_max_records immediately before assigning into kernel_options:
ensure the value is an int (or can be safely converted) and is >= 0, raising a
TypeError for non-int types and a ValueError for negative values so callers fail
fast; then assign kernel_options["deterministic_max_records"] =
deterministic_max_records only after validation. Apply the same validation logic
at the other ingestion site referenced around lines 2583-2585 to keep behavior
consistent.
warp/tests/test_unique_module.py (1)

170-205: Keep the hash assertion runnable on CPU-only jobs.

Line 172 skips the whole test, but the module.name comparison at Lines 189-193 is device-independent. Splitting this into a CPU-safe hashing assertion plus a CUDA-only launch check would preserve coverage for the unique-module hashing change on non-CUDA runners.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@warp/tests/test_unique_module.py` around lines 170 - 205, The test currently
skips the entire test when CUDA is unavailable, but the module name hashing
assertion in test_kernel_options_affect_unique_module_identity (comparing
_scatter_normal.module.name and _scatter_deterministic.module.name) is
device-independent; change the test to perform the module.name comparison
unconditionally, and only guard the CUDA-specific array creation and wp.launch
calls with if not wp.is_cuda_available(): self.skipTest(...) or conditional
blocks around the CUDA-only code (values/indices/out_* with device="cuda:0" and
the wp.launch calls) so the hashing assertion still runs on CPU-only CI while
the deterministic launch checks remain CUDA-only.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@asv/benchmarks/atomics.py`:
- Around line 242-243: The class attribute params (and param_names) in the
benchmark classes are defined as lists which creates mutable class state; change
them to tuples so they are immutable (e.g., replace params = (["normal",
"deterministic"], [1, 65536], DETERMINISTIC_BENCHMARK_SIZES) with a
tuple-of-tuples and param_names likewise) and apply the same change to the
second occurrence around lines 317-318; update the definitions referenced as
params and param_names in asv/benchmarks/atomics.py so both benchmark classes
use immutable tuples instead of lists.

In `@warp/_src/codegen.py`:
- Around line 1785-1788: The early-return that skips codegen for integer atomics
(the condition using is_float_type(scalar_dtype) and return_is_consumed) is
unsafe when a counting pass exists; modify that condition to also detect whether
the current kernel contains any consumed/counter atomic (i.e., only return None
when not is_float_type(scalar_dtype) AND not return_is_consumed AND there is no
consumed/counter atomic in the kernel). Use the existing kernel/AST context or
add a predicate (e.g., has_consumed_counter_atomic) to check for consumed
atomics before returning so phase 0 will not double-apply native integer atomics
in mixed kernels.

---

Duplicate comments:
In `@design/deterministic-execution.md`:
- Around line 129-130: The doc inconsistently specifies scan conventions:
Pattern B states the prefix pass uses wp.utils.array_scan(contrib, prefix,
inclusive=False) while the writeback section refers to “the last element of the
inclusive scan”; pick one convention and update all references to match.
Concretely, decide whether prefix scans are inclusive or exclusive, then change
mentions in Pattern B, the writeback section, and other occurrences (e.g., lines
~191-193) so they all reference the same wp.utils.array_scan(..., inclusive=...)
behavior and explain how the total is derived (either from the last element of
the inclusive scan or from last+contrib for exclusive).

In `@warp/_src/context.py`:
- Around line 7532-7535: The fwd_args list is only updated in
set_param_at_index, so calls that mutate params via ctype helpers leave fwd_args
stale; modify set_param_at_index_from_ctype and set_params_from_ctypes to also
synchronize self.fwd_args (either by delegating to set_param_at_index for each
change or by applying the same index-check-and-assign logic used there),
honoring the adjoint flag so adjoint updates do not overwrite fwd_args.
- Around line 7541-7553: The deterministic launch branches call
_launch_deterministic(...) without checking for hooks.forward, which can allow a
null kernel handle to reach CUDA; add the same forward-hook null guard used in
the non-deterministic path: verify hooks.forward (or equivalent forward kernel
handle) is present before invoking _launch_deterministic, and if missing raise
the same Python-side error/exception used elsewhere so we fail early and
clearly; apply the same fix to the other deterministic branch around the second
occurrence noted (near the 7825-7854 block).
- Around line 7657-7670: The two-pass deterministic path uses array_scan and
warp.copy without binding them to the current stream/capture and overwrites the
counter rather than adding to the pre-existing value; update the logic around
warp._src.utils.array_scan(contrib, ...), the inclusive_out buffer, and the
writeback that finds counter_arr via kernel.adj.args and ct.array_var_label so
that (1) all scans, the inclusive_out usage, the warp.copy writeback, and the
later sort-reduce are submitted/bound to the same current stream/capture to
preserve ordering, and (2) when writing the total into counter_arr you first
read the existing counter value from counter_arr and add the inclusive_out last
element (the this-launch total) to it (or perform an atomic-add equivalent)
instead of blindly overwriting; ensure the same stream-binding fix is applied to
the sort-reduce at the other location (around the existing sort-reduce call).

In `@warp/_src/deterministic.py`:
- Around line 283-293: The counter scratch buffers are always allocated as
warp.int32 in allocate_counter_buffers, which will truncate 64-bit counters;
update allocate_counter_buffers to use each CounterTarget's value_ctype
(CounterTarget.value_ctype) when creating contrib and prefix (i.e., pass
dtype=_target.value_ctype to warp.zeros/warp.empty) or alternatively
raise/assert if _target.value_ctype is not a 32-bit type so non-32-bit counters
fail fast; ensure you reference the function allocate_counter_buffers and the
CounterTarget.value_ctype field when making the change.
- Around line 189-200: get_or_create_counter_target currently deduplicates
targets solely by array_var_label which causes all array elements to share one
deterministic counter; update get_or_create_counter_target to include the
logical counter index in the identity (e.g., use a tuple of (array_var_label,
logical_index) or add a logical_index field to CounterTarget) so each distinct
bucket gets its own CounterTarget, and before creating a new CounterTarget
validate the provided index expression is a constant zero (or
reject/non-deterministic indices) if your design only allows index 0; ensure you
update the lookup loop to compare the new key (array_var_label plus logical
index) and append the new CounterTarget with the logical index populated when
creating it.

---

Nitpick comments:
In `@warp/_src/context.py`:
- Around line 1339-1341: Validate deterministic_max_records immediately before
assigning into kernel_options: ensure the value is an int (or can be safely
converted) and is >= 0, raising a TypeError for non-int types and a ValueError
for negative values so callers fail fast; then assign
kernel_options["deterministic_max_records"] = deterministic_max_records only
after validation. Apply the same validation logic at the other ingestion site
referenced around lines 2583-2585 to keep behavior consistent.

In `@warp/tests/test_unique_module.py`:
- Around line 170-205: The test currently skips the entire test when CUDA is
unavailable, but the module name hashing assertion in
test_kernel_options_affect_unique_module_identity (comparing
_scatter_normal.module.name and _scatter_deterministic.module.name) is
device-independent; change the test to perform the module.name comparison
unconditionally, and only guard the CUDA-specific array creation and wp.launch
calls with if not wp.is_cuda_available(): self.skipTest(...) or conditional
blocks around the CUDA-only code (values/indices/out_* with device="cuda:0" and
the wp.launch calls) so the hashing assertion still runs on CPU-only CI while
the deterministic launch checks remain CUDA-only.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 493e06fd-9309-43dc-ad53-7cc5a97aa0b8

📥 Commits

Reviewing files that changed from the base of the PR and between 3170255 and e34a93c.

📒 Files selected for processing (11)
  • asv/benchmarks/atomics.py
  • design/deterministic-execution.md
  • warp/_src/codegen.py
  • warp/_src/context.py
  • warp/_src/deterministic.py
  • warp/config.py
  • warp/native/deterministic.cpp
  • warp/native/deterministic.cu
  • warp/native/warp.h
  • warp/tests/test_deterministic.py
  • warp/tests/test_unique_module.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • warp/native/warp.h
  • warp/native/deterministic.cpp
  • warp/config.py

Comment thread asv/benchmarks/atomics.py Outdated
Comment thread warp/_src/codegen.py Outdated
@eric-heiden
Copy link
Copy Markdown
Member

eric-heiden commented Apr 16, 2026

This determinism mode doesn't yet support atomics inside Warp functions, which is used in MJWarp, for example in smooth.py:

  • _accumulate_jac_chain()
  • _accumulate_jac_dot_chain()

Minimal repro:

import warp as wp

wp.init()

a = wp.zeros(1, dtype=wp.float32, device="cuda")

@wp.func
def add_one(x: wp.array(dtype=wp.float32)):
    wp.atomic_add(x, 0, 1.0)

@wp.kernel
def k(x: wp.array(dtype=wp.float32)):
    add_one(x)

wp.launch(k, dim=1, inputs=[a], device="cuda")
wp.synchronize_device("cuda")
print(a.numpy())

With this PR's deterministic path enabled, atomic_add inline in a @wp.kernel works, but the same atomic inside a @wp.func helper called from a kernel fails NVRTC compilation.

The generated helper function body gets rewritten to use deterministic scatter / debug plumbing, but its signature does not receive the hidden deterministic params that are appended for kernel entry points (_wp_scatter_*, _wp_det_debug, _idx, etc.).

Signed-off-by: Miles Macklin <mmacklin@nvidia.com>
Signed-off-by: Miles Macklin <mmacklin@nvidia.com>
Comment thread warp/_src/context.py Outdated
stevenwman added a commit to stevenwman/jax-learning that referenced this pull request Apr 27, 2026
…nostic

scripts/check_tdmpc2_determinism.py — three subchecks (--check init |
update | env). Each runs in <30s; run twice in separate processes and
diff the printed SHA-256 hashes to localize where non-determinism enters.

.context/lessons/determinism.md — empirical results + root cause +
upstream fix-in-flight tracking. TL;DR:
  - JAX/XLA algo: bit-ID across processes IFF
    XLA_FLAGS=--xla_gpu_deterministic_ops=true
  - mujoco_warp env: NOT bit-ID across processes; XLA flag has no effect
    on Warp CUDA kernels. wp.set_device("cpu") is the only workaround
    today. Fix in NVIDIA/warp#1355 (Warp 1.14, ~Jun 2026) +
    mujoco_warp#1281, #1300.
  - Therefore end-to-end training trajectories cannot be byte-identical
    across two GPU processes. Same-process re-runs appear deterministic.

Rationale for adding now: the TD-MPC2 eval-key isolation fix (b5a8c70)
couldn't be empirically verified by diffing two-process trajectories
because env non-determinism dominates; needed the within-process
key-inspection workaround. Future repeatability questions for any algo
in this repo should consult this lesson.

LESSONS.md updated with index entry.
Keep ordinary CUDA launches on the pre-existing direct native call path so launch microbenchmarks avoid deterministic helper overhead. Deterministic multi-pass launches continue to use the shared helper.

Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Add a hands-on deep dive for deterministic atomic patterns, configuration, CUDA graph caveats, and the CUB/CCCL building blocks behind the implementation.

Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Document that consumed-return deterministic counters return a base slot and reserve a contiguous range when the atomic add value is greater than one.

Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Fail closed for deterministic consumed-return counters during CUDA graph capture because the current scan path allocates CUB temporary storage inside the captured call. Also reject oversized counter launches before allocating int32 prefix buffers and update the docs/tests to match.

Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Start the Deep Dive with module-level determinism for existing programs, and clarify when kernel-level deterministic options and unique modules are appropriate.

Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Describe True and False deterministic modes as ease-of-use aliases rather than backward compatibility behavior for a new feature.

Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Avoid CUDA graph capture for the deterministic consumed-return counter ASV benchmark, annotate boolean deterministic aliases, and document the scatter buffer capacity floor.

Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Expand the mode selection guidance, simplify the limitations heading, improve the NVIDIA libraries wording, and add small examples for the most concrete limitations.

Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Match the internal deterministic-mode docstring to the public docs by describing boolean aliases as ease-of-use support.

Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Explain manual deterministic lowering patterns, add a mixed-reduction workaround, and include a small measured performance table for deterministic atomic modes.

Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Call out that highly contended floating-point atomics can speed up under deterministic run-to-run reductions, while sparse atomics usually pay sorting overhead.

Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Move performance discussion into its own section with benchmark snippets and separate measurements for deterministic accumulation and slot allocation patterns.

Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Split the performance discussion into accumulation and slot-allocation subsections and state that both measurements use CUDA event timing.

Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Publish consumed-return counter totals from the exclusive prefix and final contribution on device. This avoids the redundant full inclusive scan and temporary output buffer while keeping the write stream-ordered.

Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Keep block-dim-dependent shared tile type and initializer emission consistent by using WP_TILE_BLOCK_DIM for default shared strides, not just shape dimensions.

Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Broaden consumed-return atomic_add lowering to use sorted counter records so dynamic indices, sliced counter views, graph capture, and custom adjoint call paths work in deterministic mode. Keep unsupported consumed-return non-add atomics fail-closed and document the remaining limits.

Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Custom gradient codegen can run with deterministic support disabled, which leaves deterministic metadata unset. Guard those lookups so custom adjoints use ordinary reverse calls in not_guaranteed mode while preserving metadata propagation when deterministic mode is enabled.

Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Clean up user-facing determinism wording and add immediate CUDA launch error checks for deterministic native postpass kernels. Widen launch block arithmetic to avoid int overflow near the supported record limit.

Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
The deterministic counter scan now publishes final counter totals directly, so the old counter-total native entry point and ctypes binding are dead code. Remove the stale export to keep the native API surface tidy.

Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Custom adjoint functions compile in deterministic mode, but atomics through wp.adjoint[...] are not automatically rewritten. Document that boundary so the public guide and design limitations match the implementation.

Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Allow the ASV counter benchmark to exercise deterministic graph capture and guard deterministic counter record ordinal computation with widened arithmetic.

Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
Signed-off-by: Eric Heiden <eric-heiden@outlook.com>
@eric-heiden eric-heiden force-pushed the warp-deterministic branch from ad273f5 to d0284f4 Compare May 11, 2026 16:54
@shi-eric shi-eric removed this from the 1.14.0 milestone May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add deterministic execution mode for atomic operations

4 participants