diff --git a/docs/deep_dive/allocators.rst b/docs/deep_dive/allocators.rst
index b27219f040..37aae3735d 100644
--- a/docs/deep_dive/allocators.rst
+++ b/docs/deep_dive/allocators.rst
@@ -246,7 +246,7 @@ Limitations
 Mempool-to-Mempool Copies Between GPUs During Graph Capture
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-Copying data between different GPUs will fail during graph capture if the source and destination are allocated using mempool allocators and mempool access is not enabled between devices.  Note that this only applies to capturing mempool-to-mempool copies in a graph; copies done outside of graph capture are not affected.  Copies within the same mempool (i.e., same device) are also not affected.
+Copying data between different GPUs will fail during graph capture if the source and destination are allocated using mempool allocators and mempool access is not enabled between devices.  Note that this only applies to capturing mempool-to-mempool copies in a graph.  Copies done outside of graph capture are not affected.  Copies within the same mempool (i.e., same device) are also not affected.
 
 There are two workarounds.  If mempool access is supported, you can simply enable mempool access between the devices prior to graph capture, as shown in :ref:`mempool_access`.
 
@@ -398,11 +398,11 @@ PyTorch's cache, implement a small custom allocator that calls
 PyTorch tracks the device and stream for pointers returned by
 ``caching_allocator_alloc()``, so ``caching_allocator_delete()`` only needs the
 pointer. The ``_active_allocations`` dictionary above is for validation and
-debugging; applications can customize this tracking for their own accounting,
+debugging. Applications can customize this tracking for their own accounting,
 thread-safety, or distributed runtime needs.
 
-RMM Integration
-~~~~~~~~~~~~~~~
+RAPIDS Memory Manager (RMM) Integration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 `RAPIDS Memory Manager (RMM) <https://github.com/rapidsai/rmm>`_ provides high-performance
 pooled allocators for CUDA. Warp includes a built-in adapter, :class:`~warp.utils.AllocatorRmm`, that
diff --git a/docs/deep_dive/codegen.rst b/docs/deep_dive/codegen.rst
index 3338c01ef0..45942efcec 100644
--- a/docs/deep_dive/codegen.rst
+++ b/docs/deep_dive/codegen.rst
@@ -97,6 +97,11 @@ Warp generates C++/CUDA source code for CPU/GPU and stores the .cpp/.cu source f
 The kernel cache folder path is printed during the :ref:`Warp initialization <warp-initialization>` and
 can be retrieved after Warp has been initialized from the ``warp.config.kernel_cache_dir`` :ref:`configuration setting <global-settings>`.
 
+When needed, users can intentionally insert small C++/CUDA snippets into
+generated modules with :func:`@wp.func_native <warp.func_native>`. See
+:ref:`Native Snippets in Warp Kernels <native_functions>` for the public
+native-function API.
+
 In addition to Warp's kernel cache, the NVIDIA CUDA driver maintains a separate
 compute cache that stores JIT-compiled GPU binaries (e.g., native code produced
 from PTX). This driver-level cache is not managed by Warp and is not affected by
@@ -1216,242 +1221,11 @@ Output:
 
 Kernel ``k1`` uses the latest definition of function ``f``, while kernel ``k2`` uses the definition of ``f`` when the kernel was declared.
 
+
 Ahead-of-Time Compilation Workflows
 -----------------------------------
 
-Under typical use cases, Warp will compile and load modules automatically at runtime when kernels are launched using
-:func:`wp.launch() <warp.launch>` or :func:`wp.launch_tiled() <warp.launch_tiled>`.
-
-However, there are some cases where it is convenient to compile a module ahead of time, perhaps even on a different
-machine.
-For example, you might want to use Warp to generate source code that can be included in a native CUDA C++ application.
-
-By disabling module hashing, you can even distribute pre-compiled modules without the original source code.
-
-The :func:`wp.compile_aot_module() <warp.compile_aot_module>` and :func:`wp.load_aot_module() <warp.load_aot_module>`
-functions can be used to compile and load modules manually.
-
-Example: Compile a module manually without launching any kernels
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The following example shows how to compile a module manually without launching any kernels.
-We pass ``__name__`` as the first argument to :func:`wp.compile_aot_module() <warp.compile_aot_module>` to indicate
-that the current module is the one that should be compiled, but it is also possible to pass the name of another
-module containing Warp kernels.
-
-.. code:: python
-
-    import warp as wp
-    
-
-    @wp.kernel
-    def multiply_arrays(a: wp.array[float], b: wp.array[float], out: wp.array[float]):
-        i = wp.tid()
-        out[i] = a[i] * b[i]
-
-    if __name__ == "__main__":
-        wp.compile_aot_module(__name__, module_dir="output")
-
-This will create the following files:
-
-.. code-block:: text
-
-   output/
-   ├── wp___main___97ca746.cu
-   ├── wp___main___97ca746.meta
-   └── wp___main___97ca746.sm86.ptx
-
-Example: Distributing a pre-compiled module
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-In this example, our goal is to load and run kernels from a pre-compiled module without the original source
-code that produced the module (either the Python function or the generated CUDA C++ code).
-
-First, we will disable module hashing when we ahead-of-time compile the code by passing ``strip_hash=True`` to
-:func:`wp.compile_aot_module() <warp.compile_aot_module>`.
-When hashing is disabled, the generated function names and filenames become more readable.
-For example, instead of a function named ``sum_kernel_b6a8b352_cuda_kernel_forward()`` in the file
-``wp___main___3a2e577/wp___main___3a2e577.cu``, you get the cleaner function ``sum_kernel_cuda_kernel_forward()``
-in the file ``wp___main__/wp___main__.cu``. Therefore, ``strip_hash=True`` can also be used to prevent name mangling.
-
-.. important::
-
-    Not all modules can be compiled successfully with hashing disabled. Name collisions can occur when
-    hashes are removed from the function names, particularly when using :doc:`Generics <../user_guide/generics>`, which can
-    result in multiple versions of the same function or kernel.
-
-Only use ``strip_hash=True`` when your module has a single version of each function and kernel.
-
-.. code:: python
-
-    import warp as wp
-    
-
-    @wp.kernel
-    def multiply_arrays(a: wp.array[float], b: wp.array[float], out: wp.array[float]):
-        i = wp.tid()
-        out[i] = a[i] * b[i]
-
-    if __name__ == "__main__":
-        wp.compile_aot_module(__name__, module_dir="output", strip_hash=True)
-
-This results in the following files:
-
-.. code:: text
-
-   output/
-   ├── wp___main__.cu
-   ├── wp___main__.meta
-   └── wp___main__.sm86.ptx
-
-The second part of this workflow involves writing a script that can load the pre-compiled module and run the kernels,
-all without access to the original script that compiled the module.
-
-We will remove the ``output/wp___main__.cu`` to mock up a potential user that has been provided with only the
-``output/wp___main__.sm86.ptx`` and ``output/wp___main__.meta`` files.
-For example, intellectual-property concerns may prevent disclosure of the original source code.
-
-We can write a script that also defines the ``multiply_arrays`` kernel with the same signature as in the previous
-script, but contains an empty body.
-
-.. code:: python
-
-    import warp as wp
-
-
-    @wp.kernel
-    def multiply_arrays(a: wp.array[float], b: wp.array[float], out: wp.array[float]):
-        pass
-
-
-    if __name__ == "__main__":
-        wp.load_aot_module(__name__, module_dir="output", strip_hash=True)
-
-        a = wp.full((10,), 5.0, dtype=float)
-        b = wp.full((10,), 10.0, dtype=float)
-        c = wp.empty_like(a)
-
-        wp.launch(multiply_arrays, dim=a.shape, inputs=[a, b], outputs=[c])
-
-        print(c)
-
-When we run the above script, we see the following output:
-
-.. code:: text
-
-    Module __main__ load on device 'cuda:0' took 46.33 ms  (cached)
-    [50. 50. 50. 50. 50. 50. 50. 50. 50. 50.]
-
-The ``(cached)`` in the module-load message indicates that the module is loaded from cache instead of being compiled.
-The output also shows that the kernel multiplied the elements of the arrays ``a`` and ``b`` together despite the
-empty body of the ``multiply_arrays`` kernel.
-
-This example assumed both the module compilation and loading happened on the same machine, but it might be necessary
-to compile a module for multiple GPU architectures.
-A list of GPU architectures for which to generate code as well as whether to generate PTX or CUBIN files can be
-specified when calling :func:`wp.compile_aot_module() <warp.compile_aot_module>`.
-
-The following example shows how to compile a module without hashing for multiple architectures as CUBIN files
-by passing in a list of architectures returned by
-:func:`wp.get_cuda_supported_archs() <warp.get_cuda_supported_archs>` to the ``arch`` argument.
-
-.. code:: python
-
-    import warp as wp
-
-
-    @wp.kernel
-    def multiply_arrays(
-        a: wp.array[float],
-        b: wp.array[float],
-        out: wp.array[float],
-    ):
-        i = wp.tid()
-        out[i] = a[i] * b[i]
-
-
-    if __name__ == "__main__":
-        wp.init()
-        wp.compile_aot_module(
-            __name__,
-            arch=wp.get_cuda_supported_archs(),
-            module_dir="output",
-            strip_hash=True,
-            use_ptx=False,
-        )
-
-On a CUDA 13.0 build of Warp, this results in the following files:
-
-.. code:: text
-
-   output/
-   ├── wp___main__.cu
-   ├── wp___main__.meta
-   ├── wp___main__.sm100.cubin
-   ├── wp___main__.sm103.cubin
-   ├── wp___main__.sm110.cubin
-   ├── wp___main__.sm120.cubin
-   ├── wp___main__.sm121.cubin
-   ├── wp___main__.sm75.cubin
-   ├── wp___main__.sm80.cubin
-   ├── wp___main__.sm86.cubin
-   ├── wp___main__.sm87.cubin
-   ├── wp___main__.sm88.cubin
-   ├── wp___main__.sm89.cubin
-   └── wp___main__.sm90.cubin
-
-Example: Compiling without a CUDA driver (Docker build steps)
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Warp statically links NVRTC (the NVIDIA Runtime Compiler), which means CUDA kernel compilation
-does not require a CUDA driver or GPU to be present. This is useful for ahead-of-time compilation
-during Docker image builds, where GPUs are typically unavailable.
-
-When Warp is initialized without a CUDA driver, it detects that NVRTC is still available and
-prints a message like:
-
-.. code:: text
-
-   Warp 1.12.0 initialized:
-      CUDA Toolkit 13.1, CUDA driver not available (NVRTC compilation available)
-      Devices:
-        "cpu"      : "CPU"
-
-In this mode, :func:`wp.get_cuda_supported_archs() <warp.get_cuda_supported_archs>` returns the
-full list of architectures supported by NVRTC, and
-:func:`wp.compile_aot_module() <warp.compile_aot_module>` can compile CUDA kernels as long as the
-``arch`` parameter is specified.
-
-The following self-contained ``Dockerfile`` demonstrates compiling Warp kernels during
-``docker build`` (without ``--gpus``). Save it and run ``docker build -t warp-aot-example .``
-to try it:
-
-.. code:: dockerfile
-
-   FROM python:3.12-slim
-   RUN pip install warp-lang
-
-   RUN cat <<'EOF' > compile.py
-   import os
-   import warp as wp
-
-   @wp.kernel
-   def my_kernel(a: wp.array[float], b: wp.array[float]):
-       i = wp.tid()
-       b[i] = a[i] * 2.0
-
-   os.makedirs("/app/warp_cache", exist_ok=True)
-   wp.compile_aot_module(__name__, arch=[75, 80, 86, 90], module_dir="/app/warp_cache")
-
-   files = [f for f in os.listdir("/app/warp_cache") if f.endswith((".ptx", ".cubin"))]
-   print(f"Compiled {len(files)} CUDA files: {sorted(files)}")
-   EOF
-   RUN python compile.py
-
-At runtime, launch the container with ``--gpus`` and the pre-compiled PTX/CUBIN will be loaded
-from cache, avoiding JIT compilation delays.
-
-**The** ``arch`` **parameter is required when compiling without a GPU**, since Warp cannot infer
-the target architecture from a device. Use
-:func:`wp.get_cuda_supported_archs() <warp.get_cuda_supported_archs>` to query which architectures
-are available for compilation.
+Ahead-of-time compilation is one of the C++/CUDA integration workflows. See
+:ref:`ahead_of_time_compilation_workflows` for instructions on generating
+CUDA source, PTX, and CUBIN files; distributing pre-compiled modules; and
+compiling CUDA kernels in build environments without a CUDA driver.
diff --git a/docs/index.rst b/docs/index.rst
index 31499f1ded..f4488ce648 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -181,6 +181,7 @@ warp/examples/tile
     user_guide/tiles
     user_guide/interoperability
     user_guide/configuration
+    user_guide/cpp_cuda_workflows
     user_guide/debugging
     user_guide/limitations
     user_guide/contribution_guide
diff --git a/docs/user_guide/basics.rst b/docs/user_guide/basics.rst
index 3c8d9bedd4..25aa481156 100644
--- a/docs/user_guide/basics.rst
+++ b/docs/user_guide/basics.rst
@@ -274,8 +274,9 @@ determine the tile argument type:
 
 See :ref:`Generic Functions` for details on using ``typing.Any`` in user function signatures.
 
-See :doc:`differentiability` for details on how to define custom gradient functions,
-custom replay functions, and custom native functions.
+See :doc:`differentiability` for details on custom gradient and replay
+functions. See :doc:`cpp_cuda_workflows` for native C++/CUDA snippets and
+other non-Python integration workflows.
 
 User Structs
 --------------
diff --git a/docs/user_guide/cpp_cuda_workflows.rst b/docs/user_guide/cpp_cuda_workflows.rst
new file mode 100644
index 0000000000..c6b1e590e1
--- /dev/null
+++ b/docs/user_guide/cpp_cuda_workflows.rst
@@ -0,0 +1,752 @@
+C++ and CUDA Workflows
+======================
+
+.. currentmodule:: warp
+
+Warp applications are typically authored in Python, but several workflows expose
+generated C++/CUDA code or replay captured Warp work from a native host
+application. This page collects the public entry points for those non-Python
+integration paths and links to the detailed workflow documentation and examples.
+
+This page covers topics such as:
+
+- inserting native C++/CUDA snippets into generated Warp kernels.
+- ahead-of-time compiling Warp kernels into source, PTX, or CUBIN files.
+- loading generated Warp binaries or source from a CUDA C++ application.
+- serializing captured Warp work and replaying it from C++ without a Python
+  runtime.
+
+.. _native_functions:
+
+Native Snippets in Warp Kernels
+-------------------------------
+
+Use :func:`@wp.func_native <warp.func_native>` to insert native C++/CUDA code
+into generated Warp modules. Native functions are useful when Warp does not
+provide a built-in operation, CUDA intrinsic, synchronization pattern, or
+low-level expression that your kernel needs.
+
+Pure C++ snippets, meaning snippets without CUDA-only constructs, can be used
+by CPU kernels. The same snippet can also be used by CUDA kernels if the code is
+valid device code. Snippets that use CUDA-only constructs must be called only
+from kernels launched on CUDA devices; CPU code generation cannot compile
+features such as ``__shared__`` memory, ``__syncthreads()``, or CUDA atomics.
+
+The decorator takes native source code as a string. The decorated Python
+function is a typed stub: its arguments define the names and types available to
+the snippet, and its body should be ``...`` because Warp replaces the body with
+the native snippet during code generation.
+
+The thread index should be computed by the caller and passed explicitly. Native
+snippets are inserted into generated C++/CUDA, so they cannot call
+:func:`wp.tid() <warp.tid>` directly.
+
+CUDA Shared Memory Histogram
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The following CUDA-only native function builds a 16-bin histogram within a
+single 128-thread block using shared memory. This pattern is different from a
+tile reduction: each thread chooses a bin dynamically and updates a per-block
+scratch array with CUDA shared-memory atomics before the block writes the final
+bins to global memory. It assumes the launch uses exactly one block.
+
+.. testcode:: cuda_shared_memory_histogram
+    :skipif: wp.get_cuda_device_count() == 0
+
+    import numpy as np
+    import warp as wp
+
+    snippet = """
+        __shared__ unsigned int bins[16];
+
+        if (tid < 16) {
+            bins[tid] = 0;
+        }
+        __syncthreads();
+
+        unsigned int bin = values[tid] & 15u;
+        atomicAdd(&bins[bin], 1u);
+        __syncthreads();
+
+        if (tid < 16) {
+            out[tid] = bins[tid];
+        }
+        """
+
+
+    @wp.func_native(snippet)
+    def histogram_block(values: wp.array[wp.uint32], out: wp.array[wp.uint32], tid: int):
+        ...
+
+
+    @wp.kernel(module="unique")
+    def histogram_kernel(values: wp.array[wp.uint32], out: wp.array[wp.uint32]):
+        tid = wp.tid()
+        histogram_block(values, out, tid)
+
+
+    values_host = np.arange(128, dtype=np.uint32) % 16
+    values = wp.array(values_host, dtype=wp.uint32, device="cuda")
+    out = wp.zeros(16, dtype=wp.uint32, device="cuda")
+    wp.launch(
+        histogram_kernel,
+        dim=128,
+        inputs=[values],
+        outputs=[out],
+        block_dim=128,
+        device="cuda",
+    )
+
+    print(out.numpy())
+
+.. testoutput:: cuda_shared_memory_histogram
+    :skipif: wp.get_cuda_device_count() == 0
+
+    [8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8]
+
+Inline PTX
+~~~~~~~~~~
+
+Native snippets can also use `inline Parallel Thread Execution (PTX) assembly
+<https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html>`_ inside CUDA
+code. Inline PTX is useful when you need a GPU instruction that is not exposed
+directly through Warp or CUDA C++.
+
+The following example computes the sum of four byte-wise absolute differences
+between two packed 8-bit values. The `PTX vabsdiff4 instruction
+<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#simd-video-instructions-vadd4-vsub4-vavrg4-vabsdiff4-vmin4-vmax4>`_
+performs four byte-wise absolute differences and, with the ``.add`` modifier,
+accumulates them into one 32-bit result.
+
+.. testcode:: inline_ptx
+    :skipif: wp.get_cuda_device_count() == 0
+
+    import numpy as np
+    import warp as wp
+
+    snippet = r"""
+        unsigned int result;
+        unsigned int zero = 0;
+        asm("vabsdiff4.u32.u32.u32.add %0, %1, %2, %3;"
+            : "=r"(result)
+            : "r"(a), "r"(b), "r"(zero));
+        return result;
+        """
+
+
+    @wp.func_native(snippet)
+    def absdiff4_sum_u8(a: wp.uint32, b: wp.uint32) -> wp.uint32:
+        ...
+
+
+    @wp.kernel(module="unique")
+    def absdiff4_kernel(
+        a: wp.array[wp.uint32],
+        b: wp.array[wp.uint32],
+        out: wp.array[wp.uint32],
+    ):
+        tid = wp.tid()
+        out[tid] = absdiff4_sum_u8(a[tid], b[tid])
+
+
+    def pack4(values):
+        return np.uint32(values[0] | (values[1] << 8) | (values[2] << 16) | (values[3] << 24))
+
+
+    a_host = np.array([pack4([10, 20, 30, 40]), pack4([0, 128, 255, 13])], dtype=np.uint32)
+    b_host = np.array([pack4([13, 18, 41, 35]), pack4([255, 120, 0, 15])], dtype=np.uint32)
+
+    a = wp.array(a_host, dtype=wp.uint32, device="cuda")
+    b = wp.array(b_host, dtype=wp.uint32, device="cuda")
+    out = wp.zeros_like(a)
+    wp.launch(absdiff4_kernel, dim=a.shape, inputs=[a, b], outputs=[out], device="cuda")
+
+    # [3 + 2 + 11 + 5, 255 + 8 + 255 + 2]
+    print(out.numpy())
+
+.. testoutput:: inline_ptx
+    :skipif: wp.get_cuda_device_count() == 0
+
+    [ 21 520]
+
+The ``"r"`` constraints bind the operands to 32-bit integer registers, which
+matches the ``.u32`` instruction operands. The final PTX operand is an
+accumulator and is supplied as a zero-initialized register in this example. If
+the assembly reads or writes memory through pointers, add the appropriate
+``"memory"`` clobber as described in NVIDIA's inline PTX documentation.
+
+Returning Values
+~~~~~~~~~~~~~~~~
+
+A native snippet can return a value when the Python stub declares a return type.
+Warp supports scalar, vector, matrix, quaternion, array, and fixed-array return
+types. Struct return values are not supported. The following example decodes a
+packed ``0xRRGGBB`` color and returns a :class:`wp.vec3f <warp.vec3f>`.
+
+.. testcode:: returning_values
+
+    import numpy as np
+    import warp as wp
+
+    snippet = """
+        const float inv_255 = 1.0f / 255.0f;
+        float r = ((packed >> 16) & 0xffu) * inv_255;
+        float g = ((packed >> 8) & 0xffu) * inv_255;
+        float b = (packed & 0xffu) * inv_255;
+        return wp::vec_t<3, wp::float32>(r, g, b);
+        """
+
+
+    @wp.func_native(snippet)
+    def unpack_rgb8(packed: wp.uint32) -> wp.vec3f:
+        ...
+
+
+    @wp.kernel(module="unique")
+    def unpack_rgb8_kernel(packed: wp.array[wp.uint32], out: wp.array[wp.vec3f]):
+        tid = wp.tid()
+        out[tid] = unpack_rgb8(packed[tid])
+
+
+    packed_host = np.array([0x336699, 0xFF8000], dtype=np.uint32)
+    packed = wp.array(packed_host, dtype=wp.uint32, device="cpu")
+    out = wp.zeros(len(packed_host), dtype=wp.vec3f, device="cpu")
+
+    wp.launch(unpack_rgb8_kernel, dim=len(packed_host), inputs=[packed], outputs=[out], device="cpu")
+    print(out.numpy())
+
+.. testoutput:: returning_values
+
+    [[0.20000002 0.40000004 0.6       ]
+     [1.         0.5019608  0.        ]]
+
+Differentiable Native Functions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a native function participates in a tape-recorded computation, provide an
+``adj_snippet`` that accumulates adjoints for the native operation. Adjoint
+variables use the ``adj_`` prefix, and return-value adjoints are named
+``adj_ret``. See :ref:`Custom Native Functions <custom-native-functions>` for
+more detail on how native snippets interact with tape replay and backward
+passes.
+
+.. code-block:: python
+
+    snippet = "out[tid] = 2.0f * x[tid] + y[tid];"
+    adj_snippet = """
+        adj_x[tid] += 2.0f * adj_out[tid];
+        adj_y[tid] += adj_out[tid];
+        """
+
+
+    @wp.func_native(snippet=snippet, adj_snippet=adj_snippet)
+    def axpy(
+        x: wp.array[wp.float32],
+        y: wp.array[wp.float32],
+        out: wp.array[wp.float32],
+        tid: int,
+    ):
+        ...
+
+During the backward pass, Warp runs a forward replay phase to recompute forward
+values that adjoint code needs when applying the chain rule. By default, native
+functions replay the original ``snippet``. If the forward snippet has side
+effects that should not be repeated, such as mutating a counter with an atomic
+operation, provide ``replay_snippet``. An empty string is a valid no-op replay
+snippet.
+
+.. code-block:: python
+
+    snippet = """
+        int next_index = atomicAdd(counter, 1);
+        thread_values[tid] = next_index;
+        """
+    replay_snippet = ""
+
+
+    @wp.func_native(snippet, replay_snippet=replay_snippet)
+    def record_index(counter: wp.array[int], thread_values: wp.array[int], tid: int):
+        ...
+
+Native Function Limitations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+- Native snippets are inserted into generated C++/CUDA and are not parsed as
+  Warp code.
+- The snippet can refer to variables named after the typed Python stub
+  arguments.
+- CUDA-specific snippets cannot run on CPU devices.
+- Type hints must accurately describe the stub arguments and return type.
+- Struct return values are unsupported.
+- Users are responsible for native-code correctness, synchronization, memory
+  safety, and portability.
+
+.. _ahead_of_time_compilation_workflows:
+
+Ahead-of-Time C++/CUDA Workflows
+--------------------------------
+
+Under typical use cases, Warp compiles and loads modules automatically at
+runtime when kernels are launched using :func:`wp.launch() <warp.launch>` or
+:func:`wp.launch_tiled() <warp.launch_tiled>`.
+
+However, there are some cases where it is useful to compile a module ahead of
+time, perhaps even on a different machine. For example, you might want to use
+Warp to generate source code that can be included in a native CUDA C++
+application.
+
+By disabling module hashing, you can even distribute pre-compiled modules
+without the original source code.
+
+The :func:`wp.compile_aot_module() <warp.compile_aot_module>` and
+:func:`wp.load_aot_module() <warp.load_aot_module>` functions can be used to
+compile and load modules manually.
+
+The C++ examples under ``warp/examples/cpp/`` show two native deployment
+patterns:
+
+- `00_cubin_launch <https://github.com/NVIDIA/warp/tree/main/warp/examples/cpp/00_cubin_launch>`_
+  compiles a Warp kernel to a CUBIN, loads that module with the CUDA Driver API,
+  and launches the generated kernel with ``cuLaunchKernel()``.
+- `01_source_include <https://github.com/NVIDIA/warp/tree/main/warp/examples/cpp/01_source_include>`_
+  includes the generated ``.cu`` source in a CUDA C++ translation unit and
+  launches the generated forward and backward kernels directly.
+
+Both examples use ``warp/native/aot.h`` for Warp's generated type definitions,
+CUDA setup helpers, and error-checking macros. The generated code also depends
+on the native type headers such as ``builtin.h`` that ship in ``warp/native/``.
+
+Example: Compile a Module Manually Without Launching Kernels
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The following example shows how to compile a module manually without launching
+any kernels. We pass ``__name__`` as the first argument to
+:func:`wp.compile_aot_module() <warp.compile_aot_module>` to indicate that the
+current module is the one that should be compiled, but it is also possible to
+pass the name of another module containing Warp kernels.
+
+.. code:: python
+
+    import warp as wp
+
+
+    @wp.kernel
+    def multiply_arrays(a: wp.array[float], b: wp.array[float], out: wp.array[float]):
+        i = wp.tid()
+        out[i] = a[i] * b[i]
+
+    if __name__ == "__main__":
+        wp.compile_aot_module(__name__, module_dir="output")
+
+This creates the following files:
+
+.. code-block:: text
+
+   output/
+   ├── wp___main___97ca746.cu
+   ├── wp___main___97ca746.meta
+   └── wp___main___97ca746.sm86.ptx
+
+Example: Distribute a Pre-Compiled Module
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In this example, our goal is to load and run kernels from a pre-compiled module
+without the original source code that produced the module, either the Python
+function or the generated CUDA C++ code.
+
+First, disable module hashing when ahead-of-time compiling the code by passing
+``strip_hash=True`` to
+:func:`wp.compile_aot_module() <warp.compile_aot_module>`. When hashing is
+disabled, the generated function names and filenames become more readable. For
+example, instead of a function named ``sum_kernel_b6a8b352_cuda_kernel_forward()``
+in the file ``wp___main___3a2e577/wp___main___3a2e577.cu``, you get the cleaner
+function ``sum_kernel_cuda_kernel_forward()`` in the file
+``wp___main__/wp___main__.cu``. Therefore, ``strip_hash=True`` can also be used
+to prevent name mangling.
+
+.. important::
+
+    Not all modules can be compiled successfully with hashing disabled. Name
+    collisions can occur when hashes are removed from the function names,
+    particularly when using :doc:`Generics <generics>`, which can result in
+    multiple versions of the same function or kernel.
+
+Only use ``strip_hash=True`` when your module has a single version of each
+function and kernel.
+
+.. code:: python
+
+    import warp as wp
+
+
+    @wp.kernel
+    def multiply_arrays(a: wp.array[float], b: wp.array[float], out: wp.array[float]):
+        i = wp.tid()
+        out[i] = a[i] * b[i]
+
+    if __name__ == "__main__":
+        wp.compile_aot_module(__name__, module_dir="output", strip_hash=True)
+
+This results in the following files:
+
+.. code:: text
+
+   output/
+   ├── wp___main__.cu
+   ├── wp___main__.meta
+   └── wp___main__.sm86.ptx
+
+The second part of this workflow involves writing a script that can load the
+pre-compiled module and run the kernels, all without access to the original
+script that compiled the module.
+
+We will remove ``output/wp___main__.cu`` to mock up a potential user that has
+been provided with only the ``output/wp___main__.sm86.ptx`` and
+``output/wp___main__.meta`` files. For example, intellectual-property concerns
+may prevent disclosure of the original source code.
+
+We can write a script that also defines the ``multiply_arrays`` kernel with the
+same signature as in the previous script, but contains an empty body.
+
+.. code:: python
+
+    import warp as wp
+
+
+    @wp.kernel
+    def multiply_arrays(a: wp.array[float], b: wp.array[float], out: wp.array[float]):
+        pass
+
+
+    if __name__ == "__main__":
+        wp.load_aot_module(__name__, module_dir="output", strip_hash=True)
+
+        a = wp.full((10,), 5.0, dtype=float)
+        b = wp.full((10,), 10.0, dtype=float)
+        c = wp.empty_like(a)
+
+        wp.launch(multiply_arrays, dim=a.shape, inputs=[a, b], outputs=[c])
+
+        print(c)
+
+When we run the above script, we see the following output:
+
+.. code:: text
+
+    Module __main__ load on device 'cuda:0' took 46.33 ms  (cached)
+    [50. 50. 50. 50. 50. 50. 50. 50. 50. 50.]
+
+The ``(cached)`` in the module-load message indicates that the module is loaded
+from cache instead of being compiled. The output also shows that the kernel
+multiplied the elements of the arrays ``a`` and ``b`` together despite the empty
+body of the ``multiply_arrays`` kernel.
+
+This example assumed both the module compilation and loading happened on the
+same machine, but it might be necessary to compile a module for multiple GPU
+architectures. A list of GPU architectures for which to generate code as well
+as whether to generate PTX or CUBIN files can be specified when calling
+:func:`wp.compile_aot_module() <warp.compile_aot_module>`.
+
+The following example shows how to compile a module without hashing for
+multiple architectures as CUBIN files by passing in a list of architectures
+returned by :func:`wp.get_cuda_supported_archs() <warp.get_cuda_supported_archs>`
+to the ``arch`` argument.
+
+.. code:: python
+
+    import warp as wp
+
+
+    @wp.kernel
+    def multiply_arrays(
+        a: wp.array[float],
+        b: wp.array[float],
+        out: wp.array[float],
+    ):
+        i = wp.tid()
+        out[i] = a[i] * b[i]
+
+
+    if __name__ == "__main__":
+        wp.init()
+        wp.compile_aot_module(
+            __name__,
+            arch=wp.get_cuda_supported_archs(),
+            module_dir="output",
+            strip_hash=True,
+            use_ptx=False,
+        )
+
+On a CUDA 13.0 build of Warp, this results in the following files:
+
+.. code:: text
+
+   output/
+   ├── wp___main__.cu
+   ├── wp___main__.meta
+   ├── wp___main__.sm100.cubin
+   ├── wp___main__.sm103.cubin
+   ├── wp___main__.sm110.cubin
+   ├── wp___main__.sm120.cubin
+   ├── wp___main__.sm121.cubin
+   ├── wp___main__.sm75.cubin
+   ├── wp___main__.sm80.cubin
+   ├── wp___main__.sm86.cubin
+   ├── wp___main__.sm87.cubin
+   ├── wp___main__.sm88.cubin
+   ├── wp___main__.sm89.cubin
+   └── wp___main__.sm90.cubin
+
+Example: Compile Without a CUDA Driver in a Docker Build
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Warp statically links NVRTC (the NVIDIA Runtime Compiler), which means CUDA
+kernel compilation does not require a CUDA driver or GPU to be present. This is
+useful for ahead-of-time compilation during Docker image builds, where GPUs are
+typically unavailable.
+
+When Warp is initialized without a CUDA driver, it detects that NVRTC is still
+available and prints a message like:
+
+.. code:: text
+
+   Warp 1.12.0 initialized:
+      CUDA Toolkit 13.1, CUDA driver not available (NVRTC compilation available)
+      Devices:
+        "cpu"      : "CPU"
+
+In this mode,
+:func:`wp.get_cuda_supported_archs() <warp.get_cuda_supported_archs>` returns
+the full list of architectures supported by NVRTC, and
+:func:`wp.compile_aot_module() <warp.compile_aot_module>` can compile CUDA
+kernels as long as the ``arch`` parameter is specified.
+
+The following self-contained ``Dockerfile`` demonstrates compiling Warp kernels
+during ``docker build`` without ``--gpus``:
+
+.. code:: dockerfile
+
+   FROM python:3.12-slim
+   RUN pip install warp-lang
+
+   RUN cat <<'EOF' > compile.py
+   import os
+   import warp as wp
+
+   @wp.kernel
+   def my_kernel(a: wp.array[float], b: wp.array[float]):
+       i = wp.tid()
+       b[i] = a[i] * 2.0
+
+   os.makedirs("/app/warp_cache", exist_ok=True)
+   wp.compile_aot_module(__name__, arch=[75, 80, 86, 90], module_dir="/app/warp_cache")
+
+   files = [f for f in os.listdir("/app/warp_cache") if f.endswith((".ptx", ".cubin"))]
+   print(f"Compiled {len(files)} CUDA files: {sorted(files)}")
+   EOF
+   RUN python compile.py
+
+At runtime, launch the container with ``--gpus`` and the pre-compiled PTX/CUBIN
+will be loaded from cache, avoiding JIT compilation delays.
+
+**The** ``arch`` **parameter is required when compiling without a GPU**, since
+Warp cannot infer the target architecture from a device. Use
+:func:`wp.get_cuda_supported_archs() <warp.get_cuda_supported_archs>` to query
+which architectures are available for compilation.
+
+.. _apic_cpp_replay:
+
+API Capture Replay from C++
+---------------------------
+
+API Capture (APIC) can serialize a captured Warp graph to a ``.wrp`` file plus a
+companion module directory. The saved graph can later be loaded from Python or
+from a standalone C++ program that links against the Warp native library.
+
+See :ref:`apic_save_load` for the Python capture/save/load workflow,
+serialization format notes, and current API Capture limitations. The same
+``.wrp`` file can be loaded and launched from a C++ program with no Python
+runtime. The Warp native library exposes a small C API for this purpose,
+declared in `warp/native/apic.h <https://github.com/NVIDIA/warp/blob/main/warp/native/apic.h>`_.
+The core entry points are:
+
+.. code:: c
+
+    // Load a .wrp file. device_type: 0 = CUDA, 1 = CPU.
+    // For CUDA, context is a CUcontext; for CPU it must be NULL.
+    APICGraph* wp_apic_load_graph(void* context, const char* path, int device_type);
+
+    // Update or read named parameter regions on the loaded graph.
+    bool wp_apic_set_param(APICGraph* graph, const char* name,
+                           const void* data, size_t size);
+    bool wp_apic_get_param(APICGraph* graph, const char* name,
+                           void* data, size_t size);
+
+    // CUDA replay: get the CUDA graph executable (built lazily on first call)
+    // and launch it via cudaGraphLaunch().
+    void* wp_apic_get_cuda_graph_exec(APICGraph* graph);
+
+    // CPU replay: walk the recorded operation stream and execute it directly.
+    bool wp_apic_cpu_replay_graph(APICGraph* graph);
+
+    // Release the loaded graph and its associated allocations.
+    void wp_apic_destroy_graph(APICGraph* graph);
+
+Two reference C++ examples ship with Warp under ``warp/examples/cpp/``. Both
+implement the same interactive 2-D wave simulation visualized with GLFW/OpenGL,
+and both take their captured ``.wrp`` file from a Python ``capture_wave.py``
+script that writes ``generated/wave_sim.wrp`` plus a
+``generated/wave_sim_modules/`` directory of compiled kernels.
+
+CUDA Replay (``02_apic_visualization``)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Source: `warp/examples/cpp/02_apic_visualization <https://github.com/NVIDIA/warp/tree/main/warp/examples/cpp/02_apic_visualization>`_
+
+This example captures a full simulation frame (one displacement kernel followed
+by 16 wave-equation integration substeps) as a single CUDA graph. C++ then
+launches the entire frame with one ``cudaGraphLaunch()`` per rendered frame:
+
+.. code:: cpp
+
+    #include "aot.h"   // pulls in <cuda.h> and <cuda_runtime.h>
+    #include "warp.h"  // Warp C API
+    #include "apic.h"  // APIC graph loading and execution
+
+    APICGraph* graph = wp_apic_load_graph(context, "generated/wave_sim", 0);
+
+    // Build the executable on first call.
+    cudaGraphExec_t exec = (cudaGraphExec_t)wp_apic_get_cuda_graph_exec(graph);
+
+    while (!glfwWindowShouldClose(window))
+    {
+        // Update inputs each frame (mouse, double-buffered height fields).
+        cudaMemcpyAsync(d_mouse_pos, mouse_grid, 2 * sizeof(float),
+                        cudaMemcpyHostToDevice, stream);
+        wp_apic_set_param(graph, "heights",      d_heights[cur],     heights_size);
+        wp_apic_set_param(graph, "heights_prev", d_heights[1 - cur], heights_size);
+        wp_apic_set_param(graph, "mouse_pos",    d_mouse_pos,        2 * sizeof(float));
+
+        cudaGraphLaunch(exec, stream);
+        cudaStreamSynchronize(stream);  // Must sync before reading results
+
+        wp_apic_get_param(graph, "heights_out",      d_heights[1 - cur], heights_size);
+        wp_apic_get_param(graph, "heights_prev_out", d_heights[cur],     heights_size);
+
+        // ... render with OpenGL ...
+    }
+
+    wp_apic_destroy_graph(graph);
+
+The C++ side links against the Warp native library plus the CUDA Driver API
+(``cuda``) and Runtime API (``cudart``). It does not need to know how many
+kernels the graph contains or what their parameter signatures look like. That
+is all encoded in the ``.wrp`` file and the bundled module binaries.
+
+CPU Replay (``03_apic_visualization_cpu``)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Source: `warp/examples/cpp/03_apic_visualization_cpu <https://github.com/NVIDIA/warp/tree/main/warp/examples/cpp/03_apic_visualization_cpu>`_
+
+This example mirrors ``02_apic_visualization`` but captures and replays on the
+CPU device. ``main.cpp`` does not link against CUDA at all. Replay goes through
+``wp_apic_cpu_replay_graph()`` and walks the recorded operation stream directly:
+
+.. code:: cpp
+
+    #include "apic.h"  // APIC graph loading and execution
+    #include "warp.h"  // Warp C API
+
+    APICGraph* graph = wp_apic_load_graph(nullptr, "generated/wave_sim", 1);
+
+    // ... resolve CPU kernel function pointers (see below) ...
+
+    while (!glfwWindowShouldClose(window))
+    {
+        wp_apic_set_param(graph, "heights",      h_heights[cur],     heights_size);
+        wp_apic_set_param(graph, "heights_prev", h_heights[1 - cur], heights_size);
+        wp_apic_set_param(graph, "mouse_pos",    h_mouse_pos,        2 * sizeof(float));
+
+        wp_apic_cpu_replay_graph(graph);
+
+        wp_apic_get_param(graph, "heights_out",      h_heights[1 - cur], heights_size);
+        wp_apic_get_param(graph, "heights_prev_out", h_heights[cur],     heights_size);
+
+        // ... render with OpenGL ...
+    }
+
+    wp_apic_destroy_graph(graph);
+
+CPU graph replay needs function pointers for each recorded kernel. These are
+resolved from the ``.o`` files in the companion ``_modules/`` directory by
+loading the ``warp-clang`` library at runtime and calling its ``wp_load_obj`` /
+``wp_lookup`` entry points, then registering each pointer with the loaded graph
+via ``wp_apic_register_loaded_cpu_kernel``. The example walks every kernel
+returned by ``wp_apic_get_num_kernels`` and ``wp_apic_get_kernel_key`` and does
+this resolution once at startup. The C API surface for this lookup is:
+
+.. code:: c
+
+    int         wp_apic_get_num_kernels(APICGraph* graph);
+    const char* wp_apic_get_kernel_key(APICGraph* graph, int index);
+    const char* wp_apic_get_kernel_forward_name(APICGraph* graph, const char* key);
+    const char* wp_apic_get_kernel_backward_name(APICGraph* graph, const char* key);
+    void        wp_apic_register_loaded_cpu_kernel(APICGraph* graph, const char* key,
+                                                   void* forward_fn, void* backward_fn);
+
+Loading a CPU ``.wrp`` graph uses the pure-C++ APIC loader in the Warp native
+library and does not require a CUDA-enabled build. CPU replay
+(``wp_apic_cpu_replay_graph``) still requires the warp-clang backend and the
+companion ``_modules`` directory described above. Loading a CUDA ``.wrp`` graph
+requires a CUDA-enabled build.
+
+Building and Running
+~~~~~~~~~~~~~~~~~~~~
+
+Both examples ship with a ``Makefile`` (Unix/Linux) and a ``CMakeLists.txt``
+(cross-platform). The typical workflow is:
+
+.. code:: bash
+
+    cd warp/examples/cpp/02_apic_visualization     # or 03_apic_visualization_cpu
+
+    # 1. Capture the graph (requires Warp + Python)
+    uv run capture_wave.py
+
+    # 2. Build the C++ viewer
+    cmake -B build -DCMAKE_BUILD_TYPE=Release
+    cmake --build build --config Release
+
+    # 3. Run
+    ./build/02_apic_visualization                  # or 03_apic_visualization_cpu
+
+Each example directory contains a ``README.md`` with full prerequisites, build
+options, controls, and platform-specific notes.
+
+Native Library Headers
+----------------------
+
+The C++ integration examples intentionally use a small native surface:
+
+- ``warp/native/aot.h`` exposes utilities for generated AOT kernels and includes
+  Warp's generated type support.
+- ``warp/native/warp.h`` declares the core Warp C API exported by the native
+  library.
+- ``warp/native/apic.h`` declares the APIC graph loading and replay API used by
+  ``.wrp`` graph consumers.
+
+Other files in ``warp/native/`` implement Warp's runtime and kernel support
+library. They are useful when inspecting generated code, but the examples above
+are the recommended starting points for native host integration.
+
+Related Topics
+--------------
+
+- :doc:`basics` covers regular :func:`@wp.func <warp.func>` functions called
+  from kernels.
+- :doc:`differentiability` covers custom gradients, custom replay functions,
+  tapes, and native-function adjoints.
+- :doc:`runtime` covers runtime kernel creation, CUDA graph capture, and APIC
+  save/load.
+- :doc:`../deep_dive/codegen` covers generated C++/CUDA source and code
+  generation internals.
diff --git a/docs/user_guide/debugging.rst b/docs/user_guide/debugging.rst
index 095f87e245..489b23b6f7 100644
--- a/docs/user_guide/debugging.rst
+++ b/docs/user_guide/debugging.rst
@@ -78,7 +78,7 @@ non-differentiable.
 
     Reading or setting either deprecated flag emits a one-time
     ``DeprecationWarning``. During the deprecation window the flag is still
-    honored alongside ``log_level``, so existing code keeps working; remove the
+    honored alongside ``log_level``, so existing code keeps working. Remove the
     flag once your code sets ``log_level`` directly.
 
     ``wp.config.verbose_warnings`` is not deprecated. It is an orthogonal
@@ -167,7 +167,7 @@ Debug Mode Compilation
 In debug mode, Warp kernels will perform the following additional checks:
 
 * Raise an assertion if there is an array access outside the defined shape.
-* Warn if :func:`wp.tid() <warp._src.lang.tid>` will return an overflowed value on large grids.
+* Warn if :func:`wp.tid() <warp.tid>` will return an overflowed value on large grids.
 * (GPU-only) Warn if the CUDA grid dimensions have been capped due to an overflowed number of blocks.
 * (GPU-only) Generate line-number information for device code.
 
diff --git a/docs/user_guide/differentiability.rst b/docs/user_guide/differentiability.rst
index b1d7e66303..b9aa9dd41b 100644
--- a/docs/user_guide/differentiability.rst
+++ b/docs/user_guide/differentiability.rst
@@ -649,83 +649,33 @@ for the input array:
     input.grad:  [0.5        0.35355338 0.28867513 0.25       0.2236068  0.20412414
      0.18898225 0.17677669]
 
+.. _custom-native-functions:
+
 Custom Native Functions
 #######################
 
-Users may insert native C++/CUDA code in Warp kernels using :func:`@wp.func_native <warp.func_native>` decorated functions.
-These accept native code as strings that get compiled after code generation, and are called within :func:`@wp.kernel <warp.kernel>` functions.
-For example:
-
-.. testcode::
-    :skipif: wp.get_cuda_device_count() == 0
-
-    snippet = """
-        __shared__ int sum[128];
-
-        sum[tid] = arr[tid];
-        __syncthreads();
-
-        for (int stride = 64; stride > 0; stride >>= 1) {
-            if (tid < stride) {
-                sum[tid] += sum[tid + stride];
-            }
-            __syncthreads();
-        }
-
-        if (tid == 0) {
-            out[0] = sum[0];
-        }
-        """
-
-    @wp.func_native(snippet)
-    def reduce(arr: wp.array[int], out: wp.array[int], tid: int): ...
-
-
-    @wp.kernel
-    def reduce_kernel(arr: wp.array[int], out: wp.array[int]):
-        tid = wp.tid()
-        reduce(arr, out, tid)
-
-
-    N = 128
-    x = wp.array(np.arange(N, dtype=int), dtype=int)
-    out = wp.zeros(1, dtype=int)
-
-    wp.launch(kernel=reduce_kernel, dim=N, inputs=[x, out])
-
-    print(out)
-
-.. testoutput::
-    :skipif: wp.get_cuda_device_count() == 0
-
-    [8128]
-
-Notice the use of shared memory here: The Warp library does not expose shared memory as a feature, but the CUDA compiler will
-readily accept the above snippet. This means CUDA features not exposed in Warp are still accessible in Warp scripts.
-Warp kernels meant for the CPU won't be able to leverage CUDA features of course, but this same mechanism supports pure C++ snippets as well.
+Native functions created with :func:`@wp.func_native <warp.func_native>` insert
+C++/CUDA snippets into generated Warp modules. The general feature is documented
+in :ref:`Native Snippets in Warp Kernels <native_functions>`; this section only
+covers how native snippets interact with tape replay and backward passes.
 
-Please bear in mind the following: the thread index in your snippet should be computed in a :func:`@wp.kernel <warp.kernel>` and passed to your snippet,
-as in the above example. This means your :func:`@wp.func_native <warp.func_native>` function signature should include the variables used in your snippet, 
-as well as a thread index of type ``int``. The function body itself should be stubbed with ``...`` (the snippet will be inserted during compilation).
-
-Should you wish to record your native function on the tape and then subsequently rewind the tape, you must include an adjoint snippet
-alongside your snippet as an additional input to the decorator, as in the following example:
+When a native function participates in a tape-recorded computation, Warp needs
+an adjoint implementation for the native operation. Provide it with the
+``adj_snippet`` argument:
 
 .. testcode::
     :skipif: wp.get_cuda_device_count() == 0
 
     snippet = """
-    out[tid] = a * x[tid] + y[tid];
+    out[tid] = 2.0f * x[tid] + y[tid];
     """
     adj_snippet = """
-    adj_a += x[tid] * adj_out[tid];
-    adj_x[tid] += a * adj_out[tid];
+    adj_x[tid] += 2.0f * adj_out[tid];
     adj_y[tid] += adj_out[tid];
     """
 
-    @wp.func_native(snippet, adj_snippet)
-    def saxpy(
-        a: float,
+    @wp.func_native(snippet=snippet, adj_snippet=adj_snippet)
+    def axpy(
         x: wp.array[float],
         y: wp.array[float],
         out: wp.array[float],
@@ -734,28 +684,23 @@ alongside your snippet as an additional input to the decorator, as in the follow
         ...
 
     @wp.kernel
-    def saxpy_kernel(
-        a: float,
+    def axpy_kernel(
         x: wp.array[float],
         y: wp.array[float],
-        out: wp.array[float]
+        out: wp.array[float],
     ):
         tid = wp.tid()
-        saxpy(a, x, y, out, tid)
+        axpy(x, y, out, tid)
 
-    N = 128
-    a = 2.0
+    N = 8
     x = wp.array(np.arange(N, dtype=np.float32), dtype=wp.float32, requires_grad=True)
     y = wp.zeros_like(x)
-    out = wp.array(np.arange(N, dtype=np.float32), dtype=wp.float32)
-    adj_out = wp.array(np.ones(N, dtype=np.float32), dtype=wp.float32)
+    out = wp.zeros_like(x)
 
-    tape = wp.Tape()
-
-    with tape:
-        wp.launch(kernel=saxpy_kernel, dim=N, inputs=[a, x, y], outputs=[out])
+    with wp.Tape() as tape:
+        wp.launch(kernel=axpy_kernel, dim=N, inputs=[x, y], outputs=[out])
 
-    tape.backward(grads={out: adj_out})
+    tape.backward(grads={out: wp.ones_like(out)})
 
     print(f"x.grad = {x.grad}")
     print(f"y.grad = {y.grad}")
@@ -763,111 +708,20 @@ alongside your snippet as an additional input to the decorator, as in the follow
 .. testoutput::
     :skipif: wp.get_cuda_device_count() == 0
 
-    x.grad = [2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
-     2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
-     2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
-     2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
-     2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
-     2. 2. 2. 2. 2. 2. 2. 2.]
-    y.grad = [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
-     1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
-     1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
-     1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
-     1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
-     1. 1. 1. 1. 1. 1. 1. 1.]
-
-You may also include a custom replay snippet to be executed as part of the adjoint (see `Custom Gradient Functions`_ for a full explanation).
-Consider the following example:
-
-.. testcode::
-    :skipif: wp.get_cuda_device_count() == 0
-
-    num_threads = 8
-    counter = wp.zeros(1, dtype=wp.int32)
-    thread_values = wp.zeros(num_threads, dtype=wp.int32)
-    inputs = wp.array(np.arange(num_threads, dtype=np.float32), requires_grad=True)
-    outputs = wp.zeros_like(inputs)
-
-    snippet = """
-        int next_index = atomicAdd(counter, 1);
-        thread_values[tid] = next_index;
-        """
-    replay_snippet = ""
-
-
-    @wp.func_native(snippet, replay_snippet=replay_snippet)
-    def reversible_increment(counter: wp.array[int], thread_values: wp.array[int], tid: int):
-        ...
-
-
-    @wp.kernel
-    def run_atomic_add(
-        input: wp.array[float],
-        counter: wp.array[int],
-        thread_values: wp.array[int],
-        output: wp.array[float],
-    ):
-        tid = wp.tid()
-        reversible_increment(counter, thread_values, tid)
-        idx = thread_values[tid]
-        output[idx] = input[idx] ** 2.0
-
-
-    with wp.Tape() as tape:
-        wp.launch(run_atomic_add, dim=num_threads, inputs=[inputs, counter, thread_values], outputs=[outputs])
-
-    tape.backward(grads={outputs: wp.ones(num_threads, dtype=wp.float32)})
-
-    print(f"inputs.grad = {np.round(inputs.grad.numpy(), 5)}")
-
-.. testoutput::
-    :skipif: wp.get_cuda_device_count() == 0
-
-    inputs.grad = [ 0.  2.  4.  6.  8. 10. 12. 14.]
-
-By default, ``snippet`` would be called in the backward pass, but in this case, we have defined a custom replay snippet that is called instead.
-``replay_snippet`` is a no-op, which is all that we require, since ``thread_values`` are cached in the forward pass.
-If we did not have a ``replay_snippet`` defined, ``thread_values`` would be overwritten with counter values that exceed the input array size in the backward pass.
-
-A native snippet may also include a return statement. If this is the case, you must specify the return type in the native function definition, as in the following example:
-
-.. testcode::
-
-    snippet = """
-        float sq = x * x;
-        return sq;
-        """
-    adj_snippet = """
-        adj_x += 2.f * x * adj_ret;
-        """
-
-
-    @wp.func_native(snippet, adj_snippet)
-    def square(x: float) -> float:
-        ...
-
-
-    @wp.kernel
-    def square_kernel(input: wp.array[Any], output: wp.array[Any]):
-        tid = wp.tid()
-        x = input[tid]
-        output[tid] = square(x)
-
-
-    N = 5
-    x = wp.array(np.arange(N, dtype=float), dtype=float, requires_grad=True)
-    y = wp.zeros_like(x)
-
-    with wp.Tape() as tape:
-        wp.launch(kernel=square_kernel, dim=N, inputs=[x, y])
-
-    tape.backward(grads={y: wp.ones(N, dtype=float)})
+    x.grad = [2. 2. 2. 2. 2. 2. 2. 2.]
+    y.grad = [1. 1. 1. 1. 1. 1. 1. 1.]
 
-    print(f"x.grad = {x.grad}")
-
-.. testoutput::
+During the backward pass, Warp runs a forward replay phase to recompute forward
+values that adjoint code needs when applying the chain rule. By default, native
+functions replay the original ``snippet``. If replaying the native snippet would
+repeat an unsafe side effect, provide ``replay_snippet``.
+For example, a native snippet that writes cached indices using an atomic counter
+can use an empty replay snippet so the cached forward values are reused instead
+of overwritten during the backward pass.
 
-    x.grad = [0. 2. 4. 6. 8.]
+See :ref:`Native Snippets in Warp Kernels <native_functions>` for the full
+native-function syntax, CPU/CUDA behavior, return values, replay snippets, and
+limitations.
 
 ``grad()``
 #############
diff --git a/docs/user_guide/runtime.rst b/docs/user_guide/runtime.rst
index 06c432a989..713c3b6912 100644
--- a/docs/user_guide/runtime.rst
+++ b/docs/user_guide/runtime.rst
@@ -235,6 +235,7 @@ Creating an Indexed Array
 Pass the *data* array together with a list of :class:`wp.int32 <warp.int32>` index arrays, one for each dimension:
 
 .. testcode::
+    :skipif: wp.get_cuda_device_count() == 0
 
     import warp as wp
 
@@ -247,6 +248,7 @@ Pass the *data* array together with a list of :class:`wp.int32 <warp.int32>` ind
     print(sub)
 
 .. testoutput::
+    :skipif: wp.get_cuda_device_count() == 0
 
     [2.34 4.56 6.78]
 
@@ -389,6 +391,7 @@ Nested Structs and Vector Types
 Structured arrays fully support nested structs and Warp vector (and matrix) types:
 
 .. testcode::
+    :skipif: wp.get_cuda_device_count() == 0
 
     import warp as wp
     import numpy as np
@@ -414,6 +417,7 @@ Structured arrays fully support nested structs and Warp vector (and matrix) type
     print(a.numpy())
 
 .. testoutput::
+    :skipif: wp.get_cuda_device_count() == 0
 
     [(42,  0.  , ([0., 0., 0.],)) ( 0,  0.  , ([0., 0., 0.],))
      ( 0, 13.37, ([0., 0., 0.],)) ( 0,  0.  , ([0., 0., 0.],))
@@ -1311,8 +1315,9 @@ To address this, Warp exposes the concept of `CUDA graphs <https://developer.nvi
 to allow recording large batches of kernels and replaying them with very little CPU overhead.
 The same :func:`wp.capture_begin() <warp.capture_begin>` / :func:`wp.capture_end() <warp.capture_end>` /
 :func:`wp.capture_launch() <warp.capture_launch>` API also supports recording on CPU devices and
-serializing captured graphs to a file for later replay from Python or a standalone C++ program (see
-:ref:`cpu_graphs` and :ref:`apic_save_load` below).
+serializing captured graphs to a file for later replay from Python (see
+:ref:`cpu_graphs` and :ref:`apic_save_load`). For standalone C++ replay of saved
+graphs, see :ref:`apic_cpp_replay`.
 
 To record a series of kernel launches use the :func:`wp.capture_begin() <warp.capture_begin>` and
 :func:`wp.capture_end() <warp.capture_end>` API as follows:
@@ -1693,11 +1698,9 @@ Saving and Loading Graphs
     subject to change without a formal deprecation cycle. ``.wrp`` files written
     by one version of Warp may not be loadable by another.
 
-API Capture lets you serialize a captured graph to disk and load it back later, either
-from another Python program, or from a standalone C++ application that links only
-against the Warp native library. This is useful for shipping a precomputed
-simulation pipeline as part of a binary, or for amortizing capture cost across
-many runs.
+API Capture lets you serialize a captured graph to disk and load it back later
+from another Python program. This is useful for shipping a precomputed
+simulation pipeline or for amortizing capture cost across many runs.
 
 Capturing for serialization
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -1793,165 +1796,8 @@ be loaded on a CPU device. The ``device`` passed to
 :func:`wp.capture_load() <warp.capture_load>` must match the device family the
 graph was captured on.
 
-Standalone C++ replay
-^^^^^^^^^^^^^^^^^^^^^
-
-The same ``.wrp`` file can be loaded and launched from a C++ program with no
-Python runtime. The Warp native library exposes a small C API for this purpose,
-declared in `warp/native/apic.h <https://github.com/NVIDIA/warp/blob/main/warp/native/apic.h>`_. The core entry points are:
-
-.. code:: c
-
-    // Load a .wrp file. device_type: 0 = CUDA, 1 = CPU.
-    // For CUDA, context is a CUcontext; for CPU it must be NULL.
-    APICGraph* wp_apic_load_graph(void* context, const char* path, int device_type);
-
-    // Update or read named parameter regions on the loaded graph.
-    bool wp_apic_set_param(APICGraph* graph, const char* name,
-                           const void* data, size_t size);
-    bool wp_apic_get_param(APICGraph* graph, const char* name,
-                           void* data, size_t size);
-
-    // CUDA replay: get the CUDA graph executable (built lazily on first call)
-    // and launch it via cudaGraphLaunch().
-    void* wp_apic_get_cuda_graph_exec(APICGraph* graph);
-
-    // CPU replay: walk the recorded operation stream and execute it directly.
-    bool wp_apic_cpu_replay_graph(APICGraph* graph);
-
-    // Release the loaded graph and its associated allocations.
-    void wp_apic_destroy_graph(APICGraph* graph);
-
-Two reference C++ examples ship with Warp under ``warp/examples/cpp/``. Both
-implement the same interactive 2-D wave simulation visualized with GLFW/OpenGL,
-and both take their captured ``.wrp`` file from a Python ``capture_wave.py``
-script that writes ``generated/wave_sim.wrp`` plus a ``generated/wave_sim_modules/``
-directory of compiled kernels.
-
-CUDA replay (``02_apic_visualization``)
-"""""""""""""""""""""""""""""""""""""""
-
-Source: `warp/examples/cpp/02_apic_visualization <https://github.com/NVIDIA/warp/tree/main/warp/examples/cpp/02_apic_visualization>`_
-
-This example captures a full simulation frame (one displacement kernel followed
-by 16 wave-equation integration substeps) as a single CUDA graph. C++ then
-launches the entire frame with one ``cudaGraphLaunch()`` per rendered frame:
-
-.. code:: cpp
-
-    #include "aot.h"   // pulls in <cuda.h> and <cuda_runtime.h>
-    #include "warp.h"  // Warp C API
-    #include "apic.h"  // APIC graph loading and execution
-
-    APICGraph* graph = wp_apic_load_graph(context, "generated/wave_sim", 0);
-
-    // Build the executable on first call.
-    cudaGraphExec_t exec = (cudaGraphExec_t)wp_apic_get_cuda_graph_exec(graph);
-
-    while (!glfwWindowShouldClose(window))
-    {
-        // Update inputs each frame (mouse, double-buffered height fields).
-        cudaMemcpyAsync(d_mouse_pos, mouse_grid, 2 * sizeof(float),
-                        cudaMemcpyHostToDevice, stream);
-        wp_apic_set_param(graph, "heights",      d_heights[cur],     heights_size);
-        wp_apic_set_param(graph, "heights_prev", d_heights[1 - cur], heights_size);
-        wp_apic_set_param(graph, "mouse_pos",    d_mouse_pos,        2 * sizeof(float));
-
-        cudaGraphLaunch(exec, stream);
-        cudaStreamSynchronize(stream);  // Must sync before reading results
-
-        wp_apic_get_param(graph, "heights_out",      d_heights[1 - cur], heights_size);
-        wp_apic_get_param(graph, "heights_prev_out", d_heights[cur],     heights_size);
-
-        // ... render with OpenGL ...
-    }
-
-    wp_apic_destroy_graph(graph);
-
-The C++ side links against the Warp native library plus the CUDA Driver API
-(``cuda``) and Runtime API (``cudart``). It does not need to know how many
-kernels the graph contains or what their parameter signatures look like. That
-is all encoded in the ``.wrp`` file and the bundled module binaries.
-
-CPU replay (``03_apic_visualization_cpu``)
-""""""""""""""""""""""""""""""""""""""""""
-
-Source: `warp/examples/cpp/03_apic_visualization_cpu <https://github.com/NVIDIA/warp/tree/main/warp/examples/cpp/03_apic_visualization_cpu>`_
-
-This example mirrors ``02_apic_visualization`` but captures and replays on the
-CPU device. ``main.cpp`` does not link against CUDA at all. Replay goes through
-``wp_apic_cpu_replay_graph()`` and walks the recorded operation stream directly:
-
-.. code:: cpp
-
-    #include "apic.h"  // APIC graph loading and execution
-    #include "warp.h"  // Warp C API
-
-    APICGraph* graph = wp_apic_load_graph(nullptr, "generated/wave_sim", 1);
-
-    // ... resolve CPU kernel function pointers (see below) ...
-
-    while (!glfwWindowShouldClose(window))
-    {
-        wp_apic_set_param(graph, "heights",      h_heights[cur],     heights_size);
-        wp_apic_set_param(graph, "heights_prev", h_heights[1 - cur], heights_size);
-        wp_apic_set_param(graph, "mouse_pos",    h_mouse_pos,        2 * sizeof(float));
-
-        wp_apic_cpu_replay_graph(graph);
-
-        wp_apic_get_param(graph, "heights_out",      h_heights[1 - cur], heights_size);
-        wp_apic_get_param(graph, "heights_prev_out", h_heights[cur],     heights_size);
-
-        // ... render with OpenGL ...
-    }
-
-    wp_apic_destroy_graph(graph);
-
-CPU graph replay needs function pointers for each recorded kernel. These are
-resolved from the ``.o`` files in the companion ``_modules/`` directory by
-loading the ``warp-clang`` library at runtime and calling its ``wp_load_obj`` /
-``wp_lookup`` entry points, then registering each pointer with the loaded graph
-via ``wp_apic_register_loaded_cpu_kernel``. The example walks every kernel
-returned by ``wp_apic_get_num_kernels`` and ``wp_apic_get_kernel_key`` and does
-this resolution once at startup. The C API surface for this lookup is:
-
-.. code:: c
-
-    int         wp_apic_get_num_kernels(APICGraph* graph);
-    const char* wp_apic_get_kernel_key(APICGraph* graph, int index);
-    const char* wp_apic_get_kernel_forward_name(APICGraph* graph, const char* key);
-    const char* wp_apic_get_kernel_backward_name(APICGraph* graph, const char* key);
-    void        wp_apic_register_loaded_cpu_kernel(APICGraph* graph, const char* key,
-                                                   void* forward_fn, void* backward_fn);
-
-Loading a CPU ``.wrp`` graph uses the pure-C++ APIC loader in the Warp native
-library and does not require a CUDA-enabled build. CPU replay
-(``wp_apic_cpu_replay_graph``) still requires the warp-clang backend and the
-companion ``_modules`` directory described above. Loading a CUDA ``.wrp`` graph
-requires a CUDA-enabled build.
-
-Building and running
-""""""""""""""""""""
-
-Both examples ship with a ``Makefile`` (Unix/Linux) and a ``CMakeLists.txt``
-(cross-platform). The typical workflow is:
-
-.. code:: bash
-
-    cd warp/examples/cpp/02_apic_visualization     # or 03_apic_visualization_cpu
-
-    # 1. Capture the graph (requires Warp + Python)
-    uv run capture_wave.py
-
-    # 2. Build the C++ viewer
-    cmake -B build -DCMAKE_BUILD_TYPE=Release
-    cmake --build build --config Release
-
-    # 3. Run
-    ./build/02_apic_visualization                  # or 03_apic_visualization_cpu
-
-Each example directory contains a ``README.md`` with full prerequisites, build
-options, controls, and platform-specific notes.
+For standalone C++ replay of saved ``.wrp`` graphs, see
+:ref:`apic_cpp_replay`.
 
 Current limitations of API Capture:
 
diff --git a/warp/_src/context.py b/warp/_src/context.py
index 479189b4a3..5dfba89654 100644
--- a/warp/_src/context.py
+++ b/warp/_src/context.py
@@ -999,8 +999,97 @@ def wrapper(f, *args, **kwargs):
 
 
 def func_native(snippet: str, adj_snippet: str | None = None, replay_snippet: str | None = None):
-    """
-    Decorator to register native code snippet, @func_native
+    """Register a Warp function implemented by a native C++/CUDA snippet.
+
+    ``@wp.func_native`` is an escape hatch for operations that are easier to
+    express in native code than in Warp's kernel language, such as CUDA
+    intrinsics, shared-memory synchronization patterns, or small C++ helper
+    expressions. The decorated Python function is a typed stub: its argument
+    names become the variable names available inside the snippet, and its body
+    should be ``...``.
+
+    Args:
+        snippet: Native C++/CUDA code inserted into the generated forward
+            function body.
+        adj_snippet: Optional native code inserted into the generated adjoint
+            function body. Use ``adj_``-prefixed argument names, and ``adj_ret``
+            for the adjoint of a return value.
+        replay_snippet: Optional native code used when Warp replays the forward
+            function while executing the generated backward pass. Use this when
+            replaying ``snippet`` would repeat an unsafe side effect, such as
+            incrementing an atomic counter.
+
+    Returns:
+        A decorator that registers the typed stub as a Warp function.
+
+    Note:
+        Compute thread indices in the calling kernel or function and pass them
+        explicitly; snippets cannot call :func:`wp.tid() <warp.tid>`.
+        Pure C++ snippets can be used by CPU kernels, while CUDA-specific
+        constructs require CUDA kernels. If the snippet returns a value, the
+        Python stub must declare the return type. Struct return values are not
+        supported.
+
+    Example:
+        Insert a native element-wise operation:
+
+        .. code-block:: python
+
+            snippet = "out[tid] = x[tid] + 1.0f;"
+
+
+            @wp.func_native(snippet)
+            def increment(
+                x: wp.array[wp.float32],
+                out: wp.array[wp.float32],
+                tid: int,
+            ): ...
+
+
+            @wp.kernel
+            def kernel(x: wp.array[wp.float32], out: wp.array[wp.float32]):
+                tid = wp.tid()
+                increment(x, out, tid)
+
+        Use CUDA shared memory:
+
+        This pattern assumes the kernel is launched with ``block_dim=128``.
+
+        .. code-block:: python
+
+            snippet = '''
+                int local_tid = threadIdx.x;
+                __shared__ int values[128];
+                values[local_tid] = arr[tid];
+                __syncthreads();
+                out[tid] = values[127 - local_tid];
+                '''
+
+
+            @wp.func_native(snippet)
+            def reverse_block(arr: wp.array[int], out: wp.array[int], tid: int): ...
+
+        Provide an adjoint snippet for differentiability:
+
+        .. code-block:: python
+
+            snippet = "out[tid] = 2.0f * x[tid] + y[tid];"
+            adj_snippet = '''
+                adj_x[tid] += 2.0f * adj_out[tid];
+                adj_y[tid] += adj_out[tid];
+                '''
+
+
+            @wp.func_native(snippet=snippet, adj_snippet=adj_snippet)
+            def axpy(
+                x: wp.array[wp.float32],
+                y: wp.array[wp.float32],
+                out: wp.array[wp.float32],
+                tid: int,
+            ): ...
+
+    See Also:
+        :ref:`Native Snippets in Warp Kernels <native_functions>`
     """
 
     frame = inspect.currentframe()
@@ -1310,7 +1399,7 @@ def my_kernel_with_launch_bounds(a: wp.array[float]):
 
 
         @wp.kernel(module_options={"fast_math": True}, module="unique")
-        def my_kernel_fast(a: wp.array(dtype=float), b: wp.array(dtype=float)):
+        def my_kernel_fast(a: wp.array[float], b: wp.array[float]):
             # fast_math is a module-level option, so module="unique" is required
             tid = wp.tid()
             b[tid] = a[tid] + 1.0