Skip to content

feat: dynamically load bindings#114

Merged
elibol merged 6 commits into
NVlabs:mainfrom
drbh:dynamic-cuda-loading
Apr 29, 2026
Merged

feat: dynamically load bindings#114
elibol merged 6 commits into
NVlabs:mainfrom
drbh:dynamic-cuda-loading

Conversation

@drbh
Copy link
Copy Markdown
Contributor

@drbh drbh commented Apr 23, 2026

This PR is a follow up on #30 and explores dynamically loading cuda bindings via libloading.

This change avoids the proposed features flag solution in #30 and allows the same binary to be built on systems with and without a cuda driver.

Checking cuda availability is moved from compule time to runtime which also has the benefit of relieving downstream crates from handling feature flags.

In addition to the bindings a new KernelCompiler struct/impl is added that provides a simple interface for compile-only operations. This functionality is used in the cutile-examples/examples/compile_only.rs example.

Lastly, this PR includes changes to the flake to run the compile only ecample on a macbook with the following command below.

nix develop --command cargo run -p cutile-examples --example compile_only --no-default-features

output

CUDA driver: not available (compile-only mode)
Target GPU: sm_80
Compiling my_kernels::tile_math

Generated Tile IR:

cuda_tile.module @my_kernels {
  entry @tile_math_entry(%0: tile<ptr<f32>>, %1: tile<i32>, %2: tile<i32>, %3: tile<i32>, %4: tile<i32>, %5: tile<f32>) {
    %6 = constant <i32: 32> : tile<i32>
    %7 = make_token : token
    %8, %9, %10 = get_tile_block_id : tile<i32>
    %11 = assume bounded<0, ?>, %8 : tile<i32>
    %12 = assume bounded<0, ?>, %9 : tile<i32>
    %13 = assume bounded<0, ?>, %10 : tile<i32>
    %14 = muli %11, %3 : tile<i32>
    %15 = muli %14, %2 : tile<i32>
    %16 = offset %0, %15 : tile<ptr<f32>>, tile<i32> -> tile<ptr<f32>>
    %17 = muli %11, %3 : tile<i32>
    %18 = subi %1, %17 : tile<i32>
    %19 = mini %3, %18 signed : tile<i32>
    %20 = make_tensor_view %16, shape = [%19], strides = [1] : tile<i32> -> tensor_view<?xf32, strides=[1]>
    %21 = constant <i32: 32> : tile<i32>
    %22 = constant <i32: 32> : tile<i32>
    %23 = constant <i32: 32> : tile<i32>
    %24 = constant <i32: 1> : tile<i32>
    %25 = constant <i32: 1> : tile<i32>
    %26 = reshape %5 : tile<f32> -> tile<1xf32>
    %27 = constant <i32: 1> : tile<i32>
    %28 = constant <i32: 32> : tile<i32>
    %29 = broadcast %26 : tile<1xf32> -> tile<32xf32>
    %30 = constant <f32: 1.0> : tile<f32>
    %31 = constant <i32: 32> : tile<i32>
    %32 = constant <i32: 32> : tile<i32>
    %33 = constant <i32: 1> : tile<i32>
    %34 = constant <i32: 1> : tile<i32>
    %35 = reshape %30 : tile<f32> -> tile<1xf32>
    %36 = constant <i32: 32> : tile<i32>
    %37 = constant <i32: 1> : tile<i32>
    %38 = broadcast %35 : tile<1xf32> -> tile<32xf32>
    %39 = addf %29, %38 : tile<32xf32>
    %40 = constant <i32: 32> : tile<i32>
    %41 = constant <i32: 32> : tile<i32>
    %42 = constant <i32: 32> : tile<i32>
    %43 = make_partition_view %20 : partition_view<tile=(32), padding_value = zero, tensor_view<?xf32, strides=[1]>>
    %44 = constant <i32: 0> : tile<i32>
    %45 = store_view_tko weak %39, %43[%44] token = %7 : tile<32xf32>, partition_view<tile=(32), padding_value = zero, tensor_view<?xf32, strides=[1]>>, tile<i32> -> token
    return
  }
}

Compiled bytecode: 754 bytes
First 32 bytes (hex): [7f, 54, 69, 6c, 65, 49, 52, 00, 0d, 02, 00, 00, 82, a0, 01, 08, 01, 01, 06, 02, 00, 97, 01, 10, 04, 00, 44, 07, 30, 04, 04, 04]

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@elibol
Copy link
Copy Markdown
Collaborator

elibol commented Apr 24, 2026

Thanks for pushing on the new direction! It feels right. I'm pretty excited to get it merged. Lots of folks have asked about it 🙂

Mostly LGTM:

  • The const _: fn() -> $ret = || $load_error_value compile-time return-type check is elegant.
  • Send + Sync impls have correct SAFETY justification.
  • OnceLock<Result<T, E>> + _lib: libloading::Library keep-alive is clean.
  • flake.nix Darwin gating is tidy.
  • KernelCompiler is a real ergonomics improvement over the 10-arg constructor.

Some feedback:

1. Letting bindgen generate the loader

Wondering if there's a strong reason we aren't letting bindgen generate the loader.

build.rs:31 blocklists every function and dyn_load.rs:828-1078 re-declares ~50 of them by hand. bindgen::Builder::dynamic_library_name("cuda") can generate the same struct-of-fn-pointers + libloading pattern directly from cuda.h, so signatures stay in sync with the header automatically.

If there's a disagreement between the hand-written signatures and the ABI (wrong param type, missed _v2 suffix, etc.), the signature transmutes the loaded symbol into the wrong unsafe extern "C" fn and silently corrupts state at call time.

If there's a reason to prefer hand-written here (e.g. deliberately freezing the exported surface) we ought to note the reason in dyn_load.rs so future maintainers know.

2. Make the pinned CUDA minor explicit

If the crate is pinned to a specific CUDA minor (13.2), it might be worth making that explicit and failing loudly on a mismatch rather than hoping the drift never bites, at least for now. If you agree, here's what would need to change:

  • dyn_load.rs:731libcurand.so is the dev symlink; production systems often ship only versioned sonames. Pinning to 13.2's exact name (libcurand.so.12 or whatever 13.2 actually ships) would make the requirement concrete.
  • dyn_load.rs:735curand64_10.dll is CUDA 10's DLL name; probably wants to be 13.2's equivalent.
  • After Library::new succeeds, it could call cuDriverGetVersion and compare against CUDA_VERSION from the bindgen header, surfacing a clear error on mismatch.

I'd like to encourage disagreement here. I tend to be paranoid about these sorts of things 🙂

3. The library-missing failure could be easier to debug

The fallback at dyn_load.rs:803-813 returns CUDA_ERROR_SHARED_OBJECT_INIT_FAILED with no hint of what actually went wrong. The real DynLoadError is only reachable via cuda_driver_load_error(), which most users won't know to call. Logging it on first failure (tracing / log), or having cuda-core surface it via Display at its init path, would keep today's behavior for return-code-checking code while giving humans something to go on.

4. A few minor things

  • compile_api.rs:1245 — the scalar_hints field is threaded through to CUDATileFunctionCompiler::new but has no builder setter, so users of KernelCompiler can't populate it. Worth either adding pub fn scalar_hints(...) or dropping the field.
  • Cargo.lock has ~20 unrelated version bumps; might be cleaner to split those off into a separate PR.
  • There's no unit test for KernelCompiler yet — the compile_only example is a nice smoke test but a single "trivial kernel → non-empty IR + bytecode" assertion would protect the builder plumbing against regressions.

@drbh
Copy link
Copy Markdown
Contributor Author

drbh commented Apr 27, 2026

thanks for the detailed review! pushed updates for all four points.

  1. bindgen-generated loader - agreed, no good reason to hand-write these. build.rs now uses dynamic_library_name() + dynamic_link_require_all(false) to generate the fn-pointer struct from the headers directly. a syn/quote codegen step emits shim functions on top. no more manual signature maintenance.

  2. version pinning - added a cuDriverGetVersion check after load that compares runtime vs compile-time CUDA_VERSION. some pushback on the details though:

  • check is major-version only (runtime_major < compile_major). since cuda 11, minor-version compatibility lets newer toolkits run on older same-major drivers - so rejecting e.g. a 13.0 driver with 13.2 headers would break supported configs.
  • libcurand.so now prefers libcurand.so.10 (the versioned soname). worth noting: 10 here is curand's library-major, not the cuda version - same soname across cuda 11/12/13.
  • curand64_10.dll is also curand's library-major, not cuda 10. it's the correct name for 13.2 on windows. dropped the incorrect curand64_12.dll that was in there.

on mismatch users get: "CUDA driver too old: built against 13.2 but runtime is 12.x"

  1. debugging - DriverError::Display now checks for SHARED_OBJECT_INIT_FAILED and surfaces the cached DynLoadError inline, so the actual failure reason shows up without needing to call cuda_driver_load_error() separately.

  2. minor items

  • scalar_hints: removed the dead field, pass &[] directly
  • cargo.lock: only has additions from this pr (prettyplease, proc-macro2, quote, syn)
  • added cutile/tests/kernel_compiler.rs - compiles a trivial kernel, asserts non-empty IR + valid bytecode magic

let me know what you think of the latest changes - happy to make any other updates/change the approach as you see fit

@elibol
Copy link
Copy Markdown
Collaborator

elibol commented Apr 29, 2026

Looks good to me! Thanks so much for working through the design changes and addressing the earlier feedback. I think the current design is the right direction.

One small follow-up: cutile-examples/examples/compile_only.rs calls _module_asts, but the module macro emits __module_ast_self. I’m going to merge this and then fix it by making KernelCompiler take __module_ast_self() -> Module and use CUDATileModules::from_kernel(...). Will tag you for review on the follow-up.

@elibol elibol merged commit 9c8e303 into NVlabs:main Apr 29, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants