feat: dynamically load bindings#114
Conversation
|
Thanks for pushing on the new direction! It feels right. I'm pretty excited to get it merged. Lots of folks have asked about it 🙂 Mostly LGTM:
Some feedback: 1. Letting bindgen generate the loaderWondering if there's a strong reason we aren't letting bindgen generate the loader.
If there's a disagreement between the hand-written signatures and the ABI (wrong param type, missed If there's a reason to prefer hand-written here (e.g. deliberately freezing the exported surface) we ought to note the reason in 2. Make the pinned CUDA minor explicitIf the crate is pinned to a specific CUDA minor (13.2), it might be worth making that explicit and failing loudly on a mismatch rather than hoping the drift never bites, at least for now. If you agree, here's what would need to change:
I'd like to encourage disagreement here. I tend to be paranoid about these sorts of things 🙂 3. The library-missing failure could be easier to debugThe fallback at 4. A few minor things
|
|
thanks for the detailed review! pushed updates for all four points.
on mismatch users get:
let me know what you think of the latest changes - happy to make any other updates/change the approach as you see fit |
|
Looks good to me! Thanks so much for working through the design changes and addressing the earlier feedback. I think the current design is the right direction. One small follow-up: |
This PR is a follow up on #30 and explores dynamically loading cuda bindings via
libloading.This change avoids the proposed features flag solution in #30 and allows the same binary to be built on systems with and without a cuda driver.
Checking cuda availability is moved from compule time to runtime which also has the benefit of relieving downstream crates from handling feature flags.
In addition to the bindings a new
KernelCompilerstruct/impl is added that provides a simple interface for compile-only operations. This functionality is used in thecutile-examples/examples/compile_only.rsexample.Lastly, this PR includes changes to the flake to run the compile only ecample on a macbook with the following command below.
output
CUDA driver: not available (compile-only mode) Target GPU: sm_80 Compiling my_kernels::tile_math Generated Tile IR: cuda_tile.module @my_kernels { entry @tile_math_entry(%0: tile<ptr<f32>>, %1: tile<i32>, %2: tile<i32>, %3: tile<i32>, %4: tile<i32>, %5: tile<f32>) { %6 = constant <i32: 32> : tile<i32> %7 = make_token : token %8, %9, %10 = get_tile_block_id : tile<i32> %11 = assume bounded<0, ?>, %8 : tile<i32> %12 = assume bounded<0, ?>, %9 : tile<i32> %13 = assume bounded<0, ?>, %10 : tile<i32> %14 = muli %11, %3 : tile<i32> %15 = muli %14, %2 : tile<i32> %16 = offset %0, %15 : tile<ptr<f32>>, tile<i32> -> tile<ptr<f32>> %17 = muli %11, %3 : tile<i32> %18 = subi %1, %17 : tile<i32> %19 = mini %3, %18 signed : tile<i32> %20 = make_tensor_view %16, shape = [%19], strides = [1] : tile<i32> -> tensor_view<?xf32, strides=[1]> %21 = constant <i32: 32> : tile<i32> %22 = constant <i32: 32> : tile<i32> %23 = constant <i32: 32> : tile<i32> %24 = constant <i32: 1> : tile<i32> %25 = constant <i32: 1> : tile<i32> %26 = reshape %5 : tile<f32> -> tile<1xf32> %27 = constant <i32: 1> : tile<i32> %28 = constant <i32: 32> : tile<i32> %29 = broadcast %26 : tile<1xf32> -> tile<32xf32> %30 = constant <f32: 1.0> : tile<f32> %31 = constant <i32: 32> : tile<i32> %32 = constant <i32: 32> : tile<i32> %33 = constant <i32: 1> : tile<i32> %34 = constant <i32: 1> : tile<i32> %35 = reshape %30 : tile<f32> -> tile<1xf32> %36 = constant <i32: 32> : tile<i32> %37 = constant <i32: 1> : tile<i32> %38 = broadcast %35 : tile<1xf32> -> tile<32xf32> %39 = addf %29, %38 : tile<32xf32> %40 = constant <i32: 32> : tile<i32> %41 = constant <i32: 32> : tile<i32> %42 = constant <i32: 32> : tile<i32> %43 = make_partition_view %20 : partition_view<tile=(32), padding_value = zero, tensor_view<?xf32, strides=[1]>> %44 = constant <i32: 0> : tile<i32> %45 = store_view_tko weak %39, %43[%44] token = %7 : tile<32xf32>, partition_view<tile=(32), padding_value = zero, tensor_view<?xf32, strides=[1]>>, tile<i32> -> token return } } Compiled bytecode: 754 bytes First 32 bytes (hex): [7f, 54, 69, 6c, 65, 49, 52, 00, 0d, 02, 00, 00, 82, a0, 01, 08, 01, 01, 06, 02, 00, 97, 01, 10, 04, 00, 44, 07, 30, 04, 04, 04]