feat: enable compilation without requiring a GPU#30
Conversation
|
/ok to test 48e37c4 |
|
/ok to test d333356 |
elibol
left a comment
There was a problem hiding this comment.
Thanks for looking into this! This looks like a much needed feature flag. I just want to note that the emitted mlir code is not the bulk of compilation time. Most of the compilation time is spent running tileiras.
I will need help understanding all the changes so we can see if better layering at the crate-level can achieve the same effect, or get us closer with fewer modules behind the feature flag.
| use cuda_core::{CudaContext, CudaFunction, CudaModule, CudaStream}; | ||
| pub use cutile_compiler::validator::{ | ||
| PointerParamType, ScalarParamType, TensorParamType, ValidParamType, Validator, | ||
| }; |
There was a problem hiding this comment.
The main thing we need here is to preserve the generality of cuda-async. cutile-compiler is tile-specific, whereas cuda-async should be usable by both tile and custom CUDA kernels written outside of tile. At least that is the intent :)
Perhaps the intent behind your changes here are that the validator bit is closer from a concerns point-of-view to the compiler? I am okay with moving the validator stuff to the compiler, but we ought to avoid having cuda-async depend on cutile-compiler.
There was a problem hiding this comment.
very good points, I'm going to rework the changes to remove this dependency.
yea the original intent was to avoid depending on cuda-async inside of cutile-compiler since the cutile-compiler only seemed to need the validator structs from cuda-async
will push updates shortly
There was a problem hiding this comment.
One option is to move the validator code to cuda-core. I haven't thought through the gpu-dependence implications of this (there are a lot of moving parts right now), but I think it should work as a solution which provides visibility of the validator code to both cuda-async and cutile-compiler.
The dependence graph should look something like this (I'll add it to the README):
cutile-compiler
├── cuda-tile-rs
├── cuda-async
└── cuda-core
cuda-async
└── cuda-core
cuda-core
└── cuda-bindings
There was a problem hiding this comment.
okay i've opt'd to move the type validation structs into their own crate and now the deps are
$ wtree
cuda-core
└── cuda-bindings
cutile-compiler
├── cuda-kernel-interface
└── cuda-tile-rs
cuda-async
├── cuda-core
│ └── cuda-bindings
└── cuda-kernel-interface
cutile-examples
├── cuda-async
│ ├── cuda-core
│ │ └── cuda-bindings
│ └── cuda-kernel-interface
├── cuda-core
│ └── cuda-bindings
└── cutile
├── cuda-async
│ ├── cuda-core
│ │ └── cuda-bindings
│ └── cuda-kernel-interface
├── cuda-core
│ └── cuda-bindings
├── cutile-compiler
│ ├── cuda-core
│ │ └── cuda-bindings
│ ├── cuda-kernel-interface
│ └── cuda-tile-rs
└── cutile-macro
└── cutile-compiler
├── cuda-kernel-interface
└── cuda-tile-rs
and when --no-default-features is added
(namely the examples do not depend on either core or async in this case)
$ wtree --no-default-features
cuda-core
└── cuda-bindings
cutile-compiler
├── cuda-kernel-interface
└── cuda-tile-rs
cuda-async
├── cuda-core
│ └── cuda-bindings
└── cuda-kernel-interface
cutile-examples
└── cutile
├── cutile-compiler
│ ├── cuda-kernel-interface
│ └── cuda-tile-rs
└── cutile-macro
└── cutile-compiler
├── cuda-kernel-interface
└── cuda-tile-rs
ps the helper used to dump the dep tree is
wtree() {
for p in cuda-core cutile-compiler cuda-async cutile-examples; do
cargo tree -p "$p" --edges normal --depth workspace --no-dedupe --format "{p}" "$@" \
| sed -E 's/ v[0-9].*//'
echo
done
}0dd3219 to
3a339bf
Compare
|
@elibol please let me know your thoughts on the refactors! also added some changes to the nix develop --command cargo run -p cutile-examples --example compile_only --no-default-featuresshould work out of the box for those with nix installed on osx |
elibol
left a comment
There was a problem hiding this comment.
Thanks @drbh! The overall feature-flag approach on cutile/cutile-compiler is solid.
A few thoughts on simplifying things:
- Moving cuda-kernel-interface into cuda-core
Since the validator types are pure data structs with zero dependencies, they feel like CUDA kernel metadata that could live naturally in cuda-core as a validator module. That would let us avoid the extra crate and keep the dependency graph simple:
cutile-compiler
├── cuda-tile-rs
└── cuda-core
cuda-async
└── cuda-core
cuda-core
└── cuda-bindings
What do you think?
- cfg-gated codegen as an alternative to compile_only
I was thinking — instead of the compile_only = true attribute on #[cutile::module], the macro could always emit launcher code wrapped in #[cfg(feature = "cuda")]. That way the same module definition would work on both GPU and non-GPU builds without any special annotation, and we'd avoid the extra code path in the macro. Might be simpler long-term, but curious if you see a reason to prefer the attribute approach.
- candle-core in cutile
I noticed candle-core moved into cutile's dependencies (behind the cuda feature). Since it's mainly used by the examples for reference computations, it might be better to keep it in cutile-examples so we don't add it to the core crate's dependency surface. Happy to discuss if there's a reason it needs to be there though.
Everything else looks good!
|
Hey @elibol thanks for the comments!
originally I was avoiding moving the changes into
I like this idea too! thanks for the suggestion. I've updated to prefer using a feature flag in the latest changes.
I agree that moving updated dep treewtree
cuda-core
└── cuda-bindings
cutile-compiler
├── cuda-core
└── cuda-tile-rs
cuda-async
└── cuda-core
└── cuda-bindings
cutile-examples
├── cuda-async
│ └── cuda-core
│ └── cuda-bindings
├── cuda-core
│ └── cuda-bindings
└── cutile
├── cuda-async
│ └── cuda-core
│ └── cuda-bindings
├── cuda-core
│ └── cuda-bindings
├── cutile-compiler
│ ├── cuda-core
│ │ └── cuda-bindings
│ └── cuda-tile-rs
└── cutile-macro
└── cutile-compiler
├── cuda-core
└── cuda-tile-rsand thanks again for the suggestions! please let me know if the PR requires any more changes |
|
Thanks for driving this — your work here helped clarify quite a bit. For the CUDA build-time dependency: What do you think about dynamic loading instead of feature flags? The root cause is The appeal over feature flags is that "can I use a GPU" becomes a runtime property rather than a compile-time one, which avoids the forwarding burden on downstream crates. Since we require CUDA 13.2+, the version story would be straightforward: Generate bindings against 13.2 headers at build time (headers don't require a GPU), load the driver dynamically at runtime, and fail with a clear error if the runtime driver is too old. I'm thinking this direction would pair well with a compile-only API. Something roughly like this: Thoughts? If that makes sense, I can begin looking into the |
|
Hey @elibol! apologies for the delay on this PR (was traveling and out of routine). I think the dynamic loading route is great idea and will make this much more simple/clean I'm happy to look into I'm going to spend a bit of time looking into this today/this weekend and will open a new PR in place of this one soon. I also need to catchup on the recent changes, it looks like there have been a lot of improvements to the repo! |
|
Sounds good, and welcome back! Yes, lots of improvements :) There's a few more breaking changes I'd like to get in for 0.0.2, and then I have this work + other major PRs planned for 0.0.3. |
|
closing PR in favor of #114 |
This PR explores the ability to compile cutile to mlir and bytecode without requiring a GPU.
This is achieved by avoiding dependencies that require a driver and setting them as the default features. A new example was added that outputs a .mlir and .bc file.
Target GPU: sm_90 Compiling my_kernels::tile_math Generated MLIR IR: cuda_tile.module @my_kernels { entry @tile_math_entry(%arg0: tile<ptr<f32>>, %arg1: tile<i32>, %arg2: tile<i32>, %arg3: tile<i32>, %arg4: tile<i32>, %arg5: tile<f32>) { %cst_32_i32 = constant <i32: 32> : tile<i32> %0 = make_token : token %blockId_x, %blockId_y, %blockId_z = get_tile_block_id : tile<i32> %assume_blockId_x = assume bounded<0, ?>, %blockId_x : tile<i32> %assume_blockId_y = assume bounded<0, ?>, %blockId_y : tile<i32> %assume_blockId_z = assume bounded<0, ?>, %blockId_z : tile<i32> %1 = muli %assume_blockId_x, %arg3 : tile<i32> %2 = muli %1, %arg2 : tile<i32> %3 = offset %arg0, %2 : tile<ptr<f32>>, tile<i32> -> tile<ptr<f32>> %tview = make_tensor_view %3, shape = [32], strides = [1] : tensor_view<32xf32, strides=[1]> %cst_32_i32_0 = constant <i32: 32> : tile<i32> %blockId_x_1, %blockId_y_2, %blockId_z_3 = get_tile_block_id : tile<i32> %assume_blockId_x_4 = assume bounded<0, ?>, %blockId_x_1 : tile<i32> %assume_blockId_y_5 = assume bounded<0, ?>, %blockId_y_2 : tile<i32> %assume_blockId_z_6 = assume bounded<0, ?>, %blockId_z_3 : tile<i32> %cst_32_i32_7 = constant <i32: 32> : tile<i32> %cst_32_i32_8 = constant <i32: 32> : tile<i32> %cst_1_i32 = constant <i32: 1> : tile<i32> %cst_1_i32_9 = constant <i32: 1> : tile<i32> %reshape = reshape %arg5 : tile<f32> -> tile<1xf32> %cst_32_i32_10 = constant <i32: 32> : tile<i32> %cst_1_i32_11 = constant <i32: 1> : tile<i32> %bcast = broadcast %reshape : tile<1xf32> -> tile<32xf32> %cst_1_f32 = constant <f32: 1.000000e+00> : tile<f32> %cst_32_i32_12 = constant <i32: 32> : tile<i32> %cst_32_i32_13 = constant <i32: 32> : tile<i32> %cst_1_i32_14 = constant <i32: 1> : tile<i32> %cst_1_i32_15 = constant <i32: 1> : tile<i32> %reshape_16 = reshape %cst_1_f32 : tile<f32> -> tile<1xf32> %cst_32_i32_17 = constant <i32: 32> : tile<i32> %cst_1_i32_18 = constant <i32: 1> : tile<i32> %bcast_19 = broadcast %reshape_16 : tile<1xf32> -> tile<32xf32> %4 = addf %bcast, %bcast_19 : tile<32xf32> %cst_32_i32_20 = constant <i32: 32> : tile<i32> %cst_32_i32_21 = constant <i32: 32> : tile<i32> %cst_32_i32_22 = constant <i32: 32> : tile<i32> %pview = make_partition_view %tview : partition_view<tile=(32), tensor_view<32xf32, strides=[1]>> %cst_0_i32 = constant <i32: 0> : tile<i32> %5 = store_view_tko weak %4, %pview[%cst_0_i32] token = %0 : tile<32xf32>, partition_view<tile=(32), tensor_view<32xf32, strides=[1]>>, tile<i32> -> token return } } Compiled bytecode: 416 bytes First 32 bytes (hex): [7f, 54, 69, 6c, 65, 49, 52, 00, 0d, 01, 00, 00, 82, a8, 01, 08, 01, 00, 06, 02, 00, 9d, 01, 10, 04, 00, 44, 07, 30, 04, 04, 04]note: the output above was generated on a macbook pro