Skip to content

feat: enable compilation without requiring a GPU#30

Closed
drbh wants to merge 13 commits into
NVlabs:mainfrom
drbh:support-gpuless-compilation
Closed

feat: enable compilation without requiring a GPU#30
drbh wants to merge 13 commits into
NVlabs:mainfrom
drbh:support-gpuless-compilation

Conversation

@drbh
Copy link
Copy Markdown
Contributor

@drbh drbh commented Mar 25, 2026

This PR explores the ability to compile cutile to mlir and bytecode without requiring a GPU.

This is achieved by avoiding dependencies that require a driver and setting them as the default features. A new example was added that outputs a .mlir and .bc file.

cargo run -p cutile-examples --example compile_only --no-default-features
Target GPU: sm_90
Compiling my_kernels::tile_math
Generated MLIR IR:

cuda_tile.module @my_kernels {
  entry @tile_math_entry(%arg0: tile<ptr<f32>>, %arg1: tile<i32>, %arg2: tile<i32>, %arg3: tile<i32>, %arg4: tile<i32>, %arg5: tile<f32>) {
    %cst_32_i32 = constant <i32: 32> : tile<i32>
    %0 = make_token : token
    %blockId_x, %blockId_y, %blockId_z = get_tile_block_id : tile<i32>
    %assume_blockId_x = assume bounded<0, ?>, %blockId_x : tile<i32>
    %assume_blockId_y = assume bounded<0, ?>, %blockId_y : tile<i32>
    %assume_blockId_z = assume bounded<0, ?>, %blockId_z : tile<i32>
    %1 = muli %assume_blockId_x, %arg3 : tile<i32>
    %2 = muli %1, %arg2 : tile<i32>
    %3 = offset %arg0, %2 : tile<ptr<f32>>, tile<i32> -> tile<ptr<f32>>
    %tview = make_tensor_view %3, shape = [32], strides = [1] : tensor_view<32xf32, strides=[1]>
    %cst_32_i32_0 = constant <i32: 32> : tile<i32>
    %blockId_x_1, %blockId_y_2, %blockId_z_3 = get_tile_block_id : tile<i32>
    %assume_blockId_x_4 = assume bounded<0, ?>, %blockId_x_1 : tile<i32>
    %assume_blockId_y_5 = assume bounded<0, ?>, %blockId_y_2 : tile<i32>
    %assume_blockId_z_6 = assume bounded<0, ?>, %blockId_z_3 : tile<i32>
    %cst_32_i32_7 = constant <i32: 32> : tile<i32>
    %cst_32_i32_8 = constant <i32: 32> : tile<i32>
    %cst_1_i32 = constant <i32: 1> : tile<i32>
    %cst_1_i32_9 = constant <i32: 1> : tile<i32>
    %reshape = reshape %arg5 : tile<f32> -> tile<1xf32>
    %cst_32_i32_10 = constant <i32: 32> : tile<i32>
    %cst_1_i32_11 = constant <i32: 1> : tile<i32>
    %bcast = broadcast %reshape : tile<1xf32> -> tile<32xf32>
    %cst_1_f32 = constant <f32: 1.000000e+00> : tile<f32>
    %cst_32_i32_12 = constant <i32: 32> : tile<i32>
    %cst_32_i32_13 = constant <i32: 32> : tile<i32>
    %cst_1_i32_14 = constant <i32: 1> : tile<i32>
    %cst_1_i32_15 = constant <i32: 1> : tile<i32>
    %reshape_16 = reshape %cst_1_f32 : tile<f32> -> tile<1xf32>
    %cst_32_i32_17 = constant <i32: 32> : tile<i32>
    %cst_1_i32_18 = constant <i32: 1> : tile<i32>
    %bcast_19 = broadcast %reshape_16 : tile<1xf32> -> tile<32xf32>
    %4 = addf %bcast, %bcast_19  : tile<32xf32>
    %cst_32_i32_20 = constant <i32: 32> : tile<i32>
    %cst_32_i32_21 = constant <i32: 32> : tile<i32>
    %cst_32_i32_22 = constant <i32: 32> : tile<i32>
    %pview = make_partition_view %tview : partition_view<tile=(32), tensor_view<32xf32, strides=[1]>>
    %cst_0_i32 = constant <i32: 0> : tile<i32>
    %5 = store_view_tko weak %4, %pview[%cst_0_i32] token = %0 : tile<32xf32>, partition_view<tile=(32), tensor_view<32xf32, strides=[1]>>, tile<i32> -> token
    return
  }
}


Compiled bytecode: 416 bytes
First 32 bytes (hex): [7f, 54, 69, 6c, 65, 49, 52, 00, 0d, 01, 00, 00, 82, a8, 01, 08, 01, 00, 06, 02, 00, 9d, 01, 10, 04, 00, 44, 07, 30, 04, 04, 04]

note: the output above was generated on a macbook pro

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 25, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@elibol
Copy link
Copy Markdown
Collaborator

elibol commented Mar 26, 2026

/ok to test 48e37c4

@cryos
Copy link
Copy Markdown
Collaborator

cryos commented Mar 30, 2026

/ok to test d333356

Copy link
Copy Markdown
Collaborator

@elibol elibol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this! This looks like a much needed feature flag. I just want to note that the emitted mlir code is not the bulk of compilation time. Most of the compilation time is spent running tileiras.

I will need help understanding all the changes so we can see if better layering at the crate-level can achieve the same effect, or get us closer with fewer modules behind the feature flag.

use cuda_core::{CudaContext, CudaFunction, CudaModule, CudaStream};
pub use cutile_compiler::validator::{
PointerParamType, ScalarParamType, TensorParamType, ValidParamType, Validator,
};
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main thing we need here is to preserve the generality of cuda-async. cutile-compiler is tile-specific, whereas cuda-async should be usable by both tile and custom CUDA kernels written outside of tile. At least that is the intent :)

Perhaps the intent behind your changes here are that the validator bit is closer from a concerns point-of-view to the compiler? I am okay with moving the validator stuff to the compiler, but we ought to avoid having cuda-async depend on cutile-compiler.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very good points, I'm going to rework the changes to remove this dependency.

yea the original intent was to avoid depending on cuda-async inside of cutile-compiler since the cutile-compiler only seemed to need the validator structs from cuda-async

will push updates shortly

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option is to move the validator code to cuda-core. I haven't thought through the gpu-dependence implications of this (there are a lot of moving parts right now), but I think it should work as a solution which provides visibility of the validator code to both cuda-async and cutile-compiler.

The dependence graph should look something like this (I'll add it to the README):

cutile-compiler
├── cuda-tile-rs
├── cuda-async
└── cuda-core

cuda-async
└── cuda-core

cuda-core
└── cuda-bindings

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay i've opt'd to move the type validation structs into their own crate and now the deps are

$ wtree
cuda-core
└── cuda-bindings

cutile-compiler
├── cuda-kernel-interface
└── cuda-tile-rs

cuda-async
├── cuda-core
│   └── cuda-bindings
└── cuda-kernel-interface

cutile-examples
├── cuda-async
│   ├── cuda-core
│   │   └── cuda-bindings
│   └── cuda-kernel-interface
├── cuda-core
│   └── cuda-bindings
└── cutile
    ├── cuda-async
    │   ├── cuda-core
    │   │   └── cuda-bindings
    │   └── cuda-kernel-interface
    ├── cuda-core
    │   └── cuda-bindings
    ├── cutile-compiler
    │   ├── cuda-core
    │   │   └── cuda-bindings
    │   ├── cuda-kernel-interface
    │   └── cuda-tile-rs
    └── cutile-macro
        └── cutile-compiler
            ├── cuda-kernel-interface
            └── cuda-tile-rs

and when --no-default-features is added

(namely the examples do not depend on either core or async in this case)

$ wtree --no-default-features
cuda-core
└── cuda-bindings

cutile-compiler
├── cuda-kernel-interface
└── cuda-tile-rs

cuda-async
├── cuda-core
│   └── cuda-bindings
└── cuda-kernel-interface

cutile-examples
└── cutile
    ├── cutile-compiler
    │   ├── cuda-kernel-interface
    │   └── cuda-tile-rs
    └── cutile-macro
        └── cutile-compiler
            ├── cuda-kernel-interface
            └── cuda-tile-rs

ps the helper used to dump the dep tree is

wtree() {
  for p in cuda-core cutile-compiler cuda-async cutile-examples; do
    cargo tree -p "$p" --edges normal --depth workspace --no-dedupe --format "{p}" "$@" \
    | sed -E 's/ v[0-9].*//'
    echo
  done
}

Comment thread cutile-examples/examples/compile_only.rs
Comment thread cutile-examples/Cargo.toml Outdated
Comment thread cutile-examples/Cargo.toml
Comment thread cutile-macro/src/_module.rs Outdated
Comment thread cutile-examples/examples/compile_only.rs
Comment thread cutile-examples/examples/compile_only.rs Outdated
@drbh drbh force-pushed the support-gpuless-compilation branch from 0dd3219 to 3a339bf Compare April 1, 2026 18:57
@drbh
Copy link
Copy Markdown
Contributor Author

drbh commented Apr 1, 2026

@elibol please let me know your thoughts on the refactors!

also added some changes to the flake.nix that I've been using to test on my macbook

nix develop --command cargo run -p cutile-examples --example compile_only --no-default-features

should work out of the box for those with nix installed on osx

Copy link
Copy Markdown
Collaborator

@elibol elibol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @drbh! The overall feature-flag approach on cutile/cutile-compiler is solid.

A few thoughts on simplifying things:

  1. Moving cuda-kernel-interface into cuda-core

Since the validator types are pure data structs with zero dependencies, they feel like CUDA kernel metadata that could live naturally in cuda-core as a validator module. That would let us avoid the extra crate and keep the dependency graph simple:

cutile-compiler
├── cuda-tile-rs
└── cuda-core

cuda-async
└── cuda-core

cuda-core
└── cuda-bindings

What do you think?

  1. cfg-gated codegen as an alternative to compile_only

I was thinking — instead of the compile_only = true attribute on #[cutile::module], the macro could always emit launcher code wrapped in #[cfg(feature = "cuda")]. That way the same module definition would work on both GPU and non-GPU builds without any special annotation, and we'd avoid the extra code path in the macro. Might be simpler long-term, but curious if you see a reason to prefer the attribute approach.

  1. candle-core in cutile

I noticed candle-core moved into cutile's dependencies (behind the cuda feature). Since it's mainly used by the examples for reference computations, it might be better to keep it in cutile-examples so we don't add it to the core crate's dependency surface. Happy to discuss if there's a reason it needs to be there though.

Everything else looks good!

@drbh
Copy link
Copy Markdown
Contributor Author

drbh commented Apr 7, 2026

Hey @elibol thanks for the comments!

Moving cuda-kernel-interface into cuda-core

originally I was avoiding moving the changes into cuda-core due to the dependency on the cuda-bindings, however I've updated to move the validator changes into core and removed the cuda-kernel-interface crate. I've opted to feature flag the bindings in cuda-core which feels like a better approach.

cfg-gated codegen as an alternative to compile_only

I like this idea too! thanks for the suggestion. I've updated to prefer using a feature flag in the latest changes.

candle-core in cutile

I agree that moving candle-core into the examples makes more sense, however I believe that candle-core was previously a non optional dependency in cutile. I've opted to move candle into cutile-examples and avoid the dependency in cutile.

updated dep tree

wtree
cuda-core
└── cuda-bindings

cutile-compiler
├── cuda-core
└── cuda-tile-rs

cuda-async
└── cuda-core
    └── cuda-bindings

cutile-examples
├── cuda-async
│   └── cuda-core
│       └── cuda-bindings
├── cuda-core
│   └── cuda-bindings
└── cutile
    ├── cuda-async
    │   └── cuda-core
    │       └── cuda-bindings
    ├── cuda-core
    │   └── cuda-bindings
    ├── cutile-compiler
    │   ├── cuda-core
    │   │   └── cuda-bindings
    │   └── cuda-tile-rs
    └── cutile-macro
        └── cutile-compiler
            ├── cuda-core
            └── cuda-tile-rs

and

wtree --no-default-features
cuda-core

cutile-compiler
├── cuda-core
└── cuda-tile-rs

cuda-async
└── cuda-core
    └── cuda-bindings

cutile-examples
└── cutile
    ├── cutile-compiler
    │   ├── cuda-core
    │   └── cuda-tile-rs
    └── cutile-macro
        └── cutile-compiler
            ├── cuda-core
            └── cuda-tile-rs

thanks again for the suggestions! please let me know if the PR requires any more changes

@elibol
Copy link
Copy Markdown
Collaborator

elibol commented Apr 8, 2026

Thanks for driving this — your work here helped clarify quite a bit.

For the CUDA build-time dependency: What do you think about dynamic loading instead of feature flags? The root cause is cuda-bindings/build.rs emitting cargo:rustc-link-lib=dylib=cuda, which forces the linker to find libcuda.so at build time. Switching to libloading would eliminate this — the same binary compiles everywhere and CUDA availability is checked at runtime.

The appeal over feature flags is that "can I use a GPU" becomes a runtime property rather than a compile-time one, which avoids the forwarding burden on downstream crates.

Since we require CUDA 13.2+, the version story would be straightforward: Generate bindings against 13.2 headers at build time (headers don't require a GPU), load the driver dynamically at runtime, and fail with a clear error if the runtime driver is too old.

I'm thinking this direction would pair well with a compile-only API. Something roughly like this:

// Normal path: compile + launch (needs GPU)
let output = gemm_kernel(z, x, y)
    .generics(generics)
    .grid((m_tiles, n_tiles, 1))
    .sync()?;

// Compile-only path: just generics + target
let artifacts = compiler::compile(gemm_kernel)
    .generics(generics)
    .grid((m_tiles, n_tiles, 1))
    .compile("sm_80")?;

artifacts.ir_text();       // tile IR for debugging
artifacts.bytecode();      // .bc bytes
artifacts.cubin_bytes();   // compiled cubin (requires tileiras)

Thoughts? If that makes sense, I can begin looking into the cuda-bindings changes. If the compile API is something you're interested in, you're welcome to look into it (I am also happy to pick it up if there's something else you'd like to work on / focus on).

@drbh
Copy link
Copy Markdown
Contributor Author

drbh commented Apr 17, 2026

Hey @elibol!

apologies for the delay on this PR (was traveling and out of routine). I think the dynamic loading route is great idea and will make this much more simple/clean I'm happy to look into libloading and avoid feature flags.

I'm going to spend a bit of time looking into this today/this weekend and will open a new PR in place of this one soon. I also need to catchup on the recent changes, it looks like there have been a lot of improvements to the repo!

@elibol
Copy link
Copy Markdown
Collaborator

elibol commented Apr 17, 2026

Sounds good, and welcome back! Yes, lots of improvements :) There's a few more breaking changes I'd like to get in for 0.0.2, and then I have this work + other major PRs planned for 0.0.3.

@drbh
Copy link
Copy Markdown
Contributor Author

drbh commented Apr 23, 2026

closing PR in favor of #114

@drbh drbh closed this Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants