feat: enable compilation without requiring a GPU by drbh · Pull Request #30 · NVlabs/cutile-rs

drbh · 2026-03-25T20:07:12Z

This PR explores the ability to compile cutile to mlir and bytecode without requiring a GPU.

This is achieved by avoiding dependencies that require a driver and setting them as the default features. A new example was added that outputs a .mlir and .bc file.

cargo run -p cutile-examples --example compile_only --no-default-features

Target GPU: sm_90
Compiling my_kernels::tile_math
Generated MLIR IR:

cuda_tile.module @my_kernels {
  entry @tile_math_entry(%arg0: tile<ptr<f32>>, %arg1: tile<i32>, %arg2: tile<i32>, %arg3: tile<i32>, %arg4: tile<i32>, %arg5: tile<f32>) {
    %cst_32_i32 = constant <i32: 32> : tile<i32>
    %0 = make_token : token
    %blockId_x, %blockId_y, %blockId_z = get_tile_block_id : tile<i32>
    %assume_blockId_x = assume bounded<0, ?>, %blockId_x : tile<i32>
    %assume_blockId_y = assume bounded<0, ?>, %blockId_y : tile<i32>
    %assume_blockId_z = assume bounded<0, ?>, %blockId_z : tile<i32>
    %1 = muli %assume_blockId_x, %arg3 : tile<i32>
    %2 = muli %1, %arg2 : tile<i32>
    %3 = offset %arg0, %2 : tile<ptr<f32>>, tile<i32> -> tile<ptr<f32>>
    %tview = make_tensor_view %3, shape = [32], strides = [1] : tensor_view<32xf32, strides=[1]>
    %cst_32_i32_0 = constant <i32: 32> : tile<i32>
    %blockId_x_1, %blockId_y_2, %blockId_z_3 = get_tile_block_id : tile<i32>
    %assume_blockId_x_4 = assume bounded<0, ?>, %blockId_x_1 : tile<i32>
    %assume_blockId_y_5 = assume bounded<0, ?>, %blockId_y_2 : tile<i32>
    %assume_blockId_z_6 = assume bounded<0, ?>, %blockId_z_3 : tile<i32>
    %cst_32_i32_7 = constant <i32: 32> : tile<i32>
    %cst_32_i32_8 = constant <i32: 32> : tile<i32>
    %cst_1_i32 = constant <i32: 1> : tile<i32>
    %cst_1_i32_9 = constant <i32: 1> : tile<i32>
    %reshape = reshape %arg5 : tile<f32> -> tile<1xf32>
    %cst_32_i32_10 = constant <i32: 32> : tile<i32>
    %cst_1_i32_11 = constant <i32: 1> : tile<i32>
    %bcast = broadcast %reshape : tile<1xf32> -> tile<32xf32>
    %cst_1_f32 = constant <f32: 1.000000e+00> : tile<f32>
    %cst_32_i32_12 = constant <i32: 32> : tile<i32>
    %cst_32_i32_13 = constant <i32: 32> : tile<i32>
    %cst_1_i32_14 = constant <i32: 1> : tile<i32>
    %cst_1_i32_15 = constant <i32: 1> : tile<i32>
    %reshape_16 = reshape %cst_1_f32 : tile<f32> -> tile<1xf32>
    %cst_32_i32_17 = constant <i32: 32> : tile<i32>
    %cst_1_i32_18 = constant <i32: 1> : tile<i32>
    %bcast_19 = broadcast %reshape_16 : tile<1xf32> -> tile<32xf32>
    %4 = addf %bcast, %bcast_19  : tile<32xf32>
    %cst_32_i32_20 = constant <i32: 32> : tile<i32>
    %cst_32_i32_21 = constant <i32: 32> : tile<i32>
    %cst_32_i32_22 = constant <i32: 32> : tile<i32>
    %pview = make_partition_view %tview : partition_view<tile=(32), tensor_view<32xf32, strides=[1]>>
    %cst_0_i32 = constant <i32: 0> : tile<i32>
    %5 = store_view_tko weak %4, %pview[%cst_0_i32] token = %0 : tile<32xf32>, partition_view<tile=(32), tensor_view<32xf32, strides=[1]>>, tile<i32> -> token
    return
  }
}


Compiled bytecode: 416 bytes
First 32 bytes (hex): [7f, 54, 69, 6c, 65, 49, 52, 00, 0d, 01, 00, 00, 82, a8, 01, 08, 01, 00, 06, 02, 00, 9d, 01, 10, 04, 00, 44, 07, 30, 04, 04, 04]

note: the output above was generated on a macbook pro

copy-pr-bot · 2026-03-25T20:07:16Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

elibol · 2026-03-26T03:45:15Z

/ok to test 48e37c4

cryos · 2026-03-30T23:11:18Z

/ok to test d333356

elibol

Thanks for looking into this! This looks like a much needed feature flag. I just want to note that the emitted mlir code is not the bulk of compilation time. Most of the compilation time is spent running tileiras.

I will need help understanding all the changes so we can see if better layering at the crate-level can achieve the same effect, or get us closer with fewer modules behind the feature flag.

elibol · 2026-03-31T18:30:47Z

 use cuda_core::{CudaContext, CudaFunction, CudaModule, CudaStream};
+pub use cutile_compiler::validator::{
+    PointerParamType, ScalarParamType, TensorParamType, ValidParamType, Validator,
+};


The main thing we need here is to preserve the generality of cuda-async. cutile-compiler is tile-specific, whereas cuda-async should be usable by both tile and custom CUDA kernels written outside of tile. At least that is the intent :)

Perhaps the intent behind your changes here are that the validator bit is closer from a concerns point-of-view to the compiler? I am okay with moving the validator stuff to the compiler, but we ought to avoid having cuda-async depend on cutile-compiler.

very good points, I'm going to rework the changes to remove this dependency.

yea the original intent was to avoid depending on cuda-async inside of cutile-compiler since the cutile-compiler only seemed to need the validator structs from cuda-async

will push updates shortly

One option is to move the validator code to cuda-core. I haven't thought through the gpu-dependence implications of this (there are a lot of moving parts right now), but I think it should work as a solution which provides visibility of the validator code to both cuda-async and cutile-compiler.

The dependence graph should look something like this (I'll add it to the README):

cutile-compiler ├── cuda-tile-rs ├── cuda-async └── cuda-core cuda-async └── cuda-core cuda-core └── cuda-bindings

okay i've opt'd to move the type validation structs into their own crate and now the deps are

$ wtree cuda-core └── cuda-bindings cutile-compiler ├── cuda-kernel-interface └── cuda-tile-rs cuda-async ├── cuda-core │ └── cuda-bindings └── cuda-kernel-interface cutile-examples ├── cuda-async │ ├── cuda-core │ │ └── cuda-bindings │ └── cuda-kernel-interface ├── cuda-core │ └── cuda-bindings └── cutile ├── cuda-async │ ├── cuda-core │ │ └── cuda-bindings │ └── cuda-kernel-interface ├── cuda-core │ └── cuda-bindings ├── cutile-compiler │ ├── cuda-core │ │ └── cuda-bindings │ ├── cuda-kernel-interface │ └── cuda-tile-rs └── cutile-macro └── cutile-compiler ├── cuda-kernel-interface └── cuda-tile-rs

and when --no-default-features is added

(namely the examples do not depend on either core or async in this case)

$ wtree --no-default-features cuda-core └── cuda-bindings cutile-compiler ├── cuda-kernel-interface └── cuda-tile-rs cuda-async ├── cuda-core │ └── cuda-bindings └── cuda-kernel-interface cutile-examples └── cutile ├── cutile-compiler │ ├── cuda-kernel-interface │ └── cuda-tile-rs └── cutile-macro └── cutile-compiler ├── cuda-kernel-interface └── cuda-tile-rs

ps the helper used to dump the dep tree is

wtree() { for p in cuda-core cutile-compiler cuda-async cutile-examples; do cargo tree -p "$p" --edges normal --depth workspace --no-dedupe --format "{p}" "$@" \ | sed -E 's/ v[0-9].*//' echo done }

drbh · 2026-04-01T19:59:00Z

@elibol please let me know your thoughts on the refactors!

also added some changes to the flake.nix that I've been using to test on my macbook

nix develop --command cargo run -p cutile-examples --example compile_only --no-default-features

should work out of the box for those with nix installed on osx

elibol

Thanks @drbh! The overall feature-flag approach on cutile/cutile-compiler is solid.

A few thoughts on simplifying things:

Moving cuda-kernel-interface into cuda-core

Since the validator types are pure data structs with zero dependencies, they feel like CUDA kernel metadata that could live naturally in cuda-core as a validator module. That would let us avoid the extra crate and keep the dependency graph simple:

cutile-compiler
├── cuda-tile-rs
└── cuda-core

cuda-async
└── cuda-core

cuda-core
└── cuda-bindings

What do you think?

cfg-gated codegen as an alternative to compile_only

I was thinking — instead of the compile_only = true attribute on #[cutile::module], the macro could always emit launcher code wrapped in #[cfg(feature = "cuda")]. That way the same module definition would work on both GPU and non-GPU builds without any special annotation, and we'd avoid the extra code path in the macro. Might be simpler long-term, but curious if you see a reason to prefer the attribute approach.

candle-core in cutile

I noticed candle-core moved into cutile's dependencies (behind the cuda feature). Since it's mainly used by the examples for reference computations, it might be better to keep it in cutile-examples so we don't add it to the core crate's dependency surface. Happy to discuss if there's a reason it needs to be there though.

Everything else looks good!

drbh · 2026-04-07T16:25:38Z

Hey @elibol thanks for the comments!

Moving cuda-kernel-interface into cuda-core

originally I was avoiding moving the changes into cuda-core due to the dependency on the cuda-bindings, however I've updated to move the validator changes into core and removed the cuda-kernel-interface crate. I've opted to feature flag the bindings in cuda-core which feels like a better approach.

cfg-gated codegen as an alternative to compile_only

I like this idea too! thanks for the suggestion. I've updated to prefer using a feature flag in the latest changes.

candle-core in cutile

I agree that moving candle-core into the examples makes more sense, however I believe that candle-core was previously a non optional dependency in cutile. I've opted to move candle into cutile-examples and avoid the dependency in cutile.

updated dep tree

wtree
cuda-core
└── cuda-bindings

cutile-compiler
├── cuda-core
└── cuda-tile-rs

cuda-async
└── cuda-core
    └── cuda-bindings

cutile-examples
├── cuda-async
│   └── cuda-core
│       └── cuda-bindings
├── cuda-core
│   └── cuda-bindings
└── cutile
    ├── cuda-async
    │   └── cuda-core
    │       └── cuda-bindings
    ├── cuda-core
    │   └── cuda-bindings
    ├── cutile-compiler
    │   ├── cuda-core
    │   │   └── cuda-bindings
    │   └── cuda-tile-rs
    └── cutile-macro
        └── cutile-compiler
            ├── cuda-core
            └── cuda-tile-rs

and

wtree --no-default-features
cuda-core

cutile-compiler
├── cuda-core
└── cuda-tile-rs

cuda-async
└── cuda-core
    └── cuda-bindings

cutile-examples
└── cutile
    ├── cutile-compiler
    │   ├── cuda-core
    │   └── cuda-tile-rs
    └── cutile-macro
        └── cutile-compiler
            ├── cuda-core
            └── cuda-tile-rs

thanks again for the suggestions! please let me know if the PR requires any more changes

elibol · 2026-04-08T22:08:56Z

Thanks for driving this — your work here helped clarify quite a bit.

For the CUDA build-time dependency: What do you think about dynamic loading instead of feature flags? The root cause is cuda-bindings/build.rs emitting cargo:rustc-link-lib=dylib=cuda, which forces the linker to find libcuda.so at build time. Switching to libloading would eliminate this — the same binary compiles everywhere and CUDA availability is checked at runtime.

The appeal over feature flags is that "can I use a GPU" becomes a runtime property rather than a compile-time one, which avoids the forwarding burden on downstream crates.

Since we require CUDA 13.2+, the version story would be straightforward: Generate bindings against 13.2 headers at build time (headers don't require a GPU), load the driver dynamically at runtime, and fail with a clear error if the runtime driver is too old.

I'm thinking this direction would pair well with a compile-only API. Something roughly like this:

// Normal path: compile + launch (needs GPU)
let output = gemm_kernel(z, x, y)
    .generics(generics)
    .grid((m_tiles, n_tiles, 1))
    .sync()?;

// Compile-only path: just generics + target
let artifacts = compiler::compile(gemm_kernel)
    .generics(generics)
    .grid((m_tiles, n_tiles, 1))
    .compile("sm_80")?;

artifacts.ir_text();       // tile IR for debugging
artifacts.bytecode();      // .bc bytes
artifacts.cubin_bytes();   // compiled cubin (requires tileiras)

Thoughts? If that makes sense, I can begin looking into the cuda-bindings changes. If the compile API is something you're interested in, you're welcome to look into it (I am also happy to pick it up if there's something else you'd like to work on / focus on).

drbh · 2026-04-17T18:11:45Z

Hey @elibol!

apologies for the delay on this PR (was traveling and out of routine). I think the dynamic loading route is great idea and will make this much more simple/clean I'm happy to look into libloading and avoid feature flags.

I'm going to spend a bit of time looking into this today/this weekend and will open a new PR in place of this one soon. I also need to catchup on the recent changes, it looks like there have been a lot of improvements to the repo!

elibol · 2026-04-17T18:32:20Z

Sounds good, and welcome back! Yes, lots of improvements :) There's a few more breaking changes I'd like to get in for 0.0.2, and then I have this work + other major PRs planned for 0.0.3.

drbh · 2026-04-23T23:23:28Z

closing PR in favor of #114

elibol reviewed Mar 31, 2026

View reviewed changes

Comment thread cutile-examples/examples/compile_only.rs Outdated

drbh added 2 commits April 1, 2026 14:45

feat: enable compilation without requiring a GPU

572147b

fix: adjust names

3a339bf

drbh force-pushed the support-gpuless-compilation branch from 0dd3219 to 3a339bf Compare April 1, 2026 18:57

drbh added 4 commits April 1, 2026 15:16

feat: add shared interface crate

ad6d185

fix: improve examples deps

ff3687e

fix: update var name entry_functions to kernel_launchers

9e98070

feat: update flake to run compile only example on osx

c614a91

Merge branch 'main' into support-gpuless-compilation

5c71edd

elibol reviewed Apr 3, 2026

View reviewed changes

Merge branch 'main' into support-gpuless-compilation

d039c6d

elibol mentioned this pull request Apr 6, 2026

Plans to support offline compilation? #72

Closed

drbh added 3 commits April 6, 2026 19:43

fix: move validators into core and feature flag bindings

5b9b030

feat: avoid compile only and prefer feature flag driver

7046c41

fix: move candle dependency into examples crate

df78f6f

drbh added 2 commits April 7, 2026 18:28

fix: bump version

5e21365

Merge branch 'main' into support-gpuless-compilation

cb0a47a

drbh mentioned this pull request Apr 23, 2026

feat: dynamically load bindings #114

Merged

drbh closed this Apr 23, 2026

Conversation

drbh commented Mar 25, 2026

Uh oh!

copy-pr-bot Bot commented Mar 25, 2026

Uh oh!

elibol commented Mar 26, 2026

Uh oh!

cryos commented Mar 30, 2026

Uh oh!

elibol left a comment

Choose a reason for hiding this comment

Uh oh!

elibol Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

drbh Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

elibol Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

drbh Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

drbh commented Apr 1, 2026

Uh oh!

elibol left a comment

Choose a reason for hiding this comment

Uh oh!

drbh commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

updated dep tree

Uh oh!

elibol commented Apr 8, 2026

Uh oh!

drbh commented Apr 17, 2026

Uh oh!

elibol commented Apr 17, 2026

Uh oh!

drbh commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

drbh commented Apr 7, 2026 •

edited

Loading