A Julia package for writing GPU kernels using NVIDIA's tile-based programming model.
This package is in beta. Most Tile IR features are implemented and the package has been verified on the benchmarks and tests included in the repository. Interfaces and APIs may still change without notice.
Install just like any other Julia package:
julia> using Pkg
julia> Pkg.add("cuTile")
Execution of cuTile kernels requires CUDA.jl to be installed and imported. cuTile generates kernels based on Tile IR, which requires an NVIDIA Driver that supports CUDA 13 (580 or later). CUDA.jl automatically downloads the appropriate CUDA toolkit artifacts, so no manual CUDA installation is needed. Only Ampere, Ada, and Blackwell GPUs are supported at this time, with Hopper support expected in a future release of CUDA.
A simple vector addition kernel using cuTile looks like this:
using CUDA
import cuTile as ct
# Define kernel
function vadd(a, b, c, tile_size::Int)
pid = ct.bid(1)
tile_a = ct.load(a; index=pid, shape=(tile_size,))
tile_b = ct.load(b; index=pid, shape=(tile_size,))
ct.store(c; index=pid, tile=tile_a + tile_b)
return
end
# Launch
vector_size = 2^20
tile_size = 16
blocks = cld(vector_size, tile_size)
grid = (blocks, 1, 1)
a, b = CUDA.rand(Float32, vector_size), CUDA.rand(Float32, vector_size)
c = CUDA.zeros(Float32, vector_size)
ct.launch(vadd, grid, a, b, c, ct.Constant(tile_size))
@assert c == a .+ bThe generated Tile IR can be inspected using the code_tiled function:
ct.code_tiled(vadd, Tuple{ct.TileArray{Float32, 1, ct.ArraySpec{1}(128, true, (0,), (32,))},
ct.TileArray{Float32, 1, ct.ArraySpec{1}(128, true, (0,), (32,))},
ct.TileArray{Float32, 1, ct.ArraySpec{1}(128, true, (0,), (32,))},
ct.Constant{Int64, 16}})Since these types can be verbose, and are derived from the runtime properties
of arrays, it's often easier to use the @code_tiled macro instead:
julia> ct.@code_tiled ct.launch(vadd, (cld(vector_size, tile_size), 1, 1), a, b, c, ct.Constant(tile_size))
// vadd(cuTile.TileArray{Float32, 1, cuTile.ArraySpec{1}(128, true, (0,), (32,))}, cuTile.TileArray{Float32, 1, cuTile.ArraySpec{1}(128, true, (0,), (32,))}, cuTile.TileArray{Float32, 1, cuTile.ArraySpec{1}(128, true, (0,), (32,))}, cuTile.Constant{Int64, 16})
cuda_tile.module @kernels {
entry @vadd(...) {
...
return
}
}The former form can be useful on systems without a GPU, since it does not require CUDA.jl,
while the latter needs valid CuArrays to be passed to the kernel.
Run benchmarks with:
julia --project=examples examples/benchmarks.jl # Julia
uv run python examples/benchmarks.py # Python (for comparison)Benchmarks comparing cuTile.jl against cuTile Python on an RTX 5080 (tileiras 13.2.51,
20 runs, 5 warmup, min time reported):
| Kernel | Size | Julia | Python | Status |
|---|---|---|---|---|
| Vector Addition | 2^27 f32 | 841 GB/s | 845 GB/s | OK (=) |
| Matrix Transpose | 8192² f32 | 805 GB/s | 811 GB/s | OK (-1%) |
| Layer Norm fwd | 4096² f32 | 925 GB/s | 722 GB/s | +28%* |
| Layer Norm bwd | 4096² f32 | 243 GB/s | 251 GB/s | -3% |
| Matrix Multiplication | 4096³ f32 | 46.9 TFLOPS | 43.4 TFLOPS | +8%** |
| Batch Matrix Multiply | 1024×512×2048 ×8 f32 | 33.6 TFLOPS | 30.9 TFLOPS | +9%** |
| FFT (3-stage Cooley-Tukey) | 1024-pt ×64 c64 | 3263 μs | 3127 μs | -4% |
| Mixture of Experts | 256tok 1024h 32e 2048i f16 | 19.3 TFLOPS | 20.3 TFLOPS | -5% |
| Attention (FMHA) | 8×16×1024² ×64 f16 causal | 88.5 TFLOPS | 61.6 TFLOPS | +44%*** |
* The pow(x, 2) → mulf(x, x) strength reduction eliminates the expensive
transcendental in the variance computation. Python still emits pow.
** Likely because Julia's for loop guards give tileiras a guarantee that the
loop body executes at least once, enabling more aggressive warp scheduling.
*** Likely due to Python's compiler splitting the causal masking loop into two loops, duplicating the loop body. Julia emits a single loop with a conditional.
cuTile.jl aims to expose as much functionality as possible through Julia-native constructs
(+, sum, reshape, broadcast, etc.) rather than cuTile-specific functions. Operations
prefixed with ct. are cuTile intrinsics with no direct Julia equivalent; everything else
uses standard Julia syntax and is overlaid on Base.
Integers: Int8, UInt8, Int16, UInt16, Int32, UInt32, Int64, UInt64
Floats: Float16, BFloat16, Float32, Float64, TFloat32
Boolean: Bool
TFloat32 is a 32-bit floating-point type with reduced mantissa precision (10 bits),
optimized for tensor core operations.
| Operation | Description |
|---|---|
ct.load(arr; index, shape, ...) |
Load a tile from array |
ct.store(arr; index, tile, ...) |
Store a tile to array |
ct.gather(arr, indices; ...) |
Gather elements by index tile |
ct.scatter(arr, indices, tile; ...) |
Scatter elements by index tile |
load and store accept keyword arguments order, padding_mode, latency, and allow_tma.
gather accepts mask, padding_value, check_bounds, and latency.
scatter accepts mask, check_bounds, and latency.
# Gather with user mask and custom padding for masked-out elements
tile = ct.gather(arr, indices; mask=valid_mask, padding_value=-1.0f0)
# Scatter with mask (only write where mask is true)
ct.scatter(arr, indices, tile; mask=active_mask)| Operation | Description |
|---|---|
ct.bid(axis) |
Block ID (1=x, 2=y, 3=z) |
ct.num_blocks(axis) |
Grid size along axis |
ct.num_tiles(arr, axis, shape) |
Number of tiles along axis |
| Construct | Description |
|---|---|
if/elseif/else |
Conditional branching |
for i in start:stop |
Counted loops (compiled to Tile IR ForOp) |
for i in start:step:stop |
Stepped loops |
while cond ... end |
While loops |
ifelse.(cond, x, y) |
Element-wise conditional selection |
Standard Julia control flow works inside kernels and is compiled to structured Tile IR operations.
| Operation | Description |
|---|---|
+, - |
Element-wise (same shape only) |
tile * scalar, tile / scalar |
Scalar multiply/divide |
.+, .-, .*, ./, .^ |
Broadcasting element-wise |
| Operation | Description |
|---|---|
zeros(T, dims...) |
Zero-filled tile (Base overlay) |
ones(T, dims...) |
One-filled tile (Base overlay) |
fill(value, dims...) |
Constant-filled tile (Base overlay) |
ct.arange(shape, T) / ct.arange(n, T) |
Sequence [1, 2, 3, ..., n] |
| Operation | Description |
|---|---|
ct.broadcast_to(tile, shape) |
Broadcast to target shape |
transpose(tile) |
Transpose 2D tile |
reshape(tile, shape) |
Reshape (same element count) |
permutedims(tile, perm) |
Permute dimensions |
ct.extract(tile, index, shape) |
Extract sub-tile |
ct.cat((a, b), axis) |
Concatenate tiles |
dropdims(tile; dims) |
Remove singleton dimensions |
| Operation | Description |
|---|---|
a * b |
Matrix multiplication: a @ b |
muladd(a, b, acc) |
Matrix multiply-accumulate: a * b + acc |
| Operation | Description |
|---|---|
map(f, tiles...) |
Apply function element-wise (same shape) |
f.(tiles...), broadcast(f, tiles...) |
Apply function with shape broadcasting |
reduce(f, tile; dims, init) |
Reduction with arbitrary function |
mapreduce(f, op, tile; dims, init) |
Map then reduce |
accumulate(f, tile; dims, init, rev) |
Scan/prefix-sum with arbitrary function |
| Operation | Description |
|---|---|
sum(tile; dims) |
Sum along axis |
prod(tile; dims) |
Product along axis |
maximum(tile; dims) |
Maximum along axis |
minimum(tile; dims) |
Minimum along axis |
any(tile; dims) |
Logical OR along axis |
all(tile; dims) |
Logical AND along axis |
count(tile; dims) |
Count true elements along axis |
argmax(tile; dims) |
1-based index of maximum along axis |
argmin(tile; dims) |
1-based index of minimum along axis |
cumsum(tile; dims, rev) |
Cumulative sum |
cumprod(tile; dims, rev) |
Cumulative product |
| Operation | Description |
|---|---|
sqrt.(tile) |
Square root |
rsqrt.(tile) |
Reciprocal square root |
exp.(tile) |
Natural exponential |
exp2.(tile) |
Base-2 exponential |
log.(tile) |
Natural logarithm |
log2.(tile) |
Base-2 logarithm |
sin.(tile), cos.(tile), tan.(tile) |
Trigonometric functions |
sinh.(tile), cosh.(tile), tanh.(tile) |
Hyperbolic functions |
fma.(a, b, c) |
Fused multiply-add |
abs.(tile) |
Absolute value |
isnan.(tile) |
NaN test |
max(a, b), min(a, b) |
Maximum/minimum (scalars) |
ceil.(tile), floor.(tile) |
Rounding |
ct.@fpmode rounding_mode=ct.Rounding.Approx flush_to_zero=true begin ... end |
Scoped FP rounding mode and flush-to-zero |
| Operation | Description |
|---|---|
.<, .>, .<=, .>= |
Element-wise comparisons (return Tile{Bool}) |
.==, .!= |
Element-wise equality |
ifelse.(cond, x, y) |
Conditional selection |
| Operation | Description |
|---|---|
convert(Tile{T}, tile) |
Convert element type |
T.(tile) |
Broadcasting conversion (e.g. Float16.(tile)) |
| Operation | Description |
|---|---|
cld(a, b) |
Ceiling division |
fld(a, b) |
Floor division |
div(a, b) |
Truncating division |
mul_hi.(tile_a, tile_b), mul_hi(x, y) |
High bits of integer multiply (use Base.mul_hi on Julia 1.13+) |
| Operation | Description |
|---|---|
arr[i, j, ...] |
Load scalar element from TileArray |
arr[i, j, ...] = val |
Store scalar element to TileArray |
tile[i, j, ...] |
Extract scalar from Tile |
setindex(tile, val, i, j, ...) |
Return new Tile with element replaced |
| Operation | Description |
|---|---|
ct.atomic_cas(arr, idx, expected, desired; ...) |
Compare-and-swap |
ct.atomic_xchg(arr, idx, val; ...) |
Exchange |
ct.atomic_add(arr, idx, val; ...) |
Atomic add |
ct.atomic_max(arr, idx, val; ...) |
Atomic max |
ct.atomic_min(arr, idx, val; ...) |
Atomic min |
ct.atomic_or(arr, idx, val; ...) |
Atomic bitwise OR |
ct.atomic_and(arr, idx, val; ...) |
Atomic bitwise AND |
ct.atomic_xor(arr, idx, val; ...) |
Atomic bitwise XOR |
All atomics accept memory_order (default: ct.MemoryOrder.AcqRel) and
memory_scope (default: ct.MemScope.Device) keyword arguments.
ct.@compiler_options sets optimization hints inside a kernel function body:
function matmul(A, B, C, ...)
ct.@compiler_options num_ctas=ct.ByTarget(v"10.0" => 2) occupancy=8
...
end| Option | Description | Valid values |
|---|---|---|
num_ctas |
Number of CTAs in a CGA | Powers of 2 |
occupancy |
Target concurrent CTAs per SM | 1–32 |
opt_level |
Optimization level | 0–3 |
Values can be plain scalars or ct.ByTarget(...) for per-architecture dispatch.
ByTarget maps compute capabilities to values, with an optional default:
ct.@compiler_options num_ctas=ct.ByTarget(v"10.0" => 4, v"12.0" => 2; default=1)Hints can also be passed as keyword arguments to ct.launch or ct.code_tiled,
which take precedence over @compiler_options.
ct.load and ct.store accept optional keyword arguments that influence memory
traffic scheduling:
| Hint | Description |
|---|---|
latency |
DRAM traffic weight hint, integer 1 (low) to 10 (high). Default: compiler-inferred. |
allow_tma |
Whether to allow Tensor Memory Accelerator lowering. Default: allowed. |
| Operation | Description |
|---|---|
print(args...) |
Print values (Base overlay) |
println(args...) |
Print values with newline (Base overlay) |
ct.@assert cond [msg] |
Abort kernel if condition is false |
Standard Julia print/println work inside kernels. String constants and tiles
can be mixed freely; format specifiers are inferred from element types at compile
time. String interpolation is supported.
println("Block ", ct.bid(1), ": tile=", tile)
println("result=$result") # string interpolation
ct.@assert idx <= n "index out of bounds"These are debugging aids and are not optimized for performance.
Beyond ct.code_tiled and ct.@code_tiled shown above, cuTile.jl provides
@device_code_* macros that intercept compilation during ct.launch:
ct.@device_code_tiled ct.launch(vadd, grid, a, b, c, ct.Constant(16))
ct.@device_code_typed ct.launch(vadd, grid, a, b, c, ct.Constant(16))
ct.@device_code_structured ct.launch(vadd, grid, a, b, c, ct.Constant(16))| Macro | Output |
|---|---|
ct.@device_code_tiled |
Final Tile IR (MLIR textual format) |
ct.@device_code_typed |
Typed Julia IR after overlay resolution |
ct.@device_code_structured |
Structured IR (after control-flow structurization) |
cuTile.jl follows Julia conventions, which differ from the Python API in several ways:
Kernels don't need a decorator, but do have to return nothing:
# Python
@ct.kernel
def vadd(a, b, c):
pid = ct.bid(0)
a_tile = ct.load(a, index=(pid,), shape=(16,))
b_tile = ct.load(b, index=(pid,), shape=(16,))
result = a_tile + b_tile
ct.store(c, index=(pid, ), tile=result)# Julia
function vadd(a, b, c)
pid = ct.bid(1)
a_tile = ct.load(a; index=pid, shape=(16,))
b_tile = ct.load(b; index=pid, shape=(16,))
result = a_tile + b_tile
ct.store(c; index=pid, tile=result)
return
endPython passes optimization hints as @ct.kernel decorator arguments. Julia uses
ct.@compiler_options inside the function body (like @inline). See
Performance Tuning for full details.
# Python
@ct.kernel(num_ctas=ct.ByTarget(sm_100=2), occupancy=8)
def matmul(A, B, C, ...):
...# Julia
function matmul(A, B, C, ...)
ct.@compiler_options num_ctas=ct.ByTarget(v"10.0" => 2) occupancy=8
...
endcuTile.jl implicitly uses the current task-bound stream from CUDA.jl:
# Python
import cupy as cp
ct.launch(cp.cuda.get_current_stream(), grid, vadd, (a, b, c))# Julia
ct.launch(vadd, grid, a, b, c)All index-based operations use Julia's 1-based convention:
# Python
bid_x = ct.bid(0)
bid_y = ct.bid(1)
permutedims(tile, (2, 0, 1))# Julia
bid_x = ct.bid(1)
bid_y = ct.bid(2)
permutedims(tile, (3, 1, 2))This applies to bid, num_blocks, permutedims, reshape, dimension arguments, etc.
Python annotates constant parameters in the kernel signature and passes plain values at launch. Julia is the reverse: kernel signatures use plain types, and constants are wrapped at launch:
# Python
@ct.kernel
def kernel(a, b, tile_size: ct.Constant[int]):
tile = ct.load(a, index=(0,), shape=(tile_size,))
ct.launch(stream, grid, kernel, (a, b, 16))# Julia
function kernel(a, b, tile_size::Int)
tile = ct.load(a; index=1, shape=(tile_size,))
end
ct.launch(kernel, grid, a, b, ct.Constant(16))ct.Constant arguments generate no kernel parameter; the value is embedded directly in
the compiled code. Different constant values produce different kernel specializations.
Python's operators and math functions work directly on tiles with automatic broadcasting.
Julia cuTile follows standard Julia conventions: operators and math functions apply to
scalars, while element-wise application requires broadcast syntax (.+, exp.(...), etc).
map(f, tiles...) applies an arbitrary function element-wise to tiles of the same shape.
Broadcast syntax (.+, f.(x, y), etc.) combines map with automatic shape broadcasting,
so any function that works on scalars "just works" when broadcast over tiles.
Some non-broadcast shortcuts:
- Scaling operations (
*and/) can be applied directly to tiles and scalars. - Addition and subtraction can be applied directly to tiles with matching shapes.
# Python
a + b # Automatically broadcasts (16,) + (1, 16) → (1, 16)
a * b # Element-wise multiply
result = ct.exp(tile)# Julia
a + b # Same shape only
a .+ b # Broadcasts different shapes
a .* b # Element-wise multiply (broadcast)
a * b # Matrix multiplication
tile * 2.0f0 # Scalar multiply
result = exp.(tile)
map(x -> x * x, tile) # map with arbitrary lambdaPython reductions (ct.sum, ct.max, etc.) drop the reduced dimension by default (keepdims=False). Julia reductions (sum, maximum, etc.) always keep it as size 1 (matching Base semantics). Use dropdims to remove singleton dims afterward.
# Python
result = ct.sum(tile, axis=1) # (M, N) → (M,)
result = ct.sum(tile, axis=1, keepdims=True) # (M, N) → (M, 1)# Julia
result = sum(tile; dims=2) # (M, N) → (M, 1)
result = dropdims(sum(tile; dims=2); dims=2) # (M, N) → (M,)ct.load and ct.store automatically match the tile rank to that of the target:
- Lower rank: trailing
1s are appended. Loading(M, N)from a 4D array internally uses(M, N, 1, 1). Storing a scalar tile into a 2D array pads to(1, 1). - Higher rank: trailing
1s are stripped. Storing(M, 1)into a 1D array reshapes to(M,). Non-trailing singletons (e.g., fromsum(tile; dims=1)) require explicitdropdims.
cuTile.jl uses Julia's standard left-aligned broadcast shape rules: dimensions are matched starting from the first (leftmost) dimension. cuTile Python uses NumPy-style right-aligned rules, where dimensions are matched from the last (rightmost) dimension.
This means a 1D (N,) tile cannot broadcast with a 2D (M, N) tile in Julia, because
dimension 1 has size N vs M. In NumPy/Python, (N,) would be right-aligned to (1, N)
and broadcast to (M, N).
Use reshape to get the desired alignment, just as with regular Julia arrays:
# Julia: explicitly reshape to align dimensions
a = ct.load(...) # (N,)
b = ct.load(...) # (M, N)
result = reshape(a, (1, N)) .+ b # (1, N) .+ (M, N) → (M, N)cuTile Python represents single-element loads as 0-D tiles (shape=()), which can be used
directly as indices. cuTile.jl uses Julia's standard indexing syntax instead — getindex
returns a scalar T and setindex! stores a scalar:
# Python
expert_id = ct.load(ids, index=bid_m, shape=())
b = ct.load(B, (expert_id, k, bid_n), shape=(1, TILE_K, TILE_N))# Julia
expert_id = ids[bid_m]
b = ct.load(B; index=(expert_id, k, bid_n), shape=(1, TILE_K, TILE_N))cuTile kernels cannot throw Julia exceptions. Operations that would throw in standard Julia silently produce truncated or wrapped results instead:
- Float-to-integer conversions:
Int32(x),trunc(Int32, x), andround(Int32, x, RoundToZero)silently truncate toward zero rather than throwingInexactErrorfor non-integer or out-of-range values. Useunsafe_truncfor the explicit non-throwing primitive.
Use ct.@assert to add runtime checks in kernels (see Debugging above).
cuTile.jl also provides a limited set of host-level APIs to use cuTile without
writing custom kernels. For example, for element-wise operations on CuArrays,
cuTile can automatically generate and launch a fused kernel using Julia's
broadcast machinery:
using CUDA
import cuTile as ct
A = CUDA.rand(Float32, 1024)
B = CUDA.rand(Float32, 1024)
C = CUDA.zeros(Float32, 1024)
# Wrap arrays in Tiled() to route through cuTile
ct.Tiled(C) .= ct.Tiled(A) .+ ct.Tiled(B)
# Or use the @. macro for convenience
ct.@. C = A + sin(B)
# Allocating form (returns a new CuArray)
D = ct.@. A + BThe entire broadcast expression is fused into a single cuTile kernel. Tile sizes are automatically chosen based on array dimensions (power-of-2, budget-based). Works with 1D through N-dimensional arrays.
cuTile.jl is inspired by cuTile-Python, licensed under Apache 2.0 by NVIDIA Corporation & Affiliates.
The IRStructurizer component is based on SPIRV.jl by Cédric Belmant.