cuTile.jl

A Julia package for writing GPU kernels using NVIDIA's tile-based programming model.

This package is in beta. Most Tile IR features are implemented and the package has been verified on the benchmarks and tests included in the repository. Interfaces and APIs may still change without notice.

Installation

Install just like any other Julia package:

julia> using Pkg
julia> Pkg.add("cuTile")

Execution of cuTile kernels requires CUDA.jl to be installed and imported. cuTile generates kernels based on Tile IR, which requires an NVIDIA Driver that supports CUDA 13 (580 or later). CUDA.jl automatically downloads the appropriate CUDA toolkit artifacts, so no manual CUDA installation is needed. Only Ampere, Ada, and Blackwell GPUs are supported at this time, with Hopper support expected in a future release of CUDA.

Quick Start

A simple vector addition kernel using cuTile looks like this:

using CUDA
import cuTile as ct

# Define kernel
function vadd(a, b, c, tile_size::Int)
    pid = ct.bid(1)
    tile_a = ct.load(a; index=pid, shape=(tile_size,))
    tile_b = ct.load(b; index=pid, shape=(tile_size,))
    ct.store(c; index=pid, tile=tile_a + tile_b)
    return
end

# Launch
vector_size = 2^20
tile_size = 16

blocks = cld(vector_size, tile_size)
grid = (blocks, 1, 1)

a, b = CUDA.rand(Float32, vector_size), CUDA.rand(Float32, vector_size)
c = CUDA.zeros(Float32, vector_size)

ct.launch(vadd, grid, a, b, c, ct.Constant(tile_size))

@assert c == a .+ b

Inspecting Generated Tile IR

The generated Tile IR can be inspected using the code_tiled function:

ct.code_tiled(vadd, Tuple{ct.TileArray{Float32, 1, ct.ArraySpec{1}(128, true, (0,), (32,))},
                          ct.TileArray{Float32, 1, ct.ArraySpec{1}(128, true, (0,), (32,))},
                          ct.TileArray{Float32, 1, ct.ArraySpec{1}(128, true, (0,), (32,))},
                          ct.Constant{Int64, 16}})

Since these types can be verbose, and are derived from the runtime properties of arrays, it's often easier to use the @code_tiled macro instead:

julia> ct.@code_tiled ct.launch(vadd, (cld(vector_size, tile_size), 1, 1), a, b, c, ct.Constant(tile_size))
// vadd(cuTile.TileArray{Float32, 1, cuTile.ArraySpec{1}(128, true, (0,), (32,))}, cuTile.TileArray{Float32, 1, cuTile.ArraySpec{1}(128, true, (0,), (32,))}, cuTile.TileArray{Float32, 1, cuTile.ArraySpec{1}(128, true, (0,), (32,))}, cuTile.Constant{Int64, 16})

cuda_tile.module @kernels {
  entry @vadd(...) {
    ...
    return
  }
}

The former form can be useful on systems without a GPU, since it does not require CUDA.jl, while the latter needs valid CuArrays to be passed to the kernel.

Performance

Run benchmarks with:

julia --project=examples examples/benchmarks.jl  # Julia
uv run python examples/benchmarks.py             # Python (for comparison)

Benchmarks comparing cuTile.jl against cuTile Python on an RTX 5080 (tileiras 13.2.51, 20 runs, 5 warmup, min time reported):

Kernel	Size	Julia	Python	Status
Vector Addition	2^27 f32	841 GB/s	845 GB/s	OK (=)
Matrix Transpose	8192² f32	805 GB/s	811 GB/s	OK (-1%)
Layer Norm fwd	4096² f32	925 GB/s	722 GB/s	+28%*
Layer Norm bwd	4096² f32	243 GB/s	251 GB/s	-3%
Matrix Multiplication	4096³ f32	46.9 TFLOPS	43.4 TFLOPS	+8%**
Batch Matrix Multiply	1024×512×2048 ×8 f32	33.6 TFLOPS	30.9 TFLOPS	+9%**
FFT (3-stage Cooley-Tukey)	1024-pt ×64 c64	3263 μs	3127 μs	-4%
Mixture of Experts	256tok 1024h 32e 2048i f16	19.3 TFLOPS	20.3 TFLOPS	-5%
Attention (FMHA)	8×16×1024² ×64 f16 causal	88.5 TFLOPS	61.6 TFLOPS	+44%***

* The pow(x, 2) → mulf(x, x) strength reduction eliminates the expensive transcendental in the variance computation. Python still emits pow.

** Likely because Julia's for loop guards give tileiras a guarantee that the loop body executes at least once, enabling more aggressive warp scheduling.

*** Likely due to Python's compiler splitting the causal masking loop into two loops, duplicating the loop body. Julia emits a single loop with a conditional.

Supported Operations

cuTile.jl aims to expose as much functionality as possible through Julia-native constructs (+, sum, reshape, broadcast, etc.) rather than cuTile-specific functions. Operations prefixed with ct. are cuTile intrinsics with no direct Julia equivalent; everything else uses standard Julia syntax and is overlaid on Base.

Supported Types

Integers: Int8, UInt8, Int16, UInt16, Int32, UInt32, Int64, UInt64 Floats: Float16, BFloat16, Float32, Float64, TFloat32 Boolean: Bool

TFloat32 is a 32-bit floating-point type with reduced mantissa precision (10 bits), optimized for tensor core operations.

Memory

Operation	Description
`ct.load(arr; index, shape, ...)`	Load a tile from array
`ct.store(arr; index, tile, ...)`	Store a tile to array
`ct.gather(arr, indices; ...)`	Gather elements by index tile
`ct.scatter(arr, indices, tile; ...)`	Scatter elements by index tile

load and store accept keyword arguments order, padding_mode, latency, and allow_tma. gather accepts mask, padding_value, check_bounds, and latency. scatter accepts mask, check_bounds, and latency.

# Gather with user mask and custom padding for masked-out elements
tile = ct.gather(arr, indices; mask=valid_mask, padding_value=-1.0f0)

# Scatter with mask (only write where mask is true)
ct.scatter(arr, indices, tile; mask=active_mask)

Grid

Operation	Description
`ct.bid(axis)`	Block ID (1=x, 2=y, 3=z)
`ct.num_blocks(axis)`	Grid size along axis
`ct.num_tiles(arr, axis, shape)`	Number of tiles along axis

Control Flow

Construct	Description
`if`/`elseif`/`else`	Conditional branching
`for i in start:stop`	Counted loops (compiled to Tile IR ForOp)
`for i in start:step:stop`	Stepped loops
`while cond ... end`	While loops
`ifelse.(cond, x, y)`	Element-wise conditional selection

Standard Julia control flow works inside kernels and is compiled to structured Tile IR operations.

Arithmetic

Operation	Description
`+`, `-`	Element-wise (same shape only)
`tile * scalar`, `tile / scalar`	Scalar multiply/divide
`.+`, `.-`, `.*`, `./`, `.^`	Broadcasting element-wise

Construction

Operation	Description
`zeros(T, dims...)`	Zero-filled tile (Base overlay)
`ones(T, dims...)`	One-filled tile (Base overlay)
`fill(value, dims...)`	Constant-filled tile (Base overlay)
`ct.arange(shape, T)` / `ct.arange(n, T)`	Sequence `[1, 2, 3, ..., n]`

Shape

Operation	Description
`ct.broadcast_to(tile, shape)`	Broadcast to target shape
`transpose(tile)`	Transpose 2D tile
`reshape(tile, shape)`	Reshape (same element count)
`permutedims(tile, perm)`	Permute dimensions
`ct.extract(tile, index, shape)`	Extract sub-tile
`ct.cat((a, b), axis)`	Concatenate tiles
`dropdims(tile; dims)`	Remove singleton dimensions

Matrix

Operation	Description
`a * b`	Matrix multiplication: `a @ b`
`muladd(a, b, acc)`	Matrix multiply-accumulate: `a * b + acc`

Higher-Order Functions

Operation	Description
`map(f, tiles...)`	Apply function element-wise (same shape)
`f.(tiles...)`, `broadcast(f, tiles...)`	Apply function with shape broadcasting
`reduce(f, tile; dims, init)`	Reduction with arbitrary function
`mapreduce(f, op, tile; dims, init)`	Map then reduce
`accumulate(f, tile; dims, init, rev)`	Scan/prefix-sum with arbitrary function

Reductions

Operation	Description
`sum(tile; dims)`	Sum along axis
`prod(tile; dims)`	Product along axis
`maximum(tile; dims)`	Maximum along axis
`minimum(tile; dims)`	Minimum along axis
`any(tile; dims)`	Logical OR along axis
`all(tile; dims)`	Logical AND along axis
`count(tile; dims)`	Count `true` elements along axis
`argmax(tile; dims)`	1-based index of maximum along axis
`argmin(tile; dims)`	1-based index of minimum along axis
`cumsum(tile; dims, rev)`	Cumulative sum
`cumprod(tile; dims, rev)`	Cumulative product

Math

Operation	Description
`sqrt.(tile)`	Square root
`rsqrt.(tile)`	Reciprocal square root
`exp.(tile)`	Natural exponential
`exp2.(tile)`	Base-2 exponential
`log.(tile)`	Natural logarithm
`log2.(tile)`	Base-2 logarithm
`sin.(tile)`, `cos.(tile)`, `tan.(tile)`	Trigonometric functions
`sinh.(tile)`, `cosh.(tile)`, `tanh.(tile)`	Hyperbolic functions
`fma.(a, b, c)`	Fused multiply-add
`abs.(tile)`	Absolute value
`isnan.(tile)`	NaN test
`max(a, b)`, `min(a, b)`	Maximum/minimum (scalars)
`ceil.(tile)`, `floor.(tile)`	Rounding
`ct.@fpmode rounding_mode=ct.Rounding.Approx flush_to_zero=true begin ... end`	Scoped FP rounding mode and flush-to-zero

Comparison

Operation	Description
`.<`, `.>`, `.<=`, `.>=`	Element-wise comparisons (return `Tile{Bool}`)
`.==`, `.!=`	Element-wise equality
`ifelse.(cond, x, y)`	Conditional selection

Type Conversion

Operation	Description
`convert(Tile{T}, tile)`	Convert element type
`T.(tile)`	Broadcasting conversion (e.g. `Float16.(tile)`)

Integer Arithmetic

Operation	Description
`cld(a, b)`	Ceiling division
`fld(a, b)`	Floor division
`div(a, b)`	Truncating division
`mul_hi.(tile_a, tile_b)`, `mul_hi(x, y)`	High bits of integer multiply (use `Base.mul_hi` on Julia 1.13+)

Indexing

Operation	Description
`arr[i, j, ...]`	Load scalar element from `TileArray`
`arr[i, j, ...] = val`	Store scalar element to `TileArray`
`tile[i, j, ...]`	Extract scalar from `Tile`
`setindex(tile, val, i, j, ...)`	Return new `Tile` with element replaced

Atomics

Operation	Description
`ct.atomic_cas(arr, idx, expected, desired; ...)`	Compare-and-swap
`ct.atomic_xchg(arr, idx, val; ...)`	Exchange
`ct.atomic_add(arr, idx, val; ...)`	Atomic add
`ct.atomic_max(arr, idx, val; ...)`	Atomic max
`ct.atomic_min(arr, idx, val; ...)`	Atomic min
`ct.atomic_or(arr, idx, val; ...)`	Atomic bitwise OR
`ct.atomic_and(arr, idx, val; ...)`	Atomic bitwise AND
`ct.atomic_xor(arr, idx, val; ...)`	Atomic bitwise XOR

All atomics accept memory_order (default: ct.MemoryOrder.AcqRel) and memory_scope (default: ct.MemScope.Device) keyword arguments.

Performance Tuning

Kernel configuration

ct.@compiler_options sets optimization hints inside a kernel function body:

function matmul(A, B, C, ...)
    ct.@compiler_options num_ctas=ct.ByTarget(v"10.0" => 2) occupancy=8
    ...
end

Option	Description	Valid values
`num_ctas`	Number of CTAs in a CGA	Powers of 2
`occupancy`	Target concurrent CTAs per SM	1–32
`opt_level`	Optimization level	0–3

Values can be plain scalars or ct.ByTarget(...) for per-architecture dispatch. ByTarget maps compute capabilities to values, with an optional default:

ct.@compiler_options num_ctas=ct.ByTarget(v"10.0" => 4, v"12.0" => 2; default=1)

Hints can also be passed as keyword arguments to ct.launch or ct.code_tiled, which take precedence over @compiler_options.

Load/store hints

ct.load and ct.store accept optional keyword arguments that influence memory traffic scheduling:

Hint	Description
`latency`	DRAM traffic weight hint, integer 1 (low) to 10 (high). Default: compiler-inferred.
`allow_tma`	Whether to allow Tensor Memory Accelerator lowering. Default: allowed.

Debugging

Operation	Description
`print(args...)`	Print values (Base overlay)
`println(args...)`	Print values with newline (Base overlay)
`ct.@assert cond [msg]`	Abort kernel if condition is false

Standard Julia print/println work inside kernels. String constants and tiles can be mixed freely; format specifiers are inferred from element types at compile time. String interpolation is supported.

println("Block ", ct.bid(1), ": tile=", tile)
println("result=$result")  # string interpolation
ct.@assert idx <= n "index out of bounds"

These are debugging aids and are not optimized for performance.

Code Inspection

Beyond ct.code_tiled and ct.@code_tiled shown above, cuTile.jl provides @device_code_* macros that intercept compilation during ct.launch:

ct.@device_code_tiled ct.launch(vadd, grid, a, b, c, ct.Constant(16))
ct.@device_code_typed ct.launch(vadd, grid, a, b, c, ct.Constant(16))
ct.@device_code_structured ct.launch(vadd, grid, a, b, c, ct.Constant(16))

Macro	Output
`ct.@device_code_tiled`	Final Tile IR (MLIR textual format)
`ct.@device_code_typed`	Typed Julia IR after overlay resolution
`ct.@device_code_structured`	Structured IR (after control-flow structurization)

Differences from cuTile Python

cuTile.jl follows Julia conventions, which differ from the Python API in several ways:

Kernel definition syntax

Kernels don't need a decorator, but do have to return nothing:

# Python
@ct.kernel
def vadd(a, b, c):
    pid = ct.bid(0)

    a_tile = ct.load(a, index=(pid,), shape=(16,))
    b_tile = ct.load(b, index=(pid,), shape=(16,))
    result = a_tile + b_tile
    ct.store(c, index=(pid, ), tile=result)

# Julia
function vadd(a, b, c)
    pid = ct.bid(1)

    a_tile = ct.load(a; index=pid, shape=(16,))
    b_tile = ct.load(b; index=pid, shape=(16,))
    result = a_tile + b_tile
    ct.store(c; index=pid, tile=result)

    return
end

Optimization hints

Python passes optimization hints as @ct.kernel decorator arguments. Julia uses ct.@compiler_options inside the function body (like @inline). See Performance Tuning for full details.

# Python
@ct.kernel(num_ctas=ct.ByTarget(sm_100=2), occupancy=8)
def matmul(A, B, C, ...):
    ...

# Julia
function matmul(A, B, C, ...)
    ct.@compiler_options num_ctas=ct.ByTarget(v"10.0" => 2) occupancy=8
    ...
end

Launch Syntax

cuTile.jl implicitly uses the current task-bound stream from CUDA.jl:

# Python
import cupy as cp
ct.launch(cp.cuda.get_current_stream(), grid, vadd, (a, b, c))

# Julia
ct.launch(vadd, grid, a, b, c)

1-Based Indexing

All index-based operations use Julia's 1-based convention:

# Python
bid_x = ct.bid(0)
bid_y = ct.bid(1)
permutedims(tile, (2, 0, 1))

# Julia
bid_x = ct.bid(1)
bid_y = ct.bid(2)
permutedims(tile, (3, 1, 2))

This applies to bid, num_blocks, permutedims, reshape, dimension arguments, etc.

Compile-time constants

Python annotates constant parameters in the kernel signature and passes plain values at launch. Julia is the reverse: kernel signatures use plain types, and constants are wrapped at launch:

# Python
@ct.kernel
def kernel(a, b, tile_size: ct.Constant[int]):
    tile = ct.load(a, index=(0,), shape=(tile_size,))

ct.launch(stream, grid, kernel, (a, b, 16))

# Julia
function kernel(a, b, tile_size::Int)
    tile = ct.load(a; index=1, shape=(tile_size,))
end

ct.launch(kernel, grid, a, b, ct.Constant(16))

ct.Constant arguments generate no kernel parameter; the value is embedded directly in the compiled code. Different constant values produce different kernel specializations.

Broadcasting and Math Functions

Python's operators and math functions work directly on tiles with automatic broadcasting. Julia cuTile follows standard Julia conventions: operators and math functions apply to scalars, while element-wise application requires broadcast syntax (.+, exp.(...), etc).

map(f, tiles...) applies an arbitrary function element-wise to tiles of the same shape. Broadcast syntax (.+, f.(x, y), etc.) combines map with automatic shape broadcasting, so any function that works on scalars "just works" when broadcast over tiles.

Some non-broadcast shortcuts:

Scaling operations (* and /) can be applied directly to tiles and scalars.
Addition and subtraction can be applied directly to tiles with matching shapes.

# Python
a + b              # Automatically broadcasts (16,) + (1, 16) → (1, 16)
a * b              # Element-wise multiply
result = ct.exp(tile)

# Julia
a + b              # Same shape only
a .+ b             # Broadcasts different shapes
a .* b             # Element-wise multiply (broadcast)
a * b              # Matrix multiplication
tile * 2.0f0       # Scalar multiply
result = exp.(tile)
map(x -> x * x, tile)  # map with arbitrary lambda

Reductions

Python reductions (ct.sum, ct.max, etc.) drop the reduced dimension by default (keepdims=False). Julia reductions (sum, maximum, etc.) always keep it as size 1 (matching Base semantics). Use dropdims to remove singleton dims afterward.

# Python
result = ct.sum(tile, axis=1)           # (M, N) → (M,)
result = ct.sum(tile, axis=1, keepdims=True)  # (M, N) → (M, 1)

# Julia
result = sum(tile; dims=2)              # (M, N) → (M, 1)
result = dropdims(sum(tile; dims=2); dims=2)  # (M, N) → (M,)

Automatic rank matching

ct.load and ct.store automatically match the tile rank to that of the target:

Lower rank: trailing 1s are appended. Loading (M, N) from a 4D array internally uses (M, N, 1, 1). Storing a scalar tile into a 2D array pads to (1, 1).
Higher rank: trailing 1s are stripped. Storing (M, 1) into a 1D array reshapes to (M,). Non-trailing singletons (e.g., from sum(tile; dims=1)) require explicit dropdims.

Broadcasting shape alignment

cuTile.jl uses Julia's standard left-aligned broadcast shape rules: dimensions are matched starting from the first (leftmost) dimension. cuTile Python uses NumPy-style right-aligned rules, where dimensions are matched from the last (rightmost) dimension.

This means a 1D (N,) tile cannot broadcast with a 2D (M, N) tile in Julia, because dimension 1 has size N vs M. In NumPy/Python, (N,) would be right-aligned to (1, N) and broadcast to (M, N).

Use reshape to get the desired alignment, just as with regular Julia arrays:

# Julia: explicitly reshape to align dimensions
a = ct.load(...)              # (N,)
b = ct.load(...)              # (M, N)
result = reshape(a, (1, N)) .+ b  # (1, N) .+ (M, N) → (M, N)

Scalar access and 0-D tiles

cuTile Python represents single-element loads as 0-D tiles (shape=()), which can be used directly as indices. cuTile.jl uses Julia's standard indexing syntax instead — getindex returns a scalar T and setindex! stores a scalar:

# Python
expert_id = ct.load(ids, index=bid_m, shape=())
b = ct.load(B, (expert_id, k, bid_n), shape=(1, TILE_K, TILE_N))

# Julia
expert_id = ids[bid_m]
b = ct.load(B; index=(expert_id, k, bid_n), shape=(1, TILE_K, TILE_N))

Differences from Julia

Some operations are non-throwing

cuTile kernels cannot throw Julia exceptions. Operations that would throw in standard Julia silently produce truncated or wrapped results instead:

Float-to-integer conversions: Int32(x), trunc(Int32, x), and round(Int32, x, RoundToZero) silently truncate toward zero rather than throwing InexactError for non-integer or out-of-range values. Use unsafe_trunc for the explicit non-throwing primitive.

Use ct.@assert to add runtime checks in kernels (see Debugging above).

Host-level operations

cuTile.jl also provides a limited set of host-level APIs to use cuTile without writing custom kernels. For example, for element-wise operations on CuArrays, cuTile can automatically generate and launch a fused kernel using Julia's broadcast machinery:

using CUDA
import cuTile as ct

A = CUDA.rand(Float32, 1024)
B = CUDA.rand(Float32, 1024)
C = CUDA.zeros(Float32, 1024)

# Wrap arrays in Tiled() to route through cuTile
ct.Tiled(C) .= ct.Tiled(A) .+ ct.Tiled(B)

# Or use the @. macro for convenience
ct.@. C = A + sin(B)

# Allocating form (returns a new CuArray)
D = ct.@. A + B

The entire broadcast expression is fused into a single cuTile kernel. Tile sizes are automatically chosen based on array dimensions (power-of-2, budget-based). Works with 1D through N-dimensional arrays.

Acknowledgments

cuTile.jl is inspired by cuTile-Python, licensed under Apache 2.0 by NVIDIA Corporation & Affiliates.

The IRStructurizer component is based on SPIRV.jl by Cédric Belmant.

Name		Name	Last commit message	Last commit date
Latest commit History 490 Commits
.buildkite		.buildkite
.github/workflows		.github/workflows
examples		examples
ext		ext
src		src
test		test
.gitignore		.gitignore
LICENSE.md		LICENSE.md
Project.toml		Project.toml
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

cuTile.jl

Installation

Quick Start

Inspecting Generated Tile IR

Performance

Supported Operations

Supported Types

Memory

Grid

Control Flow

Arithmetic

Construction

Shape

Matrix

Higher-Order Functions

Reductions

Math

Comparison

Type Conversion

Integer Arithmetic

Indexing

Atomics

Performance Tuning

Kernel configuration

Load/store hints

Debugging

Code Inspection

Differences from cuTile Python

Kernel definition syntax

Optimization hints

Launch Syntax

1-Based Indexing

Compile-time constants

Broadcasting and Math Functions

Reductions

Automatic rank matching

Broadcasting shape alignment

Scalar access and 0-D tiles

Differences from Julia

Some operations are non-throwing

Host-level operations

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Contributors

Uh oh!

Languages