Skip to content

Add space-based CUDA kernel fusion for copyto#2482

Draft
petebachant wants to merge 25 commits intomainfrom
pb/fieldname-set
Draft

Add space-based CUDA kernel fusion for copyto#2482
petebachant wants to merge 25 commits intomainfrom
pb/fieldname-set

Conversation

@petebachant
Copy link
Copy Markdown
Member

This is a 13% bump in SYPD for the prog EDMF 1M Atmos config (no land).

Kernel analysis:

image

@petebachant petebachant moved this to In review in Performance Apr 6, 2026
@petebachant petebachant moved this from In review to In progress in Performance Apr 6, 2026
@petebachant petebachant marked this pull request as draft April 6, 2026 17:02
@petebachant petebachant marked this pull request as draft April 6, 2026 17:02
@petebachant
Copy link
Copy Markdown
Member Author

Interestingly this shows no improvement when running the prog EDMF 1M config in the coupler: https://buildkite.com/clima/climacore-end-to-end-performance/builds/142/steps/canvas?sid=019d6395-829b-4900-93c5-cea1aca53baf&tab=output

@petebachant
Copy link
Copy Markdown
Member Author

@dennisYatunin this may be interesting to you. When running ClimaAtmos on its own, this change speeds things up significantly, but when run with the coupler with the same exact Atmos config, we hit the fallback condition, i.e., the compiler fails to fuse the kernels.

@dennisYatunin
Copy link
Copy Markdown
Member

dennisYatunin commented Apr 10, 2026

Wow, yeah, that is very interesting! Running the same function from ClimaCoupler should only add a couple of stack frames on top of running it in ClimaAtmos, but I suppose this shows that your example is right on the edge of triggering a compiler heuristic and de-optimizing.

I think the simplest way to avoid this would be to directly call the ClimaAtmos implicit solver from the AMIP driver, forcing it to be compiled efficiently before it gets called inside ClimaCoupler's deeper stacktrace. That's definitely not a pattern we want to use frequently, and understanding how to avoid these compiler heuristics altogether would be a much more sustainable solution. But at least while we're still figuring things out you can use it, especially if that leads to such a big performance improvement.

@petebachant petebachant self-assigned this Apr 21, 2026
petebachant and others added 9 commits April 29, 2026 07:36
Also ignore materialize! for kernel renaming
…ldname-set

# Conflicts:
#	.buildkite/pipeline.yml
#	ext/cuda/cuda_utils.jl
#	test/gpu/latency_benchmarks.jl
Note that the launch configuration is not ideal at the moment. Basically
needs number of vertical levels as a multiple of 32 and has upper limit
of 1024 (number of threads per block).

Note that since the solver may work with matrices that have AxisTensor
elements (with potentially multiple components), we need to use boxed
operator.

Co-authored-by: petebachant <[email protected]>
Co-authored-by: sjavis <[email protected]>
To capture the majority of tridiagonal solvers we need to split the
`multiple_field_solver` if any tridiagonal matrix is present. Otherwise
tridiagonal case may be hidden in a single kernel with non-tridiagonal
cases. This seems to happen for most instances of the
`multiple_field_solver` inthe AMIP case, hence we effectivly remove this
optimisation.

New PCR solver may fail to launch the kernel if the Nv (and hence
number of threads peer block) is too large. Fallback on the local-memory
solver to have stable (albeit not as performant) behaviour for large
number of vertical levels.
Docstring of BlockDiagonalSolve is where the original solver was
described.
The boxed operators have been removed in the recent RecursiveApply
refactor. Instead ordinary arithmetic operators can be used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

4 participants