Add space-based CUDA kernel fusion for `copyto` by petebachant · Pull Request #2482 · CliMA/ClimaCore.jl

petebachant · 2026-04-06T14:38:22Z

This is a 13% bump in SYPD for the prog EDMF 1M Atmos config (no land).

petebachant · 2026-04-06T17:04:06Z

Interestingly this shows no improvement when running the prog EDMF 1M config in the coupler: https://buildkite.com/clima/climacore-end-to-end-performance/builds/142/steps/canvas?sid=019d6395-829b-4900-93c5-cea1aca53baf&tab=output

petebachant · 2026-04-07T13:26:45Z

@dennisYatunin this may be interesting to you. When running ClimaAtmos on its own, this change speeds things up significantly, but when run with the coupler with the same exact Atmos config, we hit the fallback condition, i.e., the compiler fails to fuse the kernels.

…e-set

dennisYatunin · 2026-04-10T01:56:17Z

Wow, yeah, that is very interesting! Running the same function from ClimaCoupler should only add a couple of stack frames on top of running it in ClimaAtmos, but I suppose this shows that your example is right on the edge of triggering a compiler heuristic and de-optimizing.

I think the simplest way to avoid this would be to directly call the ClimaAtmos implicit solver from the AMIP driver, forcing it to be compiled efficiently before it gets called inside ClimaCoupler's deeper stacktrace. That's definitely not a pattern we want to use frequently, and understanding how to avoid these compiler heuristics altogether would be a much more sustainable solution. But at least while we're still figuring things out you can use it, especially if that leads to such a big performance improvement.

…e-set

Also ignore materialize! for kernel renaming

…ldname-set # Conflicts: # .buildkite/pipeline.yml # ext/cuda/cuda_utils.jl # test/gpu/latency_benchmarks.jl

Note that the launch configuration is not ideal at the moment. Basically needs number of vertical levels as a multiple of 32 and has upper limit of 1024 (number of threads per block). Note that since the solver may work with matrices that have AxisTensor elements (with potentially multiple components), we need to use boxed operator. Co-authored-by: petebachant <[email protected]> Co-authored-by: sjavis <[email protected]>

To capture the majority of tridiagonal solvers we need to split the `multiple_field_solver` if any tridiagonal matrix is present. Otherwise tridiagonal case may be hidden in a single kernel with non-tridiagonal cases. This seems to happen for most instances of the `multiple_field_solver` inthe AMIP case, hence we effectivly remove this optimisation. New PCR solver may fail to launch the kernel if the Nv (and hence number of threads peer block) is too large. Fallback on the local-memory solver to have stable (albeit not as performant) behaviour for large number of vertical levels.

Docstring of BlockDiagonalSolve is where the original solver was described.

The boxed operators have been removed in the recent RecursiveApply refactor. Instead ordinary arithmetic operators can be used.

…me-set # Conflicts: # ext/cuda/matrix_fields_single_field_solve.jl

petebachant added 3 commits April 5, 2026 19:27

Fuse copyto by space

9bf8c52

Detect compile error and fallback

892e2b3

Trigger [perf] pipeline

0d0aeab

petebachant added this to Performance Apr 6, 2026

petebachant moved this to In review in Performance Apr 6, 2026

petebachant moved this from In review to In progress in Performance Apr 6, 2026

Update baseline SYPD for [perf] pipeline

dc6fffe

petebachant marked this pull request as draft April 6, 2026 17:02

petebachant added 2 commits April 6, 2026 11:39

Merge branch 'main' into pb/fieldname-set

fcaf674

Install ClimaAtmos on main in [perf] pipeline

ab45488

petebachant added 4 commits April 7, 2026 08:25

Better catch failed fusion

836a706

Merge branch 'main' of github.com:CliMA/ClimaCore.jl into pb/fieldnam…

58738e4

…e-set

Another try: not faster

599166b

Define risky groups

f0f78a0

petebachant added 2 commits April 20, 2026 07:32

Merge branch 'main' of github.com:CliMA/ClimaCore.jl into pb/fieldnam…

4feb8bc

…e-set

Revert perf pipeline back to main

861a40c

petebachant self-assigned this Apr 21, 2026

petebachant and others added 9 commits April 29, 2026 07:36

Sync back to main

e974dd5

Try fusing copyto

8f79c25

Merge in PCR solver and sync skips

650c6b3

Ignore certain functions within ClimaCore for kernel renaming

4a3c913

Also ignore materialize! for kernel renaming

Merge branch 'tr/rename' of github.com:CliMA/ClimaCore.jl into pb/fie…

c3755d6

…ldname-set # Conflicts: # .buildkite/pipeline.yml # ext/cuda/cuda_utils.jl # test/gpu/latency_benchmarks.jl

doc: note the new PCR solver in relevant documentation

18b453e

Docstring of BlockDiagonalSolve is where the original solver was described.

refactor: unbox the operators to conform to new interface

5977c4e

The boxed operators have been removed in the recent RecursiveApply refactor. Instead ordinary arithmetic operators can be used.

Mikolaj-A-Kowalski and others added 4 commits May 5, 2026 11:13

[perf] trigger end-to-end AMIP

86ea65b

Merge remote-tracking branch 'origin/main' into pb/fieldname-set

a1fae05

Merge remote-tracking branch 'origin/iccs/pcr-solver' into pb/fieldna…

1cd9488

…me-set # Conflicts: # ext/cuda/matrix_fields_single_field_solve.jl

Another attempt at fusion

7589db0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add space-based CUDA kernel fusion for `copyto`#2482

Add space-based CUDA kernel fusion for `copyto`#2482
petebachant wants to merge 25 commits intomainfrom
pb/fieldname-set

petebachant commented Apr 6, 2026

Uh oh!

petebachant commented Apr 6, 2026

Uh oh!

petebachant commented Apr 7, 2026

Uh oh!

dennisYatunin commented Apr 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

petebachant commented Apr 6, 2026

Uh oh!

petebachant commented Apr 6, 2026

Uh oh!

petebachant commented Apr 7, 2026

Uh oh!

dennisYatunin commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dennisYatunin commented Apr 10, 2026 •

edited

Loading