Add space-based CUDA kernel fusion for copyto#2482
Add space-based CUDA kernel fusion for copyto#2482petebachant wants to merge 25 commits intomainfrom
copyto#2482Conversation
|
Interestingly this shows no improvement when running the prog EDMF 1M config in the coupler: https://buildkite.com/clima/climacore-end-to-end-performance/builds/142/steps/canvas?sid=019d6395-829b-4900-93c5-cea1aca53baf&tab=output |
|
@dennisYatunin this may be interesting to you. When running ClimaAtmos on its own, this change speeds things up significantly, but when run with the coupler with the same exact Atmos config, we hit the fallback condition, i.e., the compiler fails to fuse the kernels. |
|
Wow, yeah, that is very interesting! Running the same function from ClimaCoupler should only add a couple of stack frames on top of running it in ClimaAtmos, but I suppose this shows that your example is right on the edge of triggering a compiler heuristic and de-optimizing. I think the simplest way to avoid this would be to directly call the ClimaAtmos implicit solver from the AMIP driver, forcing it to be compiled efficiently before it gets called inside ClimaCoupler's deeper stacktrace. That's definitely not a pattern we want to use frequently, and understanding how to avoid these compiler heuristics altogether would be a much more sustainable solution. But at least while we're still figuring things out you can use it, especially if that leads to such a big performance improvement. |
Also ignore materialize! for kernel renaming
…ldname-set # Conflicts: # .buildkite/pipeline.yml # ext/cuda/cuda_utils.jl # test/gpu/latency_benchmarks.jl
Note that the launch configuration is not ideal at the moment. Basically needs number of vertical levels as a multiple of 32 and has upper limit of 1024 (number of threads per block). Note that since the solver may work with matrices that have AxisTensor elements (with potentially multiple components), we need to use boxed operator. Co-authored-by: petebachant <[email protected]> Co-authored-by: sjavis <[email protected]>
To capture the majority of tridiagonal solvers we need to split the `multiple_field_solver` if any tridiagonal matrix is present. Otherwise tridiagonal case may be hidden in a single kernel with non-tridiagonal cases. This seems to happen for most instances of the `multiple_field_solver` inthe AMIP case, hence we effectivly remove this optimisation. New PCR solver may fail to launch the kernel if the Nv (and hence number of threads peer block) is too large. Fallback on the local-memory solver to have stable (albeit not as performant) behaviour for large number of vertical levels.
Docstring of BlockDiagonalSolve is where the original solver was described.
The boxed operators have been removed in the recent RecursiveApply refactor. Instead ordinary arithmetic operators can be used.
…me-set # Conflicts: # ext/cuda/matrix_fields_single_field_solve.jl
This is a 13% bump in SYPD for the prog EDMF 1M Atmos config (no land).
Kernel analysis: