More efficient solver for TridiagonalMatrixFields by Mikolaj-A-Kowalski · Pull Request #2486 · CliMA/ClimaCore.jl

Mikolaj-A-Kowalski · 2026-04-09T09:54:49Z

Replaces #2484 (since we need PR from inside the repo to run the CI perf jobs. the description is copied verbatim)

@petebachant @imreddyTeja @sjavis @AdelekeBankole

In this PR we will aim to contribute a more efficient implementation of the tridiagonal matrix solver to that effect we:

split the multiple_field_solve! to isolate tridiagonal cases
redirect tridiagonal cases to a specialised solver

At the moment the specialised solver is significantly faster (e.g. when tested on L40 single solve reduces from ~200ms to 90ms compere.tar.gz). Once we trigger [perf] tag we will see how it performs in the complete simulation 🤞

This is still a draft since the solver is not good enough from the point of view of the launch configuration. We launch a block per column and a thread for each vertical level. This works well if the number of levels is close to multiple of 32 (like AMIP), but it is not stable enough. In this PR we will try to provide and test some alternatives. At the moment I see two options we can try:

Stitch Matrices together on load to shared memory
This would basically use the existing (quite fast) shmem solver, but to balance the load, load multiple columns into shmem to solve them as a single matrix (as far as I know some rows with 0 off-diagonals at the matrix 'stiches' should not cause numerical problems...). Here the problem would be though that since the 1st (or last) element of the off-diagonal would need to be patched to 0 (since it may not be zero, currently having "don't care" status) on load to shared memory.

Since the loading of data currently accounts for at least 50% of the solver runtime extra logic on load may not be performant. We shall see.

Shared Memory Parallel Thomas
Since we expect the vertical tridiagonal matrices to be small we can solve them in batches of 32. Have a single wrap per block and allow each of the threads to solve full matrix with the Parallel Thomas.

TODO

When the PR is ready go through the checklist:

Code follows the style guidelines.
Unit tests are included. (new solver is covered by existing unit tests)
Code is exercised in an integration test.
Documentation has been added/updated.

github-actions

⚠️ Contributor License Agreement required

The following contributor(s) must sign the CLA before this PR can be merged:

@Mikolaj-A-Kowalski

Please visit https://ecodesign.clima.caltech.edu/cla/ to review and sign the CLA.

How to sign: Authenticate with GitHub then click the "I agree" button.

Once completed, re-run the checks on this PR.

petebachant · 2026-04-09T13:00:40Z

Looks like the e2e performance pipeline is failing:

ERROR: LoadError: InvalidIRError: compiling MethodInstance for ClimaCoreCUDAExt.tridiag_pcr_kernel!(::ClimaCore.DataLayouts.VIJFH{…}, ::ClimaCore.DataLayouts.VIJFH{…}, ::ClimaCore.DataLayouts.VIJFH{…}, ::ClimaCore.DataLayouts.VIJFH{…}, ::ClimaCore.DataLayouts.VIJFH{…}, ::Val{…}, ::Val{…}) resulted in invalid LLVM IR

Do you have materials committed to your dev repo with all the scripts, configs, etc., you used to test this during development to generate that attached report? Maybe this only happens when running with the Coupler? I've seen strange idiosyncracies where kernels sped up for Atmos don't improve when run with the Coupler.

For this PR we should definitely have a unit test for the new solver, ideally using the same case(s) as the existing solver to check numerical values are similar enough.

Mikolaj-A-Kowalski · 2026-04-09T13:08:49Z

Do you have materials committed to your dev repo with all the scripts, configs, etc., you used to test this during development to generate that attached report? Maybe this only happens when running with the Coupler? I've seen strange idiosyncracies where kernels sped up for Atmos don't improve when run with the Coupler.

Sadly it happens only during the AMIP run. I was able to reproduce that locally on our side, but still straggling through compilation times and julia crashes to get an Infiltrator into the function and see what types it is given to make it fail. I suspect it is just something simple that a MatrixField with some odd eltype is getting passed to the solver.

We have been iterating the solver on a testbed.

Mikolaj-A-Kowalski · 2026-04-09T16:07:02Z

It seem that a crash is caused by a Field with elements of ClimaCore.Geometry.Covariant3Vector being fed to the solver. I am trying to determine if it is a valid use case (probably not...?) or some bug in the redirect that we are doing.

EDIT: So I was not quite 100% correct. The ClimaCore.Geometry.Covariant3Vector is provided for the RHS (and the result x), not the tridiagonal matrix itself. It is still nice Float32.

imreddyTeja · 2026-04-09T17:12:08Z

Do you have materials committed to your dev repo with all the scripts, configs, etc., you used to test this during development to generate that attached report? Maybe this only happens when running with the Coupler? I've seen strange idiosyncracies where kernels sped up for Atmos don't improve when run with the Coupler.

Sadly it happens only during the AMIP run. I was able to reproduce that locally on our side, but still straggling through compilation times and julia crashes to get an Infiltrator into the function and see what types it is given to make it fail. I suspect it is just something simple that a MatrixField with some odd eltype is getting passed to the solver.

We have been iterating the solver on a testbed.

If you aren't already, I think it would be helpful to use test/MatrixFields/field_matrix_solvers.jl. It doesn't cover every use case, but it should replicate a few common atmos setups.

Mikolaj-A-Kowalski · 2026-04-15T14:20:36Z

If you aren't already, I think it would be helpful to use test/MatrixFields/field_matrix_solvers.jl. It doesn't cover every use case, but it should replicate a few common atmos setups.

It was a good shout.

The problem seems to be that we did not handle properly the cases where the components of the matrix and vectors are tensors and co/contra-variant vectors. I am not fully certain if I interpret things correctly, but this is why we should have used the 'boxed' operators and RecursiveApply functionalities?

Mikolaj-A-Kowalski · 2026-04-20T14:53:23Z

So by boxing the operators seems to solve the problem with failing end-to-end run.

Basically we missed that not only elements of the field may be the AxisTensors, but also the RHS b may be a tuple and the solution needs to be applied element-wise.

Now the CI seems to run happily and AMIP run is reporting healthy speedup. I do, however, have a couple of worries to resolve.

First is that there is a lot of tests failing in the CI (although the failures seem to be ignored when determining the status of the overall job). I assume these are tests known to fail?

Below I have attached the Nsys reports for the before and after this PR. However, in the after I cannot seem to find the matrix solver kernels. It is quite scary. The unit tests do pass and locally I have verified that the solver runs correctly. Also I presume that something would have failed if e.g. due to some bug the solver was not called at all? Hence I guess it is something happening with the kernel renaming scheme?

reports.tar.gz

Mikolaj-A-Kowalski · 2026-04-20T16:26:52Z

Ok. I have verified in REPL that the solver does get run. I have tried to use the @debug logging to check what kernel name it should be assigned with and it should be run_field_matrix_solver__FILE_ClimaCore_jl_src_MatrixFields_field_matrix_solver_jl_LXXX.

Looking into the Kernel summary in a spreadsheet the kernels are still there.

The only suspicious thing is that the performance improvement is much, much better than we could have expected from ncu results. Feels to good to be true. We drop from over 100ms to ~1ms. It feels odd.

summary.ods

imreddyTeja

This looks mostly good to me. I'm having trouble opening the summary.ods file you attached, but I will run some quick benchmarks locally to verify the speedup. It would be good to run the nightly pipeline, or at least a simulatiojn from it, with this before merging.

imreddyTeja · 2026-04-23T16:06:48Z

+    b_data = Fields.field_values(b)
+
+    # Solve
+    threads_per_block = min(Nv, 256)


What happens if Nv > 256?

Bad things...

This is a good catch! This line stayed there by accident. It should be removed.

Ultimately the only problem that may happen is that the vertical size may exceed the maximum number of threads per block that can be supported (for a given register usage) and then the kernel will fail to launch.

Ultimately if we will want a proper error handling on that (with CUDA.maxthreads), but an annoying part is that to call it we need an access to the compiled kernel so it may need to happen inside the auto_launch!.

I have removed the line in 3aea99f

If the size is too large to handle. CUDA.jl will emit its standard error that the number of threads is too large.

If we want I can also add a check inside the auto_launch! so ClimaCore.jl emits a message that may be made more descriptive.

imreddyTeja · 2026-04-23T16:08:29Z

+    auto_launch!(
+        tridiag_pcr_kernel!,
+        args;
+        threads_s = (threads_per_block,),


This is probably not a big issue because most long simulations run with 64 vertical faces, but wouldn't this result in a theoretical occupancy of 50% when Nv==32?

Exactly! This is a biggest drawback of the kernel as it is. When the Nv is 33 basically we end up with a 50% effective occupancy. This is the worst case scenario. But I have to admit that it is annoyingly difficult to avoid...

As I mentioned in the description I have tried to improve it, by 'concatenating' multiple tridiagonal systems into a bigger system (for a test basically loaded all matrixes in a single horizontal element at once (i.j- dimensions)) and solve it instead using the same solver. This would avoid the "trailing masked threads" problem basically by averaging. But in the end it did not work. (You can see the attempt here)

Basically what happens is that since more threads participate in the synchronisation they spend more time for their brethren to catch up. In addition there are extra PCR steps due to a larger system. For the worst case scenario of Nv=33 the concatenated solver was as fast like the one in the PR. In general it was significantly slower ( I looked in a standalone "testbed" though. Not in AMIP)

What we still need to try is to do Thomas with single matrix per thread but in shared memory.

However there is also a possibility that switching to the shared memory decreased the fraction of the time spent in the solver to a point where we perhaps don't need to worry about the occupancy. Basically the change from Nv of 32 to Nv 33 would increase the runtime of the solver by a factor of about 2, but if is is small enough fraction of time it will have small overall effect on performance. I still need to test that though (the nsys results indicate that tridiagonal solutions are a very small fraction of the time, but I am not sure I count them all due to the kernel renaming, I have made nsys and ncu runs in AMIP without the renaming to verify but did not process the results yet).

To update on the above. I had a look at the nsys run without the kernel renaming and it seems that the GPU time in the tridiagonal solver is still rather large (3.4% percent) no_rename.tar.gz. This implies that something have gone wrong in the kernel renaming in the original report...

3.4% is a large number, sufficiently large to worry about the loss of the occupancy.

The only thing I am thinking is that it may not be trivial to change it so perhaps it would be better to address it in a separate, follow-up PR>

I have a fix for that. I will try to open a PR for it by end of the PST day

imreddyTeja · 2026-04-23T16:10:07Z

+
+
+function tridiag_pcr_kernel!(
+    x, a, b, c, d, ::Val{n}, ::Val{n_iter},


Suggested change

x, a, b, c, d, ::Val{n}, ::Val{n_iter},

x, a, b, c, d, ::Val{Nv}, ::Val{n_iter},

I think it would be follow the pattern of using Nv to refer to the number of vertical levels

Should be addressed in 3aea99f

imreddyTeja · 2026-04-23T18:29:36Z

I did some quick local testing with the moist_dycore_prognostic_edmf_prognostic matrix solver test ran at a 64 vertical faces and a higher horizontal resolution. The ldiv! time before was 28ms, and it is 11ms after 🚀

Mikolaj-A-Kowalski · 2026-04-27T12:29:17Z

It would be good to run the nightly pipeline, or at least a simulatiojn from it, with this before merging.

Is it something I can do by pushing a commit with a right annotation or clicking something somewhere on buildkite?

Edit: or opening a new branch in the ClimaCoupler.jl?

imreddyTeja · 2026-04-27T15:51:56Z

Downstream test in ClimaAtmos here

Mikolaj-A-Kowalski · 2026-04-28T13:57:53Z

I think this PR is now ready for the proper review. There is a couple of issues I would like some guidance on though:

Updating online documentation: should we mention something about the default use of the PCR solver on CUDA back-end in the online docs ( docs/src/matrix_fields.md)? On one hand it may be good idea to bring attention to the sub-optimal performance for Nv that are not multiples of 32. On the other it may be too much of the implementation detail.
Integration tests: the cechklist mentiones verifing that the integration tests are in place. The solver is run as the part of the unit tests in test/MatrixFields/field_matrix_solvers.jl and inside AMIP. Is this sufficient or should I add anything?

The commit history needs to be cleaned up. I plan to do it after we finish iterating on the PR (I don't want GitHub to invalidate the conversations as the result of force pushes)

imreddyTeja · 2026-04-28T17:26:30Z

Updating online documentation: I think it would be a good idea to put this in the docs. Maybe [ClimaCore.MatrixFields.BlockDiagonalSolve] (https://clima.github.io/ClimaCore.jl/dev/matrix_fields/#ClimaCore.MatrixFields.BlockDiagonalSolve) would be an appropriate place? We should definitely mention that different algorithms are used for solving the tridiagonal case on gpu vs cpu. I don't think we need to put the Nv should be a multiple of 32 in the docstring, but it's up to you. At the very least, it would be good to mention it in a comment. I was planning on documenting the general advice for choosing (Nv, Ni, Nj, Nh) values when I add docs for the eager stencil evaluation.

Integration tests - I think the test coverage for the solver is sufficient.

imreddyTeja

After updating the docs, I think this is ready to go. Could you also add an entry to NEWS.md?

imreddyTeja · 2026-04-28T17:33:28Z

 function single_field_solve!(device::ClimaComms.CUDADevice, cache, x, A, b)
+
+    # Tridiagonal solvers are handled by special implementation
+    if eltype(A) <: MatrixFields.TridiagonalMatrixRow


Should we also check that Nv <= 256 here, and fallback to the previous implementation otherwise? This would keep support for large Nvs

I have added the redirect in d6c2910

I used the threshold of 512 since at the moment the kernel consumes only 30 registers so it supports large number of threads per block.

Mikolaj-A-Kowalski · 2026-04-30T14:24:31Z

Updating online documentation: I think it would be a good idea to put this in the docs. Maybe [ClimaCore.MatrixFields.BlockDiagonalSolve] (https://clima.github.io/ClimaCore.jl/dev/matrix_fields/#ClimaCore.MatrixFields.BlockDiagonalSolve) would be an appropriate place? We should definitely mention that different algorithms are used for solving the tridiagonal case on gpu vs cpu. I don't think we need to put the Nv should be a multiple of 32 in the docstring, but it's up to you. At the very least, it would be good to mention it in a comment. I was planning on documenting the general advice for choosing (Nv, Ni, Nj, Nh) values when I add docs for the eager stencil evaluation.

I have added extra documentation in c7ce231

After updating the docs, I think this is ready to go. Could you also add an entry to NEWS.md?

Added in c50f780 (although causes a merge conflict)

If we are broadly happy with the changes I will proceed to rebase on main and cleanup the history.

imreddyTeja · 2026-04-30T17:27:02Z

Updating online documentation: I think it would be a good idea to put this in the docs. Maybe [ClimaCore.MatrixFields.BlockDiagonalSolve] (https://clima.github.io/ClimaCore.jl/dev/matrix_fields/#ClimaCore.MatrixFields.BlockDiagonalSolve) would be an appropriate place? We should definitely mention that different algorithms are used for solving the tridiagonal case on gpu vs cpu. I don't think we need to put the Nv should be a multiple of 32 in the docstring, but it's up to you. At the very least, it would be good to mention it in a comment. I was planning on documenting the general advice for choosing (Nv, Ni, Nj, Nh) values when I add docs for the eager stencil evaluation.

I have added extra documentation in c7ce231

After updating the docs, I think this is ready to go. Could you also add an entry to NEWS.md?

Added in c50f780 (although causes a merge conflict)

If we are broadly happy with the changes I will proceed to rebase on main and cleanup the history.

yes please. This looks great!

Mikolaj-A-Kowalski · 2026-05-01T08:54:13Z

All the conflicts should be resolved and it is ready to go when the test pass ;-)

imreddyTeja

this should be a nice speed up. thank you!

imreddyTeja · 2026-05-01T16:54:46Z

@dennisYatunin Do you expect there to be any changes needed because #2417 was merged in?

Note that the launch configuration is not ideal at the moment. Basically needs number of vertical levels as a multiple of 32 and has upper limit of 1024 (number of threads per block). Note that since the solver may work with matrices that have AxisTensor elements (with potentially multiple components), we need to use boxed operator. Co-authored-by: petebachant <petebachant@gmail.com> Co-authored-by: sjavis <sa2329@cam.ac.uk>

To capture the majority of tridiagonal solvers we need to split the `multiple_field_solver` if any tridiagonal matrix is present. Otherwise tridiagonal case may be hidden in a single kernel with non-tridiagonal cases. This seems to happen for most instances of the `multiple_field_solver` inthe AMIP case, hence we effectivly remove this optimisation. New PCR solver may fail to launch the kernel if the Nv (and hence number of threads peer block) is too large. Fallback on the local-memory solver to have stable (albeit not as performant) behaviour for large number of vertical levels.

Docstring of BlockDiagonalSolve is where the original solver was described.

The boxed operators have been removed in the recent RecursiveApply refactor. Instead ordinary arithmetic operators can be used.

Mikolaj-A-Kowalski · 2026-05-05T10:16:21Z

@dennisYatunin Do you expect there to be any changes needed because #2417 was merged in?

I have rebased on the current main and the major change that was needed to get the tests to pass was to 'unbox' all operators again. It is my understanding that they got deprecated?

I have also triggered another end-to-end performance check to see if there is any difference after #2417

imreddyTeja · 2026-05-05T14:01:18Z

@dennisYatunin Do you expect there to be any changes needed because #2417 was merged in?

I have rebased on the current main and the major change that was needed to get the tests to pass was to 'unbox' all operators again. It is my understanding that they got deprecated?

I have also triggered another end-to-end performance check to see if there is any difference after #2417

I checked in with him briefly yesterday. All the changes in #2417 should be backwards compatible. I believe you are correct that we should 'unbox' the operators, as they are now deprecated.

Mikolaj-A-Kowalski · 2026-05-05T14:22:15Z

Also are the failures in the KinematicDriver.jl caused by this PR?
I had a quick glance but and they seem to complain about additional Host allocations, but it is not obvious to be if the functions that gets called include any MatrixField solvers or whether they run on CPU or GPU?

imreddyTeja · 2026-05-05T16:06:25Z

Also are the failures in the KinematicDriver.jl caused by this PR? I had a quick glance but and they seem to complain about additional Host allocations, but it is not obvious to be if the functions that gets called include any MatrixField solvers or whether they run on CPU or GPU?

Those allocations are caused by #2417. The downstream tests has been failing on main with the same number of allocations since that pr was merged. I believe all the tests that we run through GitHub actions are cpu only.

imreddyTeja · 2026-05-05T21:33:28Z

If you think this is good to go, I think this should be merged

Mikolaj-A-Kowalski · 2026-05-06T08:24:55Z

If you think this is good to go, I think this should be merged

I think it is. I assume I should just squash and merge?

imreddyTeja · 2026-05-06T14:25:09Z

If you think this is good to go, I think this should be merged

I think it is. I assume I should just squash and merge?

Yes. Either squash and merge or use the rebase option on the web interface. There isn't a preferred way as far as I know.

github-actions Bot reviewed Apr 9, 2026

View reviewed changes

Mikolaj-A-Kowalski mentioned this pull request Apr 9, 2026

[REPLACED] More efficient solver for TridiagonalMatrixFields #2484

Closed

4 tasks

petebachant requested review from dennisYatunin and imreddyTeja April 9, 2026 12:39

petebachant reviewed Apr 9, 2026

View reviewed changes

Comment thread file Outdated

Mikolaj-A-Kowalski force-pushed the iccs/pcr-solver branch from b2dbb21 to bb6ad40 Compare April 20, 2026 10:11

imreddyTeja reviewed Apr 23, 2026

View reviewed changes

Mikolaj-A-Kowalski mentioned this pull request Apr 27, 2026

Implement kernel fusion between vertical operators and the solver #2496

Open

Mikolaj-A-Kowalski marked this pull request as ready for review April 28, 2026 13:53

imreddyTeja reviewed Apr 28, 2026

View reviewed changes

imreddyTeja mentioned this pull request Apr 30, 2026

Ignore certain functions withing ClimaCore for kernel renaming #2499

Merged

4 tasks

Mikolaj-A-Kowalski force-pushed the iccs/pcr-solver branch from c7ce231 to 937ee08 Compare May 1, 2026 08:41

imreddyTeja approved these changes May 1, 2026

View reviewed changes

Mikolaj-A-Kowalski added 3 commits May 5, 2026 10:24

doc: note the new PCR solver in relevant documentation

18b453e

Docstring of BlockDiagonalSolve is where the original solver was described.

refactor: unbox the operators to conform to new interface

5977c4e

The boxed operators have been removed in the recent RecursiveApply refactor. Instead ordinary arithmetic operators can be used.

Mikolaj-A-Kowalski force-pushed the iccs/pcr-solver branch from 937ee08 to 86ea65b Compare May 5, 2026 10:15

Mikolaj-A-Kowalski force-pushed the iccs/pcr-solver branch from 86ea65b to 5977c4e Compare May 6, 2026 08:24

Mikolaj-A-Kowalski merged commit 9a599da into main May 6, 2026
32 of 34 checks passed

Mikolaj-A-Kowalski deleted the iccs/pcr-solver branch May 6, 2026 14:48



		function tridiag_pcr_kernel!(
		x, a, b, c, d, ::Val{n}, ::Val{n_iter},

	x, a, b, c, d, ::Val{n}, ::Val{n_iter},
	x, a, b, c, d, ::Val{Nv}, ::Val{n_iter},

Conversation

Mikolaj-A-Kowalski commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

petebachant commented Apr 9, 2026

Uh oh!

Mikolaj-A-Kowalski commented Apr 9, 2026

Uh oh!

Mikolaj-A-Kowalski commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

imreddyTeja commented Apr 9, 2026

Uh oh!

Mikolaj-A-Kowalski commented Apr 15, 2026

Uh oh!

Mikolaj-A-Kowalski commented Apr 20, 2026

Uh oh!

Mikolaj-A-Kowalski commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

imreddyTeja left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

imreddyTeja commented Apr 23, 2026

Uh oh!

Mikolaj-A-Kowalski commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

imreddyTeja commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mikolaj-A-Kowalski commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

imreddyTeja commented Apr 28, 2026

Uh oh!

imreddyTeja left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mikolaj-A-Kowalski commented Apr 30, 2026

Uh oh!

imreddyTeja commented Apr 30, 2026

Uh oh!

Mikolaj-A-Kowalski commented May 1, 2026

Uh oh!

imreddyTeja left a comment

Choose a reason for hiding this comment

Uh oh!

imreddyTeja commented May 1, 2026

Uh oh!

Mikolaj-A-Kowalski commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Mikolaj-A-Kowalski commented Apr 9, 2026 •

edited

Loading

Mikolaj-A-Kowalski commented Apr 9, 2026 •

edited

Loading

Mikolaj-A-Kowalski commented Apr 20, 2026 •

edited

Loading

Mikolaj-A-Kowalski commented Apr 27, 2026 •

edited

Loading

imreddyTeja commented Apr 27, 2026 •

edited

Loading

Mikolaj-A-Kowalski commented Apr 28, 2026 •

edited

Loading

Mikolaj-A-Kowalski commented May 5, 2026 •

edited

Loading