Skip to content

More efficient solver for TridiagonalMatrixFields#2486

Merged
Mikolaj-A-Kowalski merged 4 commits intomainfrom
iccs/pcr-solver
May 6, 2026
Merged

More efficient solver for TridiagonalMatrixFields#2486
Mikolaj-A-Kowalski merged 4 commits intomainfrom
iccs/pcr-solver

Conversation

@Mikolaj-A-Kowalski
Copy link
Copy Markdown
Contributor

@Mikolaj-A-Kowalski Mikolaj-A-Kowalski commented Apr 9, 2026

Replaces #2484 (since we need PR from inside the repo to run the CI perf jobs. the description is copied verbatim)

Closes #2401

@petebachant @imreddyTeja @sjavis @AdelekeBankole

In this PR we will aim to contribute a more efficient implementation of the tridiagonal matrix solver to that effect we:

  • split the multiple_field_solve! to isolate tridiagonal cases
  • redirect tridiagonal cases to a specialised solver

At the moment the specialised solver is significantly faster (e.g. when tested on L40 single solve reduces from ~200ms to 90ms compere.tar.gz). Once we trigger [perf] tag we will see how it performs in the complete simulation 🤞

This is still a draft since the solver is not good enough from the point of view of the launch configuration. We launch a block per column and a thread for each vertical level. This works well if the number of levels is close to multiple of 32 (like AMIP), but it is not stable enough. In this PR we will try to provide and test some alternatives. At the moment I see two options we can try:

Stitch Matrices together on load to shared memory
This would basically use the existing (quite fast) shmem solver, but to balance the load, load multiple columns into shmem to solve them as a single matrix (as far as I know some rows with 0 off-diagonals at the matrix 'stiches' should not cause numerical problems...). Here the problem would be though that since the 1st (or last) element of the off-diagonal would need to be patched to 0 (since it may not be zero, currently having "don't care" status) on load to shared memory.

Since the loading of data currently accounts for at least 50% of the solver runtime extra logic on load may not be performant. We shall see.

Shared Memory Parallel Thomas
Since we expect the vertical tridiagonal matrices to be small we can solve them in batches of 32. Have a single wrap per block and allow each of the threads to solve full matrix with the Parallel Thomas.

TODO

When the PR is ready go through the checklist:

  • Code follows the style guidelines.
  • Unit tests are included. (new solver is covered by existing unit tests)
  • Code is exercised in an integration test.
  • Documentation has been added/updated.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Contributor License Agreement required

The following contributor(s) must sign the CLA before this PR can be merged:

Please visit https://ecodesign.clima.caltech.edu/cla/ to review and sign the CLA.

How to sign: Authenticate with GitHub then click the "I agree" button.

Once completed, re-run the checks on this PR.

Comment thread file Outdated
@petebachant
Copy link
Copy Markdown
Member

Looks like the e2e performance pipeline is failing:

ERROR: LoadError: InvalidIRError: compiling MethodInstance for ClimaCoreCUDAExt.tridiag_pcr_kernel!(::ClimaCore.DataLayouts.VIJFH{…}, ::ClimaCore.DataLayouts.VIJFH{…}, ::ClimaCore.DataLayouts.VIJFH{…}, ::ClimaCore.DataLayouts.VIJFH{…}, ::ClimaCore.DataLayouts.VIJFH{…}, ::Val{…}, ::Val{…}) resulted in invalid LLVM IR

Do you have materials committed to your dev repo with all the scripts, configs, etc., you used to test this during development to generate that attached report? Maybe this only happens when running with the Coupler? I've seen strange idiosyncracies where kernels sped up for Atmos don't improve when run with the Coupler.

For this PR we should definitely have a unit test for the new solver, ideally using the same case(s) as the existing solver to check numerical values are similar enough.

@Mikolaj-A-Kowalski
Copy link
Copy Markdown
Contributor Author

Do you have materials committed to your dev repo with all the scripts, configs, etc., you used to test this during development to generate that attached report? Maybe this only happens when running with the Coupler? I've seen strange idiosyncracies where kernels sped up for Atmos don't improve when run with the Coupler.

Sadly it happens only during the AMIP run. I was able to reproduce that locally on our side, but still straggling through compilation times and julia crashes to get an Infiltrator into the function and see what types it is given to make it fail. I suspect it is just something simple that a MatrixField with some odd eltype is getting passed to the solver.

We have been iterating the solver on a testbed.

@Mikolaj-A-Kowalski
Copy link
Copy Markdown
Contributor Author

Mikolaj-A-Kowalski commented Apr 9, 2026

It seem that a crash is caused by a Field with elements of ClimaCore.Geometry.Covariant3Vector being fed to the solver. I am trying to determine if it is a valid use case (probably not...?) or some bug in the redirect that we are doing.

EDIT: So I was not quite 100% correct. The ClimaCore.Geometry.Covariant3Vector is provided for the RHS (and the result x), not the tridiagonal matrix itself. It is still nice Float32.

@imreddyTeja
Copy link
Copy Markdown
Member

Do you have materials committed to your dev repo with all the scripts, configs, etc., you used to test this during development to generate that attached report? Maybe this only happens when running with the Coupler? I've seen strange idiosyncracies where kernels sped up for Atmos don't improve when run with the Coupler.

Sadly it happens only during the AMIP run. I was able to reproduce that locally on our side, but still straggling through compilation times and julia crashes to get an Infiltrator into the function and see what types it is given to make it fail. I suspect it is just something simple that a MatrixField with some odd eltype is getting passed to the solver.

We have been iterating the solver on a testbed.

If you aren't already, I think it would be helpful to use test/MatrixFields/field_matrix_solvers.jl. It doesn't cover every use case, but it should replicate a few common atmos setups.

@Mikolaj-A-Kowalski
Copy link
Copy Markdown
Contributor Author

If you aren't already, I think it would be helpful to use test/MatrixFields/field_matrix_solvers.jl. It doesn't cover every use case, but it should replicate a few common atmos setups.

It was a good shout.

The problem seems to be that we did not handle properly the cases where the components of the matrix and vectors are tensors and co/contra-variant vectors. I am not fully certain if I interpret things correctly, but this is why we should have used the 'boxed' operators and RecursiveApply functionalities?

@Mikolaj-A-Kowalski
Copy link
Copy Markdown
Contributor Author

So by boxing the operators seems to solve the problem with failing end-to-end run.

Basically we missed that not only elements of the field may be the AxisTensors, but also the RHS b may be a tuple and the solution needs to be applied element-wise.

Now the CI seems to run happily and AMIP run is reporting healthy speedup. I do, however, have a couple of worries to resolve.

First is that there is a lot of tests failing in the CI (although the failures seem to be ignored when determining the status of the overall job). I assume these are tests known to fail?

Below I have attached the Nsys reports for the before and after this PR. However, in the after I cannot seem to find the matrix solver kernels. It is quite scary. The unit tests do pass and locally I have verified that the solver runs correctly. Also I presume that something would have failed if e.g. due to some bug the solver was not called at all? Hence I guess it is something happening with the kernel renaming scheme?

reports.tar.gz

@Mikolaj-A-Kowalski
Copy link
Copy Markdown
Contributor Author

Mikolaj-A-Kowalski commented Apr 20, 2026

Ok. I have verified in REPL that the solver does get run. I have tried to use the @debug logging to check what kernel name it should be assigned with and it should be run_field_matrix_solver__FILE_ClimaCore_jl_src_MatrixFields_field_matrix_solver_jl_LXXX.

Looking into the Kernel summary in a spreadsheet the kernels are still there.

The only suspicious thing is that the performance improvement is much, much better than we could have expected from ncu results. Feels to good to be true. We drop from over 100ms to ~1ms. It feels odd.

summary.ods

Copy link
Copy Markdown
Member

@imreddyTeja imreddyTeja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks mostly good to me. I'm having trouble opening the summary.ods file you attached, but I will run some quick benchmarks locally to verify the speedup. It would be good to run the nightly pipeline, or at least a simulatiojn from it, with this before merging.

Comment thread ext/cuda/matrix_fields_single_field_solve.jl Outdated
b_data = Fields.field_values(b)

# Solve
threads_per_block = min(Nv, 256)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if Nv > 256?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad things...

This is a good catch! This line stayed there by accident. It should be removed.

Ultimately the only problem that may happen is that the vertical size may exceed the maximum number of threads per block that can be supported (for a given register usage) and then the kernel will fail to launch.

Ultimately if we will want a proper error handling on that (with CUDA.maxthreads), but an annoying part is that to call it we need an access to the compiled kernel so it may need to happen inside the auto_launch!.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the line in 3aea99f

If the size is too large to handle. CUDA.jl will emit its standard error that the number of threads is too large.

If we want I can also add a check inside the auto_launch! so ClimaCore.jl emits a message that may be made more descriptive.

auto_launch!(
tridiag_pcr_kernel!,
args;
threads_s = (threads_per_block,),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not a big issue because most long simulations run with 64 vertical faces, but wouldn't this result in a theoretical occupancy of 50% when Nv==32?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly! This is a biggest drawback of the kernel as it is. When the Nv is 33 basically we end up with a 50% effective occupancy. This is the worst case scenario. But I have to admit that it is annoyingly difficult to avoid...

As I mentioned in the description I have tried to improve it, by 'concatenating' multiple tridiagonal systems into a bigger system (for a test basically loaded all matrixes in a single horizontal element at once (i.j- dimensions)) and solve it instead using the same solver. This would avoid the "trailing masked threads" problem basically by averaging. But in the end it did not work. (You can see the attempt here)

Basically what happens is that since more threads participate in the synchronisation they spend more time for their brethren to catch up. In addition there are extra PCR steps due to a larger system. For the worst case scenario of Nv=33 the concatenated solver was as fast like the one in the PR. In general it was significantly slower ( I looked in a standalone "testbed" though. Not in AMIP)

What we still need to try is to do Thomas with single matrix per thread but in shared memory.

However there is also a possibility that switching to the shared memory decreased the fraction of the time spent in the solver to a point where we perhaps don't need to worry about the occupancy. Basically the change from Nv of 32 to Nv 33 would increase the runtime of the solver by a factor of about 2, but if is is small enough fraction of time it will have small overall effect on performance. I still need to test that though (the nsys results indicate that tridiagonal solutions are a very small fraction of the time, but I am not sure I count them all due to the kernel renaming, I have made nsys and ncu runs in AMIP without the renaming to verify but did not process the results yet).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To update on the above. I had a look at the nsys run without the kernel renaming and it seems that the GPU time in the tridiagonal solver is still rather large (3.4% percent) no_rename.tar.gz. This implies that something have gone wrong in the kernel renaming in the original report...

3.4% is a large number, sufficiently large to worry about the loss of the occupancy.

The only thing I am thinking is that it may not be trivial to change it so perhaps it would be better to address it in a separate, follow-up PR>

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a fix for that. I will try to open a PR for it by end of the PST day



function tridiag_pcr_kernel!(
x, a, b, c, d, ::Val{n}, ::Val{n_iter},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
x, a, b, c, d, ::Val{n}, ::Val{n_iter},
x, a, b, c, d, ::Val{Nv}, ::Val{n_iter},

I think it would be follow the pattern of using Nv to refer to the number of vertical levels

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be addressed in 3aea99f

@imreddyTeja
Copy link
Copy Markdown
Member

I did some quick local testing with the moist_dycore_prognostic_edmf_prognostic matrix solver test ran at a 64 vertical faces and a higher horizontal resolution. The ldiv! time before was 28ms, and it is 11ms after 🚀

@Mikolaj-A-Kowalski
Copy link
Copy Markdown
Contributor Author

Mikolaj-A-Kowalski commented Apr 27, 2026

It would be good to run the nightly pipeline, or at least a simulatiojn from it, with this before merging.

Is it something I can do by pushing a commit with a right annotation or clicking something somewhere on buildkite?

Edit: or opening a new branch in the ClimaCoupler.jl?

@imreddyTeja
Copy link
Copy Markdown
Member

imreddyTeja commented Apr 27, 2026

Downstream test in ClimaAtmos here

@Mikolaj-A-Kowalski
Copy link
Copy Markdown
Contributor Author

Mikolaj-A-Kowalski commented Apr 28, 2026

I think this PR is now ready for the proper review. There is a couple of issues I would like some guidance on though:

  • Updating online documentation: should we mention something about the default use of the PCR solver on CUDA back-end in the online docs ( docs/src/matrix_fields.md)? On one hand it may be good idea to bring attention to the sub-optimal performance for Nv that are not multiples of 32. On the other it may be too much of the implementation detail.
  • Integration tests: the cechklist mentiones verifing that the integration tests are in place. The solver is run as the part of the unit tests in test/MatrixFields/field_matrix_solvers.jl and inside AMIP. Is this sufficient or should I add anything?

The commit history needs to be cleaned up. I plan to do it after we finish iterating on the PR (I don't want GitHub to invalidate the conversations as the result of force pushes)

@imreddyTeja
Copy link
Copy Markdown
Member

Updating online documentation: I think it would be a good idea to put this in the docs. Maybe [ClimaCore.MatrixFields.BlockDiagonalSolve] (https://clima.github.io/ClimaCore.jl/dev/matrix_fields/#ClimaCore.MatrixFields.BlockDiagonalSolve) would be an appropriate place? We should definitely mention that different algorithms are used for solving the tridiagonal case on gpu vs cpu. I don't think we need to put the Nv should be a multiple of 32 in the docstring, but it's up to you. At the very least, it would be good to mention it in a comment. I was planning on documenting the general advice for choosing (Nv, Ni, Nj, Nh) values when I add docs for the eager stencil evaluation.

Integration tests - I think the test coverage for the solver is sufficient.

Copy link
Copy Markdown
Member

@imreddyTeja imreddyTeja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After updating the docs, I think this is ready to go. Could you also add an entry to NEWS.md?

function single_field_solve!(device::ClimaComms.CUDADevice, cache, x, A, b)

# Tridiagonal solvers are handled by special implementation
if eltype(A) <: MatrixFields.TridiagonalMatrixRow
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also check that Nv <= 256 here, and fallback to the previous implementation otherwise? This would keep support for large Nvs

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the redirect in d6c2910

I used the threshold of 512 since at the moment the kernel consumes only 30 registers so it supports large number of threads per block.

@Mikolaj-A-Kowalski
Copy link
Copy Markdown
Contributor Author

Updating online documentation: I think it would be a good idea to put this in the docs. Maybe [ClimaCore.MatrixFields.BlockDiagonalSolve] (https://clima.github.io/ClimaCore.jl/dev/matrix_fields/#ClimaCore.MatrixFields.BlockDiagonalSolve) would be an appropriate place? We should definitely mention that different algorithms are used for solving the tridiagonal case on gpu vs cpu. I don't think we need to put the Nv should be a multiple of 32 in the docstring, but it's up to you. At the very least, it would be good to mention it in a comment. I was planning on documenting the general advice for choosing (Nv, Ni, Nj, Nh) values when I add docs for the eager stencil evaluation.

I have added extra documentation in c7ce231

After updating the docs, I think this is ready to go. Could you also add an entry to NEWS.md?

Added in c50f780 (although causes a merge conflict)

If we are broadly happy with the changes I will proceed to rebase on main and cleanup the history.

@imreddyTeja
Copy link
Copy Markdown
Member

Updating online documentation: I think it would be a good idea to put this in the docs. Maybe [ClimaCore.MatrixFields.BlockDiagonalSolve] (https://clima.github.io/ClimaCore.jl/dev/matrix_fields/#ClimaCore.MatrixFields.BlockDiagonalSolve) would be an appropriate place? We should definitely mention that different algorithms are used for solving the tridiagonal case on gpu vs cpu. I don't think we need to put the Nv should be a multiple of 32 in the docstring, but it's up to you. At the very least, it would be good to mention it in a comment. I was planning on documenting the general advice for choosing (Nv, Ni, Nj, Nh) values when I add docs for the eager stencil evaluation.

I have added extra documentation in c7ce231

After updating the docs, I think this is ready to go. Could you also add an entry to NEWS.md?

Added in c50f780 (although causes a merge conflict)

If we are broadly happy with the changes I will proceed to rebase on main and cleanup the history.

yes please. This looks great!

@Mikolaj-A-Kowalski
Copy link
Copy Markdown
Contributor Author

All the conflicts should be resolved and it is ready to go when the test pass ;-)

Copy link
Copy Markdown
Member

@imreddyTeja imreddyTeja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a nice speed up. thank you!

@imreddyTeja
Copy link
Copy Markdown
Member

@dennisYatunin Do you expect there to be any changes needed because #2417 was merged in?

Note that the launch configuration is not ideal at the moment. Basically
needs number of vertical levels as a multiple of 32 and has upper limit
of 1024 (number of threads per block).

Note that since the solver may work with matrices that have AxisTensor
elements (with potentially multiple components), we need to use boxed
operator.

Co-authored-by: petebachant <petebachant@gmail.com>
Co-authored-by: sjavis <sa2329@cam.ac.uk>
To capture the majority of tridiagonal solvers we need to split the
`multiple_field_solver` if any tridiagonal matrix is present. Otherwise
tridiagonal case may be hidden in a single kernel with non-tridiagonal
cases. This seems to happen for most instances of the
`multiple_field_solver` inthe AMIP case, hence we effectivly remove this
optimisation.

New PCR solver may fail to launch the kernel if the Nv (and hence
number of threads peer block) is too large. Fallback on the local-memory
solver to have stable (albeit not as performant) behaviour for large
number of vertical levels.
Docstring of BlockDiagonalSolve is where the original solver was
described.
The boxed operators have been removed in the recent RecursiveApply
refactor. Instead ordinary arithmetic operators can be used.
@Mikolaj-A-Kowalski
Copy link
Copy Markdown
Contributor Author

Mikolaj-A-Kowalski commented May 5, 2026

@dennisYatunin Do you expect there to be any changes needed because #2417 was merged in?

I have rebased on the current main and the major change that was needed to get the tests to pass was to 'unbox' all operators again. It is my understanding that they got deprecated?

I have also triggered another end-to-end performance check to see if there is any difference after #2417

@imreddyTeja
Copy link
Copy Markdown
Member

@dennisYatunin Do you expect there to be any changes needed because #2417 was merged in?

I have rebased on the current main and the major change that was needed to get the tests to pass was to 'unbox' all operators again. It is my understanding that they got deprecated?

I have also triggered another end-to-end performance check to see if there is any difference after #2417

I checked in with him briefly yesterday. All the changes in #2417 should be backwards compatible. I believe you are correct that we should 'unbox' the operators, as they are now deprecated.

@Mikolaj-A-Kowalski
Copy link
Copy Markdown
Contributor Author

Also are the failures in the KinematicDriver.jl caused by this PR?
I had a quick glance but and they seem to complain about additional Host allocations, but it is not obvious to be if the functions that gets called include any MatrixField solvers or whether they run on CPU or GPU?

@imreddyTeja
Copy link
Copy Markdown
Member

Also are the failures in the KinematicDriver.jl caused by this PR? I had a quick glance but and they seem to complain about additional Host allocations, but it is not obvious to be if the functions that gets called include any MatrixField solvers or whether they run on CPU or GPU?

Those allocations are caused by #2417. The downstream tests has been failing on main with the same number of allocations since that pr was merged. I believe all the tests that we run through GitHub actions are cpu only.

@imreddyTeja
Copy link
Copy Markdown
Member

If you think this is good to go, I think this should be merged

@Mikolaj-A-Kowalski
Copy link
Copy Markdown
Contributor Author

If you think this is good to go, I think this should be merged

I think it is. I assume I should just squash and merge?

@imreddyTeja
Copy link
Copy Markdown
Member

If you think this is good to go, I think this should be merged

I think it is. I assume I should just squash and merge?

Yes. Either squash and merge or use the rebase option on the web interface. There isn't a preferred way as far as I know.

@Mikolaj-A-Kowalski Mikolaj-A-Kowalski merged commit 9a599da into main May 6, 2026
32 of 34 checks passed
@Mikolaj-A-Kowalski Mikolaj-A-Kowalski deleted the iccs/pcr-solver branch May 6, 2026 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Attempt to optimize run_field_matrix_solver! for CUDA with a parallel algorithm

3 participants