More efficient solver for TridiagonalMatrixFields#2486
More efficient solver for TridiagonalMatrixFields#2486Mikolaj-A-Kowalski merged 4 commits intomainfrom
Conversation
There was a problem hiding this comment.
The following contributor(s) must sign the CLA before this PR can be merged:
Please visit https://ecodesign.clima.caltech.edu/cla/ to review and sign the CLA.
How to sign: Authenticate with GitHub then click the "I agree" button.
Once completed, re-run the checks on this PR.
|
Looks like the e2e performance pipeline is failing: Do you have materials committed to your dev repo with all the scripts, configs, etc., you used to test this during development to generate that attached report? Maybe this only happens when running with the Coupler? I've seen strange idiosyncracies where kernels sped up for Atmos don't improve when run with the Coupler. For this PR we should definitely have a unit test for the new solver, ideally using the same case(s) as the existing solver to check numerical values are similar enough. |
Sadly it happens only during the AMIP run. I was able to reproduce that locally on our side, but still straggling through compilation times and julia crashes to get an We have been iterating the solver on a testbed. |
|
It seem that a crash is caused by a Field with elements of EDIT: So I was not quite 100% correct. The |
If you aren't already, I think it would be helpful to use |
It was a good shout. The problem seems to be that we did not handle properly the cases where the components of the matrix and vectors are tensors and co/contra-variant vectors. I am not fully certain if I interpret things correctly, but this is why we should have used the 'boxed' operators and |
b2dbb21 to
bb6ad40
Compare
|
So by boxing the operators seems to solve the problem with failing end-to-end run. Basically we missed that not only elements of the field may be the Now the CI seems to run happily and AMIP run is reporting healthy speedup. I do, however, have a couple of worries to resolve. First is that there is a lot of tests failing in the CI (although the failures seem to be ignored when determining the status of the overall job). I assume these are tests known to fail? Below I have attached the Nsys reports for the before and after this PR. However, in the after I cannot seem to find the matrix solver kernels. It is quite scary. The unit tests do pass and locally I have verified that the solver runs correctly. Also I presume that something would have failed if e.g. due to some bug the solver was not called at all? Hence I guess it is something happening with the kernel renaming scheme? |
|
Ok. I have verified in REPL that the solver does get run. I have tried to use the Looking into the The only suspicious thing is that the performance improvement is much, much better than we could have expected from |
imreddyTeja
left a comment
There was a problem hiding this comment.
This looks mostly good to me. I'm having trouble opening the summary.ods file you attached, but I will run some quick benchmarks locally to verify the speedup. It would be good to run the nightly pipeline, or at least a simulatiojn from it, with this before merging.
| b_data = Fields.field_values(b) | ||
|
|
||
| # Solve | ||
| threads_per_block = min(Nv, 256) |
There was a problem hiding this comment.
Bad things...
This is a good catch! This line stayed there by accident. It should be removed.
Ultimately the only problem that may happen is that the vertical size may exceed the maximum number of threads per block that can be supported (for a given register usage) and then the kernel will fail to launch.
Ultimately if we will want a proper error handling on that (with CUDA.maxthreads), but an annoying part is that to call it we need an access to the compiled kernel so it may need to happen inside the auto_launch!.
There was a problem hiding this comment.
I have removed the line in 3aea99f
If the size is too large to handle. CUDA.jl will emit its standard error that the number of threads is too large.
If we want I can also add a check inside the auto_launch! so ClimaCore.jl emits a message that may be made more descriptive.
| auto_launch!( | ||
| tridiag_pcr_kernel!, | ||
| args; | ||
| threads_s = (threads_per_block,), |
There was a problem hiding this comment.
This is probably not a big issue because most long simulations run with 64 vertical faces, but wouldn't this result in a theoretical occupancy of 50% when Nv==32?
There was a problem hiding this comment.
Exactly! This is a biggest drawback of the kernel as it is. When the Nv is 33 basically we end up with a 50% effective occupancy. This is the worst case scenario. But I have to admit that it is annoyingly difficult to avoid...
As I mentioned in the description I have tried to improve it, by 'concatenating' multiple tridiagonal systems into a bigger system (for a test basically loaded all matrixes in a single horizontal element at once (i.j- dimensions)) and solve it instead using the same solver. This would avoid the "trailing masked threads" problem basically by averaging. But in the end it did not work. (You can see the attempt here)
Basically what happens is that since more threads participate in the synchronisation they spend more time for their brethren to catch up. In addition there are extra PCR steps due to a larger system. For the worst case scenario of Nv=33 the concatenated solver was as fast like the one in the PR. In general it was significantly slower ( I looked in a standalone "testbed" though. Not in AMIP)
What we still need to try is to do Thomas with single matrix per thread but in shared memory.
However there is also a possibility that switching to the shared memory decreased the fraction of the time spent in the solver to a point where we perhaps don't need to worry about the occupancy. Basically the change from Nv of 32 to Nv 33 would increase the runtime of the solver by a factor of about 2, but if is is small enough fraction of time it will have small overall effect on performance. I still need to test that though (the nsys results indicate that tridiagonal solutions are a very small fraction of the time, but I am not sure I count them all due to the kernel renaming, I have made nsys and ncu runs in AMIP without the renaming to verify but did not process the results yet).
There was a problem hiding this comment.
To update on the above. I had a look at the nsys run without the kernel renaming and it seems that the GPU time in the tridiagonal solver is still rather large (3.4% percent) no_rename.tar.gz. This implies that something have gone wrong in the kernel renaming in the original report...
3.4% is a large number, sufficiently large to worry about the loss of the occupancy.
The only thing I am thinking is that it may not be trivial to change it so perhaps it would be better to address it in a separate, follow-up PR>
There was a problem hiding this comment.
I have a fix for that. I will try to open a PR for it by end of the PST day
|
|
||
|
|
||
| function tridiag_pcr_kernel!( | ||
| x, a, b, c, d, ::Val{n}, ::Val{n_iter}, |
There was a problem hiding this comment.
| x, a, b, c, d, ::Val{n}, ::Val{n_iter}, | |
| x, a, b, c, d, ::Val{Nv}, ::Val{n_iter}, |
I think it would be follow the pattern of using Nv to refer to the number of vertical levels
|
I did some quick local testing with the |
Is it something I can do by pushing a commit with a right annotation or clicking something somewhere on buildkite? Edit: or opening a new branch in the |
|
Downstream test in ClimaAtmos here |
|
I think this PR is now ready for the proper review. There is a couple of issues I would like some guidance on though:
The commit history needs to be cleaned up. I plan to do it after we finish iterating on the PR (I don't want GitHub to invalidate the conversations as the result of force pushes) |
|
Updating online documentation: I think it would be a good idea to put this in the docs. Maybe [ Integration tests - I think the test coverage for the solver is sufficient. |
imreddyTeja
left a comment
There was a problem hiding this comment.
After updating the docs, I think this is ready to go. Could you also add an entry to NEWS.md?
| function single_field_solve!(device::ClimaComms.CUDADevice, cache, x, A, b) | ||
|
|
||
| # Tridiagonal solvers are handled by special implementation | ||
| if eltype(A) <: MatrixFields.TridiagonalMatrixRow |
There was a problem hiding this comment.
Should we also check that Nv <= 256 here, and fallback to the previous implementation otherwise? This would keep support for large Nvs
There was a problem hiding this comment.
I have added the redirect in d6c2910
I used the threshold of 512 since at the moment the kernel consumes only 30 registers so it supports large number of threads per block.
I have added extra documentation in c7ce231
Added in c50f780 (although causes a merge conflict) If we are broadly happy with the changes I will proceed to rebase on |
yes please. This looks great! |
c7ce231 to
937ee08
Compare
|
All the conflicts should be resolved and it is ready to go when the test pass ;-) |
imreddyTeja
left a comment
There was a problem hiding this comment.
this should be a nice speed up. thank you!
|
@dennisYatunin Do you expect there to be any changes needed because #2417 was merged in? |
Note that the launch configuration is not ideal at the moment. Basically needs number of vertical levels as a multiple of 32 and has upper limit of 1024 (number of threads per block). Note that since the solver may work with matrices that have AxisTensor elements (with potentially multiple components), we need to use boxed operator. Co-authored-by: petebachant <petebachant@gmail.com> Co-authored-by: sjavis <sa2329@cam.ac.uk>
To capture the majority of tridiagonal solvers we need to split the `multiple_field_solver` if any tridiagonal matrix is present. Otherwise tridiagonal case may be hidden in a single kernel with non-tridiagonal cases. This seems to happen for most instances of the `multiple_field_solver` inthe AMIP case, hence we effectivly remove this optimisation. New PCR solver may fail to launch the kernel if the Nv (and hence number of threads peer block) is too large. Fallback on the local-memory solver to have stable (albeit not as performant) behaviour for large number of vertical levels.
Docstring of BlockDiagonalSolve is where the original solver was described.
The boxed operators have been removed in the recent RecursiveApply refactor. Instead ordinary arithmetic operators can be used.
937ee08 to
86ea65b
Compare
I have rebased on the current I have also triggered another end-to-end performance check to see if there is any difference after #2417 |
I checked in with him briefly yesterday. All the changes in #2417 should be backwards compatible. I believe you are correct that we should 'unbox' the operators, as they are now deprecated. |
|
Also are the failures in the |
Those allocations are caused by #2417. The downstream tests has been failing on main with the same number of allocations since that pr was merged. I believe all the tests that we run through GitHub actions are cpu only. |
|
If you think this is good to go, I think this should be merged |
86ea65b to
5977c4e
Compare
I think it is. I assume I should just squash and merge? |
Yes. Either squash and merge or use the rebase option on the web interface. There isn't a preferred way as far as I know. |
Replaces #2484 (since we need PR from inside the repo to run the CI
perfjobs. the description is copied verbatim)Closes #2401
@petebachant @imreddyTeja @sjavis @AdelekeBankole
In this PR we will aim to contribute a more efficient implementation of the tridiagonal matrix solver to that effect we:
multiple_field_solve!to isolate tridiagonal casesAt the moment the specialised solver is significantly faster (e.g. when tested on L40 single solve reduces from ~200ms to 90ms compere.tar.gz). Once we trigger
[perf]tag we will see how it performs in the complete simulation 🤞This is still a draft since the solver is not good enough from the point of view of the launch configuration. We launch a block per column and a thread for each vertical level. This works well if the number of levels is close to multiple of
32(like AMIP), but it is not stable enough. In this PR we will try to provide and test some alternatives. At the moment I see two options we can try:Stitch Matrices together on load to shared memory
This would basically use the existing (quite fast) shmem solver, but to balance the load, load multiple columns into shmem to solve them as a single matrix (as far as I know some rows with 0 off-diagonals at the matrix 'stiches' should not cause numerical problems...). Here the problem would be though that since the 1st (or last) element of the off-diagonal would need to be patched to 0 (since it may not be zero, currently having "don't care" status) on load to shared memory.
Since the loading of data currently accounts for at least 50% of the solver runtime extra logic on load may not be performant. We shall see.
Shared Memory Parallel Thomas
Since we expect the vertical tridiagonal matrices to be small we can solve them in batches of 32. Have a single wrap per block and allow each of the threads to solve full matrix with the Parallel Thomas.
TODO
When the PR is ready go through the checklist: