[REPLACED] More efficient solver for TridiagonalMatrixFields#2484
Closed
Mikolaj-A-Kowalski wants to merge 3 commits intoCliMA:mainfrom
Closed
[REPLACED] More efficient solver for TridiagonalMatrixFields#2484Mikolaj-A-Kowalski wants to merge 3 commits intoCliMA:mainfrom
Mikolaj-A-Kowalski wants to merge 3 commits intoCliMA:mainfrom
Conversation
Note that the launch configuration is not ideal at the moment. Basically needs number of vertical levels as a multiple of 32 and has upper limit of 1024 (number of threads per block). Co-authored-by: petebachant <[email protected]> Co-authored-by: sjavis <[email protected]>
To capture the majority of tridiagonal solvers we need to split the `multiple_field_solver` if any tridiagonal matrix is present. Otherwise tridiagonal case may be hidden in a single kernel with non-tridiagonal cases. This seems to happen for most instances of the `multiple_field_solver` inthe AMIP case, hence we effectivly remove this optimisation.
Contributor
There was a problem hiding this comment.
The following contributor(s) must sign the CLA before this PR can be merged:
Please visit https://ecodesign.clima.caltech.edu/cla/ to review and sign the CLA.
How to sign: Authenticate with GitHub then click the "I agree" button.
Once completed, re-run the checks on this PR.
Contributor
There was a problem hiding this comment.
The following contributor(s) must sign the CLA before this PR can be merged:
Please visit https://ecodesign.clima.caltech.edu/cla/ to review and sign the CLA.
How to sign: Authenticate with GitHub then click the "I agree" button.
Once completed, re-run the checks on this PR.
4 tasks
Contributor
Author
|
Replaced by #2486. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
@petebachant @imreddyTeja @sjavis @AdelekeBankole
In this PR we will aim to contribute a more efficient implementation of the tridiagonal matrix solver to that effect we:
multiple_field_solve!to isolate tridiagonal casesAt the moment the specialised solver is significantly faster (e.g. when tested on L40 single solve reduces from ~200ms to 90ms compere.tar.gz). Once we trigger
[perf]tag we will see how it performs in the complete simulation 🤞This is still draft since the solver is not good enough from the point of the view of the launch configuration. We launch a block per column and a thread for each vertical level. This works well if the number of levels is close to multiple of
32(like AMIP), but it is not stable enough. In this PR we will try to provide and test some alternatives. At the moment I see two options we can try:Stitch Matrixes together on load to shared memory
This would basically use the existing (quite fast) shmem solver, but to balance the load, load multiple columns into shmem to solve them as a single matrix (as far as I know some rows with 0 off-diagonals at the matrix 'stiches' should not cause numerical problems...) . Here the problem would be though that since the 1st (or last) element of the off-diagonal would need to be patched to 0 (since it may not be zero, currently having "don't care" status) on load to shared memory.
Since the loading of data currently accounts for at least 50% of the solver runtime extra logic on load may not be performant. We shell see.
Shared Memory Parallel Thomas
Since we expect the vertical tridiagonal matrixes to be small we can solve them in batches of 32. Have a single wrap per block and allow each of the threads to solve full matrix with the Parallel Thomas.
TODO
When the PR is ready go through the checklist: