-
Notifications
You must be signed in to change notification settings - Fork 18
More efficient solver for TridiagonalMatrixFields #2486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
8af6743
feat: add PCR based single field tridiagonal solver
Mikolaj-A-Kowalski e06f442
refactor: redirect TridiagonalMatrixField to special solver
Mikolaj-A-Kowalski 18b453e
doc: note the new PCR solver in relevant documentation
Mikolaj-A-Kowalski 5977c4e
refactor: unbox the operators to conform to new interface
Mikolaj-A-Kowalski File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably not a big issue because most long simulations run with 64 vertical faces, but wouldn't this result in a theoretical occupancy of 50% when Nv==32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly! This is a biggest drawback of the kernel as it is. When the
Nvis 33 basically we end up with a 50% effective occupancy. This is the worst case scenario. But I have to admit that it is annoyingly difficult to avoid...As I mentioned in the description I have tried to improve it, by 'concatenating' multiple tridiagonal systems into a bigger system (for a test basically loaded all matrixes in a single horizontal element at once (i.j- dimensions)) and solve it instead using the same solver. This would avoid the "trailing masked threads" problem basically by averaging. But in the end it did not work. (You can see the attempt here)
Basically what happens is that since more threads participate in the synchronisation they spend more time for their brethren to catch up. In addition there are extra PCR steps due to a larger system. For the worst case scenario of
Nv=33the concatenated solver was as fast like the one in the PR. In general it was significantly slower ( I looked in a standalone "testbed" though. Not in AMIP)What we still need to try is to do Thomas with single matrix per thread but in shared memory.
However there is also a possibility that switching to the shared memory decreased the fraction of the time spent in the solver to a point where we perhaps don't need to worry about the occupancy. Basically the change from Nv of 32 to Nv 33 would increase the runtime of the solver by a factor of about 2, but if is is small enough fraction of time it will have small overall effect on performance. I still need to test that though (the
nsysresults indicate that tridiagonal solutions are a very small fraction of the time, but I am not sure I count them all due to the kernel renaming, I have madensysandncuruns in AMIP without the renaming to verify but did not process the results yet).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To update on the above. I had a look at the nsys run without the kernel renaming and it seems that the GPU time in the tridiagonal solver is still rather large (3.4% percent) no_rename.tar.gz. This implies that something have gone wrong in the kernel renaming in the original report...
3.4% is a large number, sufficiently large to worry about the loss of the occupancy.
The only thing I am thinking is that it may not be trivial to change it so perhaps it would be better to address it in a separate, follow-up PR>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a fix for that. I will try to open a PR for it by end of the PST day