Set per-device kernel shared memory on every GPU#806
Open
tvogels wants to merge 1 commit into
Open
Conversation
On a machine with more than one GPU, direct-SCF, gradient, and Hessian builds crash for any basis containing d or f shells with errors like `CUDA Error in MD_build_j: invalid argument` and `RuntimeError: MD_build_j kernel for (dp|dp) failed`; a single GPU always works and p-only bases work even on multiple GPUs. The cause is that CUDA kernels needing more than 48 KB of dynamic shared memory require a per-device opt-in via `cudaFuncSetAttribute`, but the Python `*_init` wrappers were called only once at module import on device 0, while work is then distributed across all GPUs through `multi_gpu.run`, so any kernel launched on device 1 and beyond never received the opt-in and failed. This was introduced in v1.5.0 and is still present through the current v1.7.4: commit 6af29a4 (pyscf#547) moved `init_mdj_constant` out of the per-device `proc` to module level, and a396d48 (pyscf#505) added the `RYS_build_jk`/`RYS_build_k` builders with module-level init from the start, with the later gradient, Hessian, and periodic kernels following the same broken pattern. The condition is not covered by CI because the runners only have a single GPU, so they pass there and the regression went unnoticed. The fix restores the original pattern by calling each `*_init(SHM_SIZE)` inside the per-device `proc` with an error check and demoting the module-level call to a `.restype` declaration, across ten `proc` sites in six files (scf/j_engine.py, scf/jk.py, grad/rhf.py, hessian/rhf.py, pbc/scf/j_engine.py, pbc/scf/rsjk.py), mirroring the rysj path that was always correct, with no change to single-GPU behaviour. I verified it on four A100s: the SCF, gradient, Hessian, and periodic test suites that failed before now pass (j_engine 6, jk 12, grad 21, hessian 7, PBC j_engine+jk 36, PBC scf/stress 27), single-GPU runs are unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
On a machine with more than one GPU, direct-SCF, gradient, and Hessian builds crash for any basis containing d or f shells with errors like
CUDA Error in MD_build_j: invalid argumentandRuntimeError: MD_build_j kernel for (dp|dp) failed; a single GPU always works and p-only bases work even on multiple GPUs.The cause is that CUDA kernels needing more than 48 KB of dynamic shared memory require a per-device opt-in via
cudaFuncSetAttribute, but the Python*_initwrappers were called only once at module import on device 0, while work is then distributed across all GPUs throughmulti_gpu.run, so any kernel launched on device 1 and beyond never received the opt-in and failed.This was introduced in v1.5.0 and is still present through the current v1.7.4: commit 6af29a4 (#547) moved
init_mdj_constantout of the per-deviceprocto module level, and a396d48 (#505) added theRYS_build_jk/RYS_build_kbuilders with module-level init from the start, with the later gradient, Hessian, and periodic kernels following the same broken pattern.The condition is not covered by CI because the runners only have a single GPU, so they pass there and the regression went unnoticed.
The fix restores the original pattern by calling each
*_init(SHM_SIZE)inside the per-deviceprocwith an error check and demoting the module-level call to a.restypedeclaration, across tenprocsites in six files (scf/j_engine.py, scf/jk.py, grad/rhf.py, hessian/rhf.py, pbc/scf/j_engine.py, pbc/scf/rsjk.py), mirroring the rysj path that was always correct, with no change to single-GPU behaviour.I verified it on four A100s: the SCF, gradient, Hessian, and periodic test suites that failed before now pass (j_engine 6, jk 12, grad 21, hessian 7, PBC j_engine+jk 36, PBC scf/stress 27), single-GPU runs are unchanged.
This fixes #623.