Skip to content

Set per-device kernel shared memory on every GPU#806

Merged
sunqm merged 1 commit into
pyscf:masterfrom
tvogels:fix-initialization
Jun 29, 2026
Merged

Set per-device kernel shared memory on every GPU#806
sunqm merged 1 commit into
pyscf:masterfrom
tvogels:fix-initialization

Conversation

@tvogels

@tvogels tvogels commented Jun 26, 2026

Copy link
Copy Markdown

On a machine with more than one GPU, direct-SCF, gradient, and Hessian builds crash for any basis containing d or f shells with errors like CUDA Error in MD_build_j: invalid argument and RuntimeError: MD_build_j kernel for (dp|dp) failed; a single GPU always works and p-only bases work even on multiple GPUs.

The cause is that CUDA kernels needing more than 48 KB of dynamic shared memory require a per-device opt-in via cudaFuncSetAttribute, but the Python *_init wrappers were called only once at module import on device 0, while work is then distributed across all GPUs through multi_gpu.run, so any kernel launched on device 1 and beyond never received the opt-in and failed.

This was introduced in v1.5.0 and is still present through the current v1.7.4: commit 6af29a4 (#547) moved init_mdj_constant out of the per-device proc to module level, and a396d48 (#505) added the RYS_build_jk/RYS_build_k builders with module-level init from the start, with the later gradient, Hessian, and periodic kernels following the same broken pattern.

The condition is not covered by CI because the runners only have a single GPU, so they pass there and the regression went unnoticed.

The fix restores the original pattern by calling each *_init(SHM_SIZE) inside the per-device proc with an error check and demoting the module-level call to a .restype declaration, across ten proc sites in six files (scf/j_engine.py, scf/jk.py, grad/rhf.py, hessian/rhf.py, pbc/scf/j_engine.py, pbc/scf/rsjk.py), mirroring the rysj path that was always correct, with no change to single-GPU behaviour.

I verified it on four A100s: the SCF, gradient, Hessian, and periodic test suites that failed before now pass (j_engine 6, jk 12, grad 21, hessian 7, PBC j_engine+jk 36, PBC scf/stress 27), single-GPU runs are unchanged.

This fixes #623.

On a machine with more than one GPU, direct-SCF, gradient, and Hessian builds crash for any basis containing d or f shells with errors like `CUDA Error in MD_build_j: invalid argument` and `RuntimeError: MD_build_j kernel for (dp|dp) failed`; a single GPU always works and p-only bases work even on multiple GPUs.

The cause is that CUDA kernels needing more than 48 KB of dynamic shared memory require a per-device opt-in via `cudaFuncSetAttribute`, but the Python `*_init` wrappers were called only once at module import on device 0, while work is then distributed across all GPUs through `multi_gpu.run`, so any kernel launched on device 1 and beyond never received the opt-in and failed.

This was introduced in v1.5.0 and is still present through the current v1.7.4: commit 6af29a4 (pyscf#547) moved `init_mdj_constant` out of the per-device `proc` to module level, and a396d48 (pyscf#505) added the `RYS_build_jk`/`RYS_build_k` builders with module-level init from the start, with the later gradient, Hessian, and periodic kernels following the same broken pattern.

The condition is not covered by CI because the runners only have a single GPU, so they pass there and the regression went unnoticed.

The fix restores the original pattern by calling each `*_init(SHM_SIZE)` inside the per-device `proc` with an error check and demoting the module-level call to a `.restype` declaration, across ten `proc` sites in six files (scf/j_engine.py, scf/jk.py, grad/rhf.py, hessian/rhf.py, pbc/scf/j_engine.py, pbc/scf/rsjk.py), mirroring the rysj path that was always correct, with no change to single-GPU behaviour.

I verified it on four A100s: the SCF, gradient, Hessian, and periodic test suites that failed before now pass (j_engine 6, jk 12, grad 21, hessian 7, PBC j_engine+jk 36, PBC scf/stress 27), single-GPU runs are unchanged.
@tvogels tvogels force-pushed the fix-initialization branch from b22fa00 to bbea591 Compare June 29, 2026 06:29
@tvogels

tvogels commented Jun 29, 2026

Copy link
Copy Markdown
Author

@sunqm I adapted the PR based on a failing test. Please run the CI again.

@sunqm sunqm merged commit 8c86782 into pyscf:master Jun 29, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Direct SCF fails with multiple GPUs

2 participants