Set per-device kernel shared memory on every GPU by tvogels · Pull Request #806 · pyscf/gpu4pyscf

tvogels · 2026-06-26T08:13:55Z

On a machine with more than one GPU, direct-SCF, gradient, and Hessian builds crash for any basis containing d or f shells with errors like CUDA Error in MD_build_j: invalid argument and RuntimeError: MD_build_j kernel for (dp|dp) failed; a single GPU always works and p-only bases work even on multiple GPUs.

The cause is that CUDA kernels needing more than 48 KB of dynamic shared memory require a per-device opt-in via cudaFuncSetAttribute, but the Python *_init wrappers were called only once at module import on device 0, while work is then distributed across all GPUs through multi_gpu.run, so any kernel launched on device 1 and beyond never received the opt-in and failed.

This was introduced in v1.5.0 and is still present through the current v1.7.4: commit 6af29a4 (#547) moved init_mdj_constant out of the per-device proc to module level, and a396d48 (#505) added the RYS_build_jk/RYS_build_k builders with module-level init from the start, with the later gradient, Hessian, and periodic kernels following the same broken pattern.

The condition is not covered by CI because the runners only have a single GPU, so they pass there and the regression went unnoticed.

The fix restores the original pattern by calling each *_init(SHM_SIZE) inside the per-device proc with an error check and demoting the module-level call to a .restype declaration, across ten proc sites in six files (scf/j_engine.py, scf/jk.py, grad/rhf.py, hessian/rhf.py, pbc/scf/j_engine.py, pbc/scf/rsjk.py), mirroring the rysj path that was always correct, with no change to single-GPU behaviour.

I verified it on four A100s: the SCF, gradient, Hessian, and periodic test suites that failed before now pass (j_engine 6, jk 12, grad 21, hessian 7, PBC j_engine+jk 36, PBC scf/stress 27), single-GPU runs are unchanged.

This fixes #623.

On a machine with more than one GPU, direct-SCF, gradient, and Hessian builds crash for any basis containing d or f shells with errors like `CUDA Error in MD_build_j: invalid argument` and `RuntimeError: MD_build_j kernel for (dp|dp) failed`; a single GPU always works and p-only bases work even on multiple GPUs. The cause is that CUDA kernels needing more than 48 KB of dynamic shared memory require a per-device opt-in via `cudaFuncSetAttribute`, but the Python `*_init` wrappers were called only once at module import on device 0, while work is then distributed across all GPUs through `multi_gpu.run`, so any kernel launched on device 1 and beyond never received the opt-in and failed. This was introduced in v1.5.0 and is still present through the current v1.7.4: commit 6af29a4 (pyscf#547) moved `init_mdj_constant` out of the per-device `proc` to module level, and a396d48 (pyscf#505) added the `RYS_build_jk`/`RYS_build_k` builders with module-level init from the start, with the later gradient, Hessian, and periodic kernels following the same broken pattern. The condition is not covered by CI because the runners only have a single GPU, so they pass there and the regression went unnoticed. The fix restores the original pattern by calling each `*_init(SHM_SIZE)` inside the per-device `proc` with an error check and demoting the module-level call to a `.restype` declaration, across ten `proc` sites in six files (scf/j_engine.py, scf/jk.py, grad/rhf.py, hessian/rhf.py, pbc/scf/j_engine.py, pbc/scf/rsjk.py), mirroring the rysj path that was always correct, with no change to single-GPU behaviour. I verified it on four A100s: the SCF, gradient, Hessian, and periodic test suites that failed before now pass (j_engine 6, jk 12, grad 21, hessian 7, PBC j_engine+jk 36, PBC scf/stress 27), single-GPU runs are unchanged.

tvogels · 2026-06-29T06:33:41Z

@sunqm I adapted the PR based on a failing test. Please run the CI again.

tvogels force-pushed the fix-initialization branch from b22fa00 to bbea591 Compare June 29, 2026 06:29

sunqm approved these changes Jun 29, 2026

View reviewed changes

sunqm merged commit 8c86782 into pyscf:master Jun 29, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Set per-device kernel shared memory on every GPU#806

Set per-device kernel shared memory on every GPU#806
sunqm merged 1 commit into
pyscf:masterfrom
tvogels:fix-initialization

tvogels commented Jun 26, 2026 •

edited

Loading

Uh oh!

tvogels commented Jun 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

tvogels commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tvogels commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tvogels commented Jun 26, 2026 •

edited

Loading

tvogels commented Jun 29, 2026 •

edited

Loading