Feature: DeltaSpin for LCAO and PW base and DFTU for PW, both collinear and noncollinear spin (code change and UTs)#7384
Conversation
…ing develop) Direct sync of source/ directory from feat/dftu-pw-port-v2 which has been merged with origin/develop. 101 files changed, +10659/-1262 lines.
Reverted infrastructure files to origin/develop state to clean up the PR. Files reverted: - source/source_cell/read_atoms_helper.cpp - source/source_base/parallel_global.cpp - source/source_base/timer.cpp - source/source_base/kernels/cuda/math_kernel_op.cu - source/source_base/module_device/device_check.h - source/source_base/module_container/base/macros/cuda.h - source/source_base/module_container/base/third_party/cusolver.h - source/source_esolver/lcao_others.cpp - source/source_hsolver/kernels/cuda/diag_cusolvermp.cu - source/source_io/module_unk/berryphase.cpp - source/source_lcao/module_rt/solve_propagation.cpp - source/source_main/main.cpp - source/source_lcao/module_ri/RPA_LRI.hpp Kept: - DFT+U/DeltaSpin core functionality (esolver, kernels with npol, dftu modules) - Input parameters - Build/Test system updates
Restored the following files from port-v2 as they contain essential functional changes and bug fixes: - source/source_esolver/lcao_others.cpp: Passes sc_direction_only parameter to init_deltaspin_lcao. - source/source_base/parallel_global.cpp: Fixes MPI_Comm_free logic to avoid freeing MPI_COMM_WORLD. - source/source_main/main.cpp: Reorders fftw_cleanup_threads before MPI_Finalize to avoid segfault. - source/source_base/kernels/cuda/math_kernel_op.cu: Adds cublas_handle null check. Other infrastructure files (timer.cpp, device_check.h, berryphase.cpp, etc.) remain reverted to develop.
Restoring all code-related changes as previous partial revert caused compilation errors in CI.
Reverted 15 files with modifications that are completely unrelated to the DFT+U and DeltaSpin functionality: - Pure comment/formatting changes: timer.cpp, diago_david.cpp, esolver_sdft_pw.cpp, hsolver_lcao.cpp, berryphase.cpp, RPA_LRI.hpp, solve_propagation.cpp, diag_cusolvermp.cu, read_sep_test.cpp, cusolver.h (3 comment-only lines restored, CUDA version guards kept) - Unrelated code refactor: elecstate_pw.cpp (vkb.nc -> vkbnc rename) - Unused variable: forces.cpp (forcepaw declaration) - Unrelated test: nscf_utils_test.cpp (deleted) and its CMakeLists entry - operator_lcao.cpp: restored original comments and removed extra blank lines
dzzz2001
left a comment
There was a problem hiding this comment.
Detailed inline review covering blockers and should-fix items, organized by sub-area (DeltaSpin core, DFTU core, PW kernels + onsite, infra/wiring, tests). Each inline comment is anchored to the most representative diff line. Happy to discuss any of these.
| } | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
[Blocker] The CPU onsite_ps_op DFT+U overload in kernels/onsite_op.cpp gained an if(npol == 2) {…} else {…} branch (npol==1 path), but the matching CUDA kernel — the second __global__ void onsite_op(…) defined in this file (around line 48 of the new file, the variant that takes vu_iat, vu_begin_iat, orb_l_iat, ip_m) — was not patched. Only the DeltaSpin variant above (this hunk) gained the npol==1 else branch.
When PARAM.inp.device == "gpu" for DFT+U with nspin=1 or nspin=2 (the headline feature this PR is adding for PW), the DFT+U GPU kernel still hardcodes
ps[psind] += vu_iat[index_mm] * becp[becpind] + vu_iat[index_mm + tlp1_2*2] * becp[becpind + tnp];
ps[psind + 1] += vu_iat[index_mm + tlp1_2*1] * becp[…] + vu_iat[index_mm + tlp1_2*3] * becp[…];For npol==1 this reads past becp (into the next atom/projector), writes ps[psind+1] (into the next ip's row), and indexes off-diagonal vu entries that don't exist (vu for nspin=1/2 is sized tlp1^2, not tlp1^2 * 4). The same applies to the ROCm mirror in kernels/rocm/onsite_op.hip.cu.
| const int& nbands, | ||
| const int& ik, | ||
| const int& nkb, | ||
| const int& npol, |
There was a problem hiding this comment.
[Blocker] Only the function signature gained const int& npol — the device kernels in this file (cal_force_npw_op / cal_force_nl_op at lines 225, 267, 339, 405 of the new file) still hardcode ib * 2, (ib2+1) * nkb, and ipol * nbands * 2 * nkb + …. When invoked with npol == 1 (collinear DFT+U / DeltaSpin force in PW), these will read OOB on dbecp/becp and add the spin-down contribution to a non-existent row.
The CPU side (kernels/force_op.cpp) was correctly split into if(npol==2) { … } else if (npol==1) { … } branches. The GPU kernels need the same. The ROCm mirror (kernels/rocm/force_op.hip.cu, not in this diff) doesn't appear to have the new npol parameter in its signature at all, which will cause a link failure on ROCm builds.
| const int& ntype, | ||
| const int& wg_nc, | ||
| const int& ik, | ||
| const int& npol, |
There was a problem hiding this comment.
[Blocker] Same issue as cuda/force_op.cu: only the signature gained const int& npol; the kernels at lines 317, 946, 1008 of the new file still hardcode ib * 2 and (ib2+1) * nkb. For npol==1 (collinear PW DFT+U/DeltaSpin stress) this will OOB-read becp/dbecp and double-count. Same for kernels/rocm/stress_op.hip.cu.
| static inline void trtri(cusolverDnHandle_t& cusolver_handle, const char& uplo, const char& diag, const int& n, float* A, const int& lda) | ||
| { | ||
| int lwork = 0; | ||
| CHECK_CUSOLVER(cusolverDnStrtri_bufferSize(cusolver_handle, cublas_fill_mode(uplo), cublas_diag_type(diag), n, A, lda, &lwork)); |
There was a problem hiding this comment.
[Blocker] The #if CUDA_VERSION < 11000 fallback calls cusolverDnStrtri_bufferSize / cusolverDnStrtri (and D/C/Z variants below). These symbols do not exist in any cuSOLVER release — only the generic cusolverDnXtrtri (CUDA 11.0+, used by the #if … >= 11000 branch above) and the batched cublas{S,D,C,Z}trtri_batched exist. Any build with CUDA < 11 will fail to link with undefined references to cusolverDn{S,D,C,Z}trtri.
Options:
- Drop the fallback entirely and require CUDA ≥ 11 — consistent with
module_container/base/macros/cuda.h:25-31which already guardsGetTypeCuda<int64_t>onCUDA_VERSION >= 11000. - Route the legacy branch to
cublas{S,D,C,Z}trtri_batched(withbatchCount=1).
| @@ -398,25 +411,43 @@ struct cal_force_nl_op<FPTYPE, base_device::DEVICE_CPU> | |||
| const std::complex<FPTYPE> coefficients3(-1 * lambda[iat*3+2], 0.0); | |||
There was a problem hiding this comment.
[Blocker] This npol==1 branch:
local_force[ipol] -= fac * lambda[iat*3+2] * dbb;omits the spin sign that is correctly applied in the Hamiltonian application at op_pw_proj.cpp:1448-1456:
int spin_sign = 1;
if (PARAM.inp.nspin == 2) {
spin_sign = (this->isk[this->ik] == 0) ? 1 : -1;
}
tmp_lambda_coeff[iat] = std::complex<double>(lambda[iat][2] * spin_sign, 0.0);Without the sign, the force from spin-down k-points (isk==1) is added with the wrong sign and the force will not equal -dE/dR. Same issue in kernels/stress_op.cpp npol==1 branch.
| CHECK_CUDA(cudaFree(d_info)); | ||
| } | ||
| static inline void trtri(cusolverDnHandle_t& cusolver_handle, const char& uplo, const char& diag, const int& n, double* A, const int& lda) | ||
| { |
There was a problem hiding this comment.
[Should-fix] Even after fixing the non-existent symbols flagged above, the new functions allocate d_work / d_info via cudaMalloc and free them at the end. There's no RAII guard, so a CHECK_CUSOLVER failure between the alloc and free leaks both. Consider a small scope guard, or check the status code explicitly and free before throwing.
| { | ||
| syncmem_complex_d2h_op()(h_becp, this->becp, this->size_becp); | ||
| } | ||
| this->becp_ready_ = true; |
There was a problem hiding this comment.
[Should-fix] The new cache (becp_ready_ = true; ik_becp_ = ik_; here and bool is_becp_ready(int ik) const { return becp_ready_ && ik_becp_ == ik; } in the header) is set/invalidated but never read anywhere I can see. force_onsite / stress_onsite still recompute becp unconditionally. Either wire the consumer to skip the recomputation (cheapest perf win) or remove the dead state.
| @@ -1,7 +1,8 @@ | |||
| #ifndef W_ABACUS_DEVELOP_ABACUS_DEVELOP_SOURCE_source_pw_HAMILT_PWDFT_KERNELS_FORCE_OP_H | |||
| #define W_ABACUS_DEVELOP_ABACUS_DEVELOP_SOURCE_source_pw_HAMILT_PWDFT_KERNELS_FORCE_OP_H | |||
| #include "source_io/module_parameter/parameter.h" | |||
There was a problem hiding this comment.
[Should-fix] Pulling source_io/module_parameter/parameter.h into a kernel-op header means every TU that includes this header now transitively includes the whole parameter parser — meaningful compile-time bloat. The parameter values needed in the kernel implementation can be passed as function arguments (as npol already is). Please move the include into .cpp files only, or use a forward-declared opaque accessor.
| @@ -1,7 +1,7 @@ | |||
| #ifndef SRC_PW_STRESS_MULTI_DEVICE_H | |||
| #define SRC_PW_STRESS_MULTI_DEVICE_H | |||
| #include "source_io/module_parameter/parameter.h" | |||
There was a problem hiding this comment.
[Should-fix] Same as force_op.h above: please move #include "source_io/module_parameter/parameter.h" out of this header. Header pulls in the full parameter parser; kernel-op headers should stay light.
|
|
||
| #include "elecstate.h" | ||
| #include "source_estate/module_dm/density_matrix.h" | ||
| #include "source_basis/module_ao/parallel_orbitals.h" |
There was a problem hiding this comment.
[Should-fix] Two new includes here (parallel_orbitals.h, klist.h) but I don't see any new use of those types in the header itself — only the .cpp. Please move them into the implementation file to avoid compile-time bloat.
Reminder
Linked Issue
Fix #...
Unit Tests and/or Case Tests for my changes
What's changed?
Any changes of core modules? (ignore if not applicable)