Feature: DeltaSpin for LCAO and PW base and DFTU for PW, both collinear and noncollinear spin (code change and UTs) by dyzheng · Pull Request #7384 · deepmodeling/abacus-develop

dyzheng · 2026-05-26T09:42:20Z

Reminder

Have you linked an issue with this pull request?
Have you added adequate unit tests and/or case tests for your pull request?
Have you noticed possible changes of behavior below or in the linked issue?
Have you explained the changes of codes in core modules of ESolver, HSolver, ElecState, Hamilt, Operator or Psi? (ignore if not applicable)

Linked Issue

Fix #...

Unit Tests and/or Case Tests for my changes

A unit test is added for each new feature or bug fix.

What's changed?

Example: My changes might affect the performance of the application under certain conditions, and I have tested the impact on various scenarios...

Any changes of core modules? (ignore if not applicable)

Example: I have added a new virtual function in the esolver base class in order to ...

…ing develop) Direct sync of source/ directory from feat/dftu-pw-port-v2 which has been merged with origin/develop. 101 files changed, +10659/-1262 lines.

Reverted infrastructure files to origin/develop state to clean up the PR. Files reverted: - source/source_cell/read_atoms_helper.cpp - source/source_base/parallel_global.cpp - source/source_base/timer.cpp - source/source_base/kernels/cuda/math_kernel_op.cu - source/source_base/module_device/device_check.h - source/source_base/module_container/base/macros/cuda.h - source/source_base/module_container/base/third_party/cusolver.h - source/source_esolver/lcao_others.cpp - source/source_hsolver/kernels/cuda/diag_cusolvermp.cu - source/source_io/module_unk/berryphase.cpp - source/source_lcao/module_rt/solve_propagation.cpp - source/source_main/main.cpp - source/source_lcao/module_ri/RPA_LRI.hpp Kept: - DFT+U/DeltaSpin core functionality (esolver, kernels with npol, dftu modules) - Input parameters - Build/Test system updates

Restored the following files from port-v2 as they contain essential functional changes and bug fixes: - source/source_esolver/lcao_others.cpp: Passes sc_direction_only parameter to init_deltaspin_lcao. - source/source_base/parallel_global.cpp: Fixes MPI_Comm_free logic to avoid freeing MPI_COMM_WORLD. - source/source_main/main.cpp: Reorders fftw_cleanup_threads before MPI_Finalize to avoid segfault. - source/source_base/kernels/cuda/math_kernel_op.cu: Adds cublas_handle null check. Other infrastructure files (timer.cpp, device_check.h, berryphase.cpp, etc.) remain reverted to develop.

Restoring all code-related changes as previous partial revert caused compilation errors in CI.

Reverted 15 files with modifications that are completely unrelated to the DFT+U and DeltaSpin functionality: - Pure comment/formatting changes: timer.cpp, diago_david.cpp, esolver_sdft_pw.cpp, hsolver_lcao.cpp, berryphase.cpp, RPA_LRI.hpp, solve_propagation.cpp, diag_cusolvermp.cu, read_sep_test.cpp, cusolver.h (3 comment-only lines restored, CUDA version guards kept) - Unrelated code refactor: elecstate_pw.cpp (vkb.nc -> vkbnc rename) - Unused variable: forces.cpp (forcepaw declaration) - Unrelated test: nscf_utils_test.cpp (deleted) and its CMakeLists entry - operator_lcao.cpp: restored original comments and removed extra blank lines

dzzz2001

Detailed inline review covering blockers and should-fix items, organized by sub-area (DeltaSpin core, DFTU core, PW kernels + onsite, infra/wiring, tests). Each inline comment is anchored to the most representative diff line. Happy to discuss any of these.

dzzz2001 · 2026-05-28T09:12:02Z

+        }
    }
 }



[Blocker] The CPU onsite_ps_op DFT+U overload in kernels/onsite_op.cpp gained an if(npol == 2) {…} else {…} branch (npol==1 path), but the matching CUDA kernel — the second __global__ void onsite_op(…) defined in this file (around line 48 of the new file, the variant that takes vu_iat, vu_begin_iat, orb_l_iat, ip_m) — was not patched. Only the DeltaSpin variant above (this hunk) gained the npol==1 else branch.

When PARAM.inp.device == "gpu" for DFT+U with nspin=1 or nspin=2 (the headline feature this PR is adding for PW), the DFT+U GPU kernel still hardcodes

ps[psind] += vu_iat[index_mm] * becp[becpind] + vu_iat[index_mm + tlp1_2*2] * becp[becpind + tnp]; ps[psind + 1] += vu_iat[index_mm + tlp1_2*1] * becp[…] + vu_iat[index_mm + tlp1_2*3] * becp[…];

For npol==1 this reads past becp (into the next atom/projector), writes ps[psind+1] (into the next ip's row), and indexes off-diagonal vu entries that don't exist (vu for nspin=1/2 is sized tlp1^2, not tlp1^2 * 4). The same applies to the ROCm mirror in kernels/rocm/onsite_op.hip.cu.

dzzz2001 · 2026-05-28T09:12:02Z

                                                                  const int& nbands,
                                                                  const int& ik,
                                                                  const int& nkb,
+                                                                  const int& npol,


[Blocker] Only the function signature gained const int& npol — the device kernels in this file (cal_force_npw_op / cal_force_nl_op at lines 225, 267, 339, 405 of the new file) still hardcode ib * 2, (ib2+1) * nkb, and ipol * nbands * 2 * nkb + …. When invoked with npol == 1 (collinear DFT+U / DeltaSpin force in PW), these will read OOB on dbecp/becp and add the spin-down contribution to a non-existent row.

The CPU side (kernels/force_op.cpp) was correctly split into if(npol==2) { … } else if (npol==1) { … } branches. The GPU kernels need the same. The ROCm mirror (kernels/rocm/force_op.hip.cu, not in this diff) doesn't appear to have the new npol parameter in its signature at all, which will cause a link failure on ROCm builds.

dzzz2001 · 2026-05-28T09:12:02Z

                    const int& ntype,
                    const int& wg_nc,
                    const int& ik,
+                    const int& npol,


[Blocker] Same issue as cuda/force_op.cu: only the signature gained const int& npol; the kernels at lines 317, 946, 1008 of the new file still hardcode ib * 2 and (ib2+1) * nkb. For npol==1 (collinear PW DFT+U/DeltaSpin stress) this will OOB-read becp/dbecp and double-count. Same for kernels/rocm/stress_op.hip.cu.

dzzz2001 · 2026-05-28T09:12:02Z

+static inline void trtri(cusolverDnHandle_t& cusolver_handle, const char& uplo, const char& diag, const int& n, float* A, const int& lda)
+{
+    int lwork = 0;
+    CHECK_CUSOLVER(cusolverDnStrtri_bufferSize(cusolver_handle, cublas_fill_mode(uplo), cublas_diag_type(diag), n, A, lda, &lwork));


[Blocker] The #if CUDA_VERSION < 11000 fallback calls cusolverDnStrtri_bufferSize / cusolverDnStrtri (and D/C/Z variants below). These symbols do not exist in any cuSOLVER release — only the generic cusolverDnXtrtri (CUDA 11.0+, used by the #if … >= 11000 branch above) and the batched cublas{S,D,C,Z}trtri_batched exist. Any build with CUDA < 11 will fail to link with undefined references to cusolverDn{S,D,C,Z}trtri.

Options:

Drop the fallback entirely and require CUDA ≥ 11 — consistent with module_container/base/macros/cuda.h:25-31 which already guards GetTypeCuda<int64_t> on CUDA_VERSION >= 11000.

Route the legacy branch to cublas{S,D,C,Z}trtri_batched (with batchCount=1).

dzzz2001 · 2026-05-28T09:12:02Z

@@ -398,25 +411,43 @@ struct cal_force_nl_op<FPTYPE, base_device::DEVICE_CPU>
                const std::complex<FPTYPE> coefficients3(-1 * lambda[iat*3+2], 0.0);


[Blocker] This npol==1 branch:

local_force[ipol] -= fac * lambda[iat*3+2] * dbb;

omits the spin sign that is correctly applied in the Hamiltonian application at op_pw_proj.cpp:1448-1456:

int spin_sign = 1; if (PARAM.inp.nspin == 2) { spin_sign = (this->isk[this->ik] == 0) ? 1 : -1; } tmp_lambda_coeff[iat] = std::complex<double>(lambda[iat][2] * spin_sign, 0.0);

Without the sign, the force from spin-down k-points (isk==1) is added with the wrong sign and the force will not equal -dE/dR. Same issue in kernels/stress_op.cpp npol==1 branch.

dzzz2001 · 2026-05-28T09:12:03Z

+    CHECK_CUDA(cudaFree(d_info));
+}
+static inline void trtri(cusolverDnHandle_t& cusolver_handle, const char& uplo, const char& diag, const int& n, double* A, const int& lda)
+{


[Should-fix] Even after fixing the non-existent symbols flagged above, the new functions allocate d_work / d_info via cudaMalloc and free them at the end. There's no RAII guard, so a CHECK_CUSOLVER failure between the alloc and free leaks both. Consider a small scope guard, or check the status code explicitly and free before throwing.

dzzz2001 · 2026-05-28T09:12:03Z

    {
        syncmem_complex_d2h_op()(h_becp, this->becp, this->size_becp);
    }
+    this->becp_ready_ = true;


[Should-fix] The new cache (becp_ready_ = true; ik_becp_ = ik_; here and bool is_becp_ready(int ik) const { return becp_ready_ && ik_becp_ == ik; } in the header) is set/invalidated but never read anywhere I can see. force_onsite / stress_onsite still recompute becp unconditionally. Either wire the consumer to skip the recomputation (cheapest perf win) or remove the dead state.

dzzz2001 · 2026-05-28T09:12:03Z

@@ -1,7 +1,8 @@
 #ifndef W_ABACUS_DEVELOP_ABACUS_DEVELOP_SOURCE_source_pw_HAMILT_PWDFT_KERNELS_FORCE_OP_H
 #define W_ABACUS_DEVELOP_ABACUS_DEVELOP_SOURCE_source_pw_HAMILT_PWDFT_KERNELS_FORCE_OP_H
+#include "source_io/module_parameter/parameter.h"


[Should-fix] Pulling source_io/module_parameter/parameter.h into a kernel-op header means every TU that includes this header now transitively includes the whole parameter parser — meaningful compile-time bloat. The parameter values needed in the kernel implementation can be passed as function arguments (as npol already is). Please move the include into .cpp files only, or use a forward-declared opaque accessor.

dzzz2001 · 2026-05-28T09:12:03Z

@@ -1,7 +1,7 @@
 #ifndef SRC_PW_STRESS_MULTI_DEVICE_H
 #define SRC_PW_STRESS_MULTI_DEVICE_H
+#include "source_io/module_parameter/parameter.h"


[Should-fix] Same as force_op.h above: please move #include "source_io/module_parameter/parameter.h" out of this header. Header pulls in the full parameter parser; kernel-op headers should stay light.

dzzz2001 · 2026-05-28T09:12:03Z


 #include "elecstate.h"
 #include "source_estate/module_dm/density_matrix.h"
+#include "source_basis/module_ao/parallel_orbitals.h"


[Should-fix] Two new includes here (parallel_orbitals.h, klist.h) but I don't see any new use of those types in the header itself — only the .cpp. Please move them into the implementation file to avoid compile-time bloat.

dyzheng and others added 7 commits May 26, 2026 16:02

feat: sync DFT+U/DeltaSpin code from feat/dftu-pw-port-v2 (after merg…

c582c1a

…ing develop) Direct sync of source/ directory from feat/dftu-pw-port-v2 which has been merged with origin/develop. 101 files changed, +10659/-1262 lines.

feat: restore full source/ content from feat/dftu-pw-port-v2

f0a892a

Restoring all code-related changes as previous partial revert caused compilation errors in CI.

test: enable 17_DS_DFTU test suite in CMakeLists

ac27932

Merge branch 'develop' into feat/port-dftu-ds-code-v4

4ef2f30

mohanchen requested a review from dzzz2001 May 28, 2026 08:41

dzzz2001 reviewed May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: DeltaSpin for LCAO and PW base and DFTU for PW, both collinear and noncollinear spin (code change and UTs)#7384

Feature: DeltaSpin for LCAO and PW base and DFTU for PW, both collinear and noncollinear spin (code change and UTs)#7384
dyzheng wants to merge 7 commits into
deepmodeling:developfrom
dyzheng:feat/port-dftu-ds-code-v4

dyzheng commented May 26, 2026

Uh oh!

dzzz2001 left a comment

Uh oh!

dzzz2001 May 28, 2026

Uh oh!

dzzz2001 May 28, 2026

Uh oh!

dzzz2001 May 28, 2026

Uh oh!

dzzz2001 May 28, 2026

Uh oh!

dzzz2001 May 28, 2026

Uh oh!

dzzz2001 May 28, 2026

Uh oh!

dzzz2001 May 28, 2026

Uh oh!

dzzz2001 May 28, 2026

Uh oh!

dzzz2001 May 28, 2026

Uh oh!

dzzz2001 May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -398,25 +411,43 @@ struct cal_force_nl_op<FPTYPE, base_device::DEVICE_CPU>
		const std::complex<FPTYPE> coefficients3(-1 * lambda[iat*3+2], 0.0);

Conversation

dyzheng commented May 26, 2026

Reminder

Linked Issue

Unit Tests and/or Case Tests for my changes

What's changed?

Any changes of core modules? (ignore if not applicable)

Uh oh!

dzzz2001 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants