Conversation
|
/intelci: run |
Fixes the ABI breakage introduced by CPU SPMD algorithms in the PRs #3507, #3585 Make the binding type of the destructors (~compute_input(), ~partial_compute_input(), ~train_input(), etc.) stably "WEAK" and not dependent on the compiler's logic by adding their non-default implementations. Also fixes `set_responses` method in KNN.
Fixes the ABI breakage introduced by CPU SPMD algorithms in the PRs #3507, #3585 Make the binding type of the destructors (~compute_input(), ~partial_compute_input(), ~train_input(), etc.) stably "WEAK" and not dependent on the compiler's logic by adding their non-default implementations. Also fixes `set_responses` method in KNN. (cherry picked from commit 0a5f4cb)
Fixes the ABI breakage introduced by CPU SPMD algorithms in the PRs #3507, #3585 Make the binding type of the destructors (~compute_input(), ~partial_compute_input(), ~train_input(), etc.) stably "WEAK" and not dependent on the compiler's logic by adding their non-default implementations. Also fixes `set_responses` method in KNN. (cherry picked from commit 0a5f4cb) Co-authored-by: Victoriya Fedotova <victoriya.s.fedotova@intel.com>
|
/intelci: run |
There was a problem hiding this comment.
Pull request overview
Adds CPU-side distributed (SPMD) PCA training support (covariance method) and enables the existing SPMD PCA integration test on CPU, alongside new MPI/oneCCL sample programs demonstrating the feature.
Changes:
- Enable PCA SPMD integration testing on CPU by removing the CPU skip in the SPMD test.
- Route PCA CPU training dispatch through a universal (single-node + SPMD) kernel path.
- Implement CPU SPMD covariance-path aggregation (allreduce of partials) and add MPI/CCL PCA distributed samples.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| samples/oneapi/cpp/mpi/sources/pca_distr_mpi.cpp | New MPI sample demonstrating distributed PCA training on CPU. |
| samples/oneapi/cpp/ccl/sources/pca_distr_ccl.cpp | New oneCCL sample demonstrating distributed PCA training on CPU. |
| cpp/oneapi/dal/algo/pca/test/spmd.cpp | Enables running PCA SPMD integration test on CPU. |
| cpp/oneapi/dal/algo/pca/detail/train_ops.cpp | Switches PCA CPU dispatch to a universal SPMD-capable kernel dispatcher. |
| cpp/oneapi/dal/algo/pca/backend/cpu/train_kernel_cov.cpp | Adds CPU SPMD aggregation logic for covariance-based PCA training. |
| cpp/oneapi/dal/algo/pca/backend/cpu/finalize_train_kernel_cov.cpp | Adjusts covariance/correlation finalize behavior and variance extraction logic. |
| /// Use a local (non-SPMD) context so the finalize kernel does NOT | ||
| /// dispatch to the SPMD path — the data is already aggregated. | ||
| const context_cpu local_ctx; | ||
| return pca::backend::finalize_train_kernel_cpu<Float, method::cov, task_t>{}(local_ctx, |
There was a problem hiding this comment.
context_cpu local_ctx; drops the CPU-extension configuration carried by the incoming ctx (e.g., from spmd_host_policy::get_local()), so CPU dispatch in finalize_train_kernel_cpu may use different ISA settings than the rest of the training call. Consider calling the finalize kernel with the original ctx (it does not need additional collectives), or otherwise ensure the same enabled CPU extensions are preserved when creating a local context.
| /// Use a local (non-SPMD) context so the finalize kernel does NOT | |
| /// dispatch to the SPMD path — the data is already aggregated. | |
| const context_cpu local_ctx; | |
| return pca::backend::finalize_train_kernel_cpu<Float, method::cov, task_t>{}(local_ctx, | |
| /// Finalize on the already aggregated local data while preserving | |
| /// the CPU-extension configuration carried by the original context. | |
| return pca::backend::finalize_train_kernel_cpu<Float, method::cov, task_t>{}(ctx, |
| const auto cp = | ||
| row_accessor<const Float>(input.get_partial_crossproduct()).pull({ 0, -1 }); | ||
| const Float inv_nm1 = Float(1) / (row_count - 1); | ||
| Float* vars = arr_vars.get_mutable_data(); |
There was a problem hiding this comment.
In the zscore branch, inv_nm1 is computed as 1 / (row_count - 1). For row_count == 1 this will divide by zero and propagate inf/NaN into variances. Please add an explicit guard (e.g., require row_count > 1 for z-score normalization, or define the intended behavior for the single-row case and handle it here).
|
|
||
| const auto pca_desc = pca::descriptor<float>{}; | ||
|
|
||
| const auto result = dal::preview::train(comm, pca_desc, input_vec[rank_id]); |
There was a problem hiding this comment.
input_vec[rank_id] uses unchecked indexing; other distributed samples in this directory use .at(rank_id) for bounds-checked access. Using .at() here would keep behavior consistent and avoid undefined behavior if rank_id is ever out of range due to unexpected communicator state.
| const auto result = dal::preview::train(comm, pca_desc, input_vec[rank_id]); | |
| const auto result = dal::preview::train(comm, pca_desc, input_vec.at(rank_id)); |
|
|
||
| const auto pca_desc = pca::descriptor<float>{}; | ||
|
|
||
| const auto result = dal::preview::train(comm, pca_desc, input_vec[rank_id]); |
There was a problem hiding this comment.
input_vec[rank_id] uses unchecked indexing; other distributed samples in this directory use .at(rank_id) for bounds-checked access. Using .at() here would keep behavior consistent and avoid undefined behavior if rank_id is ever out of range due to unexpected communicator state.
| const auto result = dal::preview::train(comm, pca_desc, input_vec[rank_id]); | |
| const auto result = dal::preview::train(comm, pca_desc, input_vec.at(rank_id)); |
Description
This PR adds distributed (SPMD) training support for PCA on CPU using the covariance method. Previously, the PCA SPMD path was only available on GPU (via SYCL/DPC++); this change enables it for CPU-only environments using MPI or oneCCL communicators.
Checklist:
Completeness and readability
Testing
Performance