Skip to content

ENH: PCA CPU SPMD support#3585

Open
DDJHB wants to merge 4 commits intouxlfoundation:mainfrom
DDJHB:dev_pca_cpu_spmd
Open

ENH: PCA CPU SPMD support#3585
DDJHB wants to merge 4 commits intouxlfoundation:mainfrom
DDJHB:dev_pca_cpu_spmd

Conversation

@DDJHB
Copy link
Copy Markdown
Contributor

@DDJHB DDJHB commented Mar 29, 2026

Description

This PR adds distributed (SPMD) training support for PCA on CPU using the covariance method. Previously, the PCA SPMD path was only available on GPU (via SYCL/DPC++); this change enables it for CPU-only environments using MPI or oneCCL communicators.


Checklist:

Completeness and readability

  • I have commented my code, particularly in hard-to-understand areas.
  • I have updated the documentation to reflect the changes or created a separate PR with updates and provided its number in the description, if necessary.
  • Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
  • I have resolved any merge conflicts that might occur with the base branch.

Testing

  • I have run it locally and tested the changes extensively.
  • All CI jobs are green or I have provided justification why they aren't.
  • I have extended testing suite if new functionality was introduced in this PR.

Performance

  • I have measured performance for affected algorithms using scikit-learn_bench and provided at least a summary table with measured data, if performance change is expected.
  • I have provided justification why performance and/or quality metrics have changed or why changes are not expected.
  • I have extended the benchmarking suite and provided a corresponding scikit-learn_bench PR if new measurable functionality was introduced in this PR.

@DDJHB
Copy link
Copy Markdown
Contributor Author

DDJHB commented Mar 29, 2026

/intelci: run

Vika-F added a commit that referenced this pull request Apr 9, 2026
Fixes the ABI breakage introduced by CPU SPMD algorithms in the PRs #3507, #3585
Make the binding type of the destructors (~compute_input(), ~partial_compute_input(), ~train_input(), etc.) stably "WEAK" and not dependent on the compiler's logic by adding their non-default implementations.

Also fixes `set_responses` method in KNN.
mergify bot pushed a commit that referenced this pull request Apr 9, 2026
Fixes the ABI breakage introduced by CPU SPMD algorithms in the PRs #3507, #3585
Make the binding type of the destructors (~compute_input(), ~partial_compute_input(), ~train_input(), etc.) stably "WEAK" and not dependent on the compiler's logic by adding their non-default implementations.

Also fixes `set_responses` method in KNN.

(cherry picked from commit 0a5f4cb)
maria-Petrova pushed a commit that referenced this pull request Apr 9, 2026
Fixes the ABI breakage introduced by CPU SPMD algorithms in the PRs #3507, #3585
Make the binding type of the destructors (~compute_input(), ~partial_compute_input(), ~train_input(), etc.) stably "WEAK" and not dependent on the compiler's logic by adding their non-default implementations.

Also fixes `set_responses` method in KNN.

(cherry picked from commit 0a5f4cb)

Co-authored-by: Victoriya Fedotova <victoriya.s.fedotova@intel.com>
@Vika-F
Copy link
Copy Markdown
Contributor

Vika-F commented Apr 9, 2026

/intelci: run

@Vika-F Vika-F marked this pull request as ready for review April 10, 2026 12:33
Copilot AI review requested due to automatic review settings April 10, 2026 12:33
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds CPU-side distributed (SPMD) PCA training support (covariance method) and enables the existing SPMD PCA integration test on CPU, alongside new MPI/oneCCL sample programs demonstrating the feature.

Changes:

  • Enable PCA SPMD integration testing on CPU by removing the CPU skip in the SPMD test.
  • Route PCA CPU training dispatch through a universal (single-node + SPMD) kernel path.
  • Implement CPU SPMD covariance-path aggregation (allreduce of partials) and add MPI/CCL PCA distributed samples.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
samples/oneapi/cpp/mpi/sources/pca_distr_mpi.cpp New MPI sample demonstrating distributed PCA training on CPU.
samples/oneapi/cpp/ccl/sources/pca_distr_ccl.cpp New oneCCL sample demonstrating distributed PCA training on CPU.
cpp/oneapi/dal/algo/pca/test/spmd.cpp Enables running PCA SPMD integration test on CPU.
cpp/oneapi/dal/algo/pca/detail/train_ops.cpp Switches PCA CPU dispatch to a universal SPMD-capable kernel dispatcher.
cpp/oneapi/dal/algo/pca/backend/cpu/train_kernel_cov.cpp Adds CPU SPMD aggregation logic for covariance-based PCA training.
cpp/oneapi/dal/algo/pca/backend/cpu/finalize_train_kernel_cov.cpp Adjusts covariance/correlation finalize behavior and variance extraction logic.

Comment on lines +231 to +234
/// Use a local (non-SPMD) context so the finalize kernel does NOT
/// dispatch to the SPMD path — the data is already aggregated.
const context_cpu local_ctx;
return pca::backend::finalize_train_kernel_cpu<Float, method::cov, task_t>{}(local_ctx,
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context_cpu local_ctx; drops the CPU-extension configuration carried by the incoming ctx (e.g., from spmd_host_policy::get_local()), so CPU dispatch in finalize_train_kernel_cpu may use different ISA settings than the rest of the training call. Consider calling the finalize kernel with the original ctx (it does not need additional collectives), or otherwise ensure the same enabled CPU extensions are preserved when creating a local context.

Suggested change
/// Use a local (non-SPMD) context so the finalize kernel does NOT
/// dispatch to the SPMD path — the data is already aggregated.
const context_cpu local_ctx;
return pca::backend::finalize_train_kernel_cpu<Float, method::cov, task_t>{}(local_ctx,
/// Finalize on the already aggregated local data while preserving
/// the CPU-extension configuration carried by the original context.
return pca::backend::finalize_train_kernel_cpu<Float, method::cov, task_t>{}(ctx,

Copilot uses AI. Check for mistakes.
Comment on lines +124 to +127
const auto cp =
row_accessor<const Float>(input.get_partial_crossproduct()).pull({ 0, -1 });
const Float inv_nm1 = Float(1) / (row_count - 1);
Float* vars = arr_vars.get_mutable_data();
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the zscore branch, inv_nm1 is computed as 1 / (row_count - 1). For row_count == 1 this will divide by zero and propagate inf/NaN into variances. Please add an explicit guard (e.g., require row_count > 1 for z-score normalization, or define the intended behavior for the single-row case and handle it here).

Copilot uses AI. Check for mistakes.

const auto pca_desc = pca::descriptor<float>{};

const auto result = dal::preview::train(comm, pca_desc, input_vec[rank_id]);
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input_vec[rank_id] uses unchecked indexing; other distributed samples in this directory use .at(rank_id) for bounds-checked access. Using .at() here would keep behavior consistent and avoid undefined behavior if rank_id is ever out of range due to unexpected communicator state.

Suggested change
const auto result = dal::preview::train(comm, pca_desc, input_vec[rank_id]);
const auto result = dal::preview::train(comm, pca_desc, input_vec.at(rank_id));

Copilot uses AI. Check for mistakes.

const auto pca_desc = pca::descriptor<float>{};

const auto result = dal::preview::train(comm, pca_desc, input_vec[rank_id]);
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input_vec[rank_id] uses unchecked indexing; other distributed samples in this directory use .at(rank_id) for bounds-checked access. Using .at() here would keep behavior consistent and avoid undefined behavior if rank_id is ever out of range due to unexpected communicator state.

Suggested change
const auto result = dal::preview::train(comm, pca_desc, input_vec[rank_id]);
const auto result = dal::preview::train(comm, pca_desc, input_vec.at(rank_id));

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants