perf(retrieval): fix NDCG GPU performance regression by replacing torch.unique by rclough · Pull Request #3350 · Lightning-AI/torchmetrics

rclough · 2026-03-31T17:48:46Z

Summary

Fixes #2287 — nDCG was running up to 2.65x slower on GPU than CPU because torch.unique is ~15x slower on GPU than CPU.

The fix replaces the torch.unique-based tie-averaging in _tie_average_dcg with a vectorized argsort → diff → scatter_add_ approach that is efficient on both CPU and GPU.

Changes:

_tie_average_dcg: replaces torch.unique with diff + scatter_add_ for group detection and accumulation; supports both 1-D (single query) and 2-D (batched) inputs
_dcg_sample_scores: uses gather instead of fancy indexing so it works correctly for both 1-D and 2-D inputs
retrieval_normalized_dcg: no API changes — existing callers are unaffected; batched 2-D input now also works correctly

Test plan

All existing TestNDCG tests and test_corner_case_with_tied_scores pass (correctness preserved)
New test_batched_input_matches_per_query verifies batched 2-D input gives the same result as averaging individual 1-D calls, and matches sklearn for each query

🤖 Generated with Claude Code

📚 Documentation preview 📚: https://torchmetrics--3350.org.readthedocs.build/en/3350/

…n tie averaging torch.unique is ~15x slower on GPU than CPU, causing nDCG to run up to 2.65x slower on GPU than CPU. Replace the torch.unique-based tie-averaging approach in _tie_average_dcg with a diff + scatter_add_ strategy that is efficient on both CPU and GPU. The refactored _dcg_sample_scores also uses gather so that both 1-D (single query) and 2-D (batched queries) inputs are handled correctly, making retrieval_normalized_dcg usable with batched inputs directly. Fixes: Lightning-AI#2287 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

rclough · 2026-03-31T17:51:32Z

Sorry, iterating on PR with claude code - just trying to get the implementation I shared in #2287 months ago available in a PR

…n tie averaging torch.unique is ~15x slower on GPU than CPU, causing nDCG to run up to 2.65x slower on GPU than CPU. Replace with a diff + scatter_add_ strategy that is efficient on both CPU and GPU. Key changes to the algorithm (based on the optimized implementation proposed in Lightning-AI#2287): - _tie_average_dcg: takes pre-sorted inputs, uses diff + scatter_add_ instead of torch.unique; float64 accumulation for numerical parity with sklearn; int32 group counts; valid-group masking before scatter - _dcg_sample_scores: handles sorting (with topk fast-path when k < L), gather, and discount creation; delegates tie averaging to the above - retrieval_normalized_dcg: unchanged public API; now correctly handles both 1-D (single query) and 2-D (batched) inputs Tests added: - test_accuracy_vs_sklearn: parametrized across 8 (batch, length, top_k) configs, tolerance 1e-4 matching reference implementation parity - test_batched_input_matches_per_query: 2-D result == mean of 1-D calls - test_tie_handling_explicit: explicit tie configurations vs sklearn - test_all_zeros_target: all-irrelevant queries return 0.0, not NaN - test_perfect_ranking: ideal predictions return nDCG == 1.0 - test_top_k_valid_range: results in [0, 1] for all top_k values Fixes: Lightning-AI#2287 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

for more information, see https://pre-commit.ci

rclough requested review from SkafteNicki and justusschock as code owners March 31, 2026 17:48

github-actions Bot added the topic: Retrieval label Mar 31, 2026

rclough marked this pull request as draft March 31, 2026 17:50

rclough force-pushed the fix/ndcg-gpu-performance branch from 447869c to 1ec556c Compare March 31, 2026 17:57

style: fix ruff N806 (uppercase vars) and formatting in ndcg

41c14b2

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

rclough force-pushed the fix/ndcg-gpu-performance branch from 270b2c3 to 41c14b2 Compare March 31, 2026 18:14

[pre-commit.ci] auto fixes from pre-commit.com hooks

0cc4c42

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(retrieval): fix NDCG GPU performance regression by replacing torch.unique#3350

perf(retrieval): fix NDCG GPU performance regression by replacing torch.unique#3350
rclough wants to merge 4 commits intoLightning-AI:masterfrom
rclough:fix/ndcg-gpu-performance

rclough commented Mar 31, 2026 •

edited by github-actions Bot

Loading

Uh oh!

rclough commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rclough commented Mar 31, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

rclough commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rclough commented Mar 31, 2026 •

edited by github-actions Bot

Loading