Skip to content

perf(retrieval): fix NDCG GPU performance regression by replacing torch.unique#3350

Draft
rclough wants to merge 4 commits intoLightning-AI:masterfrom
rclough:fix/ndcg-gpu-performance
Draft

perf(retrieval): fix NDCG GPU performance regression by replacing torch.unique#3350
rclough wants to merge 4 commits intoLightning-AI:masterfrom
rclough:fix/ndcg-gpu-performance

Conversation

@rclough
Copy link
Copy Markdown

@rclough rclough commented Mar 31, 2026

Summary

Fixes #2287 — nDCG was running up to 2.65x slower on GPU than CPU because torch.unique is ~15x slower on GPU than CPU.

The fix replaces the torch.unique-based tie-averaging in _tie_average_dcg with a vectorized argsortdiffscatter_add_ approach that is efficient on both CPU and GPU.

Changes:

  • _tie_average_dcg: replaces torch.unique with diff + scatter_add_ for group detection and accumulation; supports both 1-D (single query) and 2-D (batched) inputs
  • _dcg_sample_scores: uses gather instead of fancy indexing so it works correctly for both 1-D and 2-D inputs
  • retrieval_normalized_dcg: no API changes — existing callers are unaffected; batched 2-D input now also works correctly

Test plan

  • All existing TestNDCG tests and test_corner_case_with_tied_scores pass (correctness preserved)
  • New test_batched_input_matches_per_query verifies batched 2-D input gives the same result as averaging individual 1-D calls, and matches sklearn for each query

🤖 Generated with Claude Code


📚 Documentation preview 📚: https://torchmetrics--3350.org.readthedocs.build/en/3350/

…n tie averaging

torch.unique is ~15x slower on GPU than CPU, causing nDCG to run up to
2.65x slower on GPU than CPU. Replace the torch.unique-based tie-averaging
approach in _tie_average_dcg with a diff + scatter_add_ strategy that is
efficient on both CPU and GPU.

The refactored _dcg_sample_scores also uses gather so that both 1-D
(single query) and 2-D (batched queries) inputs are handled correctly,
making retrieval_normalized_dcg usable with batched inputs directly.

Fixes: Lightning-AI#2287

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@rclough rclough marked this pull request as draft March 31, 2026 17:50
@rclough
Copy link
Copy Markdown
Author

rclough commented Mar 31, 2026

Sorry, iterating on PR with claude code - just trying to get the implementation I shared in #2287 months ago available in a PR

…n tie averaging

torch.unique is ~15x slower on GPU than CPU, causing nDCG to run up to
2.65x slower on GPU than CPU. Replace with a diff + scatter_add_ strategy
that is efficient on both CPU and GPU.

Key changes to the algorithm (based on the optimized implementation
proposed in Lightning-AI#2287):
- _tie_average_dcg: takes pre-sorted inputs, uses diff + scatter_add_
  instead of torch.unique; float64 accumulation for numerical parity
  with sklearn; int32 group counts; valid-group masking before scatter
- _dcg_sample_scores: handles sorting (with topk fast-path when k < L),
  gather, and discount creation; delegates tie averaging to the above
- retrieval_normalized_dcg: unchanged public API; now correctly handles
  both 1-D (single query) and 2-D (batched) inputs

Tests added:
- test_accuracy_vs_sklearn: parametrized across 8 (batch, length, top_k)
  configs, tolerance 1e-4 matching reference implementation parity
- test_batched_input_matches_per_query: 2-D result == mean of 1-D calls
- test_tie_handling_explicit: explicit tie configurations vs sklearn
- test_all_zeros_target: all-irrelevant queries return 0.0, not NaN
- test_perfect_ranking: ideal predictions return nDCG == 1.0
- test_top_k_valid_range: results in [0, 1] for all top_k values

Fixes: Lightning-AI#2287

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@rclough rclough force-pushed the fix/ndcg-gpu-performance branch from 447869c to 1ec556c Compare March 31, 2026 17:57
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@rclough rclough force-pushed the fix/ndcg-gpu-performance branch from 270b2c3 to 41c14b2 Compare March 31, 2026 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Calculations nDCG using GPU are 2x slower than CPU

1 participant