perf(retrieval): fix NDCG GPU performance regression by replacing torch.unique#3350
Draft
rclough wants to merge 4 commits intoLightning-AI:masterfrom
Draft
perf(retrieval): fix NDCG GPU performance regression by replacing torch.unique#3350rclough wants to merge 4 commits intoLightning-AI:masterfrom
rclough wants to merge 4 commits intoLightning-AI:masterfrom
Conversation
…n tie averaging torch.unique is ~15x slower on GPU than CPU, causing nDCG to run up to 2.65x slower on GPU than CPU. Replace the torch.unique-based tie-averaging approach in _tie_average_dcg with a diff + scatter_add_ strategy that is efficient on both CPU and GPU. The refactored _dcg_sample_scores also uses gather so that both 1-D (single query) and 2-D (batched queries) inputs are handled correctly, making retrieval_normalized_dcg usable with batched inputs directly. Fixes: Lightning-AI#2287 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
|
Sorry, iterating on PR with claude code - just trying to get the implementation I shared in #2287 months ago available in a PR |
…n tie averaging torch.unique is ~15x slower on GPU than CPU, causing nDCG to run up to 2.65x slower on GPU than CPU. Replace with a diff + scatter_add_ strategy that is efficient on both CPU and GPU. Key changes to the algorithm (based on the optimized implementation proposed in Lightning-AI#2287): - _tie_average_dcg: takes pre-sorted inputs, uses diff + scatter_add_ instead of torch.unique; float64 accumulation for numerical parity with sklearn; int32 group counts; valid-group masking before scatter - _dcg_sample_scores: handles sorting (with topk fast-path when k < L), gather, and discount creation; delegates tie averaging to the above - retrieval_normalized_dcg: unchanged public API; now correctly handles both 1-D (single query) and 2-D (batched) inputs Tests added: - test_accuracy_vs_sklearn: parametrized across 8 (batch, length, top_k) configs, tolerance 1e-4 matching reference implementation parity - test_batched_input_matches_per_query: 2-D result == mean of 1-D calls - test_tie_handling_explicit: explicit tie configurations vs sklearn - test_all_zeros_target: all-irrelevant queries return 0.0, not NaN - test_perfect_ranking: ideal predictions return nDCG == 1.0 - test_top_k_valid_range: results in [0, 1] for all top_k values Fixes: Lightning-AI#2287 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
447869c to
1ec556c
Compare
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
270b2c3 to
41c14b2
Compare
for more information, see https://pre-commit.ci
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #2287 — nDCG was running up to 2.65x slower on GPU than CPU because
torch.uniqueis ~15x slower on GPU than CPU.The fix replaces the
torch.unique-based tie-averaging in_tie_average_dcgwith a vectorizedargsort→diff→scatter_add_approach that is efficient on both CPU and GPU.Changes:
_tie_average_dcg: replacestorch.uniquewithdiff+scatter_add_for group detection and accumulation; supports both 1-D (single query) and 2-D (batched) inputs_dcg_sample_scores: usesgatherinstead of fancy indexing so it works correctly for both 1-D and 2-D inputsretrieval_normalized_dcg: no API changes — existing callers are unaffected; batched 2-D input now also works correctlyTest plan
TestNDCGtests andtest_corner_case_with_tied_scorespass (correctness preserved)test_batched_input_matches_per_queryverifies batched 2-D input gives the same result as averaging individual 1-D calls, and matches sklearn for each query🤖 Generated with Claude Code
📚 Documentation preview 📚: https://torchmetrics--3350.org.readthedocs.build/en/3350/