Skip to content

feat(metrics): add Perplexity metric to ignite.metrics.nlp#3743

Open
steaphenai wants to merge 13 commits intopytorch:masterfrom
steaphenai:feat/perplexity-metric-pr
Open

feat(metrics): add Perplexity metric to ignite.metrics.nlp#3743
steaphenai wants to merge 13 commits intopytorch:masterfrom
steaphenai:feat/perplexity-metric-pr

Conversation

@steaphenai
Copy link
Copy Markdown
Contributor

Closes #3742

Summary

  • Add a new Perplexity metric implementation in ignite.metrics.nlp.perplexity.
  • Export Perplexity from both ignite.metrics.nlp and top-level ignite.metrics.
  • Add dedicated tests for correctness, accumulation behavior, reset behavior, return type, and invalid inputs.

Test plan

  • python -m pytest tests/ignite/metrics/nlp/test_perplexity.py -v
  • Smoke test:
    • python -c "from ignite.metrics.nlp import Perplexity; import torch; ppl = Perplexity(); ppl.reset(); ppl.update((torch.randn(2,5,3), torch.randint(0,5,(2,3)))); print('PPL =', ppl.compute())"

Files changed

  • ignite/metrics/nlp/perplexity.py
  • ignite/metrics/nlp/__init__.py
  • ignite/metrics/__init__.py
  • tests/ignite/metrics/nlp/test_perplexity.py
image

@github-actions github-actions Bot added the module: metrics Metrics module label Apr 20, 2026
Expose a new token-level Perplexity metric in ignite.metrics.nlp and top-level ignite.metrics, with dedicated unit tests to validate correctness and behavior.
@steaphenai steaphenai force-pushed the feat/perplexity-metric-pr branch from 8453d4e to fa394ca Compare April 20, 2026 10:33
@Prathamesh8989
Copy link
Copy Markdown
Contributor

Nice addition! Perplexity is definitely a useful metric for language modeling and it fits well under ignite.metrics.nlp.

The test coverage looks solid — especially the token-weighted accumulation test, which ensures correctness across batches with different sequence lengths.

One small suggestion: it might be useful to add a GPU test to ensure the metric behaves correctly when tensors are on CUDA devices, since many language modeling workloads run on GPU.

Something like:

def test_gpu_support():
    if not torch.cuda.is_available():
        pytest.skip()

Overall the implementation and tests look clean and consistent with existing Ignite metrics.

@steaphenai
Copy link
Copy Markdown
Contributor Author

Good point, thanks. I’d like to keep this PR scoped to the Perplexity implementation and core correctness tests. We can add a dedicated CUDA test if maintainers want explicit GPU coverage.
@vfdev-5 thoughts?

Copy link
Copy Markdown
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steaphenai thanks for the PR, I made a quick pass and left few comments.

The tests look shallow and there is no reference implementation that we test against.
I suggest to check what we can use for reference implementation.
In terms of testing on accelerators check other tests like test_accuracy.py to inspire from.

Comment thread ignite/metrics/nlp/perplexity.py
Comment thread tests/ignite/metrics/nlp/test_perplexity.py Outdated
@steaphenai
Copy link
Copy Markdown
Contributor Author

steaphenai commented Apr 21, 2026

@steaphenai thanks for the PR, I made a quick pass and left few comments.

The tests look shallow and there is no reference implementation that we test against. I suggest to check what we can use for reference implementation. In terms of testing on accelerators check other tests like test_accuracy.py to inspire from.

Thanks for the quick review, @vfdev-5.
I addressed the two code comments in the latest push:
detached tensors from the grad graph in Perplexity.update()
removed test_returns_float
I’ll also use existing metric tests (e.g., test_accuracy.py) as reference for accelerator-oriented test patterns as needed.

@steaphenai
Copy link
Copy Markdown
Contributor Author

steaphenai commented Apr 21, 2026

I added an explicit reference implementation check (_reference_perplexity) and validated both single-batch and multi-batch token-weighted accumulation against it and also I aligned test structure with existing Ignite metric patterns (as in test_accuracy.py): available_device parametrization, device assertions, and distributed-marked test layout.
Local run: python -m pytest tests/ignite/metrics/nlp/test_perplexity.py -m "not distributed" -v -> 10 passed, 12 skipped (CUDA/MPS skipped when unavailable).

image

Comment thread tests/ignite/metrics/nlp/test_perplexity.py Outdated
@vfdev-5 vfdev-5 marked this pull request as draft April 21, 2026 12:37
@steaphenai steaphenai marked this pull request as ready for review April 21, 2026 14:25
@vfdev-5 vfdev-5 marked this pull request as draft April 21, 2026 15:38
Comment thread tests/ignite/metrics/nlp/test_perplexity.py Outdated
@steaphenai steaphenai marked this pull request as ready for review April 22, 2026 06:27
Comment thread tests/ignite/metrics/nlp/test_perplexity.py Outdated
Co-authored-by: vfdev <vfdev.5@gmail.com>
Comment thread tests/ignite/metrics/nlp/test_perplexity.py Outdated
Comment thread ignite/metrics/nlp/__init__.py
Comment thread ignite/metrics/nlp/perplexity.py
@github-actions github-actions Bot added the docs label Apr 23, 2026
@vfdev-5
Copy link
Copy Markdown
Collaborator

vfdev-5 commented Apr 23, 2026

@steaphenai
Copy link
Copy Markdown
Contributor Author

@vfdev-5 Could you approve the workflows to run? The required checks are pending approval.

@vfdev-5
Copy link
Copy Markdown
Collaborator

vfdev-5 commented Apr 24, 2026

@steaphenai this failure is real: https://github.com/pytorch/ignite/actions/runs/24847148226/job/72879291595?pr=3743
Check other metrics how do they handle double dtype

@steaphenai
Copy link
Copy Markdown
Contributor Author

@vfdev-5 I checked other metrics in ignite.metrics.nlp. I found that BLEU and ROUGE don't force dtype=torch.double on their accumulators. I've removed the explicit dtype=torch.double and dtype=torch.long from _sum_of_nll and _num_tokens in reset(), and removed dtype=torch.double from the .to() call in update() to match that pattern and fix MPS compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature : Add Perplexity metric to ignite.metrics.nlp

3 participants