Fix off-by-one in CodeCopyrightTagger._score span length by Chessing234 · Pull Request #283 · allenai/dolma

Chessing234 · 2026-04-12T05:05:26Z

Summary

CodeCopyrightTagger._score computes copyright coverage as:

score = (span.end - span.start + 1) * 1.0 / len(text)

The + 1 is incorrect. Span uses Python's exclusive-end convention — Span.__len__ returns self.end - self.start, and all other span-length calculations in the codebase use end - start without + 1. The extra + 1 inflates the score by one character per span, making copyright coverage systematically too high.

Fix: Remove the + 1 to match the Span convention.

Test plan

Verified Span.__len__ returns end - start (exclusive end, no + 1)
Existing tests should continue to pass

🤖 Generated with Claude Code

Span uses exclusive-end convention (Span.__len__ returns end - start). The + 1 inflates the copyright coverage score by one character per span, making it systematically too high. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix off-by-one in CodeCopyrightTagger._score span length#283

Fix off-by-one in CodeCopyrightTagger._score span length#283
Chessing234 wants to merge 1 commit intoallenai:mainfrom
Chessing234:fix/copyright-score-off-by-one

Chessing234 commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Chessing234 commented Apr 12, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant