Skip to content

Fix off-by-one in CodeCopyrightTagger._score span length#283

Open
Chessing234 wants to merge 1 commit intoallenai:mainfrom
Chessing234:fix/copyright-score-off-by-one
Open

Fix off-by-one in CodeCopyrightTagger._score span length#283
Chessing234 wants to merge 1 commit intoallenai:mainfrom
Chessing234:fix/copyright-score-off-by-one

Conversation

@Chessing234
Copy link
Copy Markdown

Summary

CodeCopyrightTagger._score computes copyright coverage as:

score = (span.end - span.start + 1) * 1.0 / len(text)

The + 1 is incorrect. Span uses Python's exclusive-end convention — Span.__len__ returns self.end - self.start, and all other span-length calculations in the codebase use end - start without + 1. The extra + 1 inflates the score by one character per span, making copyright coverage systematically too high.

Fix: Remove the + 1 to match the Span convention.

Test plan

  • Verified Span.__len__ returns end - start (exclusive end, no + 1)
  • Existing tests should continue to pass

🤖 Generated with Claude Code

Span uses exclusive-end convention (Span.__len__ returns end - start).
The + 1 inflates the copyright coverage score by one character per
span, making it systematically too high.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant