Skip to content

Fix off-by-one in CodeCopyrightTagger end position (missing newline)#282

Open
Chessing234 wants to merge 1 commit intoallenai:mainfrom
Chessing234:fix/code-copyright-tagger-newline-offset
Open

Fix off-by-one in CodeCopyrightTagger end position (missing newline)#282
Chessing234 wants to merge 1 commit intoallenai:mainfrom
Chessing234:fix/code-copyright-tagger-newline-offset

Conversation

@Chessing234
Copy link
Copy Markdown

Bug

CodeCopyrightTagger._extract_copyright_spans in python/dolma/taggers/code/code_taggers.py produces comment_block spans whose end is short by one per non-empty comment line. The last comment line of a leading copyright block ends up truncated mid-line.

Root cause

```python
lines = text.split("\n")
...
end = 0
for line in lines:
if line.startswith("//") or line.startswith("#") or line.startswith("--") or not line:
skip = skip + 1
if not line:
end += 1
else:
end += len(line)
else:
break
```

str.split(\"\\n\") drops the newline separators, so each matched line occupied len(line) + 1 bytes in the original text (its content plus the trailing \"\\n\"). The empty-line branch correctly adds 1 (empty content + the newline), but the non-empty branch adds only len(line), forgetting the newline.

Worked example

`text = "// a\n// b\nother"`:

  • lines = [\"// a\", \"// b\", \"other\"]
  • Current: end = 4 + 4 = 8. The emitted span is [0, 8) = \"// a\\n// \", cutting off \"b\\n\".
  • Expected: end = 5 + 5 = 10, which is exactly where \"other\" begins.

This downstream feeds Span(start=0, end=end, type=\"comment_block\", ...) and _score, so filters and downstream taggers see a slightly-too-small copyright region.

Fix

Collapse both branches to end += len(line) + 1. For empty lines this is 0 + 1 = 1, matching the prior behavior. For non-empty comment lines it now correctly includes the dropped newline. Two lines added, four removed; the if not line / else split is no longer needed.

```diff

  •            if not line:
    
  •                end += 1
    
  •            else:
    
  •                end += len(line)
    
  •            # +1 accounts for the \"\\n\" separator that str.split dropped
    
  •            end += len(line) + 1
    

```

`CodeCopyrightTagger._extract_copyright_spans` splits the text on
`"\n"` and walks the resulting lines to find a leading comment
block. To reconstruct the end offset of that block in the original
text, it accumulates the per-line length, but the split dropped the
`"\n"` separators, so each non-empty comment line was counted as
`len(line)` instead of `len(line) + 1`. The empty-line branch
already added 1 (for the newline), but the non-empty branch forgot
the separator, producing spans that end in the middle of the last
comment line.

Worked example — `text = "// a\n// b\nother"`:

- `lines = ["// a", "// b", "other"]`
- Current code: `end = 4 + 4 = 8`, so the span is `[0, 8) = "// a\n// "`,
  truncating mid-line and missing the final `"b\n"`.
- Correct: `end = 5 + 5 = 10`, matching the start of `"other"`.

Collapse the branches to `end += len(line) + 1`, which is equivalent
to the empty-line case's `+= 1` and correct for non-empty lines.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant