Fix off-by-one in CodeCopyrightTagger end position (missing newline) by Chessing234 · Pull Request #282 · allenai/dolma

Chessing234 · 2026-04-11T08:25:11Z

Bug

CodeCopyrightTagger._extract_copyright_spans in python/dolma/taggers/code/code_taggers.py produces comment_block spans whose end is short by one per non-empty comment line. The last comment line of a leading copyright block ends up truncated mid-line.

Root cause

```python
lines = text.split("\n")
...
end = 0
for line in lines:
if line.startswith("//") or line.startswith("#") or line.startswith("--") or not line:
skip = skip + 1
if not line:
end += 1
else:
end += len(line)
else:
break
```

str.split(\"\\n\") drops the newline separators, so each matched line occupied len(line) + 1 bytes in the original text (its content plus the trailing \"\\n\"). The empty-line branch correctly adds 1 (empty content + the newline), but the non-empty branch adds only len(line), forgetting the newline.

Worked example

`text = "// a\n// b\nother"`:

lines = [\"// a\", \"// b\", \"other\"]
Current: end = 4 + 4 = 8. The emitted span is [0, 8) = \"// a\\n// \", cutting off \"b\\n\".
Expected: end = 5 + 5 = 10, which is exactly where \"other\" begins.

This downstream feeds Span(start=0, end=end, type=\"comment_block\", ...) and _score, so filters and downstream taggers see a slightly-too-small copyright region.

Fix

Collapse both branches to end += len(line) + 1. For empty lines this is 0 + 1 = 1, matching the prior behavior. For non-empty comment lines it now correctly includes the dropped newline. Two lines added, four removed; the if not line / else split is no longer needed.

```diff

```
           if not line:
```
```
               end += 1
```
```
           else:
```
```
               end += len(line)
```

           # +1 accounts for the \"\\n\" separator that str.split dropped

```
           end += len(line) + 1
```

```

`CodeCopyrightTagger._extract_copyright_spans` splits the text on `"\n"` and walks the resulting lines to find a leading comment block. To reconstruct the end offset of that block in the original text, it accumulates the per-line length, but the split dropped the `"\n"` separators, so each non-empty comment line was counted as `len(line)` instead of `len(line) + 1`. The empty-line branch already added 1 (for the newline), but the non-empty branch forgot the separator, producing spans that end in the middle of the last comment line. Worked example — `text = "// a\n// b\nother"`: - `lines = ["// a", "// b", "other"]` - Current code: `end = 4 + 4 = 8`, so the span is `[0, 8) = "// a\n// "`, truncating mid-line and missing the final `"b\n"`. - Correct: `end = 5 + 5 = 10`, matching the start of `"other"`. Collapse the branches to `end += len(line) + 1`, which is equivalent to the empty-line case's `+= 1` and correct for non-empty lines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix off-by-one in CodeCopyrightTagger end position (missing newline)#282

Fix off-by-one in CodeCopyrightTagger end position (missing newline)#282
Chessing234 wants to merge 1 commit intoallenai:mainfrom
Chessing234:fix/code-copyright-tagger-newline-offset

Chessing234 commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Chessing234 commented Apr 11, 2026

Bug

Root cause

Worked example

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant