Fix find_offset treating exact start-boundary as 'not found' by Chessing234 · Pull Request #290 · allenai/dolma

Chessing234 · 2026-04-20T05:17:46Z

Bug

scripts/find_offset.py aborts with ValueError: Offset N not found in ... whenever the requested offset lands exactly on a document's starting byte — even though the offset is valid and the surrounding code has already chosen the right row:

loc = bisect(data["start"].values, offset)
if data["start"][loc] > offset:
    # in case of not exact match
    loc -= 1

if data.iloc[loc].start >= offset or data.iloc[loc].end <= offset:
    raise ValueError(f"Offset {offset} not found in {opts.file}.")

Root cause

Each row in the metadata CSV describes a document occupying the half-open byte interval [start, end). After the bisect + "not exact match" decrement, loc is the row whose start <= offset. So to decide "not found" we need offset < start (impossible here) or offset >= end, i.e.

start > offset  or  end <= offset

The existing check uses start >= offset, which is True whenever offset == start — the boundary case. That fires the ValueError for a perfectly valid offset sitting at the first byte of a document.

Why the fix is correct

Changing >= to > aligns the check with the half-open interval semantics the rest of the script already uses:

offset == start: still inside [start, end) → no longer raises (start > offset is False, end <= offset is also False because end > start == offset).
offset == end: already correctly flagged as not found, unchanged (end <= offset is True).
Any offset strictly between start and end: unchanged.
Any offset outside, after the decrement logic: unchanged.

One-character edit (>= → >) on the single offending line; no change to the bisect / decrement logic above it or the downstream locs.append(loc) path.

scripts/find_offset.py treats each metadata row as the half-open interval [start, end). After the bisect + 'not exact match' decrement, loc points at the row whose start is <= offset, so the validity check is: if data.iloc[loc].start >= offset or data.iloc[loc].end <= offset: raise ValueError(f"Offset {offset} not found in {opts.file}.") The left side uses >= instead of >. When the requested offset lands exactly on a row boundary (offset == start) that row is actually the correct one -- the first byte of the document begins at start -- but the check raises ValueError and aborts before locs.append(loc). Using > matches the half-open semantics (offset in [start, end)) and keeps the 'offset >= end' side unchanged, so the only behaviour change is that offsets exactly on a start boundary are now correctly reported as found instead of failing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix find_offset treating exact start-boundary as 'not found'#290

Fix find_offset treating exact start-boundary as 'not found'#290
Chessing234 wants to merge 1 commit intoallenai:mainfrom
Chessing234:fix/find-offset-exact-match-spurious-valueerror

Chessing234 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Chessing234 commented Apr 20, 2026

Bug

Root cause

Why the fix is correct

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant