Skip to content

Fix find_offset treating exact start-boundary as 'not found'#290

Open
Chessing234 wants to merge 1 commit intoallenai:mainfrom
Chessing234:fix/find-offset-exact-match-spurious-valueerror
Open

Fix find_offset treating exact start-boundary as 'not found'#290
Chessing234 wants to merge 1 commit intoallenai:mainfrom
Chessing234:fix/find-offset-exact-match-spurious-valueerror

Conversation

@Chessing234
Copy link
Copy Markdown

Bug

scripts/find_offset.py aborts with ValueError: Offset N not found in ... whenever the requested offset lands exactly on a document's starting byte — even though the offset is valid and the surrounding code has already chosen the right row:

loc = bisect(data["start"].values, offset)
if data["start"][loc] > offset:
    # in case of not exact match
    loc -= 1

if data.iloc[loc].start >= offset or data.iloc[loc].end <= offset:
    raise ValueError(f"Offset {offset} not found in {opts.file}.")

Root cause

Each row in the metadata CSV describes a document occupying the half-open byte interval [start, end). After the bisect + "not exact match" decrement, loc is the row whose start <= offset. So to decide "not found" we need offset < start (impossible here) or offset >= end, i.e.

start > offset  or  end <= offset

The existing check uses start >= offset, which is True whenever offset == start — the boundary case. That fires the ValueError for a perfectly valid offset sitting at the first byte of a document.

Why the fix is correct

Changing >= to > aligns the check with the half-open interval semantics the rest of the script already uses:

  • offset == start: still inside [start, end) → no longer raises (start > offset is False, end <= offset is also False because end > start == offset).
  • offset == end: already correctly flagged as not found, unchanged (end <= offset is True).
  • Any offset strictly between start and end: unchanged.
  • Any offset outside, after the decrement logic: unchanged.

One-character edit (>=>) on the single offending line; no change to the bisect / decrement logic above it or the downstream locs.append(loc) path.

scripts/find_offset.py treats each metadata row as the half-open
interval [start, end). After the bisect + 'not exact match' decrement,
loc points at the row whose start is <= offset, so the validity check is:

    if data.iloc[loc].start >= offset or data.iloc[loc].end <= offset:
        raise ValueError(f"Offset {offset} not found in {opts.file}.")

The left side uses >= instead of >. When the requested offset lands
exactly on a row boundary (offset == start) that row is actually the
correct one -- the first byte of the document begins at start -- but
the check raises ValueError and aborts before locs.append(loc).

Using > matches the half-open semantics (offset in [start, end)) and
keeps the 'offset >= end' side unchanged, so the only behaviour change
is that offsets exactly on a start boundary are now correctly reported
as found instead of failing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant