Skip to content

Fix BaseUrlTagger.IGNORE_IP_REGEX_START missing f-string prefix#289

Open
Chessing234 wants to merge 1 commit into
allenai:mainfrom
Chessing234:fix/ignore-ip-regex-start-f-string
Open

Fix BaseUrlTagger.IGNORE_IP_REGEX_START missing f-string prefix#289
Chessing234 wants to merge 1 commit into
allenai:mainfrom
Chessing234:fix/ignore-ip-regex-start-f-string

Conversation

@Chessing234
Copy link
Copy Markdown

Bug

`python/dolma/taggers/url.py:BaseUrlTagger` defines `IGNORE_IP_REGEX_START` as an anchored version of `IGNORE_IP_REGEX`, but the raw-string prefix (`r"..."`) prevents the embedded `{IGNORE_IP_REGEX.pattern}` placeholder from being interpolated:

```python
IGNORE_IP_REGEX = re.compile(r"(127.0.0.1|0.0.0.0|::1)")
IGNORE_IP_REGEX_START = re.compile(r"^{IGNORE_IP_REGEX.pattern}")
URL_REGEX = re.compile(r"(([a-z0-9-_]+.?){2,}|localhost|localdomain)")
ONLY_URL_REGEX = re.compile(f"^{URL_REGEX.pattern}")
ADP_FORMAT_REGEX = re.compile(f"\\|+{URL_REGEX.pattern}\\^")
```

So the compiled pattern becomes the literal six-character run `^{IGNORE_IP_REGEX.pattern}` (the `{...}` is not a valid `re` quantifier, so it's matched as literal characters), which cannot match any IP string.

Root cause

The two sibling anchored regexes immediately below (`ONLY_URL_REGEX`, `ADP_FORMAT_REGEX`) use `f"^...{OTHER.pattern}..."` to splice an existing pattern — `IGNORE_IP_REGEX_START` is written with `r"..."` instead of `f"..."`, so the substitution that was clearly intended never happens.

Because `IGNORE_IP_REGEX_START.match(maybe_ipv6_or_ipv4)` in `parse_line` always returns `None`, the guard

```python
if not self.IGNORE_IP_REGEX_START.match(maybe_ipv6_or_ipv4):
# do not yield the IP if it a localhost
yield maybe_ipv6_or_ipv4
```

always fires, so `127.0.0.1`, `0.0.0.0`, and `::1` lines from hosts-style blocklists are pushed into `self.blocklist` instead of being filtered out as the comment promises.

Why the fix is correct

Switching the prefix from `r` to `f` interpolates `IGNORE_IP_REGEX.pattern` into the anchored regex exactly like the adjacent `ONLY_URL_REGEX` / `ADP_FORMAT_REGEX` patterns do. The resulting regex is `^(127.0.0.1|0.0.0.0|::1)`, which is the same alternation already validated by `IGNORE_IP_REGEX` but anchored to the start — restoring the intended "drop localhost / reserved IPs" behavior in `parse_line`.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant