Skip to content

Fix catastrophic backtracking in not_alphanum_paragraph_v1 regex#284

Open
Chessing234 wants to merge 1 commit intoallenai:mainfrom
Chessing234:fix/regex-catastrophic-backtracking-issue-123
Open

Fix catastrophic backtracking in not_alphanum_paragraph_v1 regex#284
Chessing234 wants to merge 1 commit intoallenai:mainfrom
Chessing234:fix/regex-catastrophic-backtracking-issue-123

Conversation

@Chessing234
Copy link
Copy Markdown

Bug

The not_alphanum_paragraph_v1 tagger's regex contains a nested quantifier pattern [emoji-ranges]+ inside (...)+, creating a classic (X+)+ catastrophic backtracking scenario. As reported in #123, ~44 emoji characters cause 64 seconds of processing, and ~68 emoji characters cause the tagger to hang indefinitely.

Root cause

Line 21 of python/dolma/taggers/punctuation.py: the inner + on the emoji character class is redundant because the outer group )+ already handles matching multiple characters. Together they form a nested quantifier that causes the regex engine to explore exponentially many partitions.

Fix

Remove the inner + (1-character deletion). The outer )+ still matches one-or-more of any character in the alternation (punctuation, whitespace, or emoji), preserving the regex's matching semantics while eliminating exponential backtracking.

Closes #123

Remove nested quantifier `[...]+` inside `(...)+` that causes
exponential backtracking on long emoji sequences. The outer `)+`
already handles repetition.

Fixes allenai#123
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

not_alphanum_paragraph_v1 tagger takes forever to run on certain inputs.

1 participant