Skip to content

Fix catastrophic regex backtracking in NotAlphanumParagraphV1 tagger#278

Open
Chessing234 wants to merge 1 commit into
allenai:mainfrom
Chessing234:fix/regex-backtracking
Open

Fix catastrophic regex backtracking in NotAlphanumParagraphV1 tagger#278
Chessing234 wants to merge 1 commit into
allenai:mainfrom
Chessing234:fix/regex-backtracking

Conversation

@Chessing234
Copy link
Copy Markdown

Summary

  • Remove redundant + quantifier from the emoji character class in NotAlphanumParagraphV1's regex pattern (python/dolma/taggers/punctuation.py)
  • The pattern [...]+()+$ has nested quantifiers that cause exponential backtracking on long emoji sequences; the outer ()+ already handles repetition, making the inner + both redundant and harmful

Fixes #123

Test plan

  • Run existing tests for the punctuation tagger
  • Verify the tagger completes in reasonable time on inputs with long emoji sequences (e.g., 1000+ consecutive emoji characters)
  • Confirm the tagger still correctly identifies paragraphs that are entirely punctuation/emoji

🤖 Generated with Claude Code

Remove redundant `+` quantifier from the emoji character class in the
regex pattern. The outer `()+` group already handles repetition, so the
inner `]+` creates nested quantifiers that cause exponential backtracking
on long emoji sequences.

Fixes allenai#123

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

not_alphanum_paragraph_v1 tagger takes forever to run on certain inputs.

1 participant