fix: prevent MarkdownHeaderLevelInferrer regex from consuming newlines#481
Open
JSap0914 wants to merge 1 commit into
Open
fix: prevent MarkdownHeaderLevelInferrer regex from consuming newlines#481JSap0914 wants to merge 1 commit into
JSap0914 wants to merge 1 commit into
Conversation
The header pattern used `\s+` between the `#` characters and the heading text, which matches any whitespace including newlines. When a header line contained *only* trailing whitespace (e.g. `## `) the regex engine could consume the trailing space *and* the following newline in one `\s+` match, then lazily expand `(.+?)` over the next header line. This caused a whitespace-only pseudo-header and the following real header to be treated as a single match, rewriting `## \n## Section\nContent` to the invalid `# ## Section\nContent` (a hash character embedded inside the heading text). Fix: replace `\s+` / `(?:\s*)` with `[ \t]+` / `(?:[ \t]*)` so that the pattern is restricted to horizontal whitespace only and cannot span line boundaries. All 14 existing tests continue to pass; two new regression tests cover the fixed cases.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related Issues
No existing issue — discovered during code review.
Proposed Changes
The header-matching regex in
MarkdownHeaderLevelInferrerused\s+between the#characters and the heading text.\smatches any whitespace, includingnewline characters. When a header line contained only trailing whitespace (e.g.
##), the regex engine could consume the trailing space and the followingnewline in one
\s+match, then lazily extend(.+?)over the content of thenext header line. This caused a whitespace-only pseudo-header and the following
real header to be matched as a single span, rewriting:
to the invalid output:
(a literal
#character embedded inside the heading text).Fix: replace
\s+/(?:\s*)with[ \t]+/(?:[ \t]*)so thepattern is restricted to horizontal whitespace and cannot span line boundaries.
How did you test it?
test/components/preprocessors/test_markdown_header_level_inferrer.py.Notes for the reviewer
The change is confined to a single compiled regex literal in
__init__; nosurrounding logic changes.
Checklist
fix: