Skip to content

fix: prevent MarkdownHeaderLevelInferrer regex from consuming newlines#481

Open
JSap0914 wants to merge 1 commit into
deepset-ai:mainfrom
JSap0914:fix/md-inferrer-skipped-header-level
Open

fix: prevent MarkdownHeaderLevelInferrer regex from consuming newlines#481
JSap0914 wants to merge 1 commit into
deepset-ai:mainfrom
JSap0914:fix/md-inferrer-skipped-header-level

Conversation

@JSap0914

Copy link
Copy Markdown

Related Issues

No existing issue — discovered during code review.

Proposed Changes

The header-matching regex in MarkdownHeaderLevelInferrer used \s+ between the
# characters and the heading text. \s matches any whitespace, including
newline characters. When a header line contained only trailing whitespace (e.g.
## ), the regex engine could consume the trailing space and the following
newline in one \s+ match, then lazily extend (.+?) over the content of the
next header line. This caused a whitespace-only pseudo-header and the following
real header to be matched as a single span, rewriting:

## \n## Section\nContent

to the invalid output:

# ## Section\nContent

(a literal # character embedded inside the heading text).

Fix: replace \s+ / (?:\s*) with [ \t]+ / (?:[ \t]*) so the
pattern is restricted to horizontal whitespace and cannot span line boundaries.

How did you test it?

  • Two new regression tests added to test/components/preprocessors/test_markdown_header_level_inferrer.py.
  • All 14 pre-existing tests continue to pass.
  • Manual verification:
    python -m pytest test/components/preprocessors/test_markdown_header_level_inferrer.py -v
    # 16 passed
    

Notes for the reviewer

The change is confined to a single compiled regex literal in __init__; no
surrounding logic changes.

Checklist

This PR was fully generated with an AI assistant. I have reviewed the changes and run the relevant tests.

The header pattern used `\s+` between the `#` characters and the heading
text, which matches any whitespace including newlines.  When a header line
contained *only* trailing whitespace (e.g. `## `) the regex engine could
consume the trailing space *and* the following newline in one `\s+` match,
then lazily expand `(.+?)` over the next header line.  This caused a
whitespace-only pseudo-header and the following real header to be treated as
a single match, rewriting `## \n## Section\nContent` to the invalid
`# ## Section\nContent` (a hash character embedded inside the heading text).

Fix: replace `\s+` / `(?:\s*)` with `[ \t]+` / `(?:[ \t]*)` so that
the pattern is restricted to horizontal whitespace only and cannot span line
boundaries.  All 14 existing tests continue to pass; two new regression tests
cover the fixed cases.
@JSap0914 JSap0914 requested a review from a team as a code owner June 16, 2026 09:49
@JSap0914 JSap0914 requested review from anakin87 and Copilot and removed request for a team June 16, 2026 09:49

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants