fix: prevent MarkdownHeaderLevelInferrer regex from consuming newlines by JSap0914 · Pull Request #481 · deepset-ai/haystack-experimental

JSap0914 · 2026-06-16T09:49:23Z

Related Issues

No existing issue — discovered during code review.

Proposed Changes

The header-matching regex in MarkdownHeaderLevelInferrer used \s+ between the
# characters and the heading text. \s matches any whitespace, including
newline characters. When a header line contained only trailing whitespace (e.g.
## ), the regex engine could consume the trailing space and the following
newline in one \s+ match, then lazily extend (.+?) over the content of the
next header line. This caused a whitespace-only pseudo-header and the following
real header to be matched as a single span, rewriting:

## \n## Section\nContent

to the invalid output:

# ## Section\nContent

(a literal # character embedded inside the heading text).

Fix: replace \s+ / (?:\s*) with [ \t]+ / (?:[ \t]*) so the
pattern is restricted to horizontal whitespace and cannot span line boundaries.

How did you test it?

Two new regression tests added to test/components/preprocessors/test_markdown_header_level_inferrer.py.
All 14 pre-existing tests continue to pass.

Manual verification:

python -m pytest test/components/preprocessors/test_markdown_header_level_inferrer.py -v
# 16 passed

Notes for the reviewer

The change is confined to a single compiled regex literal in __init__; no
surrounding logic changes.

Checklist

I have read the contributors guidelines and the code of conduct
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:
I documented my code
I ran pre-commit hooks and fixed any issue

This PR was fully generated with an AI assistant. I have reviewed the changes and run the relevant tests.

The header pattern used `\s+` between the `#` characters and the heading text, which matches any whitespace including newlines. When a header line contained *only* trailing whitespace (e.g. `## `) the regex engine could consume the trailing space *and* the following newline in one `\s+` match, then lazily expand `(.+?)` over the next header line. This caused a whitespace-only pseudo-header and the following real header to be treated as a single match, rewriting `## \n## Section\nContent` to the invalid `# ## Section\nContent` (a hash character embedded inside the heading text). Fix: replace `\s+` / `(?:\s*)` with `[ \t]+` / `(?:[ \t]*)` so that the pattern is restricted to horizontal whitespace only and cannot span line boundaries. All 14 existing tests continue to pass; two new regression tests cover the fixed cases.

Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

JSap0914 requested a review from a team as a code owner June 16, 2026 09:49

JSap0914 requested review from anakin87 and Copilot and removed request for a team June 16, 2026 09:49

Copilot AI reviewed Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent MarkdownHeaderLevelInferrer regex from consuming newlines#481

fix: prevent MarkdownHeaderLevelInferrer regex from consuming newlines#481
JSap0914 wants to merge 1 commit into
deepset-ai:mainfrom
JSap0914:fix/md-inferrer-skipped-header-level

JSap0914 commented Jun 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JSap0914 commented Jun 16, 2026

Related Issues

Proposed Changes

How did you test it?

Notes for the reviewer

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants