Skip to content

Fix: MinerU parsed images are obscured when using MinerU as a parser.…#14219

Open
web-dev0521 wants to merge 14 commits intoinfiniflow:mainfrom
web-dev0521:fix/mineru-image-obscured-14197
Open

Fix: MinerU parsed images are obscured when using MinerU as a parser.…#14219
web-dev0521 wants to merge 14 commits intoinfiniflow:mainfrom
web-dev0521:fix/mineru-image-obscured-14197

Conversation

@web-dev0521
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

When using MinerU as a document parser, parsed images appeared obscured (darkened) in the RAGFlow UI. The root cause was in the crop method of deepdoc/parser/mineru_parser.py: a semi-transparent black overlay (intended only for surrounding context strips) was incorrectly applied to actual content images when the top context strip crop was skipped due to zero-height regions.

The fix tracks the is_context flag alongside each crop at generation time (imgs_with_flags), ensuring the overlay is only applied to context strips — never to the actual parsed content.

Fixes #14197

Type of change

  • Bug Fix (non-breaking change which fixes an issue)

@dosubot dosubot Bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Apr 20, 2026
@dosubot dosubot Bot added the 🐞 bug Something isn't working, pull request that fix bug. label Apr 20, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The MinerU parser's crop() now tracks head and tail context boundaries (head_ctx_end, tail_ctx_start) during the cropping loop and uses these ranges to decide overlay/context masking instead of relying on positional checks against the final imgs list.

Changes

Cohort / File(s) Summary
MinerU parser crop logic
deepdoc/parser/mineru_parser.py
Replaced positional overlay checks with explicit context boundary tracking: added head_ctx_end and tail_ctx_start, initialized and updated them per-segment inside the cropping loop, and switched overlay/context masking from ii == 0 or ii + 1 == len(imgs) to ii < head_ctx_end or ii >= tail_ctx_start. Minor loop initialization/boundary updates included.
Unit tests for crop behavior
test/unit_test/deepdoc/parser/test_mineru_crop.py
New test module exercising MinerUParser.crop() overlay/darkening behavior across top/bottom context cases, single-image and multi-page scenarios; includes helpers for synthetic images and center-pixel assertions to prevent regressions.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰
I hopped the loop and marked each end,
Head and tail now meet my friend.
No more shadows on the core—
Bright red center, ask for more. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies a bug fix for MinerU parser images appearing obscured, accurately reflecting the main purpose of the changeset.
Description check ✅ Passed The description includes the required template sections: problem statement with root cause, fix approach, issue reference, and type of change selection.
Linked Issues check ✅ Passed The PR directly addresses issue #14197 by fixing the overlay logic in crop() to apply darkening only to context strips, preventing content images from being obscured.
Out of Scope Changes check ✅ Passed All changes are scoped to fixing the reported issue: core fix in mineru_parser.py and comprehensive unit tests for regression coverage.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
deepdoc/parser/mineru_parser.py (1)

423-450: ⚠️ Potential issue | 🟡 Minor

Add logging for the new context/content branch behavior.

This introduces a new rendering flow (is_context) but doesn’t add traceability. A debug log here would make future visual regressions much easier to diagnose.

Proposed change
         for ii, (pns, left, right, top, bottom) in enumerate(poss):
             is_context = ii == 0 or ii == poss_last_idx
+            self.logger.debug(
+                "[MinerU] crop segment idx=%s pages=%s is_context=%s",
+                ii, pns, is_context
+            )
             right = left + max_width

As per coding guidelines, **/*.py: Add logging for new flows.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepdoc/parser/mineru_parser.py` around lines 423 - 450, Add a debug log
right before appending the cropped image to imgs_with_flags in the MinerU
parsing flow to record the new context/content branch; log the base page index
pns[0], the computed crop coordinates (x0,y0,x1,y1), the is_context boolean, and
the page_count (use self.page_images and self.logger). Place this
self.logger.debug call immediately before the line with
imgs_with_flags.append((crop0, is_context)) in mineru_parser.py so each crop
produced by the new branch is traceable for visual regression debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@deepdoc/parser/mineru_parser.py`:
- Around line 423-450: Add a debug log right before appending the cropped image
to imgs_with_flags in the MinerU parsing flow to record the new context/content
branch; log the base page index pns[0], the computed crop coordinates
(x0,y0,x1,y1), the is_context boolean, and the page_count (use self.page_images
and self.logger). Place this self.logger.debug call immediately before the line
with imgs_with_flags.append((crop0, is_context)) in mineru_parser.py so each
crop produced by the new branch is traceable for visual regression debugging.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4a081063-5a9d-4a56-a47d-e17abc9347e4

📥 Commits

Reviewing files that changed from the base of the PR and between 84b6069 and 7e13446.

📒 Files selected for processing (1)
  • deepdoc/parser/mineru_parser.py

@web-dev0521
Copy link
Copy Markdown
Contributor Author

Hi @yingfeng,
This is my first contribution to the infiniflow/ragflow repository. I’ve put significant effort into this PR and aimed to follow the project’s conventions and standards.
I would appreciate your review whenever you have time, and I welcome any feedback or suggestions for improvement.
Thank you.

@yingfeng yingfeng added the ci Continue Integration label Apr 20, 2026
@yingfeng yingfeng marked this pull request as draft April 20, 2026 04:17
@yingfeng yingfeng marked this pull request as ready for review April 20, 2026 04:17
@yingfeng yingfeng requested a review from Copilot April 20, 2026 04:19
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes MinerU image crops being darkened/obscured in the UI by ensuring the semi-transparent overlay is applied only to “context strip” crops, not to actual content crops, even when some context crops are skipped.

Changes:

  • Track (cropped_image, is_context) tuples during crop generation to decouple overlay logic from the final concatenated image index.
  • Apply the overlay based on is_context rather than “first/last image in the output list”.
  • Add debug logging for crop coordinates and context flag.

Comment thread deepdoc/parser/mineru_parser.py Outdated
Comment on lines +418 to +421
positions = []
# Each entry is (PIL.Image, is_context) so overlay logic is decoupled from final index.
imgs_with_flags = []
poss_last_idx = len(poss) - 1
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — thank you for pointing that out.

The imgs = [] line has already been removed in the latest commit. The unused variable was identified and cleaned up to address the Ruff F841 warning.

CI should now pass cleanly, but please let me know if anything else needs attention.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.66%. Comparing base (9399336) to head (c101966).

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #14219   +/-   ##
=======================================
  Coverage   96.66%   96.66%           
=======================================
  Files          10       10           
  Lines         690      690           
  Branches      108      108           
=======================================
  Hits          667      667           
  Misses          8        8           
  Partials       15       15           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@web-dev0521
Copy link
Copy Markdown
Contributor Author

Hi @yingfeng ,
Hope you had a great weekend.
I’ve addressed all the issues raised. Thank you very much for your time and valuable feedback—I truly appreciate it.
Please let me know if there’s anything else I should update.

@yingfeng yingfeng requested a review from xugangqiang April 20, 2026 04:58
@web-dev0521
Copy link
Copy Markdown
Contributor Author

Hi @yingfeng, thank you for your attention and guidance.
Hi @xugangqiang, could you please take a moment to review my PR when convenient? I would greatly appreciate your feedback.
Thank you both for your time.

@6ba3i 6ba3i assigned 6ba3i and unassigned 6ba3i Apr 20, 2026
@6ba3i 6ba3i self-requested a review April 20, 2026 05:48
Copy link
Copy Markdown
Contributor

@6ba3i 6ba3i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix here. I think this PR is addressing the right bug in the right place.
That said, I think this can be made smaller.

The current PR keeps extra state for every cropped image with imgs_with_flags, and also adds debug logging and some helper/comment churn. For this specific bug, I do not think we need all of that. A smaller fix is enough:

  • keep imgs as plain images
  • record where the leading synthetic context segment ends
  • record where the trailing synthetic context segment begins
  • use those two boundaries in the final overlay loop instead of ii == 0 / ii + 1 == len(imgs)

Suggested change:

diff --git a/deepdoc/parser/mineru_parser.py b/deepdoc/parser/mineru_parser.py
@@
-        positions = []
+        positions = []
+        head_ctx_end = 0
+        tail_ctx_start = 0
         for ii, (pns, left, right, top, bottom) in enumerate(poss):
+            if ii + 1 == len(poss):
+                tail_ctx_start = len(imgs)
             right = left + max_width
@@
                 cimgp = page.crop((x0, y0, x1, y1))
                 imgs.append(cimgp)
                 if 0 < ii < len(poss) - 1:
                     positions.append((pn + self.page_from, x0, x1, y0, y1))
                 bottom -= page.size[1]
+            if ii == 0:
+                head_ctx_end = len(imgs)
@@
         height = 0
         for ii, img in enumerate(imgs):
-            if ii == 0 or ii + 1 == len(imgs):
+            if ii < head_ctx_end or ii >= tail_ctx_start:
                 img = img.convert("RGBA")
                 overlay = Image.new("RGBA", img.size, (0, 0, 0, 0))
                 overlay.putalpha(128)

it fixes the same regression with less churn and less code change, could you give it a try ? thanks again !

@web-dev0521
Copy link
Copy Markdown
Contributor Author

Hi, @6ba3i !
Thank you for your attention.
I'll update it asap.

@xugangqiang
Copy link
Copy Markdown
Contributor

@web-dev0521
May I know if proper testing has been conducted, like regrssion test?
If so, please uploading testing evidence so that we know it's tested and will not cause regression issues.

  1. what's the UI (screen shots) before fix
  2. what's after fix (screen shots, logs, ..., etc)

Appreciate your contributions and looking forward for more contributions.

@web-dev0521
Copy link
Copy Markdown
Contributor Author

Hi @6ba3i,
I’ve updated the code based on your recommendations. I would appreciate it if you could take a moment to review the changes when convenient.
Thank you for your time and guidance.

@web-dev0521 web-dev0521 requested a review from 6ba3i April 20, 2026 06:16
Copy link
Copy Markdown
Contributor

@6ba3i 6ba3i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code looks good to me @web-dev0521 ! do you mind answering back @xugangqiang on his requests so that i can formally approve this pr ?

@web-dev0521
Copy link
Copy Markdown
Contributor Author

Hi, @xugangqiang !
Thank you for the thorough review and for the helpful suggestion regarding the boundary-index approach - I have incorporated it as recommended.

To ensure accuracy and provide proper regression coverage, I’ve added unit tests in test/unit_test/deepdoc/parser/test_mineru_crop.py that reproduce the reported edge cases.

The tests cover:

  • Content image at top = 0 (zero-height top context strip, which is skipped)
  • Content image within the GAP of the page top
  • The core scenario identified in the discussion where both context strips are skipped simultaneously (len(imgs) == 1)
  • Verification that context strips are still correctly darkened (ensuring no regression)

All four tests are passing and confirm that the fix behaves as expected. These tests should also provide reliable regression coverage in CI to prevent similar issues in the future.

Thank you again for your guidance and support. I look forward to your feedback.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
test/unit_test/deepdoc/parser/test_mineru_crop.py (1)

78-91: Make the “both context strips” assertion match the test intent.

The docstring says both top and bottom context strips are darkened, but Lines 88-90 only verify the top strip. Add a bottom-strip pixel check to fully lock the behavior.

Suggested assertion extension
     def test_context_strips_are_darkened(self):
         """Context strips above and below content must receive the overlay."""
@@
         # Top-most pixel row should be from the darkened top context strip
         r, g, b = result.getpixel((result.size[0] // 2, 0))
         assert r < 200, f"Top context strip not darkened (r={r})"
+        # Bottom-most pixel row should be from the darkened bottom context strip
+        bottom_r, _, _ = result.getpixel((result.size[0] // 2, result.size[1] - 1))
+        assert bottom_r < 200, f"Bottom context strip not darkened (r={bottom_r})"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/unit_test/deepdoc/parser/test_mineru_crop.py` around lines 78 - 91, The
test test_context_strips_are_darkened only checks the top context strip; update
it to also verify the bottom strip is darkened by sampling a pixel from the
bottom-most row of the cropped result (use result.getpixel((result.size[0] // 2,
result.size[1] - 1))) and assert its red channel is below the same threshold
(e.g., r < 200), keeping the existing top-strip check and using parser.crop and
result.getpixel to locate the pixels.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/unit_test/deepdoc/parser/test_mineru_crop.py`:
- Around line 23-27: This test module lacks pytest priority markers; import
pytest at the top and annotate the pytest test functions in this file (the
test_* functions defined around line ~47 in test_mineru_crop.py) with the
appropriate priority markers such as `@pytest.mark.p1` / `@pytest.mark.p2` /
`@pytest.mark.p3` according to the test severity; ensure the import of pytest is
added and each test function (e.g., the test functions in MinerUParser-related
tests) gets the correct `@pytest.mark`.<priority> decorator so the repository’s
test-level categorization is enforced.

---

Nitpick comments:
In `@test/unit_test/deepdoc/parser/test_mineru_crop.py`:
- Around line 78-91: The test test_context_strips_are_darkened only checks the
top context strip; update it to also verify the bottom strip is darkened by
sampling a pixel from the bottom-most row of the cropped result (use
result.getpixel((result.size[0] // 2, result.size[1] - 1))) and assert its red
channel is below the same threshold (e.g., r < 200), keeping the existing
top-strip check and using parser.crop and result.getpixel to locate the pixels.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4c3177be-37d8-4fb7-90b7-075b8ce517a1

📥 Commits

Reviewing files that changed from the base of the PR and between 937a76c and b6eb172.

📒 Files selected for processing (1)
  • test/unit_test/deepdoc/parser/test_mineru_crop.py

Comment thread test/unit_test/deepdoc/parser/test_mineru_crop.py
@web-dev0521
Copy link
Copy Markdown
Contributor Author

Hello, @yingfeng!
I have completed all the actions regarding the points you pointed out.
Thank you so much for taking the time to provide feedback—and thanks to everyone else who offered their valuable input as well. I am truly grateful.
Please feel free to let me know anytime if there are any further updates I should make.

@yingfeng
Copy link
Copy Markdown
Member

Hello, @yingfeng! I have completed all the actions regarding the points you pointed out. Thank you so much for taking the time to provide feedback—and thanks to everyone else who offered their valuable input as well. I am truly grateful. Please feel free to let me know anytime if there are any further updates I should make.

Can you answer back @xugangqiang on his requests ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🐞 bug Something isn't working, pull request that fix bug. ci Continue Integration size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Why are the parsed images obscured when I use MinerU as a parser?

5 participants