Fix: MinerU parsed images are obscured when using MinerU as a parser.… by web-dev0521 · Pull Request #14219 · infiniflow/ragflow

web-dev0521 · 2026-04-20T02:23:10Z

What problem does this PR solve?

When using MinerU as a document parser, parsed images appeared obscured (darkened) in the RAGFlow UI. The root cause was in the crop method of deepdoc/parser/mineru_parser.py: a semi-transparent black overlay (intended only for surrounding context strips) was incorrectly applied to actual content images when the top context strip crop was skipped due to zero-height regions.

The fix tracks the is_context flag alongside each crop at generation time (imgs_with_flags), ensuring the overlay is only applied to context strips — never to the actual parsed content.

Fixes #14197

Type of change

Bug Fix (non-breaking change which fixes an issue)

…infiniflow#14197)

coderabbitai · 2026-04-20T02:23:30Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

The MinerU parser's crop() now tracks head and tail context boundaries (head_ctx_end, tail_ctx_start) during the cropping loop and uses these ranges to decide overlay/context masking instead of relying on positional checks against the final imgs list.

Changes

Cohort / File(s)	Summary
MinerU parser crop logic `deepdoc/parser/mineru_parser.py`	Replaced positional overlay checks with explicit context boundary tracking: added `head_ctx_end` and `tail_ctx_start`, initialized and updated them per-segment inside the cropping loop, and switched overlay/context masking from `ii == 0 or ii + 1 == len(imgs)` to `ii < head_ctx_end or ii >= tail_ctx_start`. Minor loop initialization/boundary updates included.
Unit tests for crop behavior `test/unit_test/deepdoc/parser/test_mineru_crop.py`	New test module exercising `MinerUParser.crop()` overlay/darkening behavior across top/bottom context cases, single-image and multi-page scenarios; includes helpers for synthetic images and center-pixel assertions to prevent regressions.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰
I hopped the loop and marked each end,
Head and tail now meet my friend.
No more shadows on the core—
Bright red center, ask for more. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies a bug fix for MinerU parser images appearing obscured, accurately reflecting the main purpose of the changeset.
Description check	✅ Passed	The description includes the required template sections: problem statement with root cause, fix approach, issue reference, and type of change selection.
Linked Issues check	✅ Passed	The PR directly addresses issue `#14197` by fixing the overlay logic in crop() to apply darkening only to context strips, preventing content images from being obscured.
Out of Scope Changes check	✅ Passed	All changes are scoped to fixing the reported issue: core fix in mineru_parser.py and comprehensive unit tests for regression coverage.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

deepdoc/parser/mineru_parser.py (1)

423-450: ⚠️ Potential issue | 🟡 Minor

Add logging for the new context/content branch behavior.

This introduces a new rendering flow (is_context) but doesn’t add traceability. A debug log here would make future visual regressions much easier to diagnose.

Proposed change

         for ii, (pns, left, right, top, bottom) in enumerate(poss):
             is_context = ii == 0 or ii == poss_last_idx
+            self.logger.debug(
+                "[MinerU] crop segment idx=%s pages=%s is_context=%s",
+                ii, pns, is_context
+            )
             right = left + max_width

As per coding guidelines, **/*.py: Add logging for new flows.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@deepdoc/parser/mineru_parser.py` around lines 423 - 450, Add a debug log
right before appending the cropped image to imgs_with_flags in the MinerU
parsing flow to record the new context/content branch; log the base page index
pns[0], the computed crop coordinates (x0,y0,x1,y1), the is_context boolean, and
the page_count (use self.page_images and self.logger). Place this
self.logger.debug call immediately before the line with
imgs_with_flags.append((crop0, is_context)) in mineru_parser.py so each crop
produced by the new branch is traceable for visual regression debugging.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@deepdoc/parser/mineru_parser.py`:
- Around line 423-450: Add a debug log right before appending the cropped image
to imgs_with_flags in the MinerU parsing flow to record the new context/content
branch; log the base page index pns[0], the computed crop coordinates
(x0,y0,x1,y1), the is_context boolean, and the page_count (use self.page_images
and self.logger). Place this self.logger.debug call immediately before the line
with imgs_with_flags.append((crop0, is_context)) in mineru_parser.py so each
crop produced by the new branch is traceable for visual regression debugging.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4a081063-5a9d-4a56-a47d-e17abc9347e4

📥 Commits

Reviewing files that changed from the base of the PR and between 84b6069 and 7e13446.

📒 Files selected for processing (1)

deepdoc/parser/mineru_parser.py

web-dev0521 · 2026-04-20T02:30:55Z

Hi @yingfeng,
This is my first contribution to the infiniflow/ragflow repository. I’ve put significant effort into this PR and aimed to follow the project’s conventions and standards.
I would appreciate your review whenever you have time, and I welcome any feedback or suggestions for improvement.
Thank you.

…infiniflow#14197)

Copilot

Pull request overview

Fixes MinerU image crops being darkened/obscured in the UI by ensuring the semi-transparent overlay is applied only to “context strip” crops, not to actual content crops, even when some context crops are skipped.

Changes:

Track (cropped_image, is_context) tuples during crop generation to decouple overlay logic from the final concatenated image index.
Apply the overlay based on is_context rather than “first/last image in the output list”.
Add debug logging for crop coordinates and context flag.

web-dev0521 · 2026-04-20T04:25:20Z

        positions = []
+        # Each entry is (PIL.Image, is_context) so overlay logic is decoupled from final index.
+        imgs_with_flags = []
+        poss_last_idx = len(poss) - 1


Good catch — thank you for pointing that out.

The imgs = [] line has already been removed in the latest commit. The unused variable was identified and cleaned up to address the Ruff F841 warning.

CI should now pass cleanly, but please let me know if anything else needs attention.

codecov · 2026-04-20T04:36:52Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.66%. Comparing base (9399336) to head (c101966).

Additional details and impacted files

@@           Coverage Diff           @@
##             main   #14219   +/-   ##
=======================================
  Coverage   96.66%   96.66%           
=======================================
  Files          10       10           
  Lines         690      690           
  Branches      108      108           
=======================================
  Hits          667      667           
  Misses          8        8           
  Partials       15       15

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

web-dev0521 · 2026-04-20T04:51:39Z

Hi @yingfeng ,
Hope you had a great weekend.
I’ve addressed all the issues raised. Thank you very much for your time and valuable feedback—I truly appreciate it.
Please let me know if there’s anything else I should update.

web-dev0521 · 2026-04-20T05:23:39Z

Hi @yingfeng, thank you for your attention and guidance.
Hi @xugangqiang, could you please take a moment to review my PR when convenient? I would greatly appreciate your feedback.
Thank you both for your time.

6ba3i

Thanks for the fix here. I think this PR is addressing the right bug in the right place.
That said, I think this can be made smaller.

The current PR keeps extra state for every cropped image with imgs_with_flags, and also adds debug logging and some helper/comment churn. For this specific bug, I do not think we need all of that. A smaller fix is enough:

keep imgs as plain images
record where the leading synthetic context segment ends
record where the trailing synthetic context segment begins
use those two boundaries in the final overlay loop instead of ii == 0 / ii + 1 == len(imgs)

Suggested change:

diff --git a/deepdoc/parser/mineru_parser.py b/deepdoc/parser/mineru_parser.py
@@
-        positions = []
+        positions = []
+        head_ctx_end = 0
+        tail_ctx_start = 0
         for ii, (pns, left, right, top, bottom) in enumerate(poss):
+            if ii + 1 == len(poss):
+                tail_ctx_start = len(imgs)
             right = left + max_width
@@
                 cimgp = page.crop((x0, y0, x1, y1))
                 imgs.append(cimgp)
                 if 0 < ii < len(poss) - 1:
                     positions.append((pn + self.page_from, x0, x1, y0, y1))
                 bottom -= page.size[1]
+            if ii == 0:
+                head_ctx_end = len(imgs)
@@
         height = 0
         for ii, img in enumerate(imgs):
-            if ii == 0 or ii + 1 == len(imgs):
+            if ii < head_ctx_end or ii >= tail_ctx_start:
                 img = img.convert("RGBA")
                 overlay = Image.new("RGBA", img.size, (0, 0, 0, 0))
                 overlay.putalpha(128)

it fixes the same regression with less churn and less code change, could you give it a try ? thanks again !

web-dev0521 · 2026-04-20T05:57:17Z

Hi, @6ba3i !
Thank you for your attention.
I'll update it asap.

…stion

xugangqiang · 2026-04-20T06:07:49Z

@web-dev0521
May I know if proper testing has been conducted, like regrssion test?
If so, please uploading testing evidence so that we know it's tested and will not cause regression issues.

what's the UI (screen shots) before fix
what's after fix (screen shots, logs, ..., etc)

Appreciate your contributions and looking forward for more contributions.

web-dev0521 · 2026-04-20T06:08:41Z

Hi @6ba3i,
I’ve updated the code based on your recommendations. I would appreciate it if you could take a moment to review the changes when convenient.
Thank you for your time and guidance.

6ba3i

the code looks good to me @web-dev0521 ! do you mind answering back @xugangqiang on his requests so that i can formally approve this pr ?

…flow#14197

web-dev0521 · 2026-04-20T06:39:45Z

Hi, @xugangqiang !
Thank you for the thorough review and for the helpful suggestion regarding the boundary-index approach - I have incorporated it as recommended.

To ensure accuracy and provide proper regression coverage, I’ve added unit tests in test/unit_test/deepdoc/parser/test_mineru_crop.py that reproduce the reported edge cases.

The tests cover:

Content image at top = 0 (zero-height top context strip, which is skipped)
Content image within the GAP of the page top
The core scenario identified in the discussion where both context strips are skipped simultaneously (len(imgs) == 1)
Verification that context strips are still correctly darkened (ensuring no regression)

All four tests are passing and confirm that the fix behaves as expected. These tests should also provide reliable regression coverage in CI to prevent similar issues in the future.

Thank you again for your guidance and support. I look forward to your feedback.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

test/unit_test/deepdoc/parser/test_mineru_crop.py (1)

78-91: Make the “both context strips” assertion match the test intent.

The docstring says both top and bottom context strips are darkened, but Lines 88-90 only verify the top strip. Add a bottom-strip pixel check to fully lock the behavior.

Suggested assertion extension

     def test_context_strips_are_darkened(self):
         """Context strips above and below content must receive the overlay."""
@@
         # Top-most pixel row should be from the darkened top context strip
         r, g, b = result.getpixel((result.size[0] // 2, 0))
         assert r < 200, f"Top context strip not darkened (r={r})"
+        # Bottom-most pixel row should be from the darkened bottom context strip
+        bottom_r, _, _ = result.getpixel((result.size[0] // 2, result.size[1] - 1))
+        assert bottom_r < 200, f"Bottom context strip not darkened (r={bottom_r})"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/unit_test/deepdoc/parser/test_mineru_crop.py` around lines 78 - 91, The
test test_context_strips_are_darkened only checks the top context strip; update
it to also verify the bottom strip is darkened by sampling a pixel from the
bottom-most row of the cropped result (use result.getpixel((result.size[0] // 2,
result.size[1] - 1))) and assert its red channel is below the same threshold
(e.g., r < 200), keeping the existing top-strip check and using parser.crop and
result.getpixel to locate the pixels.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/unit_test/deepdoc/parser/test_mineru_crop.py`:
- Around line 23-27: This test module lacks pytest priority markers; import
pytest at the top and annotate the pytest test functions in this file (the
test_* functions defined around line ~47 in test_mineru_crop.py) with the
appropriate priority markers such as `@pytest.mark.p1` / `@pytest.mark.p2` /
`@pytest.mark.p3` according to the test severity; ensure the import of pytest is
added and each test function (e.g., the test functions in MinerUParser-related
tests) gets the correct `@pytest.mark`.<priority> decorator so the repository’s
test-level categorization is enforced.

---

Nitpick comments:
In `@test/unit_test/deepdoc/parser/test_mineru_crop.py`:
- Around line 78-91: The test test_context_strips_are_darkened only checks the
top context strip; update it to also verify the bottom strip is darkened by
sampling a pixel from the bottom-most row of the cropped result (use
result.getpixel((result.size[0] // 2, result.size[1] - 1))) and assert its red
channel is below the same threshold (e.g., r < 200), keeping the existing
top-strip check and using parser.crop and result.getpixel to locate the pixels.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4c3177be-37d8-4fb7-90b7-075b8ce517a1

📥 Commits

Reviewing files that changed from the base of the PR and between 937a76c and b6eb172.

📒 Files selected for processing (1)

test/unit_test/deepdoc/parser/test_mineru_crop.py

web-dev0521 · 2026-04-20T07:45:38Z

Hello, @yingfeng!
I have completed all the actions regarding the points you pointed out.
Thank you so much for taking the time to provide feedback—and thanks to everyone else who offered their valuable input as well. I am truly grateful.
Please feel free to let me know anytime if there are any further updates I should make.

yingfeng · 2026-04-20T11:27:12Z

Hello, @yingfeng! I have completed all the actions regarding the points you pointed out. Thank you so much for taking the time to provide feedback—and thanks to everyone else who offered their valuable input as well. I am truly grateful. Please feel free to let me know anytime if there are any further updates I should make.

Can you answer back @xugangqiang on his requests ?

Fix: MinerU parsed images are obscured when using MinerU as a parser. (…

2f81f98

…infiniflow#14197)

dosubot Bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Apr 20, 2026

Merge branch 'main' into fix/mineru-image-obscured-14197

7e13446

dosubot Bot added the 🐞 bug Something isn't working, pull request that fix bug. label Apr 20, 2026

coderabbitai Bot reviewed Apr 20, 2026

View reviewed changes

web-dev0521 added 4 commits April 19, 2026 19:39

Fix: MinerU parsed images are obscured when using MinerU as a parser. (…

11fd383

…infiniflow#14197)

Merge branch 'main' into fix/mineru-image-obscured-14197

d3e8148

Merge branch 'main' into fix/mineru-image-obscured-14197

1f68f68

Merge branch 'main' into fix/mineru-image-obscured-14197

fdbf6f7

yingfeng added the ci Continue Integration label Apr 20, 2026

yingfeng marked this pull request as draft April 20, 2026 04:17

yingfeng marked this pull request as ready for review April 20, 2026 04:17

yingfeng requested a review from Copilot April 20, 2026 04:19

Copilot started reviewing on behalf of yingfeng April 20, 2026 04:19 View session

Copilot AI reviewed Apr 20, 2026

View reviewed changes

Fix: remove unused imgs variable flagged by ruff

53b81e7

Merge branch 'main' into fix/mineru-image-obscured-14197

88bfd7c

yingfeng requested a review from xugangqiang April 20, 2026 04:58

6ba3i assigned 6ba3i and unassigned 6ba3i Apr 20, 2026

6ba3i self-requested a review April 20, 2026 05:48

6ba3i reviewed Apr 20, 2026

View reviewed changes

Fix: simplify overlay logic using boundary indices per reviewer sugge…

937a76c

…stion

web-dev0521 requested a review from 6ba3i April 20, 2026 06:16

6ba3i reviewed Apr 20, 2026

View reviewed changes

Test: add unit tests covering both-strips-skipped scenario for infini…

0ea4150

…flow#14197

web-dev0521 added 2 commits April 19, 2026 23:40

Merge branch 'main' into fix/mineru-image-obscured-14197

0577650

Fix: remove unused imports flagged by ruff in test_mineru_crop.py

b6eb172

coderabbitai Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread test/unit_test/deepdoc/parser/test_mineru_crop.py

web-dev0521 added 2 commits April 20, 2026 00:05

Fix: add pytest priority markers to test_mineru_crop.py

e62b21d

Merge branch 'main' into fix/mineru-image-obscured-14197

c101966

Conversation

web-dev0521 commented Apr 20, 2026

What problem does this PR solve?

Type of change

Uh oh!

coderabbitai Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

web-dev0521 commented Apr 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

web-dev0521 Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

web-dev0521 commented Apr 20, 2026

Uh oh!

web-dev0521 commented Apr 20, 2026

Uh oh!

6ba3i left a comment

Choose a reason for hiding this comment

Uh oh!

web-dev0521 commented Apr 20, 2026

Uh oh!

xugangqiang commented Apr 20, 2026

Uh oh!

web-dev0521 commented Apr 20, 2026

Uh oh!

6ba3i left a comment

Choose a reason for hiding this comment

Uh oh!

web-dev0521 commented Apr 20, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

web-dev0521 commented Apr 20, 2026

Uh oh!

yingfeng commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

coderabbitai Bot commented Apr 20, 2026 •

edited

Loading

codecov Bot commented Apr 20, 2026 •

edited

Loading