Skip to content

feat: route docling parsing through native chunking endpoints#14218

Merged
wangq8 merged 8 commits into
infiniflow:mainfrom
ParasSondhi:feat/docling-native-chunking
Apr 24, 2026
Merged

feat: route docling parsing through native chunking endpoints#14218
wangq8 merged 8 commits into
infiniflow:mainfrom
ParasSondhi:feat/docling-native-chunking

Conversation

@ParasSondhi
Copy link
Copy Markdown
Contributor

@ParasSondhi ParasSondhi commented Apr 19, 2026

Resolves #14211

Background: Currently, RAGFlow routes all Docling parsing through the standard /convert/source endpoint. For large documents, this returns massive, unchunked text that exceeds RAGFlow's internal embedding model context limits, causing pipeline failures.

Solution:
This PR updates the _parse_pdf_remote ingestion logic in docling_parser.py to prioritize docling-serve's native chunking endpoints (/v1/chunk/source and /v1alpha/chunk/source).

  • By receiving pre-sliced chunk objects directly from Docling, RAGFlow natively bypasses token limit overflows.
  • Included a graceful fallback mechanism to the standard /convert/source endpoints to maintain backwards compatibility for users running older versions of the Docling server that return 404s on the new routes.

Type of change

  • New Feature (non-breaking change which adds functionality)

@dosubot dosubot Bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Apr 19, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 19, 2026

📝 Walkthrough

Walkthrough

The parser now attempts native chunking by sending a chunking-enabled payload to Docling endpoints first, handles chunked responses (list or results), converts chunks to sections and returns with empty tables, and falls back to non-chunking conversion requests while logging chunking outcomes and request errors.

Changes

Cohort / File(s) Summary
Docling chunking & fallback
deepdoc/parser/docling_parser.py
Adds separate chunking-enabled and non-chunking payloads, tries chunking-enabled requests against Docling endpoints first, logs chunking success/failure and per-endpoint exceptions, treats chunked responses (raw list or response["results"]) by extracting chunk_data["text"] or chunk_data["chunk"]["text"], converts chunks via _sections_from_remote_text, optionally invokes callback with chunk count, returns immediately with tables=[]; falls back to prior convert/document-based parsing when chunking not used.

Sequence Diagram(s)

sequenceDiagram
    rect rgba(200,230,255,0.5)
    participant Parser as RAGFlow Parser
    end
    rect rgba(200,255,200,0.5)
    participant Docling as Docling-serve
    end
    rect rgba(255,230,200,0.5)
    participant Local as _sections_from_remote_text
    end

    Parser->>Docling: POST /v1/convert/source (payload with options.do_chunking + chunking_options)
    alt chunking 200 (list or response.results)
        Docling-->>Parser: chunked response
        Parser->>Local: _sections_from_remote_text(chunk.text / chunk.chunk.text)
        Local-->>Parser: sections
        Parser-->>Caller: return (sections, tables=[])
    else chunking non-2xx (or rejected)
        Parser->>Docling: POST /v1alpha/convert/source (chunking payload)
        alt alpha chunking 200
            Docling-->>Parser: chunked response
            Parser->>Local: _sections_from_remote_text(...)
            Local-->>Parser: sections
            Parser-->>Caller: return (sections, tables=[])
        else chunking failed or endpoints return non-chunked
            Parser->>Docling: POST /v1/convert/source (non-chunking payload)
            Docling-->>Parser: conversion documents (md/text/json)
            Parser->>Local: extract docs -> _sections_from_remote_text
            Local-->>Parser: sections, tables
            Parser-->>Caller: return (sections, tables)
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped to Docling with a curious twitch,

Sent tiny chunking seeds and waited for a stitch,
When chunks came back tidy I counted each part,
If not, I fallback-fast — clever from the start,
Now big texts bound lightly through the park.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and clearly describes the main change: routing Docling parsing through native chunking endpoints.
Linked Issues check ✅ Passed The PR implementation aligns with #14211 objectives: prefers chunking endpoints, handles chunked responses, maintains backward compatibility with fallback, and converts chunks to RAGFlow sections.
Out of Scope Changes check ✅ Passed All changes are scoped to implementing native chunking support in _parse_pdf_remote; no extraneous modifications detected beyond the PR objectives.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed PR description includes problem statement, solution overview, and type of change checkbox marked. However, it describes endpoints as '/v1/chunk/source' in description while raw_summary shows '/v1/convert/source'.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
deepdoc/parser/docling_parser.py (2)

429-444: Add logging for the new chunking flow.

Per coding guidelines, new flows should include logging. The chunking endpoint path only uses callback but lacks self.logger calls for debugging/observability. Consider adding debug logging when entering the chunking flow and when chunks are processed.

♻️ Proposed refactor: Add logging statements
         # --- NEW: Handle Native Chunked Response ---
         if is_chunked_response:
+            self.logger.info("[Docling] Using native chunking endpoint response")
             # The chunking endpoint returns an array of chunk items
             chunks = response_json if isinstance(response_json, list) else response_json.get("results", [])
             for chunk_data in chunks:
                 # Depending on the exact docling-serve spec, the text might be nested
                 chunk_text = chunk_data.get("text", "")
                 if not chunk_text and "chunk" in chunk_data:
                     chunk_text = chunk_data["chunk"].get("text", "")
                 
                 if chunk_text.strip():
                     # Feed the pre-sliced chunks directly into RAGFlow's expected format
                     sections.extend(self._sections_from_remote_text(chunk_text, parse_method=parse_method))
                     
             if callback:
                 callback(0.95, f"[Docling] Native chunks received: {len(sections)}")
+            self.logger.debug(f"[Docling] Processed {len(chunks)} chunks into {len(sections)} sections")
             return sections, tables

As per coding guidelines: "**/*.py: Add logging for new flows"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepdoc/parser/docling_parser.py` around lines 429 - 444, The chunking branch
triggered when is_chunked_response lacks any logger calls; add debug/info logs
using self.logger at the start of the branch (mentioning that chunked response
handling has started and include any identifying request id if available), log
the number of raw chunks found (len(chunks)) before the loop, and log progress
when processing chunks (e.g., per-chunk index or every N chunks) and a final log
after building sections and tables (include len(sections) and len(tables)); keep
the existing callback usage but add these self.logger.debug/info calls around
the code that uses chunks, chunk_text, and _sections_from_remote_text to aid
observability.

389-395: Consider making chunking parameters configurable.

The max_tokens and overlap_tokens values are hardcoded. These parameters directly impact chunking behavior and may need tuning based on different embedding models' context limits. Consider exposing these as method parameters or environment variables for flexibility.

♻️ Proposed refactor: Add parameters with defaults
 def _parse_pdf_remote(
     self,
     filepath: str | PathLike[str],
     binary: BytesIO | bytes | None = None,
     callback: Optional[Callable] = None,
     *,
     parse_method: str = "raw",
     docling_server_url: Optional[str] = None,
     request_timeout: Optional[int] = None,
+    chunk_max_tokens: int = 512,
+    chunk_overlap_tokens: int = 50,
 ):
     # ...
     chunking_payload = {
         "sources": [{"kind": "file", "filename": filename, "base64_string": b64}],
         "options": {
-            "max_tokens": 512, # A safe default to prevent embedding overflows
-            "overlap_tokens": 50
+            "max_tokens": chunk_max_tokens,
+            "overlap_tokens": chunk_overlap_tokens
         }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepdoc/parser/docling_parser.py` around lines 389 - 395, The chunking
parameters are hardcoded in the chunking_payload block; make max_tokens and
overlap_tokens configurable by adding parameters (e.g., max_chunk_tokens and
overlap_chunk_tokens) to the surrounding function that builds chunking_payload
in docling_parser.py, defaulting to values read from environment variables
(e.g., via os.getenv with int conversion) and falling back to 512 and 50; update
chunking_payload to use these parameters and adjust the function signature and
docstring accordingly so callers can tune embedding chunk sizes per model.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deepdoc/parser/docling_parser.py`:
- Around line 44-45: The conditional import placed inside the except block
causes extract_pdf_outlines to be undefined when RAGFlowPdfParser imports
successfully; move the import statement for extract_pdf_outlines out of the
try/except so it is always executed (ensure extract_pdf_outlines is imported at
module scope alongside or above the RAGFlowPdfParser import), keeping the
existing try/except only around RAGFlowPdfParser so that extract_pdf_outlines is
always available for the code that calls it (e.g., where extract_pdf_outlines is
referenced at line ~483).
- Around line 432-440: The loop over chunks in the parser (the for chunk_data in
chunks block in docling_parser.py) assumes each chunk_data is a dict and calls
chunk_data.get(), which can raise if malformed; add a defensive type guard:
check isinstance(chunk_data, dict) before using .get(), and if chunk_data is a
str or bytes treat it as the chunk text (assign chunk_text = chunk_data), or if
it's None/other skip the item, then proceed to call
_sections_from_remote_text(chunk_text, parse_method=parse_method) only when
chunk_text is a non-empty string; update the for-loop handling accordingly.

---

Nitpick comments:
In `@deepdoc/parser/docling_parser.py`:
- Around line 429-444: The chunking branch triggered when is_chunked_response
lacks any logger calls; add debug/info logs using self.logger at the start of
the branch (mentioning that chunked response handling has started and include
any identifying request id if available), log the number of raw chunks found
(len(chunks)) before the loop, and log progress when processing chunks (e.g.,
per-chunk index or every N chunks) and a final log after building sections and
tables (include len(sections) and len(tables)); keep the existing callback usage
but add these self.logger.debug/info calls around the code that uses chunks,
chunk_text, and _sections_from_remote_text to aid observability.
- Around line 389-395: The chunking parameters are hardcoded in the
chunking_payload block; make max_tokens and overlap_tokens configurable by
adding parameters (e.g., max_chunk_tokens and overlap_chunk_tokens) to the
surrounding function that builds chunking_payload in docling_parser.py,
defaulting to values read from environment variables (e.g., via os.getenv with
int conversion) and falling back to 512 and 50; update chunking_payload to use
these parameters and adjust the function signature and docstring accordingly so
callers can tune embedding chunk sizes per model.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3c44dc51-c795-4518-b1ea-62e03c87a03c

📥 Commits

Reviewing files that changed from the base of the PR and between 6712b50 and ac0ab99.

📒 Files selected for processing (1)
  • deepdoc/parser/docling_parser.py

Comment thread deepdoc/parser/docling_parser.py Outdated
Comment thread deepdoc/parser/docling_parser.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
deepdoc/parser/docling_parser.py (2)

389-395: Consider making chunking parameters configurable.

The max_tokens and overlap_tokens values are hardcoded. Different embedding models have varying context limits (512, 1024, 8192 tokens, etc.), and users may need to tune these for their specific use case.

♻️ Proposed refactor: Add parameters to method signature
 def _parse_pdf_remote(
     self,
     filepath: str | PathLike[str],
     binary: BytesIO | bytes | None = None,
     callback: Optional[Callable] = None,
     *,
     parse_method: str = "raw",
     docling_server_url: Optional[str] = None,
     request_timeout: Optional[int] = None,
+    max_tokens: int = 512,
+    overlap_tokens: int = 50,
 ):

Then use the parameters in the payload:

         chunking_payload = {
             "sources": [{"kind": "file", "filename": filename, "base64_string": b64}],
             "options": {
-                "max_tokens": 512, # A safe default to prevent embedding overflows
-                "overlap_tokens": 50
+                "max_tokens": max_tokens,
+                "overlap_tokens": overlap_tokens
             }
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepdoc/parser/docling_parser.py` around lines 389 - 395, The chunking
parameters are hardcoded in the chunking_payload dict; make them configurable by
adding parameters (e.g., max_tokens=512, overlap_tokens=50) to the enclosing
function/method signature in docling_parser.py and use those parameters when
building chunking_payload (replace the literal 512 and 50). Ensure the defaults
match current values, propagate the new args through any callers if necessary,
and validate or clamp values as needed before inserting them into the "options"
-> "max_tokens" and "overlap_tokens" fields.

401-420: Add logging for the new chunking flow.

Per coding guidelines, new flows should include logging. Consider adding a debug/info log when attempting chunking endpoints and when falling back to conversion endpoints, which would aid troubleshooting.

♻️ Proposed addition
         # We prioritize the new chunking endpoints first!
         for endpoint, payload, chunk_flag in (
             ("/v1/chunk/source", chunking_payload, True),          # New stable chunking
             ("/v1alpha/chunk/source", chunking_payload, True),     # New alpha chunking
             ("/v1/convert/source", v1_payload, False),             # Fallback to standard
             ("/v1alpha/convert/source", v1alpha_payload, False),   # Fallback to alpha standard
         ):
             try:
+                self.logger.debug(f"[Docling] Trying endpoint: {endpoint} (chunked={chunk_flag})")
                 resp = requests.post(
                     f"{server_url}{endpoint}",
                     json=payload,
                     timeout=timeout,
                 )
                 if resp.status_code < 300:
                     response_json = resp.json()
                     is_chunked_response = chunk_flag
+                    self.logger.info(f"[Docling] Success with endpoint: {endpoint}")
                     break
                 errors.append(f"{endpoint}: HTTP {resp.status_code} {resp.text[:300]}")
             except Exception as exc:
                 errors.append(f"{endpoint}: {exc}")

As per coding guidelines: "**/*.py: Add logging for new flows".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepdoc/parser/docling_parser.py` around lines 401 - 420, The loop that tries
chunking and conversion endpoints (the block iterating over endpoint, payload,
chunk_flag using server_url, chunking_payload, v1_payload, v1alpha_payload,
timeout, resp, response_json, is_chunked_response, errors) needs logging: add a
debug/info log before each requests.post indicating which endpoint and whether
it is a chunking attempt (chunk_flag True) or a conversion fallback, log the
response status and short body when resp.status_code < 300 or when appending to
errors, and log exceptions inside the except block; use the module or class
logger already in this file (or create one if missing) and ensure logs clearly
indicate endpoint, chunk_flag and outcome so the new chunking flow is traceable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deepdoc/parser/docling_parser.py`:
- Around line 436-438: The code assumes chunk_data["chunk"] is a dict when
computing chunk_text; add a defensive type check before calling .get on
chunk_data["chunk"] in the chunk_text assignment: if chunk_data.get("chunk") is
a dict, use its .get("text", "") value, otherwise handle other types (e.g., if
it's a str use it directly, if None or unexpected type treat as empty string).
Update the logic around the chunk_text and chunk_data variables (the existing
chunk_text = chunk_data.get("text", "") / chunk_data["chunk"].get("text", ""))
to first inspect isinstance(chunk_data.get("chunk"), dict) and branch
accordingly so no AttributeError is raised.
- Around line 425-446: The chunked-response branch currently only extracts text
into sections and leaves tables empty; update the block guarded by
is_chunked_response so that for each chunk object in response_json you also
extract any table metadata (e.g., chunk.get("tables") or chunk.get("table") /
chunk.get("do_table_structure") variants) and pass that chunk-level table
payload into the existing _transfer_to_tables(...) helper to convert/append
results into the local tables list (merging with any existing tables) before
returning; continue to call _sections_from_remote_text(...) for text, and keep
the callback usage (callback(0.95,...)) but include the final counts for both
sections and tables when reporting.

---

Nitpick comments:
In `@deepdoc/parser/docling_parser.py`:
- Around line 389-395: The chunking parameters are hardcoded in the
chunking_payload dict; make them configurable by adding parameters (e.g.,
max_tokens=512, overlap_tokens=50) to the enclosing function/method signature in
docling_parser.py and use those parameters when building chunking_payload
(replace the literal 512 and 50). Ensure the defaults match current values,
propagate the new args through any callers if necessary, and validate or clamp
values as needed before inserting them into the "options" -> "max_tokens" and
"overlap_tokens" fields.
- Around line 401-420: The loop that tries chunking and conversion endpoints
(the block iterating over endpoint, payload, chunk_flag using server_url,
chunking_payload, v1_payload, v1alpha_payload, timeout, resp, response_json,
is_chunked_response, errors) needs logging: add a debug/info log before each
requests.post indicating which endpoint and whether it is a chunking attempt
(chunk_flag True) or a conversion fallback, log the response status and short
body when resp.status_code < 300 or when appending to errors, and log exceptions
inside the except block; use the module or class logger already in this file (or
create one if missing) and ensure logs clearly indicate endpoint, chunk_flag and
outcome so the new chunking flow is traceable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bdfe475d-da31-43d4-a230-bfd7669e37f4

📥 Commits

Reviewing files that changed from the base of the PR and between ac0ab99 and 520304b.

📒 Files selected for processing (1)
  • deepdoc/parser/docling_parser.py

Comment thread deepdoc/parser/docling_parser.py
Comment thread deepdoc/parser/docling_parser.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
deepdoc/parser/docling_parser.py (1)

401-420: ⚠️ Potential issue | 🟠 Major

Only downgrade to /convert when chunking is actually unsupported.

This loop falls through to the old convert path after any non-2xx from the chunk routes. That means payload bugs, auth failures, or transient 5xxs silently disable native chunking and can reintroduce the oversized unchunked-text failure this PR is fixing. Please keep the fallback for “endpoint missing/unsupported” responses and surface a real log when chunking is skipped.

As per coding guidelines, **/*.py: Add logging for new flows.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepdoc/parser/docling_parser.py` around lines 401 - 420, The loop currently
falls back to /convert for any non-2xx from chunking endpoints; change it so
when chunk_flag is True you only treat the endpoint as "unsupported" and proceed
to the fallback if the response indicates the route is missing/unsupported
(e.g., 404/405/410/501), but for other non-2xx codes (auth failures, payload
bugs, 5xx) do not silently downgrade—append to errors, log a clear message, and
surface/raise the failure instead of continuing; specifically update the logic
around requests.post/resp handling (symbols: server_url, endpoint, chunk_flag,
resp, response_json, is_chunked_response, errors, timeout) to check status
codes, use the module logger (getLogger) to log a warning/error when chunking is
skipped or when non-supported vs real failures occur, and only set
is_chunked_response and break for successful 2xx or explicit "unsupported"
detection.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@deepdoc/parser/docling_parser.py`:
- Around line 401-420: The loop currently falls back to /convert for any non-2xx
from chunking endpoints; change it so when chunk_flag is True you only treat the
endpoint as "unsupported" and proceed to the fallback if the response indicates
the route is missing/unsupported (e.g., 404/405/410/501), but for other non-2xx
codes (auth failures, payload bugs, 5xx) do not silently downgrade—append to
errors, log a clear message, and surface/raise the failure instead of
continuing; specifically update the logic around requests.post/resp handling
(symbols: server_url, endpoint, chunk_flag, resp, response_json,
is_chunked_response, errors, timeout) to check status codes, use the module
logger (getLogger) to log a warning/error when chunking is skipped or when
non-supported vs real failures occur, and only set is_chunked_response and break
for successful 2xx or explicit "unsupported" detection.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 27e437af-4d25-4d15-85a6-9051c904988e

📥 Commits

Reviewing files that changed from the base of the PR and between 520304b and 54b5640.

📒 Files selected for processing (1)
  • deepdoc/parser/docling_parser.py

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
deepdoc/parser/docling_parser.py (2)

389-395: Consider making chunking parameters configurable.

The max_tokens and overlap_tokens values are hardcoded. Different embedding models have varying context limits, and users may want to tune these for their specific use cases.

♻️ Proposed approach: Add parameters to method signature
     def _parse_pdf_remote(
         self,
         filepath: str | PathLike[str],
         binary: BytesIO | bytes | None = None,
         callback: Optional[Callable] = None,
         *,
         parse_method: str = "raw",
         docling_server_url: Optional[str] = None,
         request_timeout: Optional[int] = None,
+        max_tokens: int = 512,
+        overlap_tokens: int = 50,
     ):

Then use these parameters in the chunking payload:

         chunking_payload = {
             "sources": [{"kind": "file", "filename": filename, "base64_string": b64}],
             "options": {
-                "max_tokens": 512,
-                "overlap_tokens": 50
+                "max_tokens": max_tokens,
+                "overlap_tokens": overlap_tokens
             }
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepdoc/parser/docling_parser.py` around lines 389 - 395, The hardcoded
chunking params (max_tokens=512, overlap_tokens=50) should be made configurable:
add optional parameters (e.g., max_tokens and overlap_tokens with the same
defaults) to the public method that builds chunking_payload (or to the
DoclingParser class constructor if you prefer instance-wide defaults), then
replace the literal values in chunking_payload with those parameters; also
propagate these values when callers invoke the method or when constructing the
parser so users can tune per-model settings.

418-427: Add INFO-level logging when falling back from chunking to conversion endpoints.

The commit message mentions "add logging for that fallback path," but when chunking endpoints return 404 (indicating they're unavailable), there's no log message indicating the fallback is occurring. This would help with debugging and operational visibility.

📝 Proposed fix: Add fallback logging
                 # --- UPDATED LOGIC ---
                 if chunk_flag and resp.status_code != 404:
                     self.logger.error(f"[Docling] Chunking failed with {resp.status_code}: {resp.text[:300]}")
                     errors.append(f"{endpoint}: HTTP {resp.status_code}")
                     break  # <-- THE MAGIC FIX: This instantly kills the loop so it cannot fallback
                 # ---------------------
+                
+                if chunk_flag and resp.status_code == 404:
+                    self.logger.info(f"[Docling] Chunking endpoint {endpoint} not available (404), falling back to conversion endpoints")

                 errors.append(f"{endpoint}: HTTP {resp.status_code} {resp.text[:300]}")

As per coding guidelines: "Add logging for new flows".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepdoc/parser/docling_parser.py` around lines 418 - 427, When chunking
endpoints return 404 you should log an INFO message to indicate you're falling
back to conversion endpoints: inside the same block that checks chunk_flag and
examines resp.status_code (the if handling chunk_flag and resp.status_code !=
404), add an else branch for the 404 case that calls self.logger.info with a
message like "[Docling] Chunking endpoint returned 404, falling back to
conversion for {endpoint}" (use endpoint and resp.status_code/resp.text for
context), then append the error to errors as currently done and continue so the
loop can try conversion endpoints; target the method containing chunk_flag,
resp, errors and endpoint in docling_parser.py.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deepdoc/parser/docling_parser.py`:
- Around line 448-452: The code reads chunk_text from chunk_data and then calls
chunk_text.strip() without ensuring it's a string; update the block around the
chunk_text assignment and the subsequent if check (the chunk_text/ chunk_data
handling and the "if chunk_text.strip():" condition) to first validate or coerce
the value: e.g., check isinstance(chunk_text, str) and only call .strip() when
True, or else set chunk_text = "" (or coerce with str(chunk_text) if non-None
values should be preserved) before using .strip(); ensure both branches (from
chunk_data.get("text") and chunk_data["chunk"].get("text")) follow this guard so
AttributeError cannot occur.

---

Nitpick comments:
In `@deepdoc/parser/docling_parser.py`:
- Around line 389-395: The hardcoded chunking params (max_tokens=512,
overlap_tokens=50) should be made configurable: add optional parameters (e.g.,
max_tokens and overlap_tokens with the same defaults) to the public method that
builds chunking_payload (or to the DoclingParser class constructor if you prefer
instance-wide defaults), then replace the literal values in chunking_payload
with those parameters; also propagate these values when callers invoke the
method or when constructing the parser so users can tune per-model settings.
- Around line 418-427: When chunking endpoints return 404 you should log an INFO
message to indicate you're falling back to conversion endpoints: inside the same
block that checks chunk_flag and examines resp.status_code (the if handling
chunk_flag and resp.status_code != 404), add an else branch for the 404 case
that calls self.logger.info with a message like "[Docling] Chunking endpoint
returned 404, falling back to conversion for {endpoint}" (use endpoint and
resp.status_code/resp.text for context), then append the error to errors as
currently done and continue so the loop can try conversion endpoints; target the
method containing chunk_flag, resp, errors and endpoint in docling_parser.py.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 60dfdb52-e85e-48f7-9c10-928c73c3b3e2

📥 Commits

Reviewing files that changed from the base of the PR and between 54b5640 and ae9fbe2.

📒 Files selected for processing (1)
  • deepdoc/parser/docling_parser.py

Comment thread deepdoc/parser/docling_parser.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
deepdoc/parser/docling_parser.py (2)

401-458: ⚠️ Potential issue | 🟡 Minor

Add normal logs for the new chunking/fallback path.

This flow only surfaces path selection through callback(...), so the runtime logs still won't show whether native chunking was used or whether we fell back after 404s. Please add self.logger.info(...) / self.logger.warning(...) around chunk-endpoint success and convert fallback. As per coding guidelines: **/*.py: Add logging for new flows.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepdoc/parser/docling_parser.py` around lines 401 - 458, Add
informational/warning logs to surface which path was taken: when a chunking
endpoint succeeds (inside the loop where resp.status_code < 300 and chunk_flag
is True) call self.logger.info with endpoint and that native chunking will be
used (and later after building sections log count via self.logger.info before
returning); when a chunking endpoint returns 404 (the code path that currently
falls through to convert endpoints) call self.logger.warning indicating endpoint
404 and that the code will fall back to standard convert endpoints; also add a
self.logger.warning when a network exception occurs and chunk_flag is True (the
except block that currently breaks) to indicate a hard failure on chunking and
that fallback/abort is happening; reference variables/methods: endpoint,
chunk_flag, is_chunked_response, response_json, callback, and
_sections_from_remote_text to locate where to insert these logs.

402-432: ⚠️ Potential issue | 🟠 Major

Try the second chunk endpoint before aborting.

Line 423 and Line 432 stop the loop on the first non-404 chunk failure, so /v1alpha/chunk/source is never attempted if /v1/chunk/source rejects the request. That turns the second preferred chunk route into dead code for some compatibility cases and can fail the parse even though another chunk endpoint would succeed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepdoc/parser/docling_parser.py` around lines 402 - 432, The loop currently
aborts on the first chunk endpoint error because both the HTTP-error branch and
the except block use break when chunk_flag is True; change those breaks so chunk
endpoints don't kill the entire try-order. Specifically, in the for-loop
handling (the tuple with "/v1/chunk/source" and "/v1alpha/chunk/source") remove
or replace the break statements that occur when chunk_flag is True (both inside
the "if chunk_flag and resp.status_code != 404:" branch and the "except
Exception as exc:" block where "if chunk_flag: break") so that
errors.append(...) is kept and the loop continues to try the next chunk
endpoint; keep the break only for non-chunk fallbacks or when a successful
resp.status_code < 300 is seen (response_json/is_chunked_response logic).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@deepdoc/parser/docling_parser.py`:
- Around line 401-458: Add informational/warning logs to surface which path was
taken: when a chunking endpoint succeeds (inside the loop where resp.status_code
< 300 and chunk_flag is True) call self.logger.info with endpoint and that
native chunking will be used (and later after building sections log count via
self.logger.info before returning); when a chunking endpoint returns 404 (the
code path that currently falls through to convert endpoints) call
self.logger.warning indicating endpoint 404 and that the code will fall back to
standard convert endpoints; also add a self.logger.warning when a network
exception occurs and chunk_flag is True (the except block that currently breaks)
to indicate a hard failure on chunking and that fallback/abort is happening;
reference variables/methods: endpoint, chunk_flag, is_chunked_response,
response_json, callback, and _sections_from_remote_text to locate where to
insert these logs.
- Around line 402-432: The loop currently aborts on the first chunk endpoint
error because both the HTTP-error branch and the except block use break when
chunk_flag is True; change those breaks so chunk endpoints don't kill the entire
try-order. Specifically, in the for-loop handling (the tuple with
"/v1/chunk/source" and "/v1alpha/chunk/source") remove or replace the break
statements that occur when chunk_flag is True (both inside the "if chunk_flag
and resp.status_code != 404:" branch and the "except Exception as exc:" block
where "if chunk_flag: break") so that errors.append(...) is kept and the loop
continues to try the next chunk endpoint; keep the break only for non-chunk
fallbacks or when a successful resp.status_code < 300 is seen
(response_json/is_chunked_response logic).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4e86b55c-7e58-491e-80fa-91f465bd5a4b

📥 Commits

Reviewing files that changed from the base of the PR and between ae9fbe2 and 4e7d1c4.

📒 Files selected for processing (1)
  • deepdoc/parser/docling_parser.py

@ParasSondhi
Copy link
Copy Markdown
Contributor Author

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

📝 Walkthrough

Walkthrough

The PR enhances _parse_pdf_remote to prefer Docling-serve chunking endpoints (/v1/chunk/source, /v1alpha/chunk/source) using a chunking_payload (max_tokens, overlap_tokens), detect and handle chunked responses (raw list or response_json["results"]) by converting chunks to sections and returning early; it falls back to existing conversion endpoints when chunking is unavailable and changes error-handling to abort on chunking-specific non-404 failures.

Changes

Cohort / File(s) Summary
Docling chunking & fallback
deepdoc/parser/docling_parser.py Prefer chunking endpoints first with a new chunking_payload (max_tokens, overlap_tokens); iterate endpoints (/v1/chunk/source, /v1alpha/chunk/source, then conversion endpoints) and set is_chunked_response when chunking succeeds. On chunking non-404 errors/exceptions, abort loop; otherwise allow fallback to convert endpoints. When chunked, extract chunk text from chunk_data["text"] or chunk_data["chunk"]["text"], convert via _sections_from_remote_text, invoke callback with chunk count, return early with tables empty. Minor payload formatting and loop structure changes; moved an import into try/except.

Sequence Diagram(s)

sequenceDiagram
    participant Client as RAGFlow
    participant Docling as Docling-serve
    participant Parser as Local Parser

    Client->>Docling: POST /v1/chunk/source (chunking_payload)
    alt chunking success (200)
        Docling-->>Client: chunks (list or response.results)
        Client->>Parser: _sections_from_remote_text(chunks)
        Parser-->>Client: sections
        Client-->>Caller: return (sections, tables=[])
    else chunking 404
        Client->>Docling: POST /v1alpha/chunk/source (retry)
        alt alpha chunking success
            Docling-->>Client: chunks
            Client->>Parser: _sections_from_remote_text(chunks)
            Client-->>Caller: return (sections, tables=[])
        else chunking endpoints failed or non-chunking response
            Client->>Docling: POST /v1/convert/source (fallback)
            Docling-->>Client: document entries (md_content/text_content)
            Client->>Parser: _extract_remote_document_entries -> build sections/tables
            Client-->>Caller: return (sections, tables)
        end
    end
Loading

Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped to Docling with a curious sniff,

Chunks in my pouch—neat, split, and swift,
I stitched them to sections, counted each part,
Kept a fallback trail close to my heart,
Now pages hop forward, fresh as a lark.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the primary change: routing Docling parsing through native chunking endpoints instead of standard conversion endpoints.
Description check ✅ Passed The PR description fully addresses the template requirements with clear problem statement, solution details, backward compatibility approach, and proper type selection.
Linked Issues check ✅ Passed The PR successfully implements all coding objectives from #14211: prioritizes native chunking endpoints (/v1/chunk/source and /v1alpha/chunk/source), processes pre-chunked text to avoid context overflows, and includes graceful fallback to /convert/source for older Docling versions.
Out of Scope Changes check ✅ Passed All changes in deepdoc/parser/docling_parser.py are directly scoped to the stated objective of routing PDF parsing through Docling's native chunking endpoints with fallback support.
✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai resume

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 19, 2026

@ParasSondhi: Resuming reviews.


🧠 Learnings used
Learnt from: ParasSondhi
Repo: infiniflow/ragflow PR: 14218
File: deepdoc/parser/docling_parser.py:425-446
Timestamp: 2026-04-19T15:57:58.362Z
Learning: In `deepdoc/parser/docling_parser.py`, the `_parse_pdf_remote` method has never implemented remote table extraction — both the standard fallback path and the new native-chunking path intentionally return an empty `tables = []` list. Remote table extraction is a separate, not-yet-implemented feature. Do not flag the empty `tables` list in `_parse_pdf_remote` as a bug.
✅ Actions performed

Reviews resumed.

@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Apr 19, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
deepdoc/parser/docling_parser.py (1)

389-395: Consider making chunking parameters configurable.

The max_tokens=512 and overlap_tokens=50 values are hardcoded. While 512 is a safe conservative default, different embedding models have varying context windows (e.g., 8192 for some models). Making these configurable via method parameters or environment variables would allow optimization for specific deployments.

♻️ Suggested approach
     def _parse_pdf_remote(
         self,
         filepath: str | PathLike[str],
         binary: BytesIO | bytes | None = None,
         callback: Optional[Callable] = None,
         *,
         parse_method: str = "raw",
         docling_server_url: Optional[str] = None,
         request_timeout: Optional[int] = None,
+        chunk_max_tokens: Optional[int] = None,
+        chunk_overlap_tokens: Optional[int] = None,
     ):
         ...
+        max_tokens = chunk_max_tokens or int(os.environ.get("DOCLING_CHUNK_MAX_TOKENS", 512))
+        overlap_tokens = chunk_overlap_tokens or int(os.environ.get("DOCLING_CHUNK_OVERLAP_TOKENS", 50))
+
         chunking_payload = {
             "sources": [{"kind": "file", "filename": filename, "base64_string": b64}],
             "options": {
-                "max_tokens": 512,
-                "overlap_tokens": 50
+                "max_tokens": max_tokens,
+                "overlap_tokens": overlap_tokens
             }
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepdoc/parser/docling_parser.py` around lines 389 - 395, The hardcoded
chunking parameters in chunking_payload (max_tokens: 512 and overlap_tokens: 50)
should be made configurable: add optional parameters (e.g., max_tokens and
overlap_tokens) to the surrounding function in docling_parser.py (or read from
environment variables as fallbacks), defaulting to 512 and 50 to preserve
current behavior, and use those parameters when building chunking_payload;
update any call sites and unit tests to pass or mock the new options and
document the new configuration options.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@deepdoc/parser/docling_parser.py`:
- Around line 389-395: The hardcoded chunking parameters in chunking_payload
(max_tokens: 512 and overlap_tokens: 50) should be made configurable: add
optional parameters (e.g., max_tokens and overlap_tokens) to the surrounding
function in docling_parser.py (or read from environment variables as fallbacks),
defaulting to 512 and 50 to preserve current behavior, and use those parameters
when building chunking_payload; update any call sites and unit tests to pass or
mock the new options and document the new configuration options.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a86c716e-92f9-46fb-80a0-e7e4e92c5eb4

📥 Commits

Reviewing files that changed from the base of the PR and between ae9fbe2 and f0add43.

📒 Files selected for processing (1)
  • deepdoc/parser/docling_parser.py

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deepdoc/parser/docling_parser.py`:
- Around line 394-402: The chunking_payload structure is incorrect for Docling
Serve: update the payload created where chunking_payload is defined to use the
/v1/convert/source contract by setting options.do_chunking = True and moving
token settings into options.chunking_options with keys max_tokens, overlap
(rename overlap_tokens -> overlap) and tokenizer; replace calls that hit
/v1/chunk/source or /v1alpha/chunk/source to call /v1/convert/source instead;
make max_tokens, overlap, and tokenizer configurable (e.g., parameters or config
values passed into the function that builds chunking_payload) so they are not
hardcoded.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2df3a4ad-b7f0-48d2-b658-051ee335bcee

📥 Commits

Reviewing files that changed from the base of the PR and between f0add43 and 469bc07.

📒 Files selected for processing (1)
  • deepdoc/parser/docling_parser.py

Comment thread deepdoc/parser/docling_parser.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
deepdoc/parser/docling_parser.py (2)

396-405: Consider making chunking parameters configurable.

The chunking options (max_tokens: 512, overlap: 50, tokenizer: "sentencepiece") are hardcoded. Different embedding models and use cases may require different token limits. Consider exposing these as constructor parameters or environment variables.

♻️ Proposed refactor to add configurability
 class DoclingParser(RAGFlowPdfParser):
-    def __init__(self, docling_server_url: str = "", request_timeout: int = 600):
+    def __init__(
+        self,
+        docling_server_url: str = "",
+        request_timeout: int = 600,
+        chunk_max_tokens: int = 512,
+        chunk_overlap: int = 50,
+        chunk_tokenizer: str = "sentencepiece",
+    ):
         self.logger = logging.getLogger(self.__class__.__name__)
         ...
+        self.chunk_max_tokens = chunk_max_tokens
+        self.chunk_overlap = chunk_overlap
+        self.chunk_tokenizer = chunk_tokenizer

Then in _parse_pdf_remote:

         chunking_opts = {
             "from_formats": ["pdf"], 
             "to_formats": ["json", "md", "text"],
             "do_chunking": True,
             "chunking_options": {
-                "max_tokens": 512,
-                "overlap": 50,
-                "tokenizer": "sentencepiece"
+                "max_tokens": self.chunk_max_tokens,
+                "overlap": self.chunk_overlap,
+                "tokenizer": self.chunk_tokenizer,
             }
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepdoc/parser/docling_parser.py` around lines 396 - 405, The chunking
options are hardcoded in chunking_opts inside _parse_pdf_remote (max_tokens:
512, overlap: 50, tokenizer: "sentencepiece"); make these configurable by adding
constructor parameters (or reading env vars) to the parser class (e.g.,
DoclingParser.__init__) such as chunk_max_tokens, chunk_overlap,
chunk_tokenizer, defaulting to the current values, then replace the literal
values in chunking_opts with those instance attributes so callers can override
them.

354-360: Docstring references outdated endpoint paths.

The docstring mentions /v1/chunk/source and /v1alpha/chunk/source, but the implementation actually uses /v1/convert/source and /v1alpha/convert/source with do_chunking: true. Consider updating the docstring to accurately reflect the implementation.

📝 Suggested docstring update
         """
         Parses a PDF document using a remote Docling server.
         
-        Prioritizes native chunking endpoints (/v1/chunk/source, /v1alpha/chunk/source) 
-        to prevent token overflow, with a graceful fallback to standard conversion 
-        endpoints if chunking is unavailable.
+        Prioritizes native chunking via /v1/convert/source with do_chunking=true
+        to prevent token overflow, with a graceful fallback to standard conversion 
+        when chunking is unavailable.
         """
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepdoc/parser/docling_parser.py` around lines 354 - 360, The docstring
incorrectly states native chunking endpoints; update the docstring in the
function containing "Parses a PDF document using a remote Docling server." to
reflect the actual endpoints used (/v1/convert/source and
/v1alpha/convert/source) and note that chunking is achieved by sending
do_chunking: true rather than calling /chunk/* endpoints (reference
docling_parser module and the function whose docstring contains the quoted
text).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@deepdoc/parser/docling_parser.py`:
- Around line 396-405: The chunking options are hardcoded in chunking_opts
inside _parse_pdf_remote (max_tokens: 512, overlap: 50, tokenizer:
"sentencepiece"); make these configurable by adding constructor parameters (or
reading env vars) to the parser class (e.g., DoclingParser.__init__) such as
chunk_max_tokens, chunk_overlap, chunk_tokenizer, defaulting to the current
values, then replace the literal values in chunking_opts with those instance
attributes so callers can override them.
- Around line 354-360: The docstring incorrectly states native chunking
endpoints; update the docstring in the function containing "Parses a PDF
document using a remote Docling server." to reflect the actual endpoints used
(/v1/convert/source and /v1alpha/convert/source) and note that chunking is
achieved by sending do_chunking: true rather than calling /chunk/* endpoints
(reference docling_parser module and the function whose docstring contains the
quoted text).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f1c32db9-e401-468c-a53e-76b355780a43

📥 Commits

Reviewing files that changed from the base of the PR and between 469bc07 and 4ede31e.

📒 Files selected for processing (1)
  • deepdoc/parser/docling_parser.py

@yingfeng yingfeng added the ci Continue Integration label Apr 20, 2026
@yingfeng yingfeng marked this pull request as draft April 20, 2026 01:37
@yingfeng yingfeng marked this pull request as ready for review April 20, 2026 01:37
@dosubot dosubot Bot added the 🌈 python Pull requests that update Python code label Apr 20, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.66%. Comparing base (6712b50) to head (4ede31e).
⚠️ Report is 78 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #14218      +/-   ##
==========================================
- Coverage   98.11%   96.66%   -1.45%     
==========================================
  Files          10       10              
  Lines         690      690              
  Branches      108      108              
==========================================
- Hits          677      667      -10     
- Misses          4        8       +4     
- Partials        9       15       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@yingfeng yingfeng requested a review from Magicbook1108 April 20, 2026 09:07
Copy link
Copy Markdown
Contributor

@Magicbook1108 Magicbook1108 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wangq8 wangq8 merged commit eeb89d6 into infiniflow:main Apr 24, 2026
3 checks passed
Oldcircle added a commit to Oldcircle/ragflow that referenced this pull request Apr 24, 2026
同步 infiniflow/ragflow main(v0.25.0 tag + 后续 commits)。

上游主要变更:
- REST API 大重构:user/tenant/stats/plugin/chunk/document/file2document/
  connector/langfuse/mcp/system/chat/search 全部迁到 api/apps/restful_apis/
- feat: Agent API (infiniflow#14157) / deepseek v4 (infiniflow#14346) / docling native chunking
  (infiniflow#14218) / Astraflow provider (infiniflow#14270) / title chunk 优化 (infiniflow#14325)
- refactor: doc metadata update / agent webhook → REST / 移除 legacy MCP web API
- Go 侧:retrieval_test in GO / stream think chat / zhipu thinking /
  minimax provider / balance command
- 若干 fix:think tags normalization / upload truncation / audio-video pipeline
- 7 个 ingestion_pipeline templates 重命名(Book→book 等)

冲突解决(2 处):
- CLAUDE.md:保留 fork 专属整块(Fork 信息/端口/macOS 本地/活跃文档/首次运行
  记录/Task Executor),追加上游新增的 Working Style 8 条作为独立段
- web/src/pages/dataset/testing/testing-result.tsx:保留 HEAD 版本
  (P1.7-C4 设计对齐版已内含 upstream 要加的 RAGFlowPagination + isRuned 空态
  文案,upstream 版被 HEAD 超集)

验证:
- `pytest test/agent_v2/` → 301 passed / 8 skipped(RAGFLOW_TEST_DB gate)
- 其它冲突文件 auto-merge 成功:dialog_service.py / docker/.env /
  pyproject.toml / uv.lock / locales/en.ts / dataset-table.tsx / testing-form.tsx

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@ParasSondhi ParasSondhi deleted the feat/docling-native-chunking branch April 24, 2026 12:20
xugangqiang pushed a commit to xugangqiang/ragflow that referenced this pull request Apr 24, 2026
…flow#14218)

Resolves infiniflow#14211

**Background:** Currently, RAGFlow routes all Docling parsing through
the standard `/convert/source` endpoint. For large documents, this
returns massive, unchunked text that exceeds RAGFlow's internal
embedding model context limits, causing pipeline failures.

**Solution:**
This PR updates the `_parse_pdf_remote` ingestion logic in
`docling_parser.py` to prioritize `docling-serve`'s native chunking
endpoints (`/v1/chunk/source` and `/v1alpha/chunk/source`).
- By receiving pre-sliced chunk objects directly from Docling, RAGFlow
natively bypasses token limit overflows.
- Included a graceful fallback mechanism to the standard
`/convert/source` endpoints to maintain backwards compatibility for
users running older versions of the Docling server that return 404s on
the new routes.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Continue Integration 🌈 python Pull requests that update Python code size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Use Docling’s chunking endpoint to handle documents exceeding embedding model context limits

4 participants