feat: route docling parsing through native chunking endpoints#14218
Conversation
📝 WalkthroughWalkthroughThe parser now attempts native chunking by sending a chunking-enabled payload to Docling endpoints first, handles chunked responses (list or results), converts chunks to sections and returns with empty tables, and falls back to non-chunking conversion requests while logging chunking outcomes and request errors. Changes
Sequence Diagram(s)sequenceDiagram
rect rgba(200,230,255,0.5)
participant Parser as RAGFlow Parser
end
rect rgba(200,255,200,0.5)
participant Docling as Docling-serve
end
rect rgba(255,230,200,0.5)
participant Local as _sections_from_remote_text
end
Parser->>Docling: POST /v1/convert/source (payload with options.do_chunking + chunking_options)
alt chunking 200 (list or response.results)
Docling-->>Parser: chunked response
Parser->>Local: _sections_from_remote_text(chunk.text / chunk.chunk.text)
Local-->>Parser: sections
Parser-->>Caller: return (sections, tables=[])
else chunking non-2xx (or rejected)
Parser->>Docling: POST /v1alpha/convert/source (chunking payload)
alt alpha chunking 200
Docling-->>Parser: chunked response
Parser->>Local: _sections_from_remote_text(...)
Local-->>Parser: sections
Parser-->>Caller: return (sections, tables=[])
else chunking failed or endpoints return non-chunked
Parser->>Docling: POST /v1/convert/source (non-chunking payload)
Docling-->>Parser: conversion documents (md/text/json)
Parser->>Local: extract docs -> _sections_from_remote_text
Local-->>Parser: sections, tables
Parser-->>Caller: return (sections, tables)
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
deepdoc/parser/docling_parser.py (2)
429-444: Add logging for the new chunking flow.Per coding guidelines, new flows should include logging. The chunking endpoint path only uses
callbackbut lacksself.loggercalls for debugging/observability. Consider adding debug logging when entering the chunking flow and when chunks are processed.♻️ Proposed refactor: Add logging statements
# --- NEW: Handle Native Chunked Response --- if is_chunked_response: + self.logger.info("[Docling] Using native chunking endpoint response") # The chunking endpoint returns an array of chunk items chunks = response_json if isinstance(response_json, list) else response_json.get("results", []) for chunk_data in chunks: # Depending on the exact docling-serve spec, the text might be nested chunk_text = chunk_data.get("text", "") if not chunk_text and "chunk" in chunk_data: chunk_text = chunk_data["chunk"].get("text", "") if chunk_text.strip(): # Feed the pre-sliced chunks directly into RAGFlow's expected format sections.extend(self._sections_from_remote_text(chunk_text, parse_method=parse_method)) if callback: callback(0.95, f"[Docling] Native chunks received: {len(sections)}") + self.logger.debug(f"[Docling] Processed {len(chunks)} chunks into {len(sections)} sections") return sections, tablesAs per coding guidelines: "
**/*.py: Add logging for new flows"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepdoc/parser/docling_parser.py` around lines 429 - 444, The chunking branch triggered when is_chunked_response lacks any logger calls; add debug/info logs using self.logger at the start of the branch (mentioning that chunked response handling has started and include any identifying request id if available), log the number of raw chunks found (len(chunks)) before the loop, and log progress when processing chunks (e.g., per-chunk index or every N chunks) and a final log after building sections and tables (include len(sections) and len(tables)); keep the existing callback usage but add these self.logger.debug/info calls around the code that uses chunks, chunk_text, and _sections_from_remote_text to aid observability.
389-395: Consider making chunking parameters configurable.The
max_tokensandoverlap_tokensvalues are hardcoded. These parameters directly impact chunking behavior and may need tuning based on different embedding models' context limits. Consider exposing these as method parameters or environment variables for flexibility.♻️ Proposed refactor: Add parameters with defaults
def _parse_pdf_remote( self, filepath: str | PathLike[str], binary: BytesIO | bytes | None = None, callback: Optional[Callable] = None, *, parse_method: str = "raw", docling_server_url: Optional[str] = None, request_timeout: Optional[int] = None, + chunk_max_tokens: int = 512, + chunk_overlap_tokens: int = 50, ): # ... chunking_payload = { "sources": [{"kind": "file", "filename": filename, "base64_string": b64}], "options": { - "max_tokens": 512, # A safe default to prevent embedding overflows - "overlap_tokens": 50 + "max_tokens": chunk_max_tokens, + "overlap_tokens": chunk_overlap_tokens } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepdoc/parser/docling_parser.py` around lines 389 - 395, The chunking parameters are hardcoded in the chunking_payload block; make max_tokens and overlap_tokens configurable by adding parameters (e.g., max_chunk_tokens and overlap_chunk_tokens) to the surrounding function that builds chunking_payload in docling_parser.py, defaulting to values read from environment variables (e.g., via os.getenv with int conversion) and falling back to 512 and 50; update chunking_payload to use these parameters and adjust the function signature and docstring accordingly so callers can tune embedding chunk sizes per model.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@deepdoc/parser/docling_parser.py`:
- Around line 44-45: The conditional import placed inside the except block
causes extract_pdf_outlines to be undefined when RAGFlowPdfParser imports
successfully; move the import statement for extract_pdf_outlines out of the
try/except so it is always executed (ensure extract_pdf_outlines is imported at
module scope alongside or above the RAGFlowPdfParser import), keeping the
existing try/except only around RAGFlowPdfParser so that extract_pdf_outlines is
always available for the code that calls it (e.g., where extract_pdf_outlines is
referenced at line ~483).
- Around line 432-440: The loop over chunks in the parser (the for chunk_data in
chunks block in docling_parser.py) assumes each chunk_data is a dict and calls
chunk_data.get(), which can raise if malformed; add a defensive type guard:
check isinstance(chunk_data, dict) before using .get(), and if chunk_data is a
str or bytes treat it as the chunk text (assign chunk_text = chunk_data), or if
it's None/other skip the item, then proceed to call
_sections_from_remote_text(chunk_text, parse_method=parse_method) only when
chunk_text is a non-empty string; update the for-loop handling accordingly.
---
Nitpick comments:
In `@deepdoc/parser/docling_parser.py`:
- Around line 429-444: The chunking branch triggered when is_chunked_response
lacks any logger calls; add debug/info logs using self.logger at the start of
the branch (mentioning that chunked response handling has started and include
any identifying request id if available), log the number of raw chunks found
(len(chunks)) before the loop, and log progress when processing chunks (e.g.,
per-chunk index or every N chunks) and a final log after building sections and
tables (include len(sections) and len(tables)); keep the existing callback usage
but add these self.logger.debug/info calls around the code that uses chunks,
chunk_text, and _sections_from_remote_text to aid observability.
- Around line 389-395: The chunking parameters are hardcoded in the
chunking_payload block; make max_tokens and overlap_tokens configurable by
adding parameters (e.g., max_chunk_tokens and overlap_chunk_tokens) to the
surrounding function that builds chunking_payload in docling_parser.py,
defaulting to values read from environment variables (e.g., via os.getenv with
int conversion) and falling back to 512 and 50; update chunking_payload to use
these parameters and adjust the function signature and docstring accordingly so
callers can tune embedding chunk sizes per model.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3c44dc51-c795-4518-b1ea-62e03c87a03c
📒 Files selected for processing (1)
deepdoc/parser/docling_parser.py
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
deepdoc/parser/docling_parser.py (2)
389-395: Consider making chunking parameters configurable.The
max_tokensandoverlap_tokensvalues are hardcoded. Different embedding models have varying context limits (512, 1024, 8192 tokens, etc.), and users may need to tune these for their specific use case.♻️ Proposed refactor: Add parameters to method signature
def _parse_pdf_remote( self, filepath: str | PathLike[str], binary: BytesIO | bytes | None = None, callback: Optional[Callable] = None, *, parse_method: str = "raw", docling_server_url: Optional[str] = None, request_timeout: Optional[int] = None, + max_tokens: int = 512, + overlap_tokens: int = 50, ):Then use the parameters in the payload:
chunking_payload = { "sources": [{"kind": "file", "filename": filename, "base64_string": b64}], "options": { - "max_tokens": 512, # A safe default to prevent embedding overflows - "overlap_tokens": 50 + "max_tokens": max_tokens, + "overlap_tokens": overlap_tokens } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepdoc/parser/docling_parser.py` around lines 389 - 395, The chunking parameters are hardcoded in the chunking_payload dict; make them configurable by adding parameters (e.g., max_tokens=512, overlap_tokens=50) to the enclosing function/method signature in docling_parser.py and use those parameters when building chunking_payload (replace the literal 512 and 50). Ensure the defaults match current values, propagate the new args through any callers if necessary, and validate or clamp values as needed before inserting them into the "options" -> "max_tokens" and "overlap_tokens" fields.
401-420: Add logging for the new chunking flow.Per coding guidelines, new flows should include logging. Consider adding a debug/info log when attempting chunking endpoints and when falling back to conversion endpoints, which would aid troubleshooting.
♻️ Proposed addition
# We prioritize the new chunking endpoints first! for endpoint, payload, chunk_flag in ( ("/v1/chunk/source", chunking_payload, True), # New stable chunking ("/v1alpha/chunk/source", chunking_payload, True), # New alpha chunking ("/v1/convert/source", v1_payload, False), # Fallback to standard ("/v1alpha/convert/source", v1alpha_payload, False), # Fallback to alpha standard ): try: + self.logger.debug(f"[Docling] Trying endpoint: {endpoint} (chunked={chunk_flag})") resp = requests.post( f"{server_url}{endpoint}", json=payload, timeout=timeout, ) if resp.status_code < 300: response_json = resp.json() is_chunked_response = chunk_flag + self.logger.info(f"[Docling] Success with endpoint: {endpoint}") break errors.append(f"{endpoint}: HTTP {resp.status_code} {resp.text[:300]}") except Exception as exc: errors.append(f"{endpoint}: {exc}")As per coding guidelines: "
**/*.py: Add logging for new flows".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepdoc/parser/docling_parser.py` around lines 401 - 420, The loop that tries chunking and conversion endpoints (the block iterating over endpoint, payload, chunk_flag using server_url, chunking_payload, v1_payload, v1alpha_payload, timeout, resp, response_json, is_chunked_response, errors) needs logging: add a debug/info log before each requests.post indicating which endpoint and whether it is a chunking attempt (chunk_flag True) or a conversion fallback, log the response status and short body when resp.status_code < 300 or when appending to errors, and log exceptions inside the except block; use the module or class logger already in this file (or create one if missing) and ensure logs clearly indicate endpoint, chunk_flag and outcome so the new chunking flow is traceable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@deepdoc/parser/docling_parser.py`:
- Around line 436-438: The code assumes chunk_data["chunk"] is a dict when
computing chunk_text; add a defensive type check before calling .get on
chunk_data["chunk"] in the chunk_text assignment: if chunk_data.get("chunk") is
a dict, use its .get("text", "") value, otherwise handle other types (e.g., if
it's a str use it directly, if None or unexpected type treat as empty string).
Update the logic around the chunk_text and chunk_data variables (the existing
chunk_text = chunk_data.get("text", "") / chunk_data["chunk"].get("text", ""))
to first inspect isinstance(chunk_data.get("chunk"), dict) and branch
accordingly so no AttributeError is raised.
- Around line 425-446: The chunked-response branch currently only extracts text
into sections and leaves tables empty; update the block guarded by
is_chunked_response so that for each chunk object in response_json you also
extract any table metadata (e.g., chunk.get("tables") or chunk.get("table") /
chunk.get("do_table_structure") variants) and pass that chunk-level table
payload into the existing _transfer_to_tables(...) helper to convert/append
results into the local tables list (merging with any existing tables) before
returning; continue to call _sections_from_remote_text(...) for text, and keep
the callback usage (callback(0.95,...)) but include the final counts for both
sections and tables when reporting.
---
Nitpick comments:
In `@deepdoc/parser/docling_parser.py`:
- Around line 389-395: The chunking parameters are hardcoded in the
chunking_payload dict; make them configurable by adding parameters (e.g.,
max_tokens=512, overlap_tokens=50) to the enclosing function/method signature in
docling_parser.py and use those parameters when building chunking_payload
(replace the literal 512 and 50). Ensure the defaults match current values,
propagate the new args through any callers if necessary, and validate or clamp
values as needed before inserting them into the "options" -> "max_tokens" and
"overlap_tokens" fields.
- Around line 401-420: The loop that tries chunking and conversion endpoints
(the block iterating over endpoint, payload, chunk_flag using server_url,
chunking_payload, v1_payload, v1alpha_payload, timeout, resp, response_json,
is_chunked_response, errors) needs logging: add a debug/info log before each
requests.post indicating which endpoint and whether it is a chunking attempt
(chunk_flag True) or a conversion fallback, log the response status and short
body when resp.status_code < 300 or when appending to errors, and log exceptions
inside the except block; use the module or class logger already in this file (or
create one if missing) and ensure logs clearly indicate endpoint, chunk_flag and
outcome so the new chunking flow is traceable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: bdfe475d-da31-43d4-a230-bfd7669e37f4
📒 Files selected for processing (1)
deepdoc/parser/docling_parser.py
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
deepdoc/parser/docling_parser.py (1)
401-420:⚠️ Potential issue | 🟠 MajorOnly downgrade to
/convertwhen chunking is actually unsupported.This loop falls through to the old convert path after any non-2xx from the chunk routes. That means payload bugs, auth failures, or transient 5xxs silently disable native chunking and can reintroduce the oversized unchunked-text failure this PR is fixing. Please keep the fallback for “endpoint missing/unsupported” responses and surface a real log when chunking is skipped.
As per coding guidelines,
**/*.py: Add logging for new flows.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepdoc/parser/docling_parser.py` around lines 401 - 420, The loop currently falls back to /convert for any non-2xx from chunking endpoints; change it so when chunk_flag is True you only treat the endpoint as "unsupported" and proceed to the fallback if the response indicates the route is missing/unsupported (e.g., 404/405/410/501), but for other non-2xx codes (auth failures, payload bugs, 5xx) do not silently downgrade—append to errors, log a clear message, and surface/raise the failure instead of continuing; specifically update the logic around requests.post/resp handling (symbols: server_url, endpoint, chunk_flag, resp, response_json, is_chunked_response, errors, timeout) to check status codes, use the module logger (getLogger) to log a warning/error when chunking is skipped or when non-supported vs real failures occur, and only set is_chunked_response and break for successful 2xx or explicit "unsupported" detection.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@deepdoc/parser/docling_parser.py`:
- Around line 401-420: The loop currently falls back to /convert for any non-2xx
from chunking endpoints; change it so when chunk_flag is True you only treat the
endpoint as "unsupported" and proceed to the fallback if the response indicates
the route is missing/unsupported (e.g., 404/405/410/501), but for other non-2xx
codes (auth failures, payload bugs, 5xx) do not silently downgrade—append to
errors, log a clear message, and surface/raise the failure instead of
continuing; specifically update the logic around requests.post/resp handling
(symbols: server_url, endpoint, chunk_flag, resp, response_json,
is_chunked_response, errors, timeout) to check status codes, use the module
logger (getLogger) to log a warning/error when chunking is skipped or when
non-supported vs real failures occur, and only set is_chunked_response and break
for successful 2xx or explicit "unsupported" detection.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 27e437af-4d25-4d15-85a6-9051c904988e
📒 Files selected for processing (1)
deepdoc/parser/docling_parser.py
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
deepdoc/parser/docling_parser.py (2)
389-395: Consider making chunking parameters configurable.The
max_tokensandoverlap_tokensvalues are hardcoded. Different embedding models have varying context limits, and users may want to tune these for their specific use cases.♻️ Proposed approach: Add parameters to method signature
def _parse_pdf_remote( self, filepath: str | PathLike[str], binary: BytesIO | bytes | None = None, callback: Optional[Callable] = None, *, parse_method: str = "raw", docling_server_url: Optional[str] = None, request_timeout: Optional[int] = None, + max_tokens: int = 512, + overlap_tokens: int = 50, ):Then use these parameters in the chunking payload:
chunking_payload = { "sources": [{"kind": "file", "filename": filename, "base64_string": b64}], "options": { - "max_tokens": 512, - "overlap_tokens": 50 + "max_tokens": max_tokens, + "overlap_tokens": overlap_tokens } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepdoc/parser/docling_parser.py` around lines 389 - 395, The hardcoded chunking params (max_tokens=512, overlap_tokens=50) should be made configurable: add optional parameters (e.g., max_tokens and overlap_tokens with the same defaults) to the public method that builds chunking_payload (or to the DoclingParser class constructor if you prefer instance-wide defaults), then replace the literal values in chunking_payload with those parameters; also propagate these values when callers invoke the method or when constructing the parser so users can tune per-model settings.
418-427: Add INFO-level logging when falling back from chunking to conversion endpoints.The commit message mentions "add logging for that fallback path," but when chunking endpoints return 404 (indicating they're unavailable), there's no log message indicating the fallback is occurring. This would help with debugging and operational visibility.
📝 Proposed fix: Add fallback logging
# --- UPDATED LOGIC --- if chunk_flag and resp.status_code != 404: self.logger.error(f"[Docling] Chunking failed with {resp.status_code}: {resp.text[:300]}") errors.append(f"{endpoint}: HTTP {resp.status_code}") break # <-- THE MAGIC FIX: This instantly kills the loop so it cannot fallback # --------------------- + + if chunk_flag and resp.status_code == 404: + self.logger.info(f"[Docling] Chunking endpoint {endpoint} not available (404), falling back to conversion endpoints") errors.append(f"{endpoint}: HTTP {resp.status_code} {resp.text[:300]}")As per coding guidelines: "Add logging for new flows".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepdoc/parser/docling_parser.py` around lines 418 - 427, When chunking endpoints return 404 you should log an INFO message to indicate you're falling back to conversion endpoints: inside the same block that checks chunk_flag and examines resp.status_code (the if handling chunk_flag and resp.status_code != 404), add an else branch for the 404 case that calls self.logger.info with a message like "[Docling] Chunking endpoint returned 404, falling back to conversion for {endpoint}" (use endpoint and resp.status_code/resp.text for context), then append the error to errors as currently done and continue so the loop can try conversion endpoints; target the method containing chunk_flag, resp, errors and endpoint in docling_parser.py.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@deepdoc/parser/docling_parser.py`:
- Around line 448-452: The code reads chunk_text from chunk_data and then calls
chunk_text.strip() without ensuring it's a string; update the block around the
chunk_text assignment and the subsequent if check (the chunk_text/ chunk_data
handling and the "if chunk_text.strip():" condition) to first validate or coerce
the value: e.g., check isinstance(chunk_text, str) and only call .strip() when
True, or else set chunk_text = "" (or coerce with str(chunk_text) if non-None
values should be preserved) before using .strip(); ensure both branches (from
chunk_data.get("text") and chunk_data["chunk"].get("text")) follow this guard so
AttributeError cannot occur.
---
Nitpick comments:
In `@deepdoc/parser/docling_parser.py`:
- Around line 389-395: The hardcoded chunking params (max_tokens=512,
overlap_tokens=50) should be made configurable: add optional parameters (e.g.,
max_tokens and overlap_tokens with the same defaults) to the public method that
builds chunking_payload (or to the DoclingParser class constructor if you prefer
instance-wide defaults), then replace the literal values in chunking_payload
with those parameters; also propagate these values when callers invoke the
method or when constructing the parser so users can tune per-model settings.
- Around line 418-427: When chunking endpoints return 404 you should log an INFO
message to indicate you're falling back to conversion endpoints: inside the same
block that checks chunk_flag and examines resp.status_code (the if handling
chunk_flag and resp.status_code != 404), add an else branch for the 404 case
that calls self.logger.info with a message like "[Docling] Chunking endpoint
returned 404, falling back to conversion for {endpoint}" (use endpoint and
resp.status_code/resp.text for context), then append the error to errors as
currently done and continue so the loop can try conversion endpoints; target the
method containing chunk_flag, resp, errors and endpoint in docling_parser.py.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 60dfdb52-e85e-48f7-9c10-928c73c3b3e2
📒 Files selected for processing (1)
deepdoc/parser/docling_parser.py
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
deepdoc/parser/docling_parser.py (2)
401-458:⚠️ Potential issue | 🟡 MinorAdd normal logs for the new chunking/fallback path.
This flow only surfaces path selection through
callback(...), so the runtime logs still won't show whether native chunking was used or whether we fell back after 404s. Please addself.logger.info(...)/self.logger.warning(...)around chunk-endpoint success and convert fallback. As per coding guidelines:**/*.py: Add logging for new flows.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepdoc/parser/docling_parser.py` around lines 401 - 458, Add informational/warning logs to surface which path was taken: when a chunking endpoint succeeds (inside the loop where resp.status_code < 300 and chunk_flag is True) call self.logger.info with endpoint and that native chunking will be used (and later after building sections log count via self.logger.info before returning); when a chunking endpoint returns 404 (the code path that currently falls through to convert endpoints) call self.logger.warning indicating endpoint 404 and that the code will fall back to standard convert endpoints; also add a self.logger.warning when a network exception occurs and chunk_flag is True (the except block that currently breaks) to indicate a hard failure on chunking and that fallback/abort is happening; reference variables/methods: endpoint, chunk_flag, is_chunked_response, response_json, callback, and _sections_from_remote_text to locate where to insert these logs.
402-432:⚠️ Potential issue | 🟠 MajorTry the second chunk endpoint before aborting.
Line 423 and Line 432 stop the loop on the first non-404 chunk failure, so
/v1alpha/chunk/sourceis never attempted if/v1/chunk/sourcerejects the request. That turns the second preferred chunk route into dead code for some compatibility cases and can fail the parse even though another chunk endpoint would succeed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepdoc/parser/docling_parser.py` around lines 402 - 432, The loop currently aborts on the first chunk endpoint error because both the HTTP-error branch and the except block use break when chunk_flag is True; change those breaks so chunk endpoints don't kill the entire try-order. Specifically, in the for-loop handling (the tuple with "/v1/chunk/source" and "/v1alpha/chunk/source") remove or replace the break statements that occur when chunk_flag is True (both inside the "if chunk_flag and resp.status_code != 404:" branch and the "except Exception as exc:" block where "if chunk_flag: break") so that errors.append(...) is kept and the loop continues to try the next chunk endpoint; keep the break only for non-chunk fallbacks or when a successful resp.status_code < 300 is seen (response_json/is_chunked_response logic).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@deepdoc/parser/docling_parser.py`:
- Around line 401-458: Add informational/warning logs to surface which path was
taken: when a chunking endpoint succeeds (inside the loop where resp.status_code
< 300 and chunk_flag is True) call self.logger.info with endpoint and that
native chunking will be used (and later after building sections log count via
self.logger.info before returning); when a chunking endpoint returns 404 (the
code path that currently falls through to convert endpoints) call
self.logger.warning indicating endpoint 404 and that the code will fall back to
standard convert endpoints; also add a self.logger.warning when a network
exception occurs and chunk_flag is True (the except block that currently breaks)
to indicate a hard failure on chunking and that fallback/abort is happening;
reference variables/methods: endpoint, chunk_flag, is_chunked_response,
response_json, callback, and _sections_from_remote_text to locate where to
insert these logs.
- Around line 402-432: The loop currently aborts on the first chunk endpoint
error because both the HTTP-error branch and the except block use break when
chunk_flag is True; change those breaks so chunk endpoints don't kill the entire
try-order. Specifically, in the for-loop handling (the tuple with
"/v1/chunk/source" and "/v1alpha/chunk/source") remove or replace the break
statements that occur when chunk_flag is True (both inside the "if chunk_flag
and resp.status_code != 404:" branch and the "except Exception as exc:" block
where "if chunk_flag: break") so that errors.append(...) is kept and the loop
continues to try the next chunk endpoint; keep the break only for non-chunk
fallbacks or when a successful resp.status_code < 300 is seen
(response_json/is_chunked_response logic).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4e86b55c-7e58-491e-80fa-91f465bd5a4b
📒 Files selected for processing (1)
deepdoc/parser/docling_parser.py
@coderabbitai resume |
|
🧠 Learnings used✅ Actions performedReviews resumed. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
deepdoc/parser/docling_parser.py (1)
389-395: Consider making chunking parameters configurable.The
max_tokens=512andoverlap_tokens=50values are hardcoded. While 512 is a safe conservative default, different embedding models have varying context windows (e.g., 8192 for some models). Making these configurable via method parameters or environment variables would allow optimization for specific deployments.♻️ Suggested approach
def _parse_pdf_remote( self, filepath: str | PathLike[str], binary: BytesIO | bytes | None = None, callback: Optional[Callable] = None, *, parse_method: str = "raw", docling_server_url: Optional[str] = None, request_timeout: Optional[int] = None, + chunk_max_tokens: Optional[int] = None, + chunk_overlap_tokens: Optional[int] = None, ): ... + max_tokens = chunk_max_tokens or int(os.environ.get("DOCLING_CHUNK_MAX_TOKENS", 512)) + overlap_tokens = chunk_overlap_tokens or int(os.environ.get("DOCLING_CHUNK_OVERLAP_TOKENS", 50)) + chunking_payload = { "sources": [{"kind": "file", "filename": filename, "base64_string": b64}], "options": { - "max_tokens": 512, - "overlap_tokens": 50 + "max_tokens": max_tokens, + "overlap_tokens": overlap_tokens } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepdoc/parser/docling_parser.py` around lines 389 - 395, The hardcoded chunking parameters in chunking_payload (max_tokens: 512 and overlap_tokens: 50) should be made configurable: add optional parameters (e.g., max_tokens and overlap_tokens) to the surrounding function in docling_parser.py (or read from environment variables as fallbacks), defaulting to 512 and 50 to preserve current behavior, and use those parameters when building chunking_payload; update any call sites and unit tests to pass or mock the new options and document the new configuration options.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@deepdoc/parser/docling_parser.py`:
- Around line 389-395: The hardcoded chunking parameters in chunking_payload
(max_tokens: 512 and overlap_tokens: 50) should be made configurable: add
optional parameters (e.g., max_tokens and overlap_tokens) to the surrounding
function in docling_parser.py (or read from environment variables as fallbacks),
defaulting to 512 and 50 to preserve current behavior, and use those parameters
when building chunking_payload; update any call sites and unit tests to pass or
mock the new options and document the new configuration options.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a86c716e-92f9-46fb-80a0-e7e4e92c5eb4
📒 Files selected for processing (1)
deepdoc/parser/docling_parser.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@deepdoc/parser/docling_parser.py`:
- Around line 394-402: The chunking_payload structure is incorrect for Docling
Serve: update the payload created where chunking_payload is defined to use the
/v1/convert/source contract by setting options.do_chunking = True and moving
token settings into options.chunking_options with keys max_tokens, overlap
(rename overlap_tokens -> overlap) and tokenizer; replace calls that hit
/v1/chunk/source or /v1alpha/chunk/source to call /v1/convert/source instead;
make max_tokens, overlap, and tokenizer configurable (e.g., parameters or config
values passed into the function that builds chunking_payload) so they are not
hardcoded.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2df3a4ad-b7f0-48d2-b658-051ee335bcee
📒 Files selected for processing (1)
deepdoc/parser/docling_parser.py
There was a problem hiding this comment.
🧹 Nitpick comments (2)
deepdoc/parser/docling_parser.py (2)
396-405: Consider making chunking parameters configurable.The chunking options (
max_tokens: 512,overlap: 50,tokenizer: "sentencepiece") are hardcoded. Different embedding models and use cases may require different token limits. Consider exposing these as constructor parameters or environment variables.♻️ Proposed refactor to add configurability
class DoclingParser(RAGFlowPdfParser): - def __init__(self, docling_server_url: str = "", request_timeout: int = 600): + def __init__( + self, + docling_server_url: str = "", + request_timeout: int = 600, + chunk_max_tokens: int = 512, + chunk_overlap: int = 50, + chunk_tokenizer: str = "sentencepiece", + ): self.logger = logging.getLogger(self.__class__.__name__) ... + self.chunk_max_tokens = chunk_max_tokens + self.chunk_overlap = chunk_overlap + self.chunk_tokenizer = chunk_tokenizerThen in
_parse_pdf_remote:chunking_opts = { "from_formats": ["pdf"], "to_formats": ["json", "md", "text"], "do_chunking": True, "chunking_options": { - "max_tokens": 512, - "overlap": 50, - "tokenizer": "sentencepiece" + "max_tokens": self.chunk_max_tokens, + "overlap": self.chunk_overlap, + "tokenizer": self.chunk_tokenizer, } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepdoc/parser/docling_parser.py` around lines 396 - 405, The chunking options are hardcoded in chunking_opts inside _parse_pdf_remote (max_tokens: 512, overlap: 50, tokenizer: "sentencepiece"); make these configurable by adding constructor parameters (or reading env vars) to the parser class (e.g., DoclingParser.__init__) such as chunk_max_tokens, chunk_overlap, chunk_tokenizer, defaulting to the current values, then replace the literal values in chunking_opts with those instance attributes so callers can override them.
354-360: Docstring references outdated endpoint paths.The docstring mentions
/v1/chunk/sourceand/v1alpha/chunk/source, but the implementation actually uses/v1/convert/sourceand/v1alpha/convert/sourcewithdo_chunking: true. Consider updating the docstring to accurately reflect the implementation.📝 Suggested docstring update
""" Parses a PDF document using a remote Docling server. - Prioritizes native chunking endpoints (/v1/chunk/source, /v1alpha/chunk/source) - to prevent token overflow, with a graceful fallback to standard conversion - endpoints if chunking is unavailable. + Prioritizes native chunking via /v1/convert/source with do_chunking=true + to prevent token overflow, with a graceful fallback to standard conversion + when chunking is unavailable. """🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepdoc/parser/docling_parser.py` around lines 354 - 360, The docstring incorrectly states native chunking endpoints; update the docstring in the function containing "Parses a PDF document using a remote Docling server." to reflect the actual endpoints used (/v1/convert/source and /v1alpha/convert/source) and note that chunking is achieved by sending do_chunking: true rather than calling /chunk/* endpoints (reference docling_parser module and the function whose docstring contains the quoted text).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@deepdoc/parser/docling_parser.py`:
- Around line 396-405: The chunking options are hardcoded in chunking_opts
inside _parse_pdf_remote (max_tokens: 512, overlap: 50, tokenizer:
"sentencepiece"); make these configurable by adding constructor parameters (or
reading env vars) to the parser class (e.g., DoclingParser.__init__) such as
chunk_max_tokens, chunk_overlap, chunk_tokenizer, defaulting to the current
values, then replace the literal values in chunking_opts with those instance
attributes so callers can override them.
- Around line 354-360: The docstring incorrectly states native chunking
endpoints; update the docstring in the function containing "Parses a PDF
document using a remote Docling server." to reflect the actual endpoints used
(/v1/convert/source and /v1alpha/convert/source) and note that chunking is
achieved by sending do_chunking: true rather than calling /chunk/* endpoints
(reference docling_parser module and the function whose docstring contains the
quoted text).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f1c32db9-e401-468c-a53e-76b355780a43
📒 Files selected for processing (1)
deepdoc/parser/docling_parser.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #14218 +/- ##
==========================================
- Coverage 98.11% 96.66% -1.45%
==========================================
Files 10 10
Lines 690 690
Branches 108 108
==========================================
- Hits 677 667 -10
- Misses 4 8 +4
- Partials 9 15 +6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
同步 infiniflow/ragflow main(v0.25.0 tag + 后续 commits)。 上游主要变更: - REST API 大重构:user/tenant/stats/plugin/chunk/document/file2document/ connector/langfuse/mcp/system/chat/search 全部迁到 api/apps/restful_apis/ - feat: Agent API (infiniflow#14157) / deepseek v4 (infiniflow#14346) / docling native chunking (infiniflow#14218) / Astraflow provider (infiniflow#14270) / title chunk 优化 (infiniflow#14325) - refactor: doc metadata update / agent webhook → REST / 移除 legacy MCP web API - Go 侧:retrieval_test in GO / stream think chat / zhipu thinking / minimax provider / balance command - 若干 fix:think tags normalization / upload truncation / audio-video pipeline - 7 个 ingestion_pipeline templates 重命名(Book→book 等) 冲突解决(2 处): - CLAUDE.md:保留 fork 专属整块(Fork 信息/端口/macOS 本地/活跃文档/首次运行 记录/Task Executor),追加上游新增的 Working Style 8 条作为独立段 - web/src/pages/dataset/testing/testing-result.tsx:保留 HEAD 版本 (P1.7-C4 设计对齐版已内含 upstream 要加的 RAGFlowPagination + isRuned 空态 文案,upstream 版被 HEAD 超集) 验证: - `pytest test/agent_v2/` → 301 passed / 8 skipped(RAGFLOW_TEST_DB gate) - 其它冲突文件 auto-merge 成功:dialog_service.py / docker/.env / pyproject.toml / uv.lock / locales/en.ts / dataset-table.tsx / testing-form.tsx Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…flow#14218) Resolves infiniflow#14211 **Background:** Currently, RAGFlow routes all Docling parsing through the standard `/convert/source` endpoint. For large documents, this returns massive, unchunked text that exceeds RAGFlow's internal embedding model context limits, causing pipeline failures. **Solution:** This PR updates the `_parse_pdf_remote` ingestion logic in `docling_parser.py` to prioritize `docling-serve`'s native chunking endpoints (`/v1/chunk/source` and `/v1alpha/chunk/source`). - By receiving pre-sliced chunk objects directly from Docling, RAGFlow natively bypasses token limit overflows. - Included a graceful fallback mechanism to the standard `/convert/source` endpoints to maintain backwards compatibility for users running older versions of the Docling server that return 404s on the new routes. ### Type of change - [x] New Feature (non-breaking change which adds functionality)
Resolves #14211
Background: Currently, RAGFlow routes all Docling parsing through the standard
/convert/sourceendpoint. For large documents, this returns massive, unchunked text that exceeds RAGFlow's internal embedding model context limits, causing pipeline failures.Solution:
This PR updates the
_parse_pdf_remoteingestion logic indocling_parser.pyto prioritizedocling-serve's native chunking endpoints (/v1/chunk/sourceand/v1alpha/chunk/source)./convert/sourceendpoints to maintain backwards compatibility for users running older versions of the Docling server that return 404s on the new routes.Type of change