Skip to content

Refactor: migrate doc upload info used in chat#14359

Merged
yingfeng merged 7 commits into
infiniflow:mainfrom
xugangqiang:doc-upload-info
Apr 27, 2026
Merged

Refactor: migrate doc upload info used in chat#14359
yingfeng merged 7 commits into
infiniflow:mainfrom
xugangqiang:doc-upload-info

Conversation

@xugangqiang
Copy link
Copy Markdown
Contributor

@xugangqiang xugangqiang commented Apr 24, 2026

What problem does this PR solve?

Before migration: POST /v1/document/upload_info/
After migration: POST /api/v1/documentss/upload/

Type of change

  • Refactoring

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces the legacy POST /upload_info route with a new POST /documents/upload REST endpoint; moves upload handling to api/apps/restful_apis/document_api.py, updates frontend API/service/hook to call the new route, and adjusts tests and test helpers accordingly.

Changes

Cohort / File(s) Summary
Backend - Removed legacy route
api/apps/document_app.py
Deleted the old POST /upload_info handler and its validation/multipart/url handling.
Backend - New REST endpoint
api/apps/restful_apis/document_api.py
Added POST /documents/upload (upload_info(tenant_id)): mutually-exclusive multipart file(s) or url input, SSRF check via assert_url_is_safe for URLs, delegates to FileService.upload_info, wraps with get_result, and maps unexpected errors to server_error_response.
Frontend - API mapping
web/src/utils/api.ts
Renamed uploadAndParsedocumentInfoUpload; endpoint URL changed from ${webAPI}/document/upload_info to ${restAPIv1}/documents/upload.
Frontend - Service & Hook updates
web/src/services/next-chat-service.ts, web/src/hooks/use-chat-request.ts
Replaced uploadAndParse usage with documentInfoUpload; upload still uses FormData, forwards abort signal, reports progress, and handles callbacks.
Tests - client helper & integration tests
test/testcases/test_web_api/test_common.py, test/testcases/test_web_api/test_document_app/test_upload_info_unit.py
Added upload_info test helper that posts multipart files or url, ensures file descriptors are closed, and replaced unit mocks with E2E-style tests asserting numeric response codes and behavior for single/multiple files and mixed/empty inputs.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client (Browser)
    participant Frontend as Frontend (hook/service)
    participant API as API Server (/documents/upload)
    participant FileSvc as FileService

    Client->>Frontend: Trigger upload (files or ?url=...)
    Frontend->>API: POST /api/v1/documents/upload (FormData or query `url`)
    API->>API: assert_url_is_safe(url)  %% for URL path
    API->>FileSvc: FileService.upload_info(tenant_id, files or url)
    FileSvc-->>API: upload metadata/result
    API-->>Frontend: JSON result (get_result)
    Frontend-->>Client: onSuccess / onError handlers
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

☯️ refactor, 🧪 test, 🐖api

Suggested reviewers

  • JinHai-CN
  • yingfeng
  • yuzhichang

Poem

🐰 I hopped from old routes to brand-new ground,
Files in my paws and URLs all safe and sound.
FormData packed snug, SSRF held at bay,
New endpoint awaits — hop, upload, hooray!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 46.15% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: migrating the document upload info endpoint used in chat from the old path to a new one.
Description check ✅ Passed The description provides the required template sections with the problem statement and change type identified, though the endpoint path contains a typo ('documentss' instead of 'documents').
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@xugangqiang xugangqiang marked this pull request as ready for review April 24, 2026 10:49
@dosubot dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Apr 24, 2026
@xugangqiang xugangqiang added the ci Continue Integration label Apr 24, 2026
@xugangqiang xugangqiang self-assigned this Apr 24, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
web/src/hooks/use-chat-request.ts (1)

491-505: ⚠️ Potential issue | 🟠 Major

Remove conversation_id from FormData or use a conversation-scoped upload endpoint instead.

The POST /documents/upload endpoint is a generic document info extraction service that only accepts file and url parameters. It calls FileService.upload_info() which returns basic file metadata without any conversation or dataset association. The conversation_id sent in FormData (line 493) is ignored.

If chat-scoped document linking is required, this implementation is incomplete. A separate conversation-scoped upload flow exists in document_app.py that calls doc_upload_and_parse(conversation_id, ...) and properly resolves the conversation to its associated dataset before uploading. The frontend should either:

  1. Use that conversation-scoped endpoint instead, or
  2. Remove conversation_id from this endpoint and handle dataset routing differently in the chat context
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@web/src/hooks/use-chat-request.ts` around lines 491 - 505, The FormData
currently includes a conversation_id which is ignored by the generic upload
endpoint; update the frontend to either remove the conversation_id from the
FormData when calling chatService.documentInfoUpload (api.documentInfoUpload) so
only file (and url if used) are sent, or switch to the conversation-scoped
upload endpoint used by document_app.py that invokes
doc_upload_and_parse(conversation_id, ...) to correctly resolve dataset linking;
locate the upload logic in use-chat-request.ts (the FormData append calls and
the chatService.documentInfoUpload invocation) and implement one of these two
fixes so the backend FileService.upload_info() receives only supported fields or
use the proper conversation-scoped flow.
🧹 Nitpick comments (2)
api/apps/restful_apis/document_api.py (1)

74-75: Minor: redundant files.get("file") guard.

MultiDict.getlist("file") already returns [] when the key is absent, so files and files.get("file") is redundant and slightly misleading (it would also short-circuit to [] if a caller sent file="", which is arguably fine but handled elsewhere by an explicit empty-name check).

-    files = await request.files
-    file_objs = files.getlist("file") if files and files.get("file") else []
+    files = await request.files
+    file_objs = files.getlist("file") if files else []
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 74 - 75, The guard
checking files and files.get("file") is redundant; change the assignment so
file_objs is obtained directly from the MultiDict using files.getlist("file")
(i.e., replace the conditional expression with a direct call to
files.getlist("file")), leaving the preceding files = await request.files and
any existing explicit empty-name checks intact; this simplifies the logic and
relies on MultiDict.getlist to return [] when the key is absent.
web/src/hooks/use-chat-request.ts (1)

47-47: Consider renaming stale UploadAndParse / uploadAndParseFile identifiers for consistency.

The endpoint and service mapping were renamed to documentInfoUpload, but the ChatApiAction.UploadAndParse = 'upload_and_parse' enum (Line 47), the mutation key, and the exported uploadAndParseFile (Line 528) still carry the old name. Callers like web/src/pages/next-chats/hooks/use-upload-file.ts continue to import useUploadAndParseFile / destructure uploadAndParseFile. This is functionally fine but will drift further as the migration continues.

If you want a clean rename in this PR, update the enum, mutation key, return property, hook name, and the two call sites in use-upload-file.ts and use-send-shared-message (if applicable) in one pass. Otherwise, fine to defer.

Also applies to: 528-528

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@web/src/hooks/use-chat-request.ts` at line 47, Rename the stale
UploadAndParse identifiers to documentInfoUpload for consistency: update the
enum entry ChatApiAction.UploadAndParse = 'upload_and_parse' to
ChatApiAction.DocumentInfoUpload = 'document_info_upload', change the mutation
key and the exported function name uploadAndParseFile to documentInfoUploadFile
(and the hook useUploadAndParseFile to useDocumentInfoUploadFile), and update
all call sites mentioned (use-upload-file.ts and use-send-shared-message
imports/destructuring) to use the new names so names match the endpoint/service
mapping documentInfoUpload.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 84-94: The handler currently calls the blocking
FileService.upload_info synchronously (and in the URL case triggers asyncio.run
inside an existing event loop), so change all branches (url, single file,
multiple files) to offload FileService.upload_info to the worker thread using
the existing thread_pool_exec helper (same pattern as upload_document): call
await thread_pool_exec(partial(FileService.upload_info, tenant_id,
file_obj_or_None, url_or_None)) for each invocation, collect results for the
list branch, and then return get_result(data=results) (or single result) so the
Quart event loop is not blocked and asyncio.run inside upload_info executes in a
fresh loop in the worker thread.
- Around line 42-94: The upload_info handler is missing logging and filename
validation: add debug/info logs for start, success and error paths in
upload_info (similar to upload_document) and validate each uploaded file's
filename against empty/None and FILE_NAME_LEN_LIMIT before calling
FileService.upload_info; on invalid names return get_error_argument_result with
a 400-like message and log the validation failure, and ensure exceptions still
use server_error_response while logging the exception.

---

Outside diff comments:
In `@web/src/hooks/use-chat-request.ts`:
- Around line 491-505: The FormData currently includes a conversation_id which
is ignored by the generic upload endpoint; update the frontend to either remove
the conversation_id from the FormData when calling
chatService.documentInfoUpload (api.documentInfoUpload) so only file (and url if
used) are sent, or switch to the conversation-scoped upload endpoint used by
document_app.py that invokes doc_upload_and_parse(conversation_id, ...) to
correctly resolve dataset linking; locate the upload logic in
use-chat-request.ts (the FormData append calls and the
chatService.documentInfoUpload invocation) and implement one of these two fixes
so the backend FileService.upload_info() receives only supported fields or use
the proper conversation-scoped flow.

---

Nitpick comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 74-75: The guard checking files and files.get("file") is
redundant; change the assignment so file_objs is obtained directly from the
MultiDict using files.getlist("file") (i.e., replace the conditional expression
with a direct call to files.getlist("file")), leaving the preceding files =
await request.files and any existing explicit empty-name checks intact; this
simplifies the logic and relies on MultiDict.getlist to return [] when the key
is absent.

In `@web/src/hooks/use-chat-request.ts`:
- Line 47: Rename the stale UploadAndParse identifiers to documentInfoUpload for
consistency: update the enum entry ChatApiAction.UploadAndParse =
'upload_and_parse' to ChatApiAction.DocumentInfoUpload = 'document_info_upload',
change the mutation key and the exported function name uploadAndParseFile to
documentInfoUploadFile (and the hook useUploadAndParseFile to
useDocumentInfoUploadFile), and update all call sites mentioned
(use-upload-file.ts and use-send-shared-message imports/destructuring) to use
the new names so names match the endpoint/service mapping documentInfoUpload.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 187f308b-195a-404e-ac16-31c2dbe7b0cc

📥 Commits

Reviewing files that changed from the base of the PR and between beb2406 and 0a9d55f.

📒 Files selected for processing (5)
  • api/apps/document_app.py
  • api/apps/restful_apis/document_api.py
  • web/src/hooks/use-chat-request.ts
  • web/src/services/next-chat-service.ts
  • web/src/utils/api.ts
💤 Files with no reviewable changes (1)
  • api/apps/document_app.py

Comment thread api/apps/restful_apis/document_api.py
Comment thread api/apps/restful_apis/document_api.py
@xugangqiang xugangqiang changed the title WIP Refactor: migrate doc upload info used in chat Refactor: migrate doc upload info used in chat Apr 24, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.30%. Comparing base (6a23dfe) to head (5262741).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #14359   +/-   ##
=======================================
  Coverage   95.30%   95.30%           
=======================================
  Files          10       10           
  Lines         703      703           
  Branches      112      112           
=======================================
  Hits          670      670           
  Misses         16       16           
  Partials       17       17           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
api/apps/restful_apis/document_api.py (1)

47-100: ⚠️ Potential issue | 🔴 Critical

Past-review issues still unresolved on the new upload_info handler.

Two previously raised concerns are still present in this revision:

  1. Blocking sync I/O from async handler + RuntimeError on URL branch (Lines 89‑100). FileService.upload_info is synchronous and, on the URL branch, calls asyncio.run(adownload()) internally. Invoking it directly from this async def Quart handler will (a) block the event loop for file uploads and (b) raise RuntimeError: asyncio.run() cannot be called from a running event loop for any ?url=... request. The sibling handler upload_document already uses thread_pool_exec for the same service layer (Line 441) — this endpoint must do the same.
  2. Missing logging and filename validation (Lines 79‑100). No logging calls on success/error paths (per **/*.py: Add logging for new flows), and no empty-filename / FILE_NAME_LEN_LIMIT validation on file_objs (see the established pattern at Lines 420‑427). A file with filename == "" or None will reach FileService.upload_infoDocumentService.check_doc_health(user_id, file.filename) and surface as a 500 instead of a 400.
🛠️ Proposed fix
+    from api.constants import FILE_NAME_LEN_LIMIT
+
     files = await request.files
     file_objs = files.getlist("file") if files and files.get("file") else []
     url = request.args.get("url")
 
     if file_objs and url:
+        logging.error("upload_info: both file and url provided")
         return get_error_argument_result("Provide either multipart file(s) or ?url=..., not both.")
 
     if not file_objs and not url:
+        logging.error("upload_info: missing input")
         return get_error_argument_result("Missing input: provide multipart file(s) or url")
 
+    for f in file_objs:
+        if f is None or not f.filename:
+            return get_error_argument_result("No file selected!")
+        if len(f.filename.encode("utf-8")) > FILE_NAME_LEN_LIMIT:
+            return get_error_argument_result(
+                f"File name must be {FILE_NAME_LEN_LIMIT} bytes or less."
+            )
+
     try:
         if url and not file_objs:
             assert_url_is_safe(url)
-            return get_result(data=FileService.upload_info(tenant_id, None, url))
+            data = await thread_pool_exec(FileService.upload_info, tenant_id, None, url)
+            return get_result(data=data)
 
         if len(file_objs) == 1:
-            return get_result(data=FileService.upload_info(tenant_id, file_objs[0], None))
+            data = await thread_pool_exec(FileService.upload_info, tenant_id, file_objs[0], None)
+            return get_result(data=data)
 
-        results = [FileService.upload_info(tenant_id, f, None) for f in file_objs]
+        results = [
+            await thread_pool_exec(FileService.upload_info, tenant_id, f, None)
+            for f in file_objs
+        ]
         return get_result(data=results)
     except Exception as e:
+        logging.exception("upload_info failed")
         return server_error_response(e)

As per coding guidelines: **/*.py: Add logging for new flows.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 47 - 100, The upload_info
Quart handler is calling the synchronous FileService.upload_info directly from
an async handler (and will hit asyncio.run() in the URL branch), and it lacks
logging and filename validation; update upload_info to run
FileService.upload_info calls inside thread_pool_exec (reuse the same pattern as
the upload_document handler) for the URL branch and for each file upload to
avoid blocking/RuntimeError, add logging calls (info on success, error on
exceptions) using the module logger, and validate each file in file_objs for a
non-empty filename and length <= FILE_NAME_LEN_LIMIT (return a 400 error via
get_error_argument_result and log the validation failure) before invoking
FileService.upload_info; reference functions/variables upload_info (handler),
FileService.upload_info, thread_pool_exec, upload_document (sibling),
FILE_NAME_LEN_LIMIT, get_error_argument_result, and server_error_response to
implement these changes.
🧹 Nitpick comments (3)
test/testcases/test_web_api/test_common.py (1)

343-370: Use proper query-string encoding for the url parameter.

Line 357 concatenates url into the request URL without encoding. Any url containing &, ?, =, spaces, or non-ASCII characters will corrupt the request and cause silently misleading test failures. Pass it through requests' params= instead so it’s properly percent-encoded.

♻️ Proposed refactor
-    url_endpoint = f"{HOST_ADDRESS}/api/{VERSION}/documents/upload"
+    url_endpoint = f"{HOST_ADDRESS}/api/{VERSION}/documents/upload"
+    params = {"url": url} if url else None
 
     fields = []
     file_objects = []
     try:
         if files_path:
             for fp in files_path:
                 p = Path(fp)
                 f = p.open("rb")
                 fields.append(("file", (p.name, f)))
                 file_objects.append(f)
 
-        # Add url as query parameter if provided
-        if url:
-            url_endpoint = f"{url_endpoint}?url={url}"
-
         # Handle empty fields (no files) - create empty MultipartEncoder
         if not fields:
             fields = [("empty", ("", ""))]
 
         m = MultipartEncoder(fields=fields)
 
         res = requests.post(
             url=url_endpoint,
             headers={"Content-Type": m.content_type},
             auth=auth,
             data=m,
+            params=params,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/testcases/test_web_api/test_common.py` around lines 343 - 370, The test
constructs url_endpoint by concatenating the unencoded url string into the URL;
instead, build a params dict (e.g., params = {"url": url} when url is truthy)
and pass it to requests.post via the params= parameter so the value is properly
percent-encoded; keep creating the MultipartEncoder from fields and set
headers={"Content-Type": m.content_type} and data=m when calling requests.post
(use requests.post(url=url_endpoint, params=params, headers=..., data=m,
auth=auth)), and ensure you only add the params entry when url is provided.
test/testcases/test_web_api/test_document_app/test_upload_info_unit.py (2)

90-110: Test name overstates coverage; URL branch is never exercised.

The method name claims to cover "url, single, and multiple files", but only file uploads are tested — the URL path (which is what triggered the SSRF guard and assert_url_is_safe addition in this PR) is left as a comment. Consider either renaming the test to reflect what's actually covered, or adding a small URL-branch test using an HTTP fixture / responses-style mock so the new SSRF gate gets exercised.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/testcases/test_web_api/test_document_app/test_upload_info_unit.py`
around lines 90 - 110, The test
test_upload_info_supports_url_single_and_multiple_files_e2e currently never
exercises the URL branch; either rename it to reflect only file uploads or add a
real URL-path assertion: use a local HTTP fixture (e.g., responses or an
httpserver fixture) to mock a reachable URL and call upload_info(WebApiAuth,
url="http://...") so the SSRF gate and assert_url_is_safe path are exercised,
then assert res["code"] and res["data"] (and for multiple file case keep the
existing file assertions); ensure you reference the upload_info call and that
assert_url_is_safe is hit by the mocked URL rather than leaving the URL as a
commented note.

17-61: Remove leftover unit-test scaffolding.

_AwaitableValue, _DummyFiles, _DummyFile, _DummyRequest, _run, and the asyncio import are leftovers from the old mocked unit-test approach and are no longer referenced after the switch to TestUploadInfoE2E. Drop them to avoid confusion.

♻️ Proposed cleanup
-import asyncio
-
 import pytest
 from test_common import upload_info
 from configs import INVALID_API_TOKEN
 from libs.auth import RAGFlowWebApiAuth
 from utils.file_utils import create_txt_file


-class _AwaitableValue:
-    def __init__(self, value):
-        self._value = value
-
-    def __await__(self):
-        async def _co():
-            return self._value
-
-        return _co().__await__()
-
-
-class _DummyFiles(dict):
-    def getlist(self, key):
-        value = self.get(key, [])
-        if isinstance(value, list):
-            return value
-        return [value]
-
-
-class _DummyFile:
-    def __init__(self, filename):
-        self.filename = filename
-
-
-class _DummyRequest:
-    def __init__(self, *, files=None, args=None):
-        self._files = files or _DummyFiles()
-        self.args = args or {}
-
-    `@property`
-    def files(self):
-        return _AwaitableValue(self._files)
-
-
-def _run(coro):
-    return asyncio.run(coro)
-
-
 # ============================================================================
 # End-to-End Tests
 # ============================================================================
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/testcases/test_web_api/test_document_app/test_upload_info_unit.py`
around lines 17 - 61, Remove the leftover mocked unit-test scaffolding: delete
the asyncio import plus the helper class/function definitions _AwaitableValue,
_DummyFiles, _DummyFile, _DummyRequest and the _run function from the test file,
and keep only the real test imports (e.g. pytest, upload_info,
INVALID_API_TOKEN, RAGFlowWebApiAuth, create_txt_file) used by
TestUploadInfoE2E; ensure no remaining references to those removed symbols
remain in the file and adjust imports if any become unused.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 47-100: The upload_info Quart handler is calling the synchronous
FileService.upload_info directly from an async handler (and will hit
asyncio.run() in the URL branch), and it lacks logging and filename validation;
update upload_info to run FileService.upload_info calls inside thread_pool_exec
(reuse the same pattern as the upload_document handler) for the URL branch and
for each file upload to avoid blocking/RuntimeError, add logging calls (info on
success, error on exceptions) using the module logger, and validate each file in
file_objs for a non-empty filename and length <= FILE_NAME_LEN_LIMIT (return a
400 error via get_error_argument_result and log the validation failure) before
invoking FileService.upload_info; reference functions/variables upload_info
(handler), FileService.upload_info, thread_pool_exec, upload_document (sibling),
FILE_NAME_LEN_LIMIT, get_error_argument_result, and server_error_response to
implement these changes.

---

Nitpick comments:
In `@test/testcases/test_web_api/test_common.py`:
- Around line 343-370: The test constructs url_endpoint by concatenating the
unencoded url string into the URL; instead, build a params dict (e.g., params =
{"url": url} when url is truthy) and pass it to requests.post via the params=
parameter so the value is properly percent-encoded; keep creating the
MultipartEncoder from fields and set headers={"Content-Type": m.content_type}
and data=m when calling requests.post (use requests.post(url=url_endpoint,
params=params, headers=..., data=m, auth=auth)), and ensure you only add the
params entry when url is provided.

In `@test/testcases/test_web_api/test_document_app/test_upload_info_unit.py`:
- Around line 90-110: The test
test_upload_info_supports_url_single_and_multiple_files_e2e currently never
exercises the URL branch; either rename it to reflect only file uploads or add a
real URL-path assertion: use a local HTTP fixture (e.g., responses or an
httpserver fixture) to mock a reachable URL and call upload_info(WebApiAuth,
url="http://...") so the SSRF gate and assert_url_is_safe path are exercised,
then assert res["code"] and res["data"] (and for multiple file case keep the
existing file assertions); ensure you reference the upload_info call and that
assert_url_is_safe is hit by the mocked URL rather than leaving the URL as a
commented note.
- Around line 17-61: Remove the leftover mocked unit-test scaffolding: delete
the asyncio import plus the helper class/function definitions _AwaitableValue,
_DummyFiles, _DummyFile, _DummyRequest and the _run function from the test file,
and keep only the real test imports (e.g. pytest, upload_info,
INVALID_API_TOKEN, RAGFlowWebApiAuth, create_txt_file) used by
TestUploadInfoE2E; ensure no remaining references to those removed symbols
remain in the file and adjust imports if any become unused.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: aa210061-2c4f-40db-853b-74862c5172fd

📥 Commits

Reviewing files that changed from the base of the PR and between bd52e7b and 00072ea.

📒 Files selected for processing (4)
  • api/apps/restful_apis/document_api.py
  • test/testcases/test_web_api/test_common.py
  • test/testcases/test_web_api/test_document_app/test_upload_info_unit.py
  • web/src/utils/api.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • web/src/utils/api.ts

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
api/apps/restful_apis/document_api.py (1)

94-105: ⚠️ Potential issue | 🔴 Critical

Blocking sync I/O on async handler — URL branch will still raise RuntimeError at runtime.

This is the same defect raised in the prior review and not yet fixed. FileService.upload_info is a synchronous function performing blocking I/O (put_blob, file read()), and in the URL branch it eventually calls asyncio.run(adownload()) (see api/db/services/file_service.py upload_info → crawl path). Calling asyncio.run() from inside this already-running Quart event loop raises RuntimeError: asyncio.run() cannot be called from a running event loop, so any ?url=… request 500s. The file branches "work" but block the loop for the entire upload.

The sibling _upload_local_documents at Lines 565-568 already uses thread_pool_exec — the new endpoint should follow the same pattern. Offloading also gives the URL branch its own fresh loop in the worker thread, which makes asyncio.run inside upload_info safe.

Also folding in the still-missing logging hooks per the repo guideline (the prior "addressed" marker doesn't appear to match the current source).

🛠️ Proposed fix
     try:
         if url and not file_objs:
             assert_url_is_safe(url)
-            return get_result(data=FileService.upload_info(tenant_id, None, url))
+            data = await thread_pool_exec(FileService.upload_info, tenant_id, None, url)
+            return get_result(data=data)
 
         if len(file_objs) == 1:
-            return get_result(data=FileService.upload_info(tenant_id, file_objs[0], None))
+            data = await thread_pool_exec(FileService.upload_info, tenant_id, file_objs[0], None)
+            return get_result(data=data)
 
-        results = [FileService.upload_info(tenant_id, f, None) for f in file_objs]
+        results = [
+            await thread_pool_exec(FileService.upload_info, tenant_id, f, None)
+            for f in file_objs
+        ]
         return get_result(data=results)
     except Exception as e:
+        logging.exception("upload_info failed")
         return server_error_response(e)

As per coding guidelines: "**/*.py: Add logging for new flows".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 94 - 105, The URL branch
in document_api.py calls the synchronous FileService.upload_info (which may
invoke asyncio.run), causing RuntimeError when invoked inside the running Quart
event loop; refactor the URL branch (and the other file-processing branches) to
offload blocking work to the thread pool like the sibling
_upload_local_documents uses (call thread_pool_exec to run
FileService.upload_info in a worker thread) so the URL path gets a fresh event
loop and doesn't block the main loop; also add logging calls (using the same
logger pattern used elsewhere) around the start/finish/error of the upload flow
and keep existing guards like assert_url_is_safe and the existing
get_result/server_error_response logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 94-97: The current try/except around the URL-upload branch
swallows ValueError from assert_url_is_safe and turns SSRF rejections into 500s;
update the exception handling in the block that checks "if url and not
file_objs" (where assert_url_is_safe and FileService.upload_info are called) to
explicitly catch ValueError and return an argument/bad-request response (use the
existing argument_error_response or equivalent) instead of falling through to
the generic exception handler, while leaving other exceptions to be handled as
before.

---

Duplicate comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 94-105: The URL branch in document_api.py calls the synchronous
FileService.upload_info (which may invoke asyncio.run), causing RuntimeError
when invoked inside the running Quart event loop; refactor the URL branch (and
the other file-processing branches) to offload blocking work to the thread pool
like the sibling _upload_local_documents uses (call thread_pool_exec to run
FileService.upload_info in a worker thread) so the URL path gets a fresh event
loop and doesn't block the main loop; also add logging calls (using the same
logger pattern used elsewhere) around the start/finish/error of the upload flow
and keep existing guards like assert_url_is_safe and the existing
get_result/server_error_response logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ebaf1743-1380-4287-944d-b4af0aace4b5

📥 Commits

Reviewing files that changed from the base of the PR and between 00072ea and ba86b52.

📒 Files selected for processing (4)
  • api/apps/document_app.py
  • api/apps/restful_apis/document_api.py
  • test/testcases/test_web_api/test_common.py
  • web/src/utils/api.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • api/apps/document_app.py
  • test/testcases/test_web_api/test_common.py

Comment thread api/apps/restful_apis/document_api.py Outdated
@yingfeng yingfeng merged commit 61a24a2 into infiniflow:main Apr 27, 2026
2 checks passed
@xugangqiang xugangqiang deleted the doc-upload-info branch May 15, 2026 06:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Continue Integration size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants