Refa: unify document create flows under REST documents API#14345
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughConsolidates document creation into dataset-scoped REST endpoint Changes
Sequence Diagram(s)sequenceDiagram
actor Client
participant API as REST API\n/datasets/{id}/documents
participant Validator as Validation & Storage
participant DB as Document DB
rect rgba(100, 200, 100, 0.5)
note over Client,DB: Mode = "local" (multipart file upload)
Client->>API: POST ?type=local (multipart file)
API->>Validator: Validate file, check duplicates
Validator->>Validator: Infer FileType, choose storage key
Validator->>Validator: Persist blob, generate thumbnail
API->>DB: Insert document record
API->>Client: Return mapped document
end
rect rgba(100, 150, 200, 0.5)
note over Client,DB: Mode = "web" (HTML page crawl)
Client->>API: POST ?type=web (form: name, url)
API->>Validator: Validate name & URL, check duplicates
Validator->>Validator: Download HTML, convert to PDF blob
Validator->>Validator: Persist PDF blob, generate thumbnail
API->>DB: Insert document record
API->>Client: Return mapped document
end
rect rgba(200, 150, 100, 0.5)
note over Client,DB: Mode = "empty" (virtual document)
Client->>API: POST ?type=empty (JSON: name)
API->>Validator: Validate name, check duplicates
API->>DB: Create virtual document record (FileType.VIRTUAL, empty location)
API->>Client: Return virtual document
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
web/src/hooks/use-document-request.ts (1)
520-544:⚠️ Potential issue | 🟠 MajorRefresh the document list after a successful web crawl.
This mutation now creates a document through the unified API, but unlike
useCreateDocument()it only shows a toast. The new document won't become visible until a manual refresh, and if the user is not on page 1 it should also reset pagination so the freshly created row can appear.Suggested fix
export const useNextWebCrawl = () => { const { knowledgeId } = useGetKnowledgeSearchParams(); + const { setPaginationParams, page } = useSetPaginationParams(); + const queryClient = useQueryClient(); const { data, isPending: loading, mutateAsync, @@ const ret = await webCrawlDocument(knowledgeId, formData); const code = get(ret, 'code'); if (code === 0) { + if (page === 1) { + queryClient.invalidateQueries({ + queryKey: [DocumentApiAction.FetchDocumentList], + }); + } else { + setPaginationParams(); + } message.success(i18n.t('message.uploaded')); } return code; },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@web/src/hooks/use-document-request.ts` around lines 520 - 544, The web crawl mutation in useNextWebCrawl (mutationFn using webCrawlDocument) currently only shows a toast and returns a code, so newly created documents don't appear until a manual refresh; update the success path to trigger the same post-create behavior as useCreateDocument: after detecting code === 0, call the documents list refresh and reset pagination to page 1 (e.g., invoke the documents-list hook's mutate/refresh method or the same refresh handler used by useCreateDocument and set currentPage = 1) so the new row becomes visible immediately; ensure you reference useNextWebCrawl, webCrawlDocument, and the documents list refresh/mutate function from the documents hook when implementing this.docs/references/http_api_reference.md (1)
1376-1439:⚠️ Potential issue | 🟠 MajorDocument the non-list response shapes for
type=webandtype=empty.This section adds two new creation modes, but the only success shape documented below is still the
localupload array. The backend returns a single document object for bothtype=webandtype=empty, so API consumers following this page will parse the response incorrectly for those modes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/references/http_api_reference.md` around lines 1376 - 1439, The docs only show the multipart/local upload response (an array) but omit that POST /api/v1/datasets/{dataset_id}/documents with query param `type=web` and `type=empty` returns a single document object; update the "Request" and response examples/parameters to explicitly document the non-list response shape for `type=web` and `type=empty` (single JSON document object), adjust the example curl outputs and the "Request parameters" response description accordingly, and include a brief example JSON response for both `web` and `empty` modes so consumers parse them as an object instead of an array.
🧹 Nitpick comments (4)
api/apps/restful_apis/document_api.py (3)
336-339: Dead code: parser-override regexes can never match in thewebbranch.In this branch the stored filename is always
f"{name}.pdf"(line 308) beforeduplicate_nameis applied, so the suffixes\.(ppt|pptx|pages)$and\.(eml)$will never match. These twore.searchblocks appear to have been copied from thelocalupload path where arbitrary suffixes can appear. They can be dropped from the web branch.♻️ Proposed diff
if doc["type"] == FileType.VISUAL: doc["parser_id"] = ParserType.PICTURE.value if doc["type"] == FileType.AURAL: doc["parser_id"] = ParserType.AUDIO.value - if re.search(r"\.(ppt|pptx|pages)$", filename): - doc["parser_id"] = ParserType.PRESENTATION.value - if re.search(r"\.(eml)$", filename): - doc["parser_id"] = ParserType.EMAIL.valueSimilarly,
VISUAL/AURALoverrides are unreachable for PDF output — you can arguably drop those too; left conservative here in casehtml2pdfcan emit non-PDF in some edge case.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/apps/restful_apis/document_api.py` around lines 336 - 339, In document_api.py, remove the unreachable parser-override regex checks that set doc["parser_id"] for presentations and emails (the re.search(r"\.(ppt|pptx|pages)$", filename) and re.search(r"\.(eml)$", filename) blocks) from the web upload branch where filename is forced to f"{name}.pdf"; these conditions can never match in the web path so delete those two if-blocks (leave other parser_id/override logic intact).
281-387: Add info-level logging for the newwebandemptyflows.Both new branches only log on error paths (
logging.error). Per the**/*.py: Add logging for new flowsguideline, consider adding a couple of info logs at key success milestones (e.g., dataset + mode on entry, and a success log beforeget_result(...)) to aid operational debugging of the unified endpoint.As per coding guidelines: "
**/*.py: Add logging for new flows".♻️ Suggested additions
if upload_type == "web": + logging.info(f"upload_document web-crawl requested: dataset_id={dataset_id}") form = await request.form ... - DocumentService.insert(doc) + DocumentService.insert(doc) + logging.info(f"Web-crawled document created: dataset_id={dataset_id}, doc_id={doc['id']}, name={filename}") FileService.add_file_from_kb(doc, kb_folder["id"], kb.tenant_id) if upload_type == "empty": + logging.info(f"upload_document empty requested: dataset_id={dataset_id}") req = await get_request_json() ... - FileService.add_file_from_kb(doc.to_dict(), kb_folder["id"], kb.tenant_id) + FileService.add_file_from_kb(doc.to_dict(), kb_folder["id"], kb.tenant_id) + logging.info(f"Empty document created: dataset_id={dataset_id}, doc_id={doc.id}, name={name}") return get_result(data=map_doc_keys(doc))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/apps/restful_apis/document_api.py` around lines 281 - 387, Add info-level logs for both the "web" and "empty" upload branches: log dataset id and upload_type on entry (inside the upload_type == "web" and upload_type == "empty" blocks) and log a success message just before the successful get_result returns (before the return that calls get_result in the "web" branch and before the get_result in the "empty" branch). Use the existing logger (e.g., logging.info or the module logger) and include identifying fields like dataset_id, kb.id or current_user.id and the resulting document id/name (from doc or filename) to aid tracing; place logs near the symbols shown (upload_type == "web", blob handling and the return get_result(data=map_doc_keys_with_run_status(...)) in the web flow, and around DocumentService.insert(...) and return get_result(data=map_doc_keys(...)) in the empty flow).
298-305: Consider offloading blocking I/O to thread pool to prevent event loop stalls.
html2pdf(url)at line 298 performs a synchronous HTTP fetch via Selenium (driver.get) that can block indefinitely on an unresponsive site—thetimeoutparameter (default=2) only applies to DOM staleness after the page loads, not to the HTTP connection itself. The subsequentFileService.*calls execute synchronous database queries via Peewee ORM, further stalling the async handler. Since this is now the unified front door for document creation and will receive higher traffic, run both the PDF conversion and file service operations on the thread pool, mirroring the pattern already established at line 411 withawait thread_pool_exec(FileService.upload_document, ...).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/apps/restful_apis/document_api.py` around lines 298 - 305, The synchronous html2pdf(url) call and subsequent synchronous Peewee FileService calls (FileService.get_root_folder, FileService.init_knowledgebase_docs, FileService.get_kb_folder, FileService.new_a_file_from_kb) must be executed off the event loop using the thread pool; replace the direct calls with await thread_pool_exec(html2pdf, url) for the PDF conversion and await thread_pool_exec(FileService.get_root_folder, current_user.id), await thread_pool_exec(FileService.init_knowledgebase_docs, root_id, current_user.id), await thread_pool_exec(FileService.get_kb_folder, current_user.id), and await thread_pool_exec(FileService.new_a_file_from_kb, kb.tenant_id, kb.name, kb_root_id) (matching the existing pattern used at the other call site) so blocking I/O runs in threads and does not stall the async handler.test/testcases/test_web_api/test_document_app/conftest.py (1)
133-190: Extract shared fixture setup to avoid ~60 lines of duplication.The
document_rest_api_modulefixture duplicates almost all ofdocument_app_module(stubs forcommon,deepdoc.parser.*,xgboost,api.apps, dummy manager). Only the finalmodule_pathdiffers. Consider factoring the stub setup into a helper so future changes (adding a new stub, fixing a path) don't need to be mirrored in two places.♻️ Proposed refactor sketch
def _install_document_api_stubs(monkeypatch, repo_root): common_pkg = ModuleType("common") common_pkg.__path__ = [str(repo_root / "common")] monkeypatch.setitem(sys.modules, "common", common_pkg) deepdoc_pkg = ModuleType("deepdoc") deepdoc_parser_pkg = ModuleType("deepdoc.parser") deepdoc_parser_pkg.__path__ = [] class _Stub: pass deepdoc_parser_pkg.PdfParser = _Stub deepdoc_pkg.parser = deepdoc_parser_pkg monkeypatch.setitem(sys.modules, "deepdoc", deepdoc_pkg) monkeypatch.setitem(sys.modules, "deepdoc.parser", deepdoc_parser_pkg) for modname, attr in [ ("deepdoc.parser.excel_parser", "RAGFlowExcelParser"), ("deepdoc.parser.html_parser", "RAGFlowHtmlParser"), ("deepdoc.parser.mineru_parser", "MinerUParser"), ("deepdoc.parser.paddleocr_parser", "PaddleOCRParser"), ]: m = ModuleType(modname); setattr(m, attr, _Stub) monkeypatch.setitem(sys.modules, modname, m) monkeypatch.setitem(sys.modules, "xgboost", ModuleType("xgboost")) stub_apps = ModuleType("api.apps") stub_apps.current_user = SimpleNamespace(id="user-1") stub_apps.login_required = lambda func: func monkeypatch.setitem(sys.modules, "api.apps", stub_apps) def _load_module(name, module_path): spec = importlib.util.spec_from_file_location(name, module_path) module = importlib.util.module_from_spec(spec) module.manager = _DummyManager() spec.loader.exec_module(module) return module `@pytest.fixture`() def document_app_module(monkeypatch): repo_root = Path(__file__).resolve().parents[4] _install_document_api_stubs(monkeypatch, repo_root) return _load_module("test_document_app_unit", repo_root / "api" / "apps" / "document_app.py") `@pytest.fixture`() def document_rest_api_module(monkeypatch): repo_root = Path(__file__).resolve().parents[4] _install_document_api_stubs(monkeypatch, repo_root) return _load_module( "test_document_api_unit", repo_root / "api" / "apps" / "restful_apis" / "document_api.py", )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/testcases/test_web_api/test_document_app/conftest.py` around lines 133 - 190, The document_rest_api_module fixture duplicates almost all setup from document_app_module; extract the shared monkeypatch/stub logic into a helper (e.g., _install_document_api_stubs) that creates the ModuleType stubs (common, deepdoc.parser.*, xgboost, api.apps) and a loader helper (e.g., _load_module) that creates a spec, sets module.manager = _DummyManager and execs the module; then simplify document_app_module and document_rest_api_module to call _install_document_api_stubs(monkeypatch, repo_root) and return _load_module(...) for their respective module paths, keeping the existing symbols ModuleType, _DummyManager, document_app_module and document_rest_api_module names so tests continue to find the same fixtures.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/testcases/test_web_api/test_document_app/test_create_document.py`:
- Around line 139-140: The assertion in test_create_document.py currently
expects "Can't find this dataset!" but the API handler (in
api/apps/restful_apis/document_api.py) returns a message formatted as "Can't
find the dataset with ID {dataset_id}!", so update the test in
test_create_document.py (the failing assertions for res["code"] and
res["message"]) to either assert the exact message using the dataset_id used in
the test (match "Can't find the dataset with ID {dataset_id}!") or assert using
a substring check (e.g., verify "Can't find the dataset" in res["message"])
similar to the pattern used in test_upload_documents.py::test_invalid_kb_id.
- Around line 104-127: The tests fail because upload_document calls
KnowledgebaseService.get_by_id and check_kb_team_permission before filename
validation; add the same mocks used in other tests so the handler proceeds to
name checks: in both test_filename_too_long and test_filename_whitespace
monkeypatch module.KnowledgebaseService.get_by_id to return a successful (True,
<kb object>) response and monkeypatch module.check_kb_team_permission to return
True (or the expected success tuple), mirroring the stubs used in
test_duplicate_name/test_success so the filename validation in upload_document
runs and returns code 101.
In `@test/testcases/test_web_api/test_document_app/test_upload_documents.py`:
- Around line 451-460: The test_invalid_url currently only monkeypatches request
so the real KnowledgebaseService.get_by_id and check_kb_team_permission run and
block the test; update the test to also stub KnowledgebaseService.get_by_id
(return a dummy kb object for "kb1") and check_kb_team_permission (return True
or a permitted response) on the module under test before calling
module.upload_document(dataset_id="kb1") so execution reaches the upload_type ==
"web" branch and triggers is_valid_url error handling; reference the test
function name test_invalid_url and the target function upload_document to locate
where to add these monkeypatches.
---
Outside diff comments:
In `@docs/references/http_api_reference.md`:
- Around line 1376-1439: The docs only show the multipart/local upload response
(an array) but omit that POST /api/v1/datasets/{dataset_id}/documents with query
param `type=web` and `type=empty` returns a single document object; update the
"Request" and response examples/parameters to explicitly document the non-list
response shape for `type=web` and `type=empty` (single JSON document object),
adjust the example curl outputs and the "Request parameters" response
description accordingly, and include a brief example JSON response for both
`web` and `empty` modes so consumers parse them as an object instead of an
array.
In `@web/src/hooks/use-document-request.ts`:
- Around line 520-544: The web crawl mutation in useNextWebCrawl (mutationFn
using webCrawlDocument) currently only shows a toast and returns a code, so
newly created documents don't appear until a manual refresh; update the success
path to trigger the same post-create behavior as useCreateDocument: after
detecting code === 0, call the documents list refresh and reset pagination to
page 1 (e.g., invoke the documents-list hook's mutate/refresh method or the same
refresh handler used by useCreateDocument and set currentPage = 1) so the new
row becomes visible immediately; ensure you reference useNextWebCrawl,
webCrawlDocument, and the documents list refresh/mutate function from the
documents hook when implementing this.
---
Nitpick comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 336-339: In document_api.py, remove the unreachable
parser-override regex checks that set doc["parser_id"] for presentations and
emails (the re.search(r"\.(ppt|pptx|pages)$", filename) and
re.search(r"\.(eml)$", filename) blocks) from the web upload branch where
filename is forced to f"{name}.pdf"; these conditions can never match in the web
path so delete those two if-blocks (leave other parser_id/override logic
intact).
- Around line 281-387: Add info-level logs for both the "web" and "empty" upload
branches: log dataset id and upload_type on entry (inside the upload_type ==
"web" and upload_type == "empty" blocks) and log a success message just before
the successful get_result returns (before the return that calls get_result in
the "web" branch and before the get_result in the "empty" branch). Use the
existing logger (e.g., logging.info or the module logger) and include
identifying fields like dataset_id, kb.id or current_user.id and the resulting
document id/name (from doc or filename) to aid tracing; place logs near the
symbols shown (upload_type == "web", blob handling and the return
get_result(data=map_doc_keys_with_run_status(...)) in the web flow, and around
DocumentService.insert(...) and return get_result(data=map_doc_keys(...)) in the
empty flow).
- Around line 298-305: The synchronous html2pdf(url) call and subsequent
synchronous Peewee FileService calls (FileService.get_root_folder,
FileService.init_knowledgebase_docs, FileService.get_kb_folder,
FileService.new_a_file_from_kb) must be executed off the event loop using the
thread pool; replace the direct calls with await thread_pool_exec(html2pdf, url)
for the PDF conversion and await thread_pool_exec(FileService.get_root_folder,
current_user.id), await thread_pool_exec(FileService.init_knowledgebase_docs,
root_id, current_user.id), await thread_pool_exec(FileService.get_kb_folder,
current_user.id), and await thread_pool_exec(FileService.new_a_file_from_kb,
kb.tenant_id, kb.name, kb_root_id) (matching the existing pattern used at the
other call site) so blocking I/O runs in threads and does not stall the async
handler.
In `@test/testcases/test_web_api/test_document_app/conftest.py`:
- Around line 133-190: The document_rest_api_module fixture duplicates almost
all setup from document_app_module; extract the shared monkeypatch/stub logic
into a helper (e.g., _install_document_api_stubs) that creates the ModuleType
stubs (common, deepdoc.parser.*, xgboost, api.apps) and a loader helper (e.g.,
_load_module) that creates a spec, sets module.manager = _DummyManager and execs
the module; then simplify document_app_module and document_rest_api_module to
call _install_document_api_stubs(monkeypatch, repo_root) and return
_load_module(...) for their respective module paths, keeping the existing
symbols ModuleType, _DummyManager, document_app_module and
document_rest_api_module names so tests continue to find the same fixtures.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 480ce29e-8d2c-4a4d-adc7-e73d3a96fe9f
📒 Files selected for processing (10)
api/apps/document_app.pyapi/apps/restful_apis/document_api.pydocs/references/http_api_reference.mdtest/testcases/test_web_api/test_common.pytest/testcases/test_web_api/test_document_app/conftest.pytest/testcases/test_web_api/test_document_app/test_create_document.pytest/testcases/test_web_api/test_document_app/test_upload_documents.pyweb/src/hooks/use-document-request.tsweb/src/services/knowledge-service.tsweb/src/utils/api.ts
| def test_invalid_url(self, document_rest_api_module, monkeypatch): | ||
| module = document_rest_api_module | ||
| monkeypatch.setattr( | ||
| module, | ||
| "request", | ||
| _DummyRequest(form={"name": "doc", "url": "not-a-url"}, args={"type": "web"}), | ||
| ) | ||
| res = _run(module.upload_document(dataset_id="kb1")) | ||
| assert res["code"] == 101 | ||
| assert res["message"] == "The URL format is invalid" |
There was a problem hiding this comment.
test_invalid_url unlikely to reach URL validation — KB/permission gate now sits in front of it.
In the unified handler (see context snippet 1 at document_api.py:271-279), KnowledgebaseService.get_by_id(dataset_id) and check_kb_team_permission(...) execute before the if upload_type == "web": branch that houses the is_valid_url check. This test only monkeypatches request, so the real KnowledgebaseService.get_by_id("kb1") runs and will typically short-circuit with "Can't find the dataset with ID kb1!", not "The URL format is invalid".
Other web-mode tests in this class (test_invalid_kb_id, test_no_permission, test_download_failure, …) correctly stub both. Please align this one.
🐛 Suggested fix
def test_invalid_url(self, document_rest_api_module, monkeypatch):
module = document_rest_api_module
+ kb = SimpleNamespace(id="kb1", tenant_id="tenant1", name="kb", parser_id="parser", pipeline_id="pipe", parser_config={})
+ monkeypatch.setattr(module.KnowledgebaseService, "get_by_id", lambda _kb_id: (True, kb))
+ monkeypatch.setattr(module, "check_kb_team_permission", lambda *_a, **_k: True)
monkeypatch.setattr(
module,
"request",
_DummyRequest(form={"name": "doc", "url": "not-a-url"}, args={"type": "web"}),
)
res = _run(module.upload_document(dataset_id="kb1"))
assert res["code"] == 101
assert res["message"] == "The URL format is invalid"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/testcases/test_web_api/test_document_app/test_upload_documents.py`
around lines 451 - 460, The test_invalid_url currently only monkeypatches
request so the real KnowledgebaseService.get_by_id and check_kb_team_permission
run and block the test; update the test to also stub
KnowledgebaseService.get_by_id (return a dummy kb object for "kb1") and
check_kb_team_permission (return True or a permitted response) on the module
under test before calling module.upload_document(dataset_id="kb1") so execution
reaches the upload_type == "web" branch and triggers is_valid_url error
handling; reference the test function name test_invalid_url and the target
function upload_document to locate where to add these monkeypatches.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #14345 +/- ##
=======================================
Coverage 95.30% 95.30%
=======================================
Files 10 10
Lines 703 703
Branches 112 112
=======================================
Hits 670 670
Misses 16 16
Partials 17 17 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/testcases/test_web_api/test_document_app/test_create_document.py (1)
16-26:⚠️ Potential issue | 🔴 CriticalMissing
SimpleNamespaceimport — all unit tests will fail withNameError.
SimpleNamespaceis referenced on lines 119, 131, 144, 159, 175, 192, and 222 but is never imported. Every test inTestDocumentCreateUnitwill raiseNameError: name 'SimpleNamespace' is not definedat collection/execution time.🐛 Proposed fix
import asyncio import string from concurrent.futures import ThreadPoolExecutor, as_completed +from types import SimpleNamespace import pytest🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/testcases/test_web_api/test_document_app/test_create_document.py` around lines 16 - 26, The tests fail because SimpleNamespace is used in TestDocumentCreateUnit but not imported; add the missing import by importing SimpleNamespace from the types module (e.g., add "from types import SimpleNamespace") alongside the other top-level imports so references in TestDocumentCreateUnit (lines where SimpleNamespace is used) resolve correctly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@test/testcases/test_web_api/test_document_app/test_create_document.py`:
- Around line 16-26: The tests fail because SimpleNamespace is used in
TestDocumentCreateUnit but not imported; add the missing import by importing
SimpleNamespace from the types module (e.g., add "from types import
SimpleNamespace") alongside the other top-level imports so references in
TestDocumentCreateUnit (lines where SimpleNamespace is used) resolve correctly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 085daaee-dac7-400f-9651-15e1d9dbbff1
📒 Files selected for processing (2)
test/testcases/test_web_api/test_document_app/conftest.pytest/testcases/test_web_api/test_document_app/test_create_document.py
🚧 Files skipped from review as they are similar to previous changes (1)
- test/testcases/test_web_api/test_document_app/conftest.py
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
api/apps/restful_apis/document_api.py (1)
296-403: Add logging to the new web/empty upload flows.The new helpers have no
logging.*calls, while_upload_local_documentslogs validation failures (No file part!,No file selected!, etc.) and the existingupload_documentdispatcher logs its own errors. Please add parity logging for validation errors (missingname/url, length overflow, invalid URL, duplicate name) and for successful creations so these flows are observable.As per coding guidelines, "Add logging for new flows" for
**/*.py.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/apps/restful_apis/document_api.py` around lines 296 - 403, Add logging calls to the new upload flows: in _upload_web_document and _upload_empty_document, log validation failures (missing name, missing url, name too long, invalid URL, duplicate name) at warning level and log successful document creation (after DocumentService.insert / FileService.add_file_from_kb completes) at info level including the doc id/name/kb id; use the module logger (e.g., logging.getLogger(__name__)) consistent with other handlers and place logs immediately where current get_error_data_result or return get_result calls happen to mirror _upload_local_documents behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 322-360: The blob can be orphaned if STORAGE_IMPL.put succeeds but
later DocumentService.insert or FileService.add_file_from_kb fails, and
unsupported-type RuntimeError is being turned into a 500 by the broad except; to
fix: (1) change the flow in the function so you either insert the DB record
(DocumentService.insert) before calling STORAGE_IMPL.put, or keep the current
order but wrap the DB calls in a try/except that on any failure calls
settings.STORAGE_IMPL.rm(dataset_id, location) to delete the uploaded blob; (2)
replace the thrown RuntimeError for unsupported file types with an explicit
early return using get_error_data_result(..., code=RetCode.ARGUMENT_ERROR) (or
catch RuntimeError specifically and map it to get_error_data_result) so clients
receive an argument error instead of server_error_response; update references to
STORAGE_IMPL.put, settings.STORAGE_IMPL.rm, DocumentService.insert,
FileService.add_file_from_kb, RuntimeError, server_error_response and
get_error_data_result/RetCode.ARGUMENT_ERROR accordingly.
- Around line 313-315: The handler currently calls html2pdf(url) synchronously
which blocks the async event loop; update the async handler to run html2pdf via
the existing thread_pool_exec helper (same pattern as _upload_local_documents)
and await its result, and likewise offload the blocking STORAGE_IMPL.obj_exist
and STORAGE_IMPL.put calls to thread_pool_exec so all CPU/IO-heavy work is
executed on the thread pool; locate the synchronous calls to html2pdf,
STORAGE_IMPL.obj_exist and STORAGE_IMPL.put in the async function and wrap them
with thread_pool_exec (await the returned futures) to avoid blocking the Quart
event loop.
- Around line 317-320: The KB folder resolution is inconsistent:
_upload_web_document calls FileService.get_root_folder,
FileService.init_knowledgebase_docs, and FileService.get_kb_folder with
current_user.id but then calls FileService.new_a_file_from_kb with kb.tenant_id,
causing wrong folder linkage for team members; update _upload_web_document to
use kb.tenant_id for all folder-resolution calls (replace current_user.id with
kb.tenant_id when invoking get_root_folder, init_knowledgebase_docs, and
get_kb_folder) so its behavior matches _upload_empty_document and ensures the
created file is placed under the KB owner’s folders.
- Around line 385-399: The DocumentService.insert call is storing the enum
member FileType.VIRTUAL instead of its string value; update the inserted dict so
the "type" field uses FileType.VIRTUAL.value to match the rest of the codebase
(see usage patterns in document_service and file_service) so comparisons and
stored values remain consistent.
- Around line 347-354: The conditional block that sets doc["parser_id"] for
FileType.VISUAL, FileType.AURAL and regex checks for .ppt/.pptx/.pages/.eml is
dead because filename is always constructed as f"{name}.pdf" and doc["type"]
will always be FileType.PDF; remove this entire block (the if branches
referencing doc["type"], FileType.VISUAL/FileType.AURAL, and the re.search
checks) so the default kb.parser_id remains in effect (see references to
filename construction and doc["type"] and parser_id). If you actually need to
handle presentations/emails for crawled pages, change the logic to recover the
original source filename from the URL and run the type checks against that
source filename instead of the generated f"{name}.pdf".
---
Nitpick comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 296-403: Add logging calls to the new upload flows: in
_upload_web_document and _upload_empty_document, log validation failures
(missing name, missing url, name too long, invalid URL, duplicate name) at
warning level and log successful document creation (after DocumentService.insert
/ FileService.add_file_from_kb completes) at info level including the doc
id/name/kb id; use the module logger (e.g., logging.getLogger(__name__))
consistent with other handlers and place logs immediately where current
get_error_data_result or return get_result calls happen to mirror
_upload_local_documents behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d3d4fdd0-a072-45cc-a2a4-7de059e28fef
📒 Files selected for processing (1)
api/apps/restful_apis/document_api.py
| blob = html2pdf(url) | ||
| if not blob: | ||
| return server_error_response(ValueError("Download failure.")) |
There was a problem hiding this comment.
Blocking Selenium call on the async event loop.
html2pdf(url) launches headless Chrome via Selenium (see api/utils/web_utils.py:157-188) and can take many seconds. Running it synchronously inside this async handler blocks the Quart event loop for every concurrent request. _upload_local_documents already offloads heavy work via thread_pool_exec — do the same here (and ideally also for the subsequent STORAGE_IMPL.obj_exist / STORAGE_IMPL.put which are blocking I/O).
🛠️ Suggested change
- blob = html2pdf(url)
+ blob = await thread_pool_exec(html2pdf, url)
if not blob:
return server_error_response(ValueError("Download failure."))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| blob = html2pdf(url) | |
| if not blob: | |
| return server_error_response(ValueError("Download failure.")) | |
| blob = await thread_pool_exec(html2pdf, url) | |
| if not blob: | |
| return server_error_response(ValueError("Download failure.")) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@api/apps/restful_apis/document_api.py` around lines 313 - 315, The handler
currently calls html2pdf(url) synchronously which blocks the async event loop;
update the async handler to run html2pdf via the existing thread_pool_exec
helper (same pattern as _upload_local_documents) and await its result, and
likewise offload the blocking STORAGE_IMPL.obj_exist and STORAGE_IMPL.put calls
to thread_pool_exec so all CPU/IO-heavy work is executed on the thread pool;
locate the synchronous calls to html2pdf, STORAGE_IMPL.obj_exist and
STORAGE_IMPL.put in the async function and wrap them with thread_pool_exec
(await the returned futures) to avoid blocking the Quart event loop.
| try: | ||
| filename = duplicate_name(DocumentService.query, name=f"{name}.pdf", kb_id=kb.id) | ||
| filetype = filename_type(filename) | ||
| if filetype == FileType.OTHER.value: | ||
| raise RuntimeError("This type of file has not been supported yet!") | ||
|
|
||
| location = filename | ||
| while settings.STORAGE_IMPL.obj_exist(dataset_id, location): | ||
| location += "_" | ||
| settings.STORAGE_IMPL.put(dataset_id, location, blob) | ||
|
|
||
| doc = { | ||
| "id": get_uuid(), | ||
| "kb_id": kb.id, | ||
| "parser_id": kb.parser_id, | ||
| "pipeline_id": kb.pipeline_id, | ||
| "parser_config": kb.parser_config, | ||
| "created_by": current_user.id, | ||
| "type": filetype, | ||
| "name": filename, | ||
| "location": location, | ||
| "size": len(blob), | ||
| "thumbnail": thumbnail(filename, blob), | ||
| "suffix": Path(filename).suffix.lstrip("."), | ||
| } | ||
| if doc["type"] == FileType.VISUAL: | ||
| doc["parser_id"] = ParserType.PICTURE.value | ||
| if doc["type"] == FileType.AURAL: | ||
| doc["parser_id"] = ParserType.AUDIO.value | ||
| if re.search(r"\.(ppt|pptx|pages)$", filename): | ||
| doc["parser_id"] = ParserType.PRESENTATION.value | ||
| if re.search(r"\.(eml)$", filename): | ||
| doc["parser_id"] = ParserType.EMAIL.value | ||
|
|
||
| DocumentService.insert(doc) | ||
| FileService.add_file_from_kb(doc, kb_folder["id"], kb.tenant_id) | ||
| return get_result(data=map_doc_keys_with_run_status(doc, run_status="0")) | ||
| except Exception as e: | ||
| return server_error_response(e) |
There was a problem hiding this comment.
Orphaned storage blob on partial failure + wrong error code for unsupported type.
Two smaller correctness/UX points inside the try block:
- If
STORAGE_IMPL.put(line 331) succeeds butDocumentService.insert/add_file_from_kblater fails, the blob stays in object storage with no DB record referencing it. Consider ordering the DB insert before the storage write, or adding a compensatingSTORAGE_IMPL.rm(dataset_id, location)in an exception path. RuntimeError("This type of file has not been supported yet!")(line 326) is a client-input error, but the bareexceptat line 359 maps it toserver_error_response(HTTP 500). Return aget_error_data_result(..., code=RetCode.ARGUMENT_ERROR)for that specific case so clients can distinguish it from server faults. (This case becomes unreachable if you drop the dead parser_id block suggested above — in that case this is moot.)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@api/apps/restful_apis/document_api.py` around lines 322 - 360, The blob can
be orphaned if STORAGE_IMPL.put succeeds but later DocumentService.insert or
FileService.add_file_from_kb fails, and unsupported-type RuntimeError is being
turned into a 500 by the broad except; to fix: (1) change the flow in the
function so you either insert the DB record (DocumentService.insert) before
calling STORAGE_IMPL.put, or keep the current order but wrap the DB calls in a
try/except that on any failure calls settings.STORAGE_IMPL.rm(dataset_id,
location) to delete the uploaded blob; (2) replace the thrown RuntimeError for
unsupported file types with an explicit early return using
get_error_data_result(..., code=RetCode.ARGUMENT_ERROR) (or catch RuntimeError
specifically and map it to get_error_data_result) so clients receive an argument
error instead of server_error_response; update references to STORAGE_IMPL.put,
settings.STORAGE_IMPL.rm, DocumentService.insert, FileService.add_file_from_kb,
RuntimeError, server_error_response and
get_error_data_result/RetCode.ARGUMENT_ERROR accordingly.
| if doc["type"] == FileType.VISUAL: | ||
| doc["parser_id"] = ParserType.PICTURE.value | ||
| if doc["type"] == FileType.AURAL: | ||
| doc["parser_id"] = ParserType.AUDIO.value | ||
| if re.search(r"\.(ppt|pptx|pages)$", filename): | ||
| doc["parser_id"] = ParserType.PRESENTATION.value | ||
| if re.search(r"\.(eml)$", filename): | ||
| doc["parser_id"] = ParserType.EMAIL.value |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Inspect FileType enum definition and filename_type return type
ast-grep --pattern $'class FileType($_):
$$$'
rg -nP --type=py -C3 '\bdef\s+filename_type\s*\('Repository: infiniflow/ragflow
Length of output: 824
🏁 Script executed:
cd api/apps/restful_apis && sed -n '320,360p' document_api.py | cat -nRepository: infiniflow/ragflow
Length of output: 2058
🏁 Script executed:
rg -n "def filename_type" api/utils/file_utils.py -A 15Repository: infiniflow/ragflow
Length of output: 827
🏁 Script executed:
grep -n "FileType" api/apps/restful_apis/document_api.py | head -20Repository: infiniflow/ragflow
Length of output: 299
🏁 Script executed:
sed -n '58,85p' api/utils/file_utils.py | cat -nRepository: infiniflow/ragflow
Length of output: 1404
🏁 Script executed:
rg "new_kb_from_web_page" api/apps/restful_apis/document_api.py -B 5Repository: infiniflow/ragflow
Length of output: 44
Remove this dead code block — filename is always .pdf, making these branches unreachable.
The filename is constructed as f"{name}.pdf" (line 323), so filename_type(filename) always returns FileType.PDF.value. This means:
doc["type"]is always"pdf"and can never equalFileType.VISUALorFileType.AURAL- The regex patterns checking for
.pptxand.emlcan never match a.pdffilename
These overrides are dead code and should be removed entirely. The default kb.parser_id set on line 336 is what actually applies here.
If presentation/email handling is needed for crawled pages, recover the original source filename from the URL and run the type checks against that instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@api/apps/restful_apis/document_api.py` around lines 347 - 354, The
conditional block that sets doc["parser_id"] for FileType.VISUAL, FileType.AURAL
and regex checks for .ppt/.pptx/.pages/.eml is dead because filename is always
constructed as f"{name}.pdf" and doc["type"] will always be FileType.PDF; remove
this entire block (the if branches referencing doc["type"],
FileType.VISUAL/FileType.AURAL, and the re.search checks) so the default
kb.parser_id remains in effect (see references to filename construction and
doc["type"] and parser_id). If you actually need to handle presentations/emails
for crawled pages, change the logic to recover the original source filename from
the URL and run the type checks against that source filename instead of the
generated f"{name}.pdf".
| doc = DocumentService.insert( | ||
| { | ||
| "id": get_uuid(), | ||
| "kb_id": kb.id, | ||
| "parser_id": kb.parser_id, | ||
| "pipeline_id": kb.pipeline_id, | ||
| "parser_config": kb.parser_config, | ||
| "created_by": current_user.id, | ||
| "type": FileType.VIRTUAL, | ||
| "name": name, | ||
| "suffix": Path(name).suffix.lstrip("."), | ||
| "location": "", | ||
| "size": 0, | ||
| } | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Confirm how `type` is persisted / compared for documents across the codebase
ast-grep --pattern $'class FileType($_):
$$$'
rg -nP --type=py -C2 '"type"\s*:\s*FileType\.[A-Z_]+\b'
rg -nP --type=py -C2 '\btype\s*==\s*FileType\.[A-Z_]+\.value\b'Repository: infiniflow/ragflow
Length of output: 7291
Store FileType.VIRTUAL.value, not the enum member.
FileType is a StrEnum, so enum members are strings internally. However, throughout the codebase—in file inserts (user_account_service.py:88, file_service.py:205, file_api_service.py:326) and all type comparisons (document_service.py:256/532/553, file_service.py:396, file2document_api.py:103, file_api_service.py:236/255/313, file_api.py:294, document_app.py:227)—the pattern is consistently to use .value. This location is the only exception and should align with the standard pattern.
Suggested change
- "type": FileType.VIRTUAL,
+ "type": FileType.VIRTUAL.value,📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| doc = DocumentService.insert( | |
| { | |
| "id": get_uuid(), | |
| "kb_id": kb.id, | |
| "parser_id": kb.parser_id, | |
| "pipeline_id": kb.pipeline_id, | |
| "parser_config": kb.parser_config, | |
| "created_by": current_user.id, | |
| "type": FileType.VIRTUAL, | |
| "name": name, | |
| "suffix": Path(name).suffix.lstrip("."), | |
| "location": "", | |
| "size": 0, | |
| } | |
| ) | |
| doc = DocumentService.insert( | |
| { | |
| "id": get_uuid(), | |
| "kb_id": kb.id, | |
| "parser_id": kb.parser_id, | |
| "pipeline_id": kb.pipeline_id, | |
| "parser_config": kb.parser_config, | |
| "created_by": current_user.id, | |
| "type": FileType.VIRTUAL.value, | |
| "name": name, | |
| "suffix": Path(name).suffix.lstrip("."), | |
| "location": "", | |
| "size": 0, | |
| } | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@api/apps/restful_apis/document_api.py` around lines 385 - 399, The
DocumentService.insert call is storing the enum member FileType.VIRTUAL instead
of its string value; update the inserted dict so the "type" field uses
FileType.VIRTUAL.value to match the rest of the codebase (see usage patterns in
document_service and file_service) so comparisons and stored values remain
consistent.
|
LGTM |
What problem does this PR solve?
unify document create flows under REST documents API
Type of change