Refactor: migrate document change status API#14300
Refactor: migrate document change status API#14300xugangqiang wants to merge 11 commits intoinfiniflow:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughMoves document status updates from a global POST Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Frontend as Frontend (hooks/service)
participant API as Server API
participant DB as Database
participant DocStore as DocStore
Client->>Frontend: request batch-update-status (datasetId, doc_ids, status)
Frontend->>API: POST /datasets/{datasetId}/documents/batch-update-status (doc_ids, status)
API->>DB: verify dataset ownership
DB-->>API: ownership OK / error
loop per doc_id
API->>DB: fetch document record
DB-->>API: document or missing
alt missing
API-->>Frontend: include per-document missing error
else exists & status differs
API->>DB: update document.status
DB-->>API: update ack
alt doc.chunk_num > 0
API->>DocStore: update availability (tenant index, kb_id)
DocStore-->>API: ack / error
end
else status already matches
API-->>Frontend: include per-document no-op result
end
end
API-->>Frontend: aggregated per-document results (success or "Partial failure")
Frontend-->>Client: API response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
web/src/pages/dataset/dataset/use-dataset-table-columns.tsx (1)
168-178: Optional: sourcedatasetIdfrom the row itself.
row.original.dataset_idis already available onIDocumentInfoand is used elsewhere in this same column set (line 99,navigateToChunkParsedResult(..., row.original.dataset_id)). Reusing it here would avoid threading an extra optionaldatasetIdprop throughDatasetTable→useDatasetTableColumnsand eliminates the?: stringuncertainty in the column signature.♻️ Proposed refactor
cell: ({ row }) => { const id = row.original.id; return ( <Switch checked={row.getValue('status') === '1'} onCheckedChange={(e) => { - setDocumentStatus({ status: e, documentId: id, datasetId }); + setDocumentStatus({ + status: e, + documentId: id, + datasetId: row.original.dataset_id, + }); }} /> ); },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@web/src/pages/dataset/dataset/use-dataset-table-columns.tsx` around lines 168 - 178, The Switch cell currently uses an external datasetId prop; change it to read datasetId from the row itself (row.original.dataset_id) inside useDatasetTableColumns so the Switch onCheckedChange calls setDocumentStatus({ status: e, documentId: id, datasetId: row.original.dataset_id }) instead of using the optional external prop; update any type assumptions in useDatasetTableColumns and the column signature to remove the optional datasetId threading and rely on row.original.dataset_id (refer to the Switch cell, setDocumentStatus function call, and row.original.dataset_id).test/testcases/test_web_api/test_document_app/test_document_metadata.py (1)
141-197: Add a negative test for cross-datasetdoc_ids.The new dataset-scoped batch endpoint accepts a list of
doc_idswithout (currently) validating that those docs belong todataset_id. Regardless of how the handler ends up enforcing that, a test like the following would guard against regressions:`@pytest.mark.p2` def test_change_status_foreign_doc_id_rejected(self, WebApiAuth, add_document_func, add_dataset_func): _, doc_id = add_document_func # doc belongs to dataset A other_dataset_id = add_dataset_func # dataset B, does not own doc res = document_change_status(WebApiAuth, other_dataset_id, {"doc_ids": [doc_id], "status": "1"}) # Expect an error for this doc, not a silent success assert res["data"][doc_id].get("error"), resBased on learnings (Applies to tests/**/.py : Add/adjust tests for behavior changes*).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py` around lines 141 - 197, Add a negative test to verify cross-dataset doc_id rejection: create a new test method (e.g., test_change_status_foreign_doc_id_rejected) in the TestDocumentMetadataNegative class that uses add_document_func to create a doc in dataset A, uses add_dataset_func to create dataset B, then calls document_change_status(WebApiAuth, other_dataset_id, {"doc_ids": [doc_id], "status": "1"}) and assert the response indicates an error for that doc (e.g., res["data"][doc_id].get("error") is truthy) rather than silently succeeding; reference the existing test_change_status for call shape and reuse WebApiAuth, add_document_func, add_dataset_func, and document_change_status symbols.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 930-932: The partial-failure branch currently returns a
server-side code (RetCode.SERVER_ERROR) which misclassifies client/document
errors; update the logic in the block that checks has_error (where
get_json_result is called) to return an appropriate client-side code such as
RetCode.ARGUMENT_ERROR or RetCode.DATA_ERROR instead, or add finer distinctions
(e.g., separate conditions for "all failed" vs "partial failure" vs "all
succeeded") so get_json_result(data=result, message="Partial failure", code=...)
uses a non-5xx RetCode and callers can handle partial vs total failures
correctly.
- Around line 870-928: Validate per-document authorization and add logging: for
each doc_id retrieved by DocumentService.get_by_id, ensure the document belongs
to the requested dataset by either calling DocumentService.accessible(doc_id,
tenant_id) (or checking doc.kb_id == dataset_id) before performing updates; if
access is denied, set result[doc_id] = {"error": "No authorization."} and
logging.error that denial. Add logging.error/exception calls on the initial
request validation failure, per-doc lookup/update errors, and in the except
block around settings.docStoreConn.update to record exception details. Also stop
double-stringifying status when calling DocumentService.update_by_id (pass an
int or a single-string value consistently) so update_by_id receives the correct
type.
In `@web/src/pages/dataset/dataset/use-bulk-operate-dataset.tsx`:
- Around line 88-97: The onChangeStatus callback can call setDocumentStatus with
datasetId undefined if knowledgeBase?.id isn't ready; update onChangeStatus
(and/or the caller) to guard for knowledgeBase?.id before invoking
setDocumentStatus — e.g., early-return if !knowledgeBase?.id or disable the bulk
action UI when knowledgeBase?.id is falsy — and ensure the payload passed to
setDocumentStatus (used here) always contains a valid datasetId (reference:
onChangeStatus, setDocumentStatus, knowledgeBase?.id, selectedRowKeys).
---
Nitpick comments:
In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`:
- Around line 141-197: Add a negative test to verify cross-dataset doc_id
rejection: create a new test method (e.g.,
test_change_status_foreign_doc_id_rejected) in the TestDocumentMetadataNegative
class that uses add_document_func to create a doc in dataset A, uses
add_dataset_func to create dataset B, then calls
document_change_status(WebApiAuth, other_dataset_id, {"doc_ids": [doc_id],
"status": "1"}) and assert the response indicates an error for that doc (e.g.,
res["data"][doc_id].get("error") is truthy) rather than silently succeeding;
reference the existing test_change_status for call shape and reuse WebApiAuth,
add_document_func, add_dataset_func, and document_change_status symbols.
In `@web/src/pages/dataset/dataset/use-dataset-table-columns.tsx`:
- Around line 168-178: The Switch cell currently uses an external datasetId
prop; change it to read datasetId from the row itself (row.original.dataset_id)
inside useDatasetTableColumns so the Switch onCheckedChange calls
setDocumentStatus({ status: e, documentId: id, datasetId:
row.original.dataset_id }) instead of using the optional external prop; update
any type assumptions in useDatasetTableColumns and the column signature to
remove the optional datasetId threading and rely on row.original.dataset_id
(refer to the Switch cell, setDocumentStatus function call, and
row.original.dataset_id).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8fb69f16-1896-4361-a929-24ab076b93d7
📒 Files selected for processing (10)
api/apps/document_app.pyapi/apps/restful_apis/document_api.pytest/testcases/test_web_api/test_common.pytest/testcases/test_web_api/test_document_app/test_document_metadata.pyweb/src/hooks/use-document-request.tsweb/src/pages/dataset/dataset/dataset-table.tsxweb/src/pages/dataset/dataset/use-bulk-operate-dataset.tsxweb/src/pages/dataset/dataset/use-dataset-table-columns.tsxweb/src/services/knowledge-service.tsweb/src/utils/api.ts
💤 Files with no reviewable changes (1)
- api/apps/document_app.py
| if has_error: | ||
| return get_json_result(data=result, message="Partial failure", code=RetCode.SERVER_ERROR) | ||
| return get_json_result(data=result) |
There was a problem hiding this comment.
Partial-failure response should not use SERVER_ERROR.
When has_error is due solely to client-side/per-document issues (e.g., Document not found, status mismatch), returning RetCode.SERVER_ERROR misrepresents the failure class and will trigger 5xx error handling / alerts on the client. Consider returning a client-side code (e.g., RetCode.ARGUMENT_ERROR or RetCode.DATA_ERROR), or distinguishing between "all failed", "partial failure", and "all succeeded" so callers can handle each.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@api/apps/restful_apis/document_api.py` around lines 930 - 932, The
partial-failure branch currently returns a server-side code
(RetCode.SERVER_ERROR) which misclassifies client/document errors; update the
logic in the block that checks has_error (where get_json_result is called) to
return an appropriate client-side code such as RetCode.ARGUMENT_ERROR or
RetCode.DATA_ERROR instead, or add finer distinctions (e.g., separate conditions
for "all failed" vs "partial failure" vs "all succeeded") so
get_json_result(data=result, message="Partial failure", code=...) uses a non-5xx
RetCode and callers can handle partial vs total failures correctly.
| const onChangeStatus = useCallback( | ||
| (enabled: boolean) => { | ||
| setDocumentStatus({ status: enabled, documentId: selectedRowKeys }); | ||
| setDocumentStatus({ | ||
| status: enabled, | ||
| documentId: selectedRowKeys, | ||
| datasetId: knowledgeBase?.id, | ||
| }); | ||
| }, | ||
| [selectedRowKeys, setDocumentStatus], | ||
| [selectedRowKeys, setDocumentStatus, knowledgeBase], | ||
| ); |
There was a problem hiding this comment.
Consider guarding against undefined knowledgeBase?.id.
setDocumentStatus will receive datasetId: undefined if a user triggers enable/disable before the KB context finishes loading, producing a request to /datasets/undefined/documents/batch-update-status. In practice the bulk actions are only reachable when rows are selected (which requires documents to have loaded alongside the KB), so this is mostly theoretical — but an early-return or disabled-state on !knowledgeBase?.id would be safer.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@web/src/pages/dataset/dataset/use-bulk-operate-dataset.tsx` around lines 88 -
97, The onChangeStatus callback can call setDocumentStatus with datasetId
undefined if knowledgeBase?.id isn't ready; update onChangeStatus (and/or the
caller) to guard for knowledgeBase?.id before invoking setDocumentStatus — e.g.,
early-return if !knowledgeBase?.id or disable the bulk action UI when
knowledgeBase?.id is falsy — and ensure the payload passed to setDocumentStatus
(used here) always contains a valid datasetId (reference: onChangeStatus,
setDocumentStatus, knowledgeBase?.id, selectedRowKeys).
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #14300 +/- ##
=======================================
Coverage 95.97% 95.97%
=======================================
Files 10 10
Lines 695 695
Branches 111 111
=======================================
Hits 667 667
Misses 11 11
Partials 17 17 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
test/testcases/test_web_api/test_document_app/test_document_metadata.py (1)
370-386: Consider removing the duplicate invalid-status case.This duplicates
TestDocumentMetadataNegative.test_change_status_invalid_statusat Lines 195-200 with the same setup, payload, and assertions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py` around lines 370 - 386, The test_change_status_invalid_status function duplicates an existing negative test (TestDocumentMetadataNegative.test_change_status_invalid_status) that uses the same setup, payload and assertions; remove this duplicate from test_document_metadata.py (or consolidate by reusing the existing TestDocumentMetadataNegative case or converting both to a shared parametrized test) so only one test asserts the invalid status behavior, referencing the test function name test_change_status_invalid_status and the existing TestDocumentMetadataNegative suite to locate and eliminate the redundant case.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`:
- Around line 388-414: The test currently asserts the document already has the
target status ("1") before calling document_change_status, so the update could
be a no-op; update test_change_status_all_success to first assert the initial
status is the opposite (e.g., "0") or at minimum assert initial_status != "1"
using document_infos, then call document_change_status to set status "1" and
keep the existing assertions that response code is 0 and that both the response
payload (res["data"][doc_id]["status"]) and a subsequent document_infos call
show the new status "1"; references: test_change_status_all_success,
document_infos, document_change_status.
- Around line 330-368: The test test_change_status_partial_failure_matrix
currently only uses valid IDs and can leak docs if bulk_upload_documents returns
a partial list; change it to (1) capture the exact list returned by
bulk_upload_documents into doc_ids, assert it's non-empty but not require exact
length, (2) append a known-invalid id (e.g., "nonexistent-id-123") to doc_ids to
force a partial-failure scenario before calling document_change_status, (3)
assert the response from document_change_status contains both success entries
for the real IDs and failure info for the invalid id, and verify via
document_infos that only the real docs' statuses changed, and (4) make the
finally block robust by only calling delete_document when doc_ids_real (the
original returned real IDs) is non-empty and wrapping delete_document in a
try/except to ensure cleanup never raises; reference the functions/bindings
bulk_upload_documents, document_change_status, document_infos, delete_document
and the test function name to locate edits.
---
Nitpick comments:
In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`:
- Around line 370-386: The test_change_status_invalid_status function duplicates
an existing negative test
(TestDocumentMetadataNegative.test_change_status_invalid_status) that uses the
same setup, payload and assertions; remove this duplicate from
test_document_metadata.py (or consolidate by reusing the existing
TestDocumentMetadataNegative case or converting both to a shared parametrized
test) so only one test asserts the invalid status behavior, referencing the test
function name test_change_status_invalid_status and the existing
TestDocumentMetadataNegative suite to locate and eliminate the redundant case.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f412c0be-0031-4531-9ba8-7b2f71e8592b
📒 Files selected for processing (1)
test/testcases/test_web_api/test_document_app/test_document_metadata.py
| @pytest.mark.p2 | ||
| def test_change_status_all_success(self, WebApiAuth, add_document_func): | ||
| """ | ||
| E2E test for successful batch document status change. | ||
|
|
||
| This test verifies that all documents are successfully updated | ||
| when valid status values are provided. | ||
| """ | ||
|
|
||
| dataset_id, doc_id = add_document_func | ||
|
|
||
| # Verify initial status is "0" (unprocessed) | ||
| info_res = document_infos(WebApiAuth, dataset_id, {"ids": [doc_id]}) | ||
| assert info_res["code"] == 0, info_res | ||
| assert info_res["data"]["docs"][0]["status"] == "1", "Initial status should be 1" | ||
|
|
||
| # Update status to "1" (processed) | ||
| res = document_change_status(WebApiAuth, dataset_id, {"doc_ids": [doc_id], "status": "1"}) | ||
|
|
||
| # Verify success | ||
| assert res["code"] == 0, f"Expected success code 0, got {res}" | ||
| assert res["data"][doc_id]["status"] == "1", "Document status should be 1" | ||
|
|
||
| # Verify the status was actually updated in the database | ||
| info_res = document_infos(WebApiAuth, dataset_id, {"ids": [doc_id]}) | ||
| assert info_res["code"] == 0, info_res | ||
| assert info_res["data"]["docs"][0]["status"] == "1", "Document status should be 1 in database" |
There was a problem hiding this comment.
Verify a real status transition.
Line 402 asserts the document already has the target status, then Lines 405-414 update it to the same value. This can pass even if the endpoint is a no-op.
Proposed fix
- # Verify initial status is "0" (unprocessed)
+ # Capture initial status and switch to the opposite valid status.
info_res = document_infos(WebApiAuth, dataset_id, {"ids": [doc_id]})
assert info_res["code"] == 0, info_res
- assert info_res["data"]["docs"][0]["status"] == "1", "Initial status should be 1"
+ initial_status = info_res["data"]["docs"][0]["status"]
+ target_status = "0" if initial_status == "1" else "1"
- # Update status to "1" (processed)
- res = document_change_status(WebApiAuth, dataset_id, {"doc_ids": [doc_id], "status": "1"})
+ # Update status to the opposite valid value.
+ res = document_change_status(WebApiAuth, dataset_id, {"doc_ids": [doc_id], "status": target_status})
# Verify success
assert res["code"] == 0, f"Expected success code 0, got {res}"
- assert res["data"][doc_id]["status"] == "1", "Document status should be 1"
+ assert res["data"][doc_id]["status"] == target_status, f"Document status should be {target_status}"
# Verify the status was actually updated in the database
info_res = document_infos(WebApiAuth, dataset_id, {"ids": [doc_id]})
assert info_res["code"] == 0, info_res
- assert info_res["data"]["docs"][0]["status"] == "1", "Document status should be 1 in database"
+ assert info_res["data"]["docs"][0]["status"] == target_status, f"Document status should be {target_status} in database"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`
around lines 388 - 414, The test currently asserts the document already has the
target status ("1") before calling document_change_status, so the update could
be a no-op; update test_change_status_all_success to first assert the initial
status is the opposite (e.g., "0") or at minimum assert initial_status != "1"
using document_infos, then call document_change_status to set status "1" and
keep the existing assertions that response code is 0 and that both the response
payload (res["data"][doc_id]["status"]) and a subsequent document_infos call
show the new status "1"; references: test_change_status_all_success,
document_infos, document_change_status.
There was a problem hiding this comment.
♻️ Duplicate comments (4)
api/apps/restful_apis/document_api.py (4)
943-959:⚠️ Potential issue | 🟡 MinorReject empty or non-list
doc_idsinstead of returning success.The schema marks
doc_idsas required, but the handler defaults missing input to[]; an empty or malformed batch currently returns200with{}. Validatedoc_idsbefore the loop.🛡️ Proposed validation
doc_ids = req.get("doc_ids", []) status = str(req.get("status", -1)) + if not isinstance(doc_ids, list) or not doc_ids: + return get_error_argument_result(message='"doc_ids" must be a non-empty list.') + if status not in ["0", "1"]: return get_error_argument_result(message=f'"Status" must be either 0 or 1:{status}!')🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/apps/restful_apis/document_api.py` around lines 943 - 959, The handler currently defaults doc_ids to [] and proceeds, causing an empty/malformed batch to return success; validate that req["doc_ids"] is present and is a non-empty list before the loop (reject non-list or empty list with get_error_argument_result), e.g. check the variable doc_ids immediately after it's assigned and before querying KnowledgebaseService/get_by_id or iterating over for doc_id in doc_ids; return a clear error when doc_ids is missing, not a list, or empty.
1002-1003:⚠️ Potential issue | 🟡 MinorDon’t classify client-side partial failures as server errors.
has_errorcan be caused by missing/foreign documents or invalid per-document state, so returningRetCode.SERVER_ERRORmakes these look like 5xx incidents. Use a data/argument error for client-side partials, or track server failures separately.♻️ Minimal safer response code
if has_error: - return get_json_result(data=result, message="Partial failure", code=RetCode.SERVER_ERROR) + return get_json_result(data=result, message="Partial failure", code=RetCode.DATA_ERROR)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/apps/restful_apis/document_api.py` around lines 1002 - 1003, The current branch returns a 5xx RetCode.SERVER_ERROR when has_error is true; instead treat partial failures from client-side issues as a data/argument error and not a server error. Update the return to use an appropriate client-side code (e.g., RetCode.DATA_ERROR or RetCode.INVALID_ARGUMENT) in the get_json_result call, and if you need to distinguish true server-side failures from client-side partials, add a separate flag/counter (e.g., server_failures) and only return SERVER_ERROR when that flag is non-zero; update the code around has_error/result/get_json_result to reflect this change.
949-984:⚠️ Potential issue | 🔴 CriticalScope every
doc_idto the path dataset before updating.The dataset ownership check only proves access to
dataset_id; Line 961 can still load a document from another dataset, and Lines 971-984 then update its DB status and doc-store availability. Reject documents wheredoc.kb_id != dataset_idbefore any write.🛡️ Proposed fix
e, doc = DocumentService.get_by_id(doc_id) - if not e: - result[doc_id] = {"error": "Document not found"} + if not e or doc.kb_id != dataset_id: + logging.warning( + "batch_update_document_status: document %s not found in dataset %s", + doc_id, + dataset_id, + ) + result[doc_id] = {"error": "Document not found in this dataset."} has_error = True continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/apps/restful_apis/document_api.py` around lines 949 - 984, The loop updates documents without ensuring each document belongs to the requested dataset; before any write or doc-store update, verify doc.kb_id == dataset_id and reject otherwise. In the for-doc_id loop (where DocumentService.get_by_id(doc_id) returns doc), add a check that compares getattr(doc, "kb_id", None) or doc.kb_id to dataset_id and, if mismatched, set result[doc_id] = {"error":"Document does not belong to dataset"} and continue so neither DocumentService.update_by_id nor settings.docStoreConn.update runs on cross-dataset docs.
942-1000:⚠️ Potential issue | 🟠 MajorAdd logging on the new batch-update failure paths.
This new flow returns errors for invalid status, unauthorized dataset access, missing documents, DB update failures, and doc-store failures without logging most of them. Add
logging.warning/error/exceptionat those branches so production failures are diagnosable.As per coding guidelines:
**/*.py: Add logging for new flows.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/apps/restful_apis/document_api.py` around lines 942 - 1000, Add logging calls on each failure path so production issues are diagnosable: log a warning/error with context (dataset_id, tenant_id, status, doc_id as relevant) when the status is invalid (before returning get_error_argument_result), when KnowledgebaseService.query denies access, when KnowledgebaseService.get_by_id returns no dataset, when DocumentService.get_by_id returns no document, when DocumentService.update_by_id returns False, when settings.docStoreConn.update raises an exception (use logging.exception to capture stack trace) or returns falsy, and in the outer except block (use logging.exception) to record unexpected errors; place these logs adjacent to the existing return/result assignments and include identifiable values like dataset_id, tenant_id, doc_id and the exception message.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 943-959: The handler currently defaults doc_ids to [] and
proceeds, causing an empty/malformed batch to return success; validate that
req["doc_ids"] is present and is a non-empty list before the loop (reject
non-list or empty list with get_error_argument_result), e.g. check the variable
doc_ids immediately after it's assigned and before querying
KnowledgebaseService/get_by_id or iterating over for doc_id in doc_ids; return a
clear error when doc_ids is missing, not a list, or empty.
- Around line 1002-1003: The current branch returns a 5xx RetCode.SERVER_ERROR
when has_error is true; instead treat partial failures from client-side issues
as a data/argument error and not a server error. Update the return to use an
appropriate client-side code (e.g., RetCode.DATA_ERROR or
RetCode.INVALID_ARGUMENT) in the get_json_result call, and if you need to
distinguish true server-side failures from client-side partials, add a separate
flag/counter (e.g., server_failures) and only return SERVER_ERROR when that flag
is non-zero; update the code around has_error/result/get_json_result to reflect
this change.
- Around line 949-984: The loop updates documents without ensuring each document
belongs to the requested dataset; before any write or doc-store update, verify
doc.kb_id == dataset_id and reject otherwise. In the for-doc_id loop (where
DocumentService.get_by_id(doc_id) returns doc), add a check that compares
getattr(doc, "kb_id", None) or doc.kb_id to dataset_id and, if mismatched, set
result[doc_id] = {"error":"Document does not belong to dataset"} and continue so
neither DocumentService.update_by_id nor settings.docStoreConn.update runs on
cross-dataset docs.
- Around line 942-1000: Add logging calls on each failure path so production
issues are diagnosable: log a warning/error with context (dataset_id, tenant_id,
status, doc_id as relevant) when the status is invalid (before returning
get_error_argument_result), when KnowledgebaseService.query denies access, when
KnowledgebaseService.get_by_id returns no dataset, when
DocumentService.get_by_id returns no document, when DocumentService.update_by_id
returns False, when settings.docStoreConn.update raises an exception (use
logging.exception to capture stack trace) or returns falsy, and in the outer
except block (use logging.exception) to record unexpected errors; place these
logs adjacent to the existing return/result assignments and include identifiable
values like dataset_id, tenant_id, doc_id and the exception message.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7231b76e-0102-474f-9e35-8b45184b678b
📒 Files selected for processing (8)
api/apps/document_app.pyapi/apps/restful_apis/document_api.pytest/testcases/test_web_api/test_common.pytest/testcases/test_web_api/test_document_app/test_document_metadata.pyweb/src/pages/user-setting/data-source/data-source-detail-page/index.tsxweb/src/pages/user-setting/data-source/hooks.tsweb/src/services/knowledge-service.tsweb/src/utils/api.ts
💤 Files with no reviewable changes (1)
- api/apps/document_app.py
✅ Files skipped from review due to trivial changes (2)
- web/src/pages/user-setting/data-source/hooks.ts
- web/src/pages/user-setting/data-source/data-source-detail-page/index.tsx
🚧 Files skipped from review as they are similar to previous changes (3)
- test/testcases/test_web_api/test_common.py
- test/testcases/test_web_api/test_document_app/test_document_metadata.py
- web/src/services/knowledge-service.ts
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`:
- Around line 356-371: This test is a duplicate of
TestDocumentMetadataNegative.test_change_status_invalid_status; remove this
redundant test function test_change_status_invalid_status (which uses
add_document_func and the payload {"doc_ids": [doc_id], "status": "2"}) or
replace it with distinct negative cases (e.g., status as empty string, status as
numeric 2, status as None, or omit the status key) and adjust assertions
accordingly so it no longer mirrors the original at lines 208-213.
- Around line 315-399: The three E2E tests
(test_change_status_partial_failure_matrix, test_change_status_invalid_status,
test_change_status_all_success) are placed inside TestDocumentMetadataUnit but
use WebApiAuth and live API fixtures; move them out of the
TestDocumentMetadataUnit class into an appropriate E2E test class (either
existing TestDocumentMetadata / TestDocumentMetadataNegative where similar tests
live, or create a new TestDocumentBatchChangeStatus class) so unit-suite markers
and monkeypatched unit tests remain isolated; update imports/fixtures on the
moved tests as needed and remove them from TestDocumentMetadataUnit to keep
marker-based selection correct.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 50dcbef3-6b3a-4a23-b779-8672687c2b10
📒 Files selected for processing (1)
test/testcases/test_web_api/test_document_app/test_document_metadata.py
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
api/apps/restful_apis/document_api.py (1)
942-947:⚠️ Potential issue | 🟡 MinorValidate
doc_idsis a non-empty list.
req.get("doc_ids", [])accepts anything the client sends —null, a string, a dict, or an empty list. With"doc_ids": nullthe subsequentfor doc_id in doc_idsraisesTypeErrorand surfaces as an unhandled 500 (there's no outertry). With an empty list or missing key, the handler silently returns{"data": {}}as success even though no work was performed, which is misleading for a batch mutation. A short up-front check aligns this withdelete_documents(lines 734-740).🛡️ Proposed fix
req = await get_request_json() doc_ids = req.get("doc_ids", []) status = str(req.get("status", -1)) if status not in ["0", "1"]: return get_error_argument_result(message=f'"Status" must be either 0 or 1:{status}!') + + if not isinstance(doc_ids, list) or not doc_ids: + return get_error_argument_result(message="'doc_ids' must be a non-empty list.")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/apps/restful_apis/document_api.py` around lines 942 - 947, The handler currently assigns doc_ids = req.get("doc_ids", []) without validating type or emptiness, which allows null or non-list values to cause a TypeError or silent no-op; update the handler (after calling get_request_json()) to verify that doc_ids is a list and not empty (e.g., isinstance(doc_ids, list) and len(doc_ids) > 0) and return a clear argument error (similar to delete_documents) when validation fails; reference the doc_ids variable and the get_request_json() call when adding this check so the request is rejected early with an informative message instead of proceeding to the for doc_id in doc_ids loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 959-968: The loop currently returns early when a document's kb_id
!= dataset_id, which aborts the batch and exposes the foreign kb_id; instead, in
the DocumentService.get_by_id handling replace the early return with the same
per-doc error pattern used elsewhere: set result[doc_id] = {"error": "Document
not in dataset"}, set has_error = True, and continue the loop. Use doc_id (not
doc.kb_id) in the error message to avoid leaking tenant identifiers and do not
call get_error_data_result at this point.
---
Duplicate comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 942-947: The handler currently assigns doc_ids =
req.get("doc_ids", []) without validating type or emptiness, which allows null
or non-list values to cause a TypeError or silent no-op; update the handler
(after calling get_request_json()) to verify that doc_ids is a list and not
empty (e.g., isinstance(doc_ids, list) and len(doc_ids) > 0) and return a clear
argument error (similar to delete_documents) when validation fails; reference
the doc_ids variable and the get_request_json() call when adding this check so
the request is rejected early with an informative message instead of proceeding
to the for doc_id in doc_ids loop.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0cec966a-d66c-43a3-b54b-5e13898d66f6
📒 Files selected for processing (2)
api/apps/restful_apis/document_api.pytest/testcases/test_web_api/test_document_app/test_document_metadata.py
🚧 Files skipped from review as they are similar to previous changes (1)
- test/testcases/test_web_api/test_document_app/test_document_metadata.py
What problem does this PR solve?
Before migration
Web API: POST /v1/document/change_status
After consolidation, Restful API
POST /api/v1/datasets/<dataset_id>/documents/batch-update-status
Type of change