Skip to content

Refactor: migrate document change status API#14300

Closed
xugangqiang wants to merge 11 commits intoinfiniflow:mainfrom
xugangqiang:doc-batch-change-status
Closed

Refactor: migrate document change status API#14300
xugangqiang wants to merge 11 commits intoinfiniflow:mainfrom
xugangqiang:doc-batch-change-status

Conversation

@xugangqiang
Copy link
Copy Markdown
Contributor

@xugangqiang xugangqiang commented Apr 22, 2026

What problem does this PR solve?

Before migration
Web API: POST /v1/document/change_status

After consolidation, Restful API
POST /api/v1/datasets/<dataset_id>/documents/batch-update-status

Type of change

  • Refactoring

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Moves document status updates from a global POST /change_status to a dataset-scoped POST /datasets/<dataset_id>/documents/batch-update-status, updating backend route and logic, frontend API/hooks/components to pass datasetId, and tests to exercise the new dataset-scoped flow.

Changes

Cohort / File(s) Summary
Backend — route removal & addition
api/apps/document_app.py, api/apps/restful_apis/document_api.py
Removed global POST /change_status handler from document_app.py; added POST /datasets/<dataset_id>/documents/batch-update-status in restful_apis/document_api.py that validates status, enforces dataset ownership, updates per-document status, updates doc-store availability when needed, and returns per-document results (partial-failure semantics preserved).
Frontend — API surface & service
web/src/utils/api.ts, web/src/services/knowledge-service.ts
Changed documentChangeStatus from a fixed URL to a datasetId => .../datasets/${datasetId}/documents/batch-update-status builder; removed kbService.documentChangeStatus mapping and added exported changeDocumentsStatus({ kb_id, doc_ids, status }).
Frontend — Hooks & UI wiring
web/src/hooks/use-document-request.ts, web/src/pages/dataset/dataset/use-bulk-operate-dataset.tsx, web/src/pages/dataset/dataset/use-dataset-table-columns.tsx, web/src/pages/dataset/dataset/dataset-table.tsx
Threaded datasetId (knowledgeBase.id) through hooks, table columns, and bulk operations; updated mutation signatures and call sites to include datasetId when performing batch status updates.
Tests
test/testcases/test_web_api/test_common.py, test/testcases/test_web_api/test_document_app/test_document_metadata.py
Updated test helper signature to accept dataset_id; replaced unit tests that targeted removed endpoint with dataset-scoped E2E-style tests covering multi-document updates, partial failures, persistence checks, and cleanup.
Minor / formatting
web/src/pages/user-setting/..., web/src/pages/user-setting/data-source/hooks.ts
Whitespace/import/formatting tweaks and minor signature formatting changes; no behavioral changes.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Frontend as Frontend (hooks/service)
    participant API as Server API
    participant DB as Database
    participant DocStore as DocStore

    Client->>Frontend: request batch-update-status (datasetId, doc_ids, status)
    Frontend->>API: POST /datasets/{datasetId}/documents/batch-update-status (doc_ids, status)
    API->>DB: verify dataset ownership
    DB-->>API: ownership OK / error
    loop per doc_id
        API->>DB: fetch document record
        DB-->>API: document or missing
        alt missing
            API-->>Frontend: include per-document missing error
        else exists & status differs
            API->>DB: update document.status
            DB-->>API: update ack
            alt doc.chunk_num > 0
                API->>DocStore: update availability (tenant index, kb_id)
                DocStore-->>API: ack / error
            end
        else status already matches
            API-->>Frontend: include per-document no-op result
        end
    end
    API-->>Frontend: aggregated per-document results (success or "Partial failure")
    Frontend-->>Client: API response
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

☯️ refactor, 🧪 test, 🐖api

Suggested reviewers

  • JinHai-CN
  • yuzhichang
  • yingfeng

Poem

🐰 I hopped from a global trail to a dataset lane,

Doc flags now travel with their dataset name.
Rows flip softly, indexes hum in tune,
Partial mishaps listed under the moon.
A tiny rabbit cheers — scoped hops make work less vain!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The description is incomplete. While it identifies the refactoring type and shows before/after endpoints, it lacks explanation of why this migration is necessary or what problem it solves. Expand the 'What problem does this PR solve?' section to explain the motivation behind consolidating the endpoints and benefits of the RESTful approach.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main change: migrating/refactoring the document change status API from one endpoint to another.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@xugangqiang xugangqiang added the ci Continue Integration label Apr 22, 2026
@xugangqiang xugangqiang marked this pull request as ready for review April 22, 2026 11:31
@dosubot dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Apr 22, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
web/src/pages/dataset/dataset/use-dataset-table-columns.tsx (1)

168-178: Optional: source datasetId from the row itself.

row.original.dataset_id is already available on IDocumentInfo and is used elsewhere in this same column set (line 99, navigateToChunkParsedResult(..., row.original.dataset_id)). Reusing it here would avoid threading an extra optional datasetId prop through DatasetTableuseDatasetTableColumns and eliminates the ?: string uncertainty in the column signature.

♻️ Proposed refactor
 cell: ({ row }) => {
   const id = row.original.id;
   return (
     <Switch
       checked={row.getValue('status') === '1'}
       onCheckedChange={(e) => {
-        setDocumentStatus({ status: e, documentId: id, datasetId });
+        setDocumentStatus({
+          status: e,
+          documentId: id,
+          datasetId: row.original.dataset_id,
+        });
       }}
     />
   );
 },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@web/src/pages/dataset/dataset/use-dataset-table-columns.tsx` around lines 168
- 178, The Switch cell currently uses an external datasetId prop; change it to
read datasetId from the row itself (row.original.dataset_id) inside
useDatasetTableColumns so the Switch onCheckedChange calls setDocumentStatus({
status: e, documentId: id, datasetId: row.original.dataset_id }) instead of
using the optional external prop; update any type assumptions in
useDatasetTableColumns and the column signature to remove the optional datasetId
threading and rely on row.original.dataset_id (refer to the Switch cell,
setDocumentStatus function call, and row.original.dataset_id).
test/testcases/test_web_api/test_document_app/test_document_metadata.py (1)

141-197: Add a negative test for cross-dataset doc_ids.

The new dataset-scoped batch endpoint accepts a list of doc_ids without (currently) validating that those docs belong to dataset_id. Regardless of how the handler ends up enforcing that, a test like the following would guard against regressions:

`@pytest.mark.p2`
def test_change_status_foreign_doc_id_rejected(self, WebApiAuth, add_document_func, add_dataset_func):
    _, doc_id = add_document_func          # doc belongs to dataset A
    other_dataset_id = add_dataset_func    # dataset B, does not own doc
    res = document_change_status(WebApiAuth, other_dataset_id, {"doc_ids": [doc_id], "status": "1"})
    # Expect an error for this doc, not a silent success
    assert res["data"][doc_id].get("error"), res

Based on learnings (Applies to tests/**/.py : Add/adjust tests for behavior changes*).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`
around lines 141 - 197, Add a negative test to verify cross-dataset doc_id
rejection: create a new test method (e.g.,
test_change_status_foreign_doc_id_rejected) in the TestDocumentMetadataNegative
class that uses add_document_func to create a doc in dataset A, uses
add_dataset_func to create dataset B, then calls
document_change_status(WebApiAuth, other_dataset_id, {"doc_ids": [doc_id],
"status": "1"}) and assert the response indicates an error for that doc (e.g.,
res["data"][doc_id].get("error") is truthy) rather than silently succeeding;
reference the existing test_change_status for call shape and reuse WebApiAuth,
add_document_func, add_dataset_func, and document_change_status symbols.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 930-932: The partial-failure branch currently returns a
server-side code (RetCode.SERVER_ERROR) which misclassifies client/document
errors; update the logic in the block that checks has_error (where
get_json_result is called) to return an appropriate client-side code such as
RetCode.ARGUMENT_ERROR or RetCode.DATA_ERROR instead, or add finer distinctions
(e.g., separate conditions for "all failed" vs "partial failure" vs "all
succeeded") so get_json_result(data=result, message="Partial failure", code=...)
uses a non-5xx RetCode and callers can handle partial vs total failures
correctly.
- Around line 870-928: Validate per-document authorization and add logging: for
each doc_id retrieved by DocumentService.get_by_id, ensure the document belongs
to the requested dataset by either calling DocumentService.accessible(doc_id,
tenant_id) (or checking doc.kb_id == dataset_id) before performing updates; if
access is denied, set result[doc_id] = {"error": "No authorization."} and
logging.error that denial. Add logging.error/exception calls on the initial
request validation failure, per-doc lookup/update errors, and in the except
block around settings.docStoreConn.update to record exception details. Also stop
double-stringifying status when calling DocumentService.update_by_id (pass an
int or a single-string value consistently) so update_by_id receives the correct
type.

In `@web/src/pages/dataset/dataset/use-bulk-operate-dataset.tsx`:
- Around line 88-97: The onChangeStatus callback can call setDocumentStatus with
datasetId undefined if knowledgeBase?.id isn't ready; update onChangeStatus
(and/or the caller) to guard for knowledgeBase?.id before invoking
setDocumentStatus — e.g., early-return if !knowledgeBase?.id or disable the bulk
action UI when knowledgeBase?.id is falsy — and ensure the payload passed to
setDocumentStatus (used here) always contains a valid datasetId (reference:
onChangeStatus, setDocumentStatus, knowledgeBase?.id, selectedRowKeys).

---

Nitpick comments:
In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`:
- Around line 141-197: Add a negative test to verify cross-dataset doc_id
rejection: create a new test method (e.g.,
test_change_status_foreign_doc_id_rejected) in the TestDocumentMetadataNegative
class that uses add_document_func to create a doc in dataset A, uses
add_dataset_func to create dataset B, then calls
document_change_status(WebApiAuth, other_dataset_id, {"doc_ids": [doc_id],
"status": "1"}) and assert the response indicates an error for that doc (e.g.,
res["data"][doc_id].get("error") is truthy) rather than silently succeeding;
reference the existing test_change_status for call shape and reuse WebApiAuth,
add_document_func, add_dataset_func, and document_change_status symbols.

In `@web/src/pages/dataset/dataset/use-dataset-table-columns.tsx`:
- Around line 168-178: The Switch cell currently uses an external datasetId
prop; change it to read datasetId from the row itself (row.original.dataset_id)
inside useDatasetTableColumns so the Switch onCheckedChange calls
setDocumentStatus({ status: e, documentId: id, datasetId:
row.original.dataset_id }) instead of using the optional external prop; update
any type assumptions in useDatasetTableColumns and the column signature to
remove the optional datasetId threading and rely on row.original.dataset_id
(refer to the Switch cell, setDocumentStatus function call, and
row.original.dataset_id).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8fb69f16-1896-4361-a929-24ab076b93d7

📥 Commits

Reviewing files that changed from the base of the PR and between b52c518 and 39d348e.

📒 Files selected for processing (10)
  • api/apps/document_app.py
  • api/apps/restful_apis/document_api.py
  • test/testcases/test_web_api/test_common.py
  • test/testcases/test_web_api/test_document_app/test_document_metadata.py
  • web/src/hooks/use-document-request.ts
  • web/src/pages/dataset/dataset/dataset-table.tsx
  • web/src/pages/dataset/dataset/use-bulk-operate-dataset.tsx
  • web/src/pages/dataset/dataset/use-dataset-table-columns.tsx
  • web/src/services/knowledge-service.ts
  • web/src/utils/api.ts
💤 Files with no reviewable changes (1)
  • api/apps/document_app.py

Comment thread api/apps/restful_apis/document_api.py
Comment on lines +930 to +932
if has_error:
return get_json_result(data=result, message="Partial failure", code=RetCode.SERVER_ERROR)
return get_json_result(data=result)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Partial-failure response should not use SERVER_ERROR.

When has_error is due solely to client-side/per-document issues (e.g., Document not found, status mismatch), returning RetCode.SERVER_ERROR misrepresents the failure class and will trigger 5xx error handling / alerts on the client. Consider returning a client-side code (e.g., RetCode.ARGUMENT_ERROR or RetCode.DATA_ERROR), or distinguishing between "all failed", "partial failure", and "all succeeded" so callers can handle each.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 930 - 932, The
partial-failure branch currently returns a server-side code
(RetCode.SERVER_ERROR) which misclassifies client/document errors; update the
logic in the block that checks has_error (where get_json_result is called) to
return an appropriate client-side code such as RetCode.ARGUMENT_ERROR or
RetCode.DATA_ERROR instead, or add finer distinctions (e.g., separate conditions
for "all failed" vs "partial failure" vs "all succeeded") so
get_json_result(data=result, message="Partial failure", code=...) uses a non-5xx
RetCode and callers can handle partial vs total failures correctly.

Comment on lines 88 to 97
const onChangeStatus = useCallback(
(enabled: boolean) => {
setDocumentStatus({ status: enabled, documentId: selectedRowKeys });
setDocumentStatus({
status: enabled,
documentId: selectedRowKeys,
datasetId: knowledgeBase?.id,
});
},
[selectedRowKeys, setDocumentStatus],
[selectedRowKeys, setDocumentStatus, knowledgeBase],
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Consider guarding against undefined knowledgeBase?.id.

setDocumentStatus will receive datasetId: undefined if a user triggers enable/disable before the KB context finishes loading, producing a request to /datasets/undefined/documents/batch-update-status. In practice the bulk actions are only reachable when rows are selected (which requires documents to have loaded alongside the KB), so this is mostly theoretical — but an early-return or disabled-state on !knowledgeBase?.id would be safer.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@web/src/pages/dataset/dataset/use-bulk-operate-dataset.tsx` around lines 88 -
97, The onChangeStatus callback can call setDocumentStatus with datasetId
undefined if knowledgeBase?.id isn't ready; update onChangeStatus (and/or the
caller) to guard for knowledgeBase?.id before invoking setDocumentStatus — e.g.,
early-return if !knowledgeBase?.id or disable the bulk action UI when
knowledgeBase?.id is falsy — and ensure the payload passed to setDocumentStatus
(used here) always contains a valid datasetId (reference: onChangeStatus,
setDocumentStatus, knowledgeBase?.id, selectedRowKeys).

@xugangqiang xugangqiang changed the title WIP migrate document change status Refactor: migrate document change status API Apr 22, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.97%. Comparing base (2b02988) to head (4e9e9b2).
⚠️ Report is 10 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #14300   +/-   ##
=======================================
  Coverage   95.97%   95.97%           
=======================================
  Files          10       10           
  Lines         695      695           
  Branches      111      111           
=======================================
  Hits          667      667           
  Misses         11       11           
  Partials       17       17           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@xugangqiang xugangqiang self-assigned this Apr 23, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
test/testcases/test_web_api/test_document_app/test_document_metadata.py (1)

370-386: Consider removing the duplicate invalid-status case.

This duplicates TestDocumentMetadataNegative.test_change_status_invalid_status at Lines 195-200 with the same setup, payload, and assertions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`
around lines 370 - 386, The test_change_status_invalid_status function
duplicates an existing negative test
(TestDocumentMetadataNegative.test_change_status_invalid_status) that uses the
same setup, payload and assertions; remove this duplicate from
test_document_metadata.py (or consolidate by reusing the existing
TestDocumentMetadataNegative case or converting both to a shared parametrized
test) so only one test asserts the invalid status behavior, referencing the test
function name test_change_status_invalid_status and the existing
TestDocumentMetadataNegative suite to locate and eliminate the redundant case.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`:
- Around line 388-414: The test currently asserts the document already has the
target status ("1") before calling document_change_status, so the update could
be a no-op; update test_change_status_all_success to first assert the initial
status is the opposite (e.g., "0") or at minimum assert initial_status != "1"
using document_infos, then call document_change_status to set status "1" and
keep the existing assertions that response code is 0 and that both the response
payload (res["data"][doc_id]["status"]) and a subsequent document_infos call
show the new status "1"; references: test_change_status_all_success,
document_infos, document_change_status.
- Around line 330-368: The test test_change_status_partial_failure_matrix
currently only uses valid IDs and can leak docs if bulk_upload_documents returns
a partial list; change it to (1) capture the exact list returned by
bulk_upload_documents into doc_ids, assert it's non-empty but not require exact
length, (2) append a known-invalid id (e.g., "nonexistent-id-123") to doc_ids to
force a partial-failure scenario before calling document_change_status, (3)
assert the response from document_change_status contains both success entries
for the real IDs and failure info for the invalid id, and verify via
document_infos that only the real docs' statuses changed, and (4) make the
finally block robust by only calling delete_document when doc_ids_real (the
original returned real IDs) is non-empty and wrapping delete_document in a
try/except to ensure cleanup never raises; reference the functions/bindings
bulk_upload_documents, document_change_status, document_infos, delete_document
and the test function name to locate edits.

---

Nitpick comments:
In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`:
- Around line 370-386: The test_change_status_invalid_status function duplicates
an existing negative test
(TestDocumentMetadataNegative.test_change_status_invalid_status) that uses the
same setup, payload and assertions; remove this duplicate from
test_document_metadata.py (or consolidate by reusing the existing
TestDocumentMetadataNegative case or converting both to a shared parametrized
test) so only one test asserts the invalid status behavior, referencing the test
function name test_change_status_invalid_status and the existing
TestDocumentMetadataNegative suite to locate and eliminate the redundant case.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f412c0be-0031-4531-9ba8-7b2f71e8592b

📥 Commits

Reviewing files that changed from the base of the PR and between 39d348e and b6a00e8.

📒 Files selected for processing (1)
  • test/testcases/test_web_api/test_document_app/test_document_metadata.py

Comment thread test/testcases/test_web_api/test_document_app/test_document_metadata.py Outdated
Comment on lines +388 to +414
@pytest.mark.p2
def test_change_status_all_success(self, WebApiAuth, add_document_func):
"""
E2E test for successful batch document status change.

This test verifies that all documents are successfully updated
when valid status values are provided.
"""

dataset_id, doc_id = add_document_func

# Verify initial status is "0" (unprocessed)
info_res = document_infos(WebApiAuth, dataset_id, {"ids": [doc_id]})
assert info_res["code"] == 0, info_res
assert info_res["data"]["docs"][0]["status"] == "1", "Initial status should be 1"

# Update status to "1" (processed)
res = document_change_status(WebApiAuth, dataset_id, {"doc_ids": [doc_id], "status": "1"})

# Verify success
assert res["code"] == 0, f"Expected success code 0, got {res}"
assert res["data"][doc_id]["status"] == "1", "Document status should be 1"

# Verify the status was actually updated in the database
info_res = document_infos(WebApiAuth, dataset_id, {"ids": [doc_id]})
assert info_res["code"] == 0, info_res
assert info_res["data"]["docs"][0]["status"] == "1", "Document status should be 1 in database"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Verify a real status transition.

Line 402 asserts the document already has the target status, then Lines 405-414 update it to the same value. This can pass even if the endpoint is a no-op.

Proposed fix
-        # Verify initial status is "0" (unprocessed)
+        # Capture initial status and switch to the opposite valid status.
         info_res = document_infos(WebApiAuth, dataset_id, {"ids": [doc_id]})
         assert info_res["code"] == 0, info_res
-        assert info_res["data"]["docs"][0]["status"] == "1", "Initial status should be 1"
+        initial_status = info_res["data"]["docs"][0]["status"]
+        target_status = "0" if initial_status == "1" else "1"
         
-        # Update status to "1" (processed)
-        res = document_change_status(WebApiAuth, dataset_id, {"doc_ids": [doc_id], "status": "1"})
+        # Update status to the opposite valid value.
+        res = document_change_status(WebApiAuth, dataset_id, {"doc_ids": [doc_id], "status": target_status})
         
         # Verify success
         assert res["code"] == 0, f"Expected success code 0, got {res}"
-        assert res["data"][doc_id]["status"] == "1", "Document status should be 1"
+        assert res["data"][doc_id]["status"] == target_status, f"Document status should be {target_status}"
         
         # Verify the status was actually updated in the database
         info_res = document_infos(WebApiAuth, dataset_id, {"ids": [doc_id]})
         assert info_res["code"] == 0, info_res
-        assert info_res["data"]["docs"][0]["status"] == "1", "Document status should be 1 in database"
+        assert info_res["data"]["docs"][0]["status"] == target_status, f"Document status should be {target_status} in database"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`
around lines 388 - 414, The test currently asserts the document already has the
target status ("1") before calling document_change_status, so the update could
be a no-op; update test_change_status_all_success to first assert the initial
status is the opposite (e.g., "0") or at minimum assert initial_status != "1"
using document_infos, then call document_change_status to set status "1" and
keep the existing assertions that response code is 0 and that both the response
payload (res["data"][doc_id]["status"]) and a subsequent document_infos call
show the new status "1"; references: test_change_status_all_success,
document_infos, document_change_status.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (4)
api/apps/restful_apis/document_api.py (4)

943-959: ⚠️ Potential issue | 🟡 Minor

Reject empty or non-list doc_ids instead of returning success.

The schema marks doc_ids as required, but the handler defaults missing input to []; an empty or malformed batch currently returns 200 with {}. Validate doc_ids before the loop.

🛡️ Proposed validation
     doc_ids = req.get("doc_ids", [])
     status = str(req.get("status", -1))
 
+    if not isinstance(doc_ids, list) or not doc_ids:
+        return get_error_argument_result(message='"doc_ids" must be a non-empty list.')
+
     if status not in ["0", "1"]:
         return get_error_argument_result(message=f'"Status" must be either 0 or 1:{status}!')
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 943 - 959, The handler
currently defaults doc_ids to [] and proceeds, causing an empty/malformed batch
to return success; validate that req["doc_ids"] is present and is a non-empty
list before the loop (reject non-list or empty list with
get_error_argument_result), e.g. check the variable doc_ids immediately after
it's assigned and before querying KnowledgebaseService/get_by_id or iterating
over for doc_id in doc_ids; return a clear error when doc_ids is missing, not a
list, or empty.

1002-1003: ⚠️ Potential issue | 🟡 Minor

Don’t classify client-side partial failures as server errors.

has_error can be caused by missing/foreign documents or invalid per-document state, so returning RetCode.SERVER_ERROR makes these look like 5xx incidents. Use a data/argument error for client-side partials, or track server failures separately.

♻️ Minimal safer response code
     if has_error:
-        return get_json_result(data=result, message="Partial failure", code=RetCode.SERVER_ERROR)
+        return get_json_result(data=result, message="Partial failure", code=RetCode.DATA_ERROR)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 1002 - 1003, The current
branch returns a 5xx RetCode.SERVER_ERROR when has_error is true; instead treat
partial failures from client-side issues as a data/argument error and not a
server error. Update the return to use an appropriate client-side code (e.g.,
RetCode.DATA_ERROR or RetCode.INVALID_ARGUMENT) in the get_json_result call, and
if you need to distinguish true server-side failures from client-side partials,
add a separate flag/counter (e.g., server_failures) and only return SERVER_ERROR
when that flag is non-zero; update the code around
has_error/result/get_json_result to reflect this change.

949-984: ⚠️ Potential issue | 🔴 Critical

Scope every doc_id to the path dataset before updating.

The dataset ownership check only proves access to dataset_id; Line 961 can still load a document from another dataset, and Lines 971-984 then update its DB status and doc-store availability. Reject documents where doc.kb_id != dataset_id before any write.

🛡️ Proposed fix
             e, doc = DocumentService.get_by_id(doc_id)
-            if not e:
-                result[doc_id] = {"error": "Document not found"}
+            if not e or doc.kb_id != dataset_id:
+                logging.warning(
+                    "batch_update_document_status: document %s not found in dataset %s",
+                    doc_id,
+                    dataset_id,
+                )
+                result[doc_id] = {"error": "Document not found in this dataset."}
                 has_error = True
                 continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 949 - 984, The loop
updates documents without ensuring each document belongs to the requested
dataset; before any write or doc-store update, verify doc.kb_id == dataset_id
and reject otherwise. In the for-doc_id loop (where
DocumentService.get_by_id(doc_id) returns doc), add a check that compares
getattr(doc, "kb_id", None) or doc.kb_id to dataset_id and, if mismatched, set
result[doc_id] = {"error":"Document does not belong to dataset"} and continue so
neither DocumentService.update_by_id nor settings.docStoreConn.update runs on
cross-dataset docs.

942-1000: ⚠️ Potential issue | 🟠 Major

Add logging on the new batch-update failure paths.

This new flow returns errors for invalid status, unauthorized dataset access, missing documents, DB update failures, and doc-store failures without logging most of them. Add logging.warning/error/exception at those branches so production failures are diagnosable.

As per coding guidelines: **/*.py: Add logging for new flows.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 942 - 1000, Add logging
calls on each failure path so production issues are diagnosable: log a
warning/error with context (dataset_id, tenant_id, status, doc_id as relevant)
when the status is invalid (before returning get_error_argument_result), when
KnowledgebaseService.query denies access, when KnowledgebaseService.get_by_id
returns no dataset, when DocumentService.get_by_id returns no document, when
DocumentService.update_by_id returns False, when settings.docStoreConn.update
raises an exception (use logging.exception to capture stack trace) or returns
falsy, and in the outer except block (use logging.exception) to record
unexpected errors; place these logs adjacent to the existing return/result
assignments and include identifiable values like dataset_id, tenant_id, doc_id
and the exception message.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 943-959: The handler currently defaults doc_ids to [] and
proceeds, causing an empty/malformed batch to return success; validate that
req["doc_ids"] is present and is a non-empty list before the loop (reject
non-list or empty list with get_error_argument_result), e.g. check the variable
doc_ids immediately after it's assigned and before querying
KnowledgebaseService/get_by_id or iterating over for doc_id in doc_ids; return a
clear error when doc_ids is missing, not a list, or empty.
- Around line 1002-1003: The current branch returns a 5xx RetCode.SERVER_ERROR
when has_error is true; instead treat partial failures from client-side issues
as a data/argument error and not a server error. Update the return to use an
appropriate client-side code (e.g., RetCode.DATA_ERROR or
RetCode.INVALID_ARGUMENT) in the get_json_result call, and if you need to
distinguish true server-side failures from client-side partials, add a separate
flag/counter (e.g., server_failures) and only return SERVER_ERROR when that flag
is non-zero; update the code around has_error/result/get_json_result to reflect
this change.
- Around line 949-984: The loop updates documents without ensuring each document
belongs to the requested dataset; before any write or doc-store update, verify
doc.kb_id == dataset_id and reject otherwise. In the for-doc_id loop (where
DocumentService.get_by_id(doc_id) returns doc), add a check that compares
getattr(doc, "kb_id", None) or doc.kb_id to dataset_id and, if mismatched, set
result[doc_id] = {"error":"Document does not belong to dataset"} and continue so
neither DocumentService.update_by_id nor settings.docStoreConn.update runs on
cross-dataset docs.
- Around line 942-1000: Add logging calls on each failure path so production
issues are diagnosable: log a warning/error with context (dataset_id, tenant_id,
status, doc_id as relevant) when the status is invalid (before returning
get_error_argument_result), when KnowledgebaseService.query denies access, when
KnowledgebaseService.get_by_id returns no dataset, when
DocumentService.get_by_id returns no document, when DocumentService.update_by_id
returns False, when settings.docStoreConn.update raises an exception (use
logging.exception to capture stack trace) or returns falsy, and in the outer
except block (use logging.exception) to record unexpected errors; place these
logs adjacent to the existing return/result assignments and include identifiable
values like dataset_id, tenant_id, doc_id and the exception message.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7231b76e-0102-474f-9e35-8b45184b678b

📥 Commits

Reviewing files that changed from the base of the PR and between b6a00e8 and 0db5a62.

📒 Files selected for processing (8)
  • api/apps/document_app.py
  • api/apps/restful_apis/document_api.py
  • test/testcases/test_web_api/test_common.py
  • test/testcases/test_web_api/test_document_app/test_document_metadata.py
  • web/src/pages/user-setting/data-source/data-source-detail-page/index.tsx
  • web/src/pages/user-setting/data-source/hooks.ts
  • web/src/services/knowledge-service.ts
  • web/src/utils/api.ts
💤 Files with no reviewable changes (1)
  • api/apps/document_app.py
✅ Files skipped from review due to trivial changes (2)
  • web/src/pages/user-setting/data-source/hooks.ts
  • web/src/pages/user-setting/data-source/data-source-detail-page/index.tsx
🚧 Files skipped from review as they are similar to previous changes (3)
  • test/testcases/test_web_api/test_common.py
  • test/testcases/test_web_api/test_document_app/test_document_metadata.py
  • web/src/services/knowledge-service.ts

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`:
- Around line 356-371: This test is a duplicate of
TestDocumentMetadataNegative.test_change_status_invalid_status; remove this
redundant test function test_change_status_invalid_status (which uses
add_document_func and the payload {"doc_ids": [doc_id], "status": "2"}) or
replace it with distinct negative cases (e.g., status as empty string, status as
numeric 2, status as None, or omit the status key) and adjust assertions
accordingly so it no longer mirrors the original at lines 208-213.
- Around line 315-399: The three E2E tests
(test_change_status_partial_failure_matrix, test_change_status_invalid_status,
test_change_status_all_success) are placed inside TestDocumentMetadataUnit but
use WebApiAuth and live API fixtures; move them out of the
TestDocumentMetadataUnit class into an appropriate E2E test class (either
existing TestDocumentMetadata / TestDocumentMetadataNegative where similar tests
live, or create a new TestDocumentBatchChangeStatus class) so unit-suite markers
and monkeypatched unit tests remain isolated; update imports/fixtures on the
moved tests as needed and remove them from TestDocumentMetadataUnit to keep
marker-based selection correct.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 50dcbef3-6b3a-4a23-b779-8672687c2b10

📥 Commits

Reviewing files that changed from the base of the PR and between 0db5a62 and b7e07ca.

📒 Files selected for processing (1)
  • test/testcases/test_web_api/test_document_app/test_document_metadata.py

Comment thread test/testcases/test_web_api/test_document_app/test_document_metadata.py Outdated
Comment thread test/testcases/test_web_api/test_document_app/test_document_metadata.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
api/apps/restful_apis/document_api.py (1)

942-947: ⚠️ Potential issue | 🟡 Minor

Validate doc_ids is a non-empty list.

req.get("doc_ids", []) accepts anything the client sends — null, a string, a dict, or an empty list. With "doc_ids": null the subsequent for doc_id in doc_ids raises TypeError and surfaces as an unhandled 500 (there's no outer try). With an empty list or missing key, the handler silently returns {"data": {}} as success even though no work was performed, which is misleading for a batch mutation. A short up-front check aligns this with delete_documents (lines 734-740).

🛡️ Proposed fix
     req = await get_request_json()
     doc_ids = req.get("doc_ids", [])
     status = str(req.get("status", -1))
 
     if status not in ["0", "1"]:
         return get_error_argument_result(message=f'"Status" must be either 0 or 1:{status}!')
+
+    if not isinstance(doc_ids, list) or not doc_ids:
+        return get_error_argument_result(message="'doc_ids' must be a non-empty list.")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 942 - 947, The handler
currently assigns doc_ids = req.get("doc_ids", []) without validating type or
emptiness, which allows null or non-list values to cause a TypeError or silent
no-op; update the handler (after calling get_request_json()) to verify that
doc_ids is a list and not empty (e.g., isinstance(doc_ids, list) and
len(doc_ids) > 0) and return a clear argument error (similar to
delete_documents) when validation fails; reference the doc_ids variable and the
get_request_json() call when adding this check so the request is rejected early
with an informative message instead of proceeding to the for doc_id in doc_ids
loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 959-968: The loop currently returns early when a document's kb_id
!= dataset_id, which aborts the batch and exposes the foreign kb_id; instead, in
the DocumentService.get_by_id handling replace the early return with the same
per-doc error pattern used elsewhere: set result[doc_id] = {"error": "Document
not in dataset"}, set has_error = True, and continue the loop. Use doc_id (not
doc.kb_id) in the error message to avoid leaking tenant identifiers and do not
call get_error_data_result at this point.

---

Duplicate comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 942-947: The handler currently assigns doc_ids =
req.get("doc_ids", []) without validating type or emptiness, which allows null
or non-list values to cause a TypeError or silent no-op; update the handler
(after calling get_request_json()) to verify that doc_ids is a list and not
empty (e.g., isinstance(doc_ids, list) and len(doc_ids) > 0) and return a clear
argument error (similar to delete_documents) when validation fails; reference
the doc_ids variable and the get_request_json() call when adding this check so
the request is rejected early with an informative message instead of proceeding
to the for doc_id in doc_ids loop.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0cec966a-d66c-43a3-b54b-5e13898d66f6

📥 Commits

Reviewing files that changed from the base of the PR and between b7e07ca and 996de89.

📒 Files selected for processing (2)
  • api/apps/restful_apis/document_api.py
  • test/testcases/test_web_api/test_document_app/test_document_metadata.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/testcases/test_web_api/test_document_app/test_document_metadata.py

Comment thread api/apps/restful_apis/document_api.py Outdated
@xugangqiang xugangqiang marked this pull request as draft April 23, 2026 14:32
@xugangqiang xugangqiang marked this pull request as ready for review April 23, 2026 14:33
@xugangqiang xugangqiang removed the ci Continue Integration label Apr 23, 2026
@xugangqiang xugangqiang marked this pull request as draft April 23, 2026 14:41
@xugangqiang xugangqiang added the ci Continue Integration label Apr 23, 2026
@xugangqiang xugangqiang marked this pull request as ready for review April 23, 2026 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Continue Integration size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant