Skip to content

Refactor: migrate document thumbnails API#14344

Merged
yingfeng merged 14 commits into
infiniflow:mainfrom
xugangqiang:doc-thumbnails
Apr 27, 2026
Merged

Refactor: migrate document thumbnails API#14344
yingfeng merged 14 commits into
infiniflow:mainfrom
xugangqiang:doc-thumbnails

Conversation

@xugangqiang
Copy link
Copy Markdown
Contributor

@xugangqiang xugangqiang commented Apr 24, 2026

What problem does this PR solve?

Before migration: GET /v1/document/thumbnails
After migration: GET /api/v1/thumbnails

Type of change

  • Refactoring

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request refactors thumbnail and image serving by migrating HTTP handlers from the main document app to the RESTful API layer, implementing new endpoints with URL rewriting logic, and introducing comprehensive test coverage for the relocated functionality.

Changes

Cohort / File(s) Summary
API Endpoint Migration
api/apps/document_app.py, api/apps/restful_apis/document_api.py
Removed /thumbnails and /image/<image_id> handlers from document app; implemented new /thumbnails and /documents/images/<image_id> endpoints in RESTful API layer with URL rewriting, image validation, and JPEG byte serving via storage backend.
Frontend Configuration
web/src/utils/api.ts
Updated documentThumbnails API route from ${webAPI}/document/thumbnails to ${restAPIv1}/thumbnails.
Test Coverage
test/testcases/test_web_api/test_common.py, test/testcases/test_web_api/test_document_app/test_document_metadata.py
Added document_thumbnails test helper function and new comprehensive test module covering auth failures, document filtering, metadata updates, status changes, and endpoint validation with mocked service/storage dependencies.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

🧪 test, 🐖api, ☯️ refactor

Suggested reviewers

  • JinHai-CN
  • yuzhichang
  • yingfeng

Poem

🐰 Thumbnails hoppity-hoppity go,
From old routes to new paths below,
RESTful and tidy, with tests shining bright,
Image bytes served in the API's light! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.93% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: migrating the document thumbnails API from one endpoint to another, which matches the primary objective of the pull request.
Description check ✅ Passed The description includes the required template sections and explains the endpoint migration, but lacks detail about the refactoring scope and impact.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@xugangqiang xugangqiang added the ci Continue Integration label Apr 24, 2026
@xugangqiang xugangqiang marked this pull request as ready for review April 24, 2026 03:10
@dosubot dosubot Bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Apr 24, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
api/apps/restful_apis/document_api.py (2)

915-916: Redundant local imports.

IMG_BASE64_PREFIX and DocumentService are already imported at the top of this module (lines 26 and 29). The function-scoped re-imports are dead weight.

♻️ Proposed cleanup
 def list_thumbnails():
     ...
-    from api.constants import IMG_BASE64_PREFIX
-    from api.db.services.document_service import DocumentService
-
     doc_ids = request.args.getlist("doc_ids")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 915 - 916, Remove the
redundant function-scoped imports of IMG_BASE64_PREFIX and DocumentService
inside the function where they were re-imported; these symbols are already
imported at module top, so delete the lines "from api.constants import
IMG_BASE64_PREFIX" and "from api.db.services.document_service import
DocumentService" in that scope and rely on the module-level imports to avoid
duplicate imports and dead weight.

896-931: No logging on the new flow.

The error/success paths here have no logging.* calls, unlike neighboring handlers (upload_document, list_docs, update_metadata_config) which log validation failures and exceptions. Add at minimum an error log for the missing-IDs branch and an logging.exception in the except branch.

As per coding guidelines: **/*.py: Add logging for new flows.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 896 - 931, The
list_thumbnails handler lacks logging: import the logging module and add a
logging.error when doc_ids is empty in list_thumbnails with context (e.g.,
include request info or the missing parameter name and any user/kb context), and
replace/augment the generic except return by calling logging.exception inside
the except block (include doc_ids and any relevant variables) before returning
server_error_response; ensure the log messages reference list_thumbnails and
DocumentService.get_thumbnails/IMG_BASE64_PREFIX so it's clear which flow
failed.
test/testcases/test_web_api/test_document_app/test_document_metadata.py (1)

166-184: Thumbnail assertion for a .txt upload is fairly weak.

A plain .txt has no renderable thumbnail, so thumbnail() in the upload path returns "". The endpoint then produces "/v1/document/image/{kb_id}-" (trailing dash + empty value) or an empty string, and the test only asserts key presence. Consider either:

  • uploading an image file so you can assert the value starts with IMG_BASE64_PREFIX, or
  • at minimum asserting the mapped value is a string (catches None/type regressions):
         res = document_thumbnails(WebApiAuth, params={"doc_ids": [doc_id]})
 
         assert res["code"] == 0, res
         assert doc_id in res["data"], res
+        assert isinstance(res["data"][doc_id], str), res

Not blocking — the happy-path key check is still useful coverage.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`
around lines 166 - 184, The test test_thumbnails_with_doc_ids only checks that
doc_id is present but doesn't validate the thumbnail value (upload_documents
uses thumbnail() which returns "" for .txt), causing weak coverage; update the
test to either upload an image file instead of a .txt so you can assert the
returned thumbnail string starts with IMG_BASE64_PREFIX (or whatever
base64/image prefix constant you use), or at minimum assert that
res["data"][doc_id] is an instance of str to catch None/type regressions;
reference the existing helpers upload_documents and document_thumbnails and the
test name test_thumbnails_with_doc_ids when making the change.
test/testcases/test_web_api/test_common.py (1)

432-440: Minor consistency nit: prefer composing the URL with VERSION.

The rest of this module uses f"/api/{VERSION}/..." (or named URL constants) rather than a hardcoded /api/v1/.... For consistency and to keep future VERSION bumps painless:

♻️ Proposed tweak
-    res = requests.get(url=f"{HOST_ADDRESS}/api/v1/thumbnails", params=params, headers=headers, auth=auth, data=data)
+    res = requests.get(url=f"{HOST_ADDRESS}/api/{VERSION}/thumbnails", params=params, headers=headers, auth=auth, data=data)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/testcases/test_web_api/test_common.py` around lines 432 - 440, In
document_thumbnails replace the hardcoded "/api/v1/thumbnails" URL with the
module's VERSION usage (e.g., build the URL using VERSION) so it follows the
same pattern as other helpers; update the f-string that currently uses
f"{HOST_ADDRESS}/api/v1/thumbnails" to use
f"{HOST_ADDRESS}/api/{VERSION}/thumbnails" (or the shared named URL constant if
present) so future VERSION bumps remain consistent with the rest of the module.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 896-931: Add authentication, tenant scoping, and logging to the
list_thumbnails handler: decorate list_thumbnails with `@login_required` and
`@add_tenant_id_to_kwargs`, accept tenant_id from kwargs, log the missing doc_ids
case before returning the ARGUMENT_ERROR, call
DocumentService.get_thumbnails(doc_ids) as before but then filter the returned
docs by keeping only those where
KnowledgebaseService.accessible(kb_id=d["kb_id"], user_id=tenant_id) is True
(import KnowledgebaseService), and add logging in the exception handler before
returning server_error_response(e) so failures are recorded; keep the existing
thumbnail prefix logic and produced response shape.

---

Nitpick comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 915-916: Remove the redundant function-scoped imports of
IMG_BASE64_PREFIX and DocumentService inside the function where they were
re-imported; these symbols are already imported at module top, so delete the
lines "from api.constants import IMG_BASE64_PREFIX" and "from
api.db.services.document_service import DocumentService" in that scope and rely
on the module-level imports to avoid duplicate imports and dead weight.
- Around line 896-931: The list_thumbnails handler lacks logging: import the
logging module and add a logging.error when doc_ids is empty in list_thumbnails
with context (e.g., include request info or the missing parameter name and any
user/kb context), and replace/augment the generic except return by calling
logging.exception inside the except block (include doc_ids and any relevant
variables) before returning server_error_response; ensure the log messages
reference list_thumbnails and DocumentService.get_thumbnails/IMG_BASE64_PREFIX
so it's clear which flow failed.

In `@test/testcases/test_web_api/test_common.py`:
- Around line 432-440: In document_thumbnails replace the hardcoded
"/api/v1/thumbnails" URL with the module's VERSION usage (e.g., build the URL
using VERSION) so it follows the same pattern as other helpers; update the
f-string that currently uses f"{HOST_ADDRESS}/api/v1/thumbnails" to use
f"{HOST_ADDRESS}/api/{VERSION}/thumbnails" (or the shared named URL constant if
present) so future VERSION bumps remain consistent with the rest of the module.

In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`:
- Around line 166-184: The test test_thumbnails_with_doc_ids only checks that
doc_id is present but doesn't validate the thumbnail value (upload_documents
uses thumbnail() which returns "" for .txt), causing weak coverage; update the
test to either upload an image file instead of a .txt so you can assert the
returned thumbnail string starts with IMG_BASE64_PREFIX (or whatever
base64/image prefix constant you use), or at minimum assert that
res["data"][doc_id] is an instance of str to catch None/type regressions;
reference the existing helpers upload_documents and document_thumbnails and the
test name test_thumbnails_with_doc_ids when making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0df7d55e-0ea7-4207-91e1-ded64c9c5e90

📥 Commits

Reviewing files that changed from the base of the PR and between 199fbce and 0418bce.

📒 Files selected for processing (5)
  • api/apps/document_app.py
  • api/apps/restful_apis/document_api.py
  • test/testcases/test_web_api/test_common.py
  • test/testcases/test_web_api/test_document_app/test_document_metadata.py
  • web/src/utils/api.ts

Comment on lines +896 to +931
@manager.route("/thumbnails", methods=["GET"]) # noqa: F821
def list_thumbnails():
"""
Get thumbnails for documents.
---
tags:
- Documents
parameters:
- in: query
name: doc_ids
type: array
required: true
description: List of document IDs to get thumbnails for.
responses:
200:
description: Successfully retrieved thumbnails
400:
description: Missing document IDs
"""
from api.constants import IMG_BASE64_PREFIX
from api.db.services.document_service import DocumentService

doc_ids = request.args.getlist("doc_ids")
if not doc_ids:
return get_json_result(data=False, message='Lack of "Document ID"', code=RetCode.ARGUMENT_ERROR)

try:
docs = DocumentService.get_thumbnails(doc_ids)

for doc_item in docs:
if doc_item["thumbnail"] and not doc_item["thumbnail"].startswith(IMG_BASE64_PREFIX):
doc_item["thumbnail"] = f"/v1/document/image/{doc_item['kb_id']}-{doc_item['thumbnail']}"

return get_json_result(data={d["id"]: d["thumbnail"] for d in docs})
except Exception as e:
return server_error_response(e)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Inspect the handler as it existed just before this PR's changes on the target branch.
git fetch origin main --depth=1 >/dev/null 2>&1 || true
git show origin/main:api/apps/document_app.py 2>/dev/null \
  | awk '/@manager\.route\("\/thumbnails"/{flag=1} flag{print; if(/^def |^async def /){c++}; if(c>=1 && /^$/){exit}}'

Repository: infiniflow/ragflow

Length of output: 331


Add authentication and authorization to the /thumbnails endpoint.

The list_thumbnails handler is missing @login_required and tenant/KB scoping, unlike all other handlers in this module:

  1. Unauthenticated public access. Any caller can retrieve thumbnail data and leak document kb_id values. While the old handler was also undecorated, this endpoint must enforce authentication.
  2. Cross-tenant enumeration. No permission check scopes DocumentService.get_thumbnails(doc_ids) to the caller's tenant, allowing an authenticated user to enumerate documents across other tenants via direct-object-reference.
  3. Missing logging. No logging for invalid arguments or exceptions, violating the coding guideline to add logging for new flows.

Add @login_required, @add_tenant_id_to_kwargs, and filter results by calling KnowledgebaseService.accessible(kb_id=d["kb_id"], user_id=tenant_id) before returning. Also add logging for the missing doc_ids case and exception handler. Sync the implementation with pattern used in list_docs and other authenticated endpoints.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 896 - 931, Add
authentication, tenant scoping, and logging to the list_thumbnails handler:
decorate list_thumbnails with `@login_required` and `@add_tenant_id_to_kwargs`,
accept tenant_id from kwargs, log the missing doc_ids case before returning the
ARGUMENT_ERROR, call DocumentService.get_thumbnails(doc_ids) as before but then
filter the returned docs by keeping only those where
KnowledgebaseService.accessible(kb_id=d["kb_id"], user_id=tenant_id) is True
(import KnowledgebaseService), and add logging in the exception handler before
returning server_error_response(e) so failures are recorded; keep the existing
thumbnail prefix logic and produced response shape.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.30%. Comparing base (965717c) to head (566ca80).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #14344   +/-   ##
=======================================
  Coverage   95.30%   95.30%           
=======================================
  Files          10       10           
  Lines         703      703           
  Branches      112      112           
=======================================
  Hits          670      670           
  Misses         16       16           
  Partials       17       17           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Apr 24, 2026
@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label Apr 24, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
api/apps/restful_apis/document_api.py (1)

982-1017: ⚠️ Potential issue | 🟠 Major

Auth, tenant scoping, and logging still missing on list_thumbnails.

This handler still has no @login_required / @add_tenant_id_to_kwargs, no permission check against the caller's tenant before returning thumbnails, and no logging on the argument-error or exception paths (the latter violates the repo's "Add logging for new flows" guideline). As-is, an unauthenticated caller can fetch thumbnails for arbitrary doc_ids and enumerate kb_id values across tenants via IDOR. Please align this with the patterns used in list_docs / other endpoints in this module: decorate with @login_required + @add_tenant_id_to_kwargs, and filter results by KnowledgebaseService.accessible(kb_id=d["kb_id"], user_id=tenant_id) before returning.

Also note that IMG_BASE64_PREFIX (line 1001) and DocumentService (line 1002) are already imported at the top of the file (lines 26 and 30), so the local re-imports can be removed.

As per coding guidelines: "Add logging for new flows".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 982 - 1017, Add auth,
tenant scoping, permission checks, logging, and remove duplicate imports in
list_thumbnails: decorate the handler with `@login_required` and
`@add_tenant_id_to_kwargs`, remove the local imports of IMG_BASE64_PREFIX and
DocumentService, and call DocumentService.get_thumbnails(doc_ids) using the
injected tenant_id; after fetching docs filter them with
KnowledgebaseService.accessible(kb_id=d["kb_id"], user_id=tenant_id) (only
include accessible docs), keep the existing base64-prefix logic using the
top-of-file IMG_BASE64_PREFIX, and on argument errors and in the except block
emit logs (use the module logger) before returning get_json_result(...) or
server_error_response(e). Ensure you reference list_thumbnails,
KnowledgebaseService.accessible, DocumentService.get_thumbnails,
`@login_required`, and `@add_tenant_id_to_kwargs` in the changes.
🧹 Nitpick comments (1)
api/apps/restful_apis/document_api.py (1)

1386-1390: Drop the redundant in-function imports.

settings, thread_pool_exec, get_data_error_result, and server_error_response are all already imported at the top of this module (lines 35–36, 40, 43), so re-importing them inside the handler just adds noise and makes the call path slightly slower on each request. Only make_response actually needs a local import here (top-level from quart import request doesn't pull it in) — and even that could be promoted to the module-level quart import for consistency with the rest of the file.

♻️ Proposed cleanup
-    try:
-        from quart import make_response
-
-        from common import settings
-        from common.misc_utils import thread_pool_exec
-        from api.utils.api_utils import get_data_error_result, server_error_response
-
-        arr = image_id.split("-")
+    try:
+        from quart import make_response
+
+        arr = image_id.split("-")

Or move make_response up next to from quart import request at line 19.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 1386 - 1390, Remove the
redundant in-function imports of settings, thread_pool_exec,
get_data_error_result, and server_error_response inside the handler; these are
already imported at module level, so delete those local import statements and
only keep make_response if needed, or better yet move make_response up to the
module-level import alongside the existing "from quart import request" to keep
imports consistent; update the handler to use the already-imported symbols
(settings, thread_pool_exec, get_data_error_result, server_error_response) and
ensure only make_response is imported once at module scope.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 1362-1401: The get_document_image handler is unauthenticated and
unlogged; protect it by adding the authentication/authorization check (e.g.,
apply `@login_required` and verify the caller has access to the target tenant/KB
before calling settings.STORAGE_IMPL.get) in get_document_image, and log
failures: when image_id parsing fails or get_data_error_result is returned, log
the bad image_id and reason; in the exception handler log the image_id and the
caught exception (stack/trace) before returning server_error_response(e). Ensure
imports for the auth decorator/current_user/authorization helper and the module
logger (or use process_logger) are added and reference get_document_image,
settings.STORAGE_IMPL.get, get_data_error_result, and server_error_response so
the checks and logs are co-located with the storage fetch.

---

Duplicate comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 982-1017: Add auth, tenant scoping, permission checks, logging,
and remove duplicate imports in list_thumbnails: decorate the handler with
`@login_required` and `@add_tenant_id_to_kwargs`, remove the local imports of
IMG_BASE64_PREFIX and DocumentService, and call
DocumentService.get_thumbnails(doc_ids) using the injected tenant_id; after
fetching docs filter them with KnowledgebaseService.accessible(kb_id=d["kb_id"],
user_id=tenant_id) (only include accessible docs), keep the existing
base64-prefix logic using the top-of-file IMG_BASE64_PREFIX, and on argument
errors and in the except block emit logs (use the module logger) before
returning get_json_result(...) or server_error_response(e). Ensure you reference
list_thumbnails, KnowledgebaseService.accessible,
DocumentService.get_thumbnails, `@login_required`, and `@add_tenant_id_to_kwargs` in
the changes.

---

Nitpick comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 1386-1390: Remove the redundant in-function imports of settings,
thread_pool_exec, get_data_error_result, and server_error_response inside the
handler; these are already imported at module level, so delete those local
import statements and only keep make_response if needed, or better yet move
make_response up to the module-level import alongside the existing "from quart
import request" to keep imports consistent; update the handler to use the
already-imported symbols (settings, thread_pool_exec, get_data_error_result,
server_error_response) and ensure only make_response is imported once at module
scope.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 97420aee-1998-4a51-ba4e-2442001d622d

📥 Commits

Reviewing files that changed from the base of the PR and between 431a399 and 42608a6.

📒 Files selected for processing (6)
  • api/apps/document_app.py
  • api/apps/restful_apis/document_api.py
  • test/testcases/test_web_api/test_common.py
  • test/testcases/test_web_api/test_document_app/test_document_metadata.py
  • web/src/locales/zh.ts
  • web/src/utils/api.ts
✅ Files skipped from review due to trivial changes (2)
  • web/src/utils/api.ts
  • web/src/locales/zh.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/testcases/test_web_api/test_common.py
  • test/testcases/test_web_api/test_document_app/test_document_metadata.py

Comment thread api/apps/restful_apis/document_api.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
api/apps/restful_apis/document_api.py (2)

1362-1395: ⚠️ Potential issue | 🟠 Major

get_document_image still has no authentication and no logging.

Already raised previously and not yet addressed. Without @login_required and a KB-access check on bkt (which is a kb_id), any caller who can guess or read a thumbnail URL can pull bytes directly out of settings.STORAGE_IMPL across tenants (IDOR). Both the get_data_error_result(...) branch and the except Exception arm should also log the offending image_id and the failure cause; right now failures are silent.

As per coding guidelines: "Add logging for new flows".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 1362 - 1395, The endpoint
get_document_image is missing authentication and tenant/Kb access checks and
lacks logging; add the `@login_required` decorator to the get_document_image
route, extract the bucket part (kb_id) from image_id and call the project's KB
access check helper (e.g., check_kb_access(kb_id) or
ensure_user_has_access_to_kb(kb_id)) before calling settings.STORAGE_IMPL.get
via thread_pool_exec, and if the image_id format is invalid or access is denied
return the appropriate get_data_error_result/403 while logging the offending
image_id; also wrap exceptions to log image_id and the caught exception before
returning server_error_response(e) so all failure paths log both image_id and
error context.

982-1017: ⚠️ Potential issue | 🔴 Critical

list_thumbnails still lacks authentication, tenant scoping, and logging.

Already raised in a previous review and not addressed: handler is undecorated (no @login_required / @add_tenant_id_to_kwargs), DocumentService.get_thumbnails(doc_ids) is not scoped to the caller's tenant/KB (cross-tenant enumeration), and there is no logging on the missing-doc_ids path or the except branch — violating the repo guideline to add logging for new flows.

As per coding guidelines: "Add logging for new flows".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 982 - 1017, The
list_thumbnails handler lacks authentication/tenant scoping and logging: add the
appropriate decorators (e.g., `@login_required` and `@add_tenant_id_to_kwargs`) to
list_thumbnails, extract the tenant/kb id from the handler kwargs and pass it
into DocumentService.get_thumbnails (e.g., call
DocumentService.get_thumbnails(doc_ids, tenant_id=tenant_id) or the KB-specific
parameter expected by that method) to prevent cross-tenant enumeration, and add
logging statements for the missing doc_ids branch and inside the except branch
(include context like doc_ids and tenant_id); preserve the existing thumbnail
prefix logic that uses IMG_BASE64_PREFIX and return behavior.
🧹 Nitpick comments (1)
api/apps/restful_apis/document_api.py (1)

1386-1389: Reuse arr instead of splitting twice.

image_id.split("-") is called on line 1386 and again on line 1389; the second call is redundant and risks drifting from the validation above. Also, since image_id is f"{kb_id}-{thumbnail}", this only works because kb_id is currently a hex UUID with no -. Worth a brief comment or rsplit("-", 1) to make the contract explicit and tolerate future ID format changes.

♻️ Proposed tweak
-        arr = image_id.split("-")
-        if len(arr) != 2:
+        arr = image_id.split("-", 1)
+        if len(arr) != 2:
             return get_data_error_result(message="Image not found.")
-        bkt, nm = image_id.split("-")
+        bkt, nm = arr
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 1386 - 1389, Replace the
redundant second split by reusing the earlier result: after computing arr =
image_id.split("-") and validating len(arr) == 2, assign bkt, nm = arr (or use
arr[0], arr[1]) instead of calling image_id.split("-") again; alternatively use
image_id.rsplit("-", 1) once and destructure into bkt, nm to make the contract
explicit for future ID formats, and keep the existing
get_data_error_result(message="Image not found.") validation intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 1011-1015: The thumbnail URLs returned by list_thumbnails are
built with the wrong path ("/api/v1/documents/image/...") and thus 404 because
the handler get_document_image is mounted at "/documents/images/<image_id>";
update the URL rewrite inside the loop in list_thumbnails (the block iterating
docs where you set doc_item["thumbnail"]) to use the plural "images" path (e.g.,
"/api/v1/documents/images/{kb_id}-{thumbnail}") so it matches get_document_image
(also ensure consistency with list_docs which already uses "images").

---

Duplicate comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 1362-1395: The endpoint get_document_image is missing
authentication and tenant/Kb access checks and lacks logging; add the
`@login_required` decorator to the get_document_image route, extract the bucket
part (kb_id) from image_id and call the project's KB access check helper (e.g.,
check_kb_access(kb_id) or ensure_user_has_access_to_kb(kb_id)) before calling
settings.STORAGE_IMPL.get via thread_pool_exec, and if the image_id format is
invalid or access is denied return the appropriate get_data_error_result/403
while logging the offending image_id; also wrap exceptions to log image_id and
the caught exception before returning server_error_response(e) so all failure
paths log both image_id and error context.
- Around line 982-1017: The list_thumbnails handler lacks authentication/tenant
scoping and logging: add the appropriate decorators (e.g., `@login_required` and
`@add_tenant_id_to_kwargs`) to list_thumbnails, extract the tenant/kb id from the
handler kwargs and pass it into DocumentService.get_thumbnails (e.g., call
DocumentService.get_thumbnails(doc_ids, tenant_id=tenant_id) or the KB-specific
parameter expected by that method) to prevent cross-tenant enumeration, and add
logging statements for the missing doc_ids branch and inside the except branch
(include context like doc_ids and tenant_id); preserve the existing thumbnail
prefix logic that uses IMG_BASE64_PREFIX and return behavior.

---

Nitpick comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 1386-1389: Replace the redundant second split by reusing the
earlier result: after computing arr = image_id.split("-") and validating
len(arr) == 2, assign bkt, nm = arr (or use arr[0], arr[1]) instead of calling
image_id.split("-") again; alternatively use image_id.rsplit("-", 1) once and
destructure into bkt, nm to make the contract explicit for future ID formats,
and keep the existing get_data_error_result(message="Image not found.")
validation intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 94db55f8-6859-45dc-8994-f512056dc652

📥 Commits

Reviewing files that changed from the base of the PR and between 42608a6 and 6f3bbba.

📒 Files selected for processing (2)
  • api/apps/document_app.py
  • api/apps/restful_apis/document_api.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • api/apps/document_app.py

Comment thread api/apps/restful_apis/document_api.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
api/apps/restful_apis/document_api.py (3)

1133-1135: ⚠️ Potential issue | 🔴 Critical

URL still uses singular image while the handler is mounted at plural images — thumbnails will 404.

Line 1135 rewrites thumbnails to /api/v1/documents/image/..., but get_document_image at line 1484 is registered as /documents/images/<image_id>. list_docs on line 655 already correctly uses the plural form. Every URL returned by this endpoint will 404, defeating the whole purpose of the migration.

🔧 Proposed fix
         for doc_item in docs:
             if doc_item["thumbnail"] and not doc_item["thumbnail"].startswith(IMG_BASE64_PREFIX):
-                doc_item["thumbnail"] = f"/api/v1/documents/image/{doc_item['kb_id']}-{doc_item['thumbnail']}"
+                doc_item["thumbnail"] = f"/api/v1/documents/images/{doc_item['kb_id']}-{doc_item['thumbnail']}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 1133 - 1135, The URL
rewrite for thumbnails uses the wrong singular path and will 404; update the
loop that rewrites doc_item["thumbnail"] to use the plural route segment so it
matches the registered handler (change "/api/v1/documents/image/..." to
"/api/v1/documents/images/..."); ensure you only rewrite when thumbnail is not
base64 (retain the IMG_BASE64_PREFIX check) and keep the same id construction
using doc_item["kb_id"] and doc_item["thumbnail"] so the result matches what
get_document_image expects (and aligns with list_docs behavior).

1484-1517: ⚠️ Potential issue | 🟠 Major

get_document_image is still unauthenticated, unlogged, and double-splits image_id.

This was previously flagged and marked as resolved in commit 6f3bbba, but the handler in the current state still has no @login_required, no tenant/KB authorization, and no logging on the bad-input or exception paths. Because image_id is {kb_id}-{filename}, any caller who can read a thumbnail URL (now emitted unauthenticated by list_thumbnails) can pull bytes directly out of settings.STORAGE_IMPL for arbitrary buckets/objects — same IDOR risk as before.

Additionally, line 1511 calls image_id.split("-") a second time after arr was already produced on line 1508; just unpack arr.

🔒 Suggested change
-@manager.route("/documents/images/<image_id>", methods=["GET"])  # noqa: F821
-async def get_document_image(image_id):
+@manager.route("/documents/images/<image_id>", methods=["GET"])  # noqa: F821
+@login_required
+@add_tenant_id_to_kwargs
+async def get_document_image(tenant_id, image_id):
     ...
     try:
         arr = image_id.split("-")
         if len(arr) != 2:
+            logging.error(f"Invalid image_id format: {image_id}")
             return get_data_error_result(message="Image not found.")
-        bkt, nm = image_id.split("-")
+        bkt, nm = arr
+        if not KnowledgebaseService.accessible(kb_id=bkt, user_id=tenant_id):
+            return get_error_data_result(message="No authorization.", code=RetCode.AUTHENTICATION_ERROR)
         data = await thread_pool_exec(settings.STORAGE_IMPL.get, bkt, nm)
         response = await make_response(data)
         response.headers.set("Content-Type", "image/JPEG")
         return response
     except Exception as e:
+        logging.exception(f"get_document_image failed for {image_id}")
         return server_error_response(e)

As per coding guidelines: "Add logging for new flows".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 1484 - 1517, The
get_document_image handler is unauthenticated, lacks tenant/KB authorization,
logs no failures, and double-splits image_id; fix it by adding the same
`@login_required` decorator and the tenant/KB authorization check used by
list_thumbnails (reject if caller not authorized for the kb), replace the second
split by unpacking arr into bkt,nm (arr = image_id.split("-"); if len(arr)!=2 ->
log warning and return get_data_error_result), add logging for the bad-input
path and for exceptions (use the module logger to log image_id and exception
details before returning server_error_response), and keep using
thread_pool_exec(settings.STORAGE_IMPL.get, bkt, nm) and make_response as
before.

1104-1139: ⚠️ Potential issue | 🔴 Critical

list_thumbnails is still unauthenticated and unlogged — gate access and add logging.

The handler still has no @login_required, no @add_tenant_id_to_kwargs, no tenant scoping on DocumentService.get_thumbnails(doc_ids), and no logging on the bad-input or exception paths. As-is, any caller can enumerate thumbnails (and leak kb_id values) for arbitrary doc_ids across tenants by direct-object-reference, which is a regression on the rest of this module's auth posture and is the same defect previously raised on this segment.

Per repo guidelines: "Add logging for new flows".

🔒 Suggested change
-@manager.route("/thumbnails", methods=["GET"])  # noqa: F821
-def list_thumbnails():
+@manager.route("/thumbnails", methods=["GET"])  # noqa: F821
+@login_required
+@add_tenant_id_to_kwargs
+def list_thumbnails(tenant_id):
     ...
-    from api.constants import IMG_BASE64_PREFIX
-    from api.db.services.document_service import DocumentService
-
     doc_ids = request.args.getlist("doc_ids")
     if not doc_ids:
+        logging.warning("list_thumbnails called without doc_ids")
         return get_json_result(data=False, message='Lack of "Document ID"', code=RetCode.ARGUMENT_ERROR)

     try:
         docs = DocumentService.get_thumbnails(doc_ids)
-
-        for doc_item in docs:
+        docs = [d for d in docs if KnowledgebaseService.accessible(kb_id=d["kb_id"], user_id=tenant_id)]
+        for doc_item in docs:
             if doc_item["thumbnail"] and not doc_item["thumbnail"].startswith(IMG_BASE64_PREFIX):
                 doc_item["thumbnail"] = f"/api/v1/documents/images/{doc_item['kb_id']}-{doc_item['thumbnail']}"

         return get_json_result(data={d["id"]: d["thumbnail"] for d in docs})
     except Exception as e:
+        logging.exception("list_thumbnails failed")
         return server_error_response(e)

Note: the local re-imports of IMG_BASE64_PREFIX and DocumentService on lines 1123–1124 are redundant — they're already imported at the top of the file.

As per coding guidelines: "Add logging for new flows".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/restful_apis/document_api.py` around lines 1104 - 1139, The
list_thumbnails handler is missing authentication, tenant scoping, and logging;
add the `@login_required` and `@add_tenant_id_to_kwargs` decorators to
list_thumbnails, ensure you call DocumentService.get_thumbnails with the tenant
context (e.g. pass tenant_id from kwargs or call a tenant-scoped method) instead
of the current unauthenticated call, add logging calls (use the module logger)
for the missing doc_ids branch and inside the exception handler before returning
server_error_response, and remove the redundant local imports of
IMG_BASE64_PREFIX and DocumentService since they exist at the top of the file;
keep existing return shapes (get_json_result/RetCode) and reference the same
symbols (list_thumbnails, DocumentService.get_thumbnails, IMG_BASE64_PREFIX,
server_error_response) when implementing these changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 1133-1135: The URL rewrite for thumbnails uses the wrong singular
path and will 404; update the loop that rewrites doc_item["thumbnail"] to use
the plural route segment so it matches the registered handler (change
"/api/v1/documents/image/..." to "/api/v1/documents/images/..."); ensure you
only rewrite when thumbnail is not base64 (retain the IMG_BASE64_PREFIX check)
and keep the same id construction using doc_item["kb_id"] and
doc_item["thumbnail"] so the result matches what get_document_image expects (and
aligns with list_docs behavior).
- Around line 1484-1517: The get_document_image handler is unauthenticated,
lacks tenant/KB authorization, logs no failures, and double-splits image_id; fix
it by adding the same `@login_required` decorator and the tenant/KB authorization
check used by list_thumbnails (reject if caller not authorized for the kb),
replace the second split by unpacking arr into bkt,nm (arr =
image_id.split("-"); if len(arr)!=2 -> log warning and return
get_data_error_result), add logging for the bad-input path and for exceptions
(use the module logger to log image_id and exception details before returning
server_error_response), and keep using
thread_pool_exec(settings.STORAGE_IMPL.get, bkt, nm) and make_response as
before.
- Around line 1104-1139: The list_thumbnails handler is missing authentication,
tenant scoping, and logging; add the `@login_required` and
`@add_tenant_id_to_kwargs` decorators to list_thumbnails, ensure you call
DocumentService.get_thumbnails with the tenant context (e.g. pass tenant_id from
kwargs or call a tenant-scoped method) instead of the current unauthenticated
call, add logging calls (use the module logger) for the missing doc_ids branch
and inside the exception handler before returning server_error_response, and
remove the redundant local imports of IMG_BASE64_PREFIX and DocumentService
since they exist at the top of the file; keep existing return shapes
(get_json_result/RetCode) and reference the same symbols (list_thumbnails,
DocumentService.get_thumbnails, IMG_BASE64_PREFIX, server_error_response) when
implementing these changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f4527cc9-97de-4c78-b758-ab6d06b058c9

📥 Commits

Reviewing files that changed from the base of the PR and between 6f3bbba and 5760685.

📒 Files selected for processing (4)
  • api/apps/document_app.py
  • api/apps/restful_apis/document_api.py
  • test/testcases/test_web_api/test_common.py
  • web/src/utils/api.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/testcases/test_web_api/test_common.py
  • api/apps/document_app.py

@yingfeng yingfeng merged commit c5116b9 into infiniflow:main Apr 27, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Continue Integration lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants