Skip to content

fix: secure document thumbnail image access#14766

Open
Parvezkhan0 wants to merge 2 commits into
infiniflow:mainfrom
Parvezkhan0:fix/document-thumbnail-auth-14763
Open

fix: secure document thumbnail image access#14766
Parvezkhan0 wants to merge 2 commits into
infiniflow:mainfrom
Parvezkhan0:fix/document-thumbnail-auth-14763

Conversation

@Parvezkhan0
Copy link
Copy Markdown

@Parvezkhan0 Parvezkhan0 commented May 11, 2026

What problem does this PR solve?

Fixes #14763
This PR fixes an authorization bypass in document image serving.
Previously, GET /api/v1/documents/images/<image_id> had no authentication and accepted a user-controlled storage key, which allowed unauthenticated and cross-tenant access to document thumbnails and even arbitrary objects stored in the same bucket. This change moves document thumbnails to an authenticated document-scoped endpoint, validates access with DocumentService.accessible(...), updates thumbnail URL generation, and restricts the legacy image route to authorized chunk-image access only.

Type of change

  • Bug Fix (non-breaking change which fixes an issue)

Copilot AI review requested due to automatic review settings May 11, 2026 05:30
@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. 🐞 bug Something isn't working, pull request that fix bug. 🧪 test Pull requests that update test cases. labels May 11, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

Adds authenticated GET /documents/<doc_id>/thumbnail, helpers to map image IDs to accessible documents and to build safe thumbnail URLs/headers, rewrites listing/thumbnails URLs to the per-document endpoint, tightens GET /documents/images/<image_id> with access checks, and updates test fixtures and tests.

Changes

Authenticated Thumbnail Endpoints with Access Control

Layer / File(s) Summary
Required Imports
api/apps/restful_apis/document_api.py
Adds OrderByExpr import used for image-to-document resolution (range_8d32bcaac736).
Document Listing Thumbnail URL Rewrite
api/apps/restful_apis/document_api.py
Rewrites dataset document listing non-base64 thumbnail URLs to /api/v1/documents/<doc_id>/thumbnail (range_8853e4cd5eec).
Thumbnails Auth & Filtering
api/apps/restful_apis/document_api.py
Adds @login_required to /thumbnails, filters requested doc_ids to those accessible to current_user, and rewrites returned thumbnail URLs to per-document endpoint (range_2ba0f86cc51b, range_2e9ea99ca5e0).
URL & Access Resolution Helpers
api/apps/restful_apis/document_api.py
Adds helpers: thumbnail URL generator, safe response header/content-type applier, and _get_accessible_chunk_image_doc_id() which queries docStore with OrderByExpr() and verifies DocumentService.accessible(...) (range_7c47c4c57693).
New Document Thumbnail Endpoint
api/apps/restful_apis/document_api.py
Adds GET /documents/<doc_id>/thumbnail: validates doc existence and KB access, rejects missing/non-binary thumbnails, fetches bytes from settings.STORAGE_IMPL, and sets Content-Type/headers from filename extension (range_7803a2b41cf4).
Image Endpoint Access Control
api/apps/restful_apis/document_api.py
Updates GET /documents/images/<image_id> to validate image_id format, resolve owning accessible doc via helper, parse bucket/object, fetch bytes, and set Content-Type based on object name extension (range_64a28500ba0f).
Test Fixtures and Dependency Stubs
test/testcases/test_web_api/test_document_app/conftest.py
Centralizes sys.modules stubs and loader helpers, deterministic rag.nlp.search.index_name, parser stubs, DB/service stubs, and utility helpers used when loading document_api.py in tests (range_be0a6445f7ee, range_b5708b40d972, range_0653aaece14b, range_a0096fa5712e, range_44aae2681969, range_62ede9f84677).
Test API Helper
test/testcases/test_web_api/test_common.py
Adds document_thumbnail() helper to call the new thumbnail endpoint and return the raw response (range_603458eafc08).
Test Cases and Authorization
test/testcases/test_web_api/test_document_app/test_document_metadata.py
Adds test_thumbnail_auth_invalid and test_get_document_thumbnail_success_and_exception_unit; updates image tests to exercise accessible-image branches and content-type expectations (range_093b7f31ec09, range_209747fa5654, range_cf29bf613eeb, range_44788ab45a9d).

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant API as DocumentAPI
  participant Store as DocStore
  participant DocSvc as DocumentService
  participant Storage
  Client->>API: GET /documents/images/<image_id> or /documents/<doc_id>/thumbnail
  API->>Store: query image owner via OrderByExpr (per-tenant index)
  Store-->>API: doc_id (or not found)
  API->>DocSvc: DocumentService.accessible(doc_id, current_user.id)
  DocSvc-->>API: accessible? (True/False)
  API->>Storage: STORAGE_IMPL.get(bucket, object_name)  (if accessible)
  Storage-->>API: bytes
  API-->>Client: HTTP response with safe headers and Content-Type from filename extension
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • infiniflow/ragflow#14344: Also modifies document thumbnail and image endpoints in api/apps/restful_apis/document_api.py.

Suggested labels

lgtm

Suggested reviewers

  • wangq8

Poem

🐰 A rabbit hops through thumbnails new,
Per-document endpoints guard the view.
Resolving image owners, access in sight,
Storage returns bytes only when right.
Hooray — small paws keep data tight!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 19.44% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: secure document thumbnail image access' accurately summarizes the main change—adding authentication and authorization to document thumbnail endpoints to prevent cross-tenant data exfiltration.
Description check ✅ Passed The PR description addresses both required template sections: it explains the authorization bypass being fixed (issue #14763) and correctly marks the change type as a bug fix.
Linked Issues check ✅ Passed The code changes comprehensively address all objectives from issue #14763: authentication is enforced via @login_required, authorization uses DocumentService.accessible(...), thumbnails are served only by document lookup, URL builders are updated, and tests cover anonymous/cross-tenant denial.
Out of Scope Changes check ✅ Passed All changes are narrowly scoped to fixing the authorization bypass: new authenticated thumbnail endpoint, updated URL generation, hardened legacy image route, test infrastructure refactoring, and related test coverage—no unrelated modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
test/testcases/test_web_api/test_document_app/test_document_metadata.py (1)

534-543: ⚡ Quick win

Please add a direct unit test for _get_accessible_chunk_image_doc_id().

This test replaces the new authorization helper with a lambda, so it never exercises the tenant/index lookup that actually closes the vulnerability. Given that this PR is a security fix, I’d keep route coverage here but add a focused unit test for the helper itself.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`
around lines 534 - 543, Add a focused unit test that directly exercises
_get_accessible_chunk_image_doc_id rather than stubbing it in the route test:
call module._get_accessible_chunk_image_doc_id with image IDs representing (a) a
chunk image that is accessible (mock the authorization/index/tenant lookup
functions it uses to return a valid doc id) and assert it returns the expected
"doc-..." id, and (b) a chunk image that is not accessible (mock the lookups to
deny access) and assert it returns None; use monkeypatch to replace the specific
helper functions or data sources the helper calls (not the helper itself) so the
tenant/index resolution logic runs, and keep assertions on returned values only.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 1708-1712: The image response builds a Response and sets
Content-Type directly, but it must reuse apply_safe_file_response_headers to
enforce safe headers (prevent serving active types like SVG). Replace the direct
response.headers.set(...) call and the return with a call to
apply_safe_file_response_headers(response, filename=doc.thumbnail) (or the
helper's actual signature) after make_response, or call the helper first then
set any fallback Content-Type with CONTENT_TYPE_MAP if needed; ensure you still
await thread_pool_exec(STORAGE_IMPL.get, doc.kb_id, doc.thumbnail) and use
make_response(response_body) before invoking apply_safe_file_response_headers so
the thumbnail and content type are derived via the helper instead of setting
headers manually.

In `@test/testcases/test_web_api/test_document_app/conftest.py`:
- Line 243: Replace the no-op stub with one that returns the same tuple shape as
the real helper: implement api_utils_mod.check_duplicate_ids as a small stub
that accepts *args, **kwargs and returns (unique_ids, duplicate_messages) — e.g.
return (args[0] if args else kwargs.get('ids'), []) so callers unpacking
(unique_ids, duplicate_messages) get valid values; apply the same change to the
other fixture instance.

---

Nitpick comments:
In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`:
- Around line 534-543: Add a focused unit test that directly exercises
_get_accessible_chunk_image_doc_id rather than stubbing it in the route test:
call module._get_accessible_chunk_image_doc_id with image IDs representing (a) a
chunk image that is accessible (mock the authorization/index/tenant lookup
functions it uses to return a valid doc id) and assert it returns the expected
"doc-..." id, and (b) a chunk image that is not accessible (mock the lookups to
deny access) and assert it returns None; use monkeypatch to replace the specific
helper functions or data sources the helper calls (not the helper itself) so the
tenant/index resolution logic runs, and keep assertions on returned values only.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bb6daffb-5261-4c30-b498-d5836a8fb6d6

📥 Commits

Reviewing files that changed from the base of the PR and between 46897d6 and b42a0e4.

📒 Files selected for processing (4)
  • api/apps/restful_apis/document_api.py
  • test/testcases/test_web_api/test_common.py
  • test/testcases/test_web_api/test_document_app/conftest.py
  • test/testcases/test_web_api/test_document_app/test_document_metadata.py

Comment thread api/apps/restful_apis/document_api.py
Comment thread test/testcases/test_web_api/test_document_app/conftest.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses an authorization bypass in document image serving by moving document thumbnails to an authenticated, document-scoped endpoint and restricting the legacy image route to authorized access patterns.

Changes:

  • Added authenticated GET /api/v1/documents/<doc_id>/thumbnail and updated thumbnail URL generation to use it.
  • Secured the legacy GET /api/v1/documents/images/<image_id> route with authentication plus doc-store-backed authorization checks.
  • Added/updated web API and unit tests to cover invalid-auth and new thumbnail/image behaviors.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
api/apps/restful_apis/document_api.py Adds authenticated thumbnail endpoint, tightens legacy image access, and updates thumbnail URL generation.
test/testcases/test_web_api/test_common.py Adds a client helper for the new document thumbnail endpoint.
test/testcases/test_web_api/test_document_app/test_document_metadata.py Adds invalid-auth coverage for thumbnails and new unit tests for the new/secured handlers.
test/testcases/test_web_api/test_document_app/conftest.py Extends module stubs to support the new imports/authorization logic in unit tests.
Comments suppressed due to low confidence (2)

api/apps/restful_apis/document_api.py:1236

  • list_thumbnails now requires authentication, but the OpenAPI docstring no longer includes the security section like other authenticated endpoints in this module. This can cause generated API docs/clients to treat it as public. Please update the docstring to reflect that it requires ApiKeyAuth/auth.
@manager.route("/thumbnails", methods=["GET"])  # noqa: F821
@login_required
def list_thumbnails():
    """
    Get thumbnails for documents.
    ---
    tags:
      - Documents
    parameters:
      - in: query
        name: doc_ids
        type: array
        required: true
        description: List of document IDs to get thumbnails for.
    responses:
      200:
        description: Successfully retrieved thumbnails

api/apps/restful_apis/document_api.py:1733

  • get_document_image is now @login_required, but its OpenAPI docstring does not declare any security requirement. Please add the security stanza (consistent with other routes) so the API schema/docs match runtime behavior.
@manager.route("/documents/images/<image_id>", methods=["GET"])  # noqa: F821
@login_required
async def get_document_image(image_id):
    """
    Get a document image by ID.
    ---
    tags:
      - Documents
    parameters:
      - name: image_id
        in: path
        required: true
        schema:
          type: string
        description: The image ID (format: bucket-name-image-name)
    responses:
      200:

Comment thread api/apps/restful_apis/document_api.py Outdated
Comment on lines +1700 to +1705
if not DocumentService.accessible(doc_id, current_user.id):
logging.warning("get_document_thumbnail: access denied for doc_id=%s user_id=%s", doc_id, current_user.id)
return get_data_error_result(message="Document not found!")

e, doc = DocumentService.get_by_id(doc_id)
if not e or not doc.thumbnail or doc.thumbnail.startswith(IMG_BASE64_PREFIX):
Comment on lines +844 to +888
def _get_accessible_chunk_image_doc_id(image_id: str) -> str | None:
tenants = UserTenantService.query(user_id=current_user.id)
if not tenants:
return None

accessible_kbs, _ = KnowledgebaseService.get_by_tenant_ids(
[tenant.tenant_id for tenant in tenants],
current_user.id,
0,
0,
"update_time",
True,
"",
)

kb_ids_by_tenant = {}
for kb in accessible_kbs:
kb_ids_by_tenant.setdefault(kb["tenant_id"], []).append(kb["id"])

for tenant_id, kb_ids in kb_ids_by_tenant.items():
index_name = search.index_name(tenant_id)
for kb_id in kb_ids:
if not settings.docStoreConn.index_exist(index_name, kb_id):
continue

result = settings.docStoreConn.search(
["doc_id"],
[],
{"img_id": image_id},
[],
OrderByExpr(),
0,
1,
index_name,
[kb_id],
)
fields = settings.docStoreConn.get_fields(result, ["doc_id"])
if not fields:
continue

doc_id = next(iter(fields.values())).get("doc_id")
if doc_id and DocumentService.accessible(doc_id, current_user.id):
return doc_id

return None
Comment on lines +88 to +106
common_settings_mod = ModuleType("common.settings")
common_settings_mod.STORAGE_IMPL = SimpleNamespace(get=lambda *_args, **_kwargs: b"", obj_exist=lambda *_args, **_kwargs: False)
common_settings_mod.docStoreConn = SimpleNamespace(
index_exist=lambda *_args, **_kwargs: False,
search=lambda *_args, **_kwargs: {},
get_fields=lambda *_args, **_kwargs: {},
)
monkeypatch.setitem(sys.modules, "common.settings", common_settings_mod)

metadata_utils_mod = ModuleType("common.metadata_utils")
metadata_utils_mod.convert_conditions = lambda *_args, **_kwargs: {}
metadata_utils_mod.meta_filter = lambda *_args, **_kwargs: True
metadata_utils_mod.turn2jsonschema = lambda value: value
monkeypatch.setitem(sys.modules, "common.metadata_utils", metadata_utils_mod)

rag_nlp_mod = ModuleType("rag.nlp")
rag_nlp_mod.search = SimpleNamespace(index_name=lambda tenant_id: f"ragflow_{tenant_id}")
monkeypatch.setitem(sys.modules, "rag.nlp", rag_nlp_mod)

Comment on lines +232 to +234
def _pass_through_decorator(func):
return func

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/testcases/test_web_api/test_document_app/conftest.py`:
- Around line 81-82: The stub _check_duplicate_ids currently returns
deduplicated ids using list(set(ids)) which drops insertion order; change it to
preserve original order by using an ordered-dedup approach (e.g.
list(dict.fromkeys(ids))) so unique_ids maintain the input sequence; keep the
second return value as the empty list and do this replacement inside the
_check_duplicate_ids function.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 12a306fd-ced0-43a3-a322-668cb73a4c88

📥 Commits

Reviewing files that changed from the base of the PR and between b42a0e4 and ba0a0f0.

📒 Files selected for processing (3)
  • api/apps/restful_apis/document_api.py
  • test/testcases/test_web_api/test_document_app/conftest.py
  • test/testcases/test_web_api/test_document_app/test_document_metadata.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/testcases/test_web_api/test_document_app/test_document_metadata.py
  • api/apps/restful_apis/document_api.py

Comment on lines +81 to +82
def _check_duplicate_ids(ids, *_args, **_kwargs):
return list(set(ids)), []
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Preserve insertion order in _check_duplicate_ids stub.

list(set(ids)) deduplicates but does not preserve the input order, so any test that asserts on positional ordering of unique_ids (or iterates through them deterministically) may become flaky. The real helper conventionally preserves order while deduplicating; using dict.fromkeys matches that behavior with negligible cost.

♻️ Proposed fix
 def _check_duplicate_ids(ids, *_args, **_kwargs):
-    return list(set(ids)), []
+    return list(dict.fromkeys(ids)), []
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/testcases/test_web_api/test_document_app/conftest.py` around lines 81 -
82, The stub _check_duplicate_ids currently returns deduplicated ids using
list(set(ids)) which drops insertion order; change it to preserve original order
by using an ordered-dedup approach (e.g. list(dict.fromkeys(ids))) so unique_ids
maintain the input sequence; keep the second return value as the empty list and
do this replacement inside the _check_duplicate_ids function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🐞 bug Something isn't working, pull request that fix bug. size:L This PR changes 100-499 lines, ignoring generated files. 🧪 test Pull requests that update test cases.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Unauthenticated cross-tenant data exfiltration via GET /api/v1/documents/images/<image_id> (missing @login_required, confused-deputy MinIO fetch)

2 participants