Skip to content

fix: authenticate /documents/images endpoint and restrict to chunk im…#14765

Open
hunnyboy1217 wants to merge 2 commits into
infiniflow:mainfrom
hunnyboy1217:fix/14763-auth-document-images-endpoint
Open

fix: authenticate /documents/images endpoint and restrict to chunk im…#14765
hunnyboy1217 wants to merge 2 commits into
infiniflow:mainfrom
hunnyboy1217:fix/14763-auth-document-images-endpoint

Conversation

@hunnyboy1217
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Closes #14763.

GET /api/v1/documents/images/<image_id> had no @login_required and split the user-supplied path on - into a (bucket, key) pair passed straight to STORAGE_IMPL.get. Because thumbnails and raw documents share the same bucket (kb.id), an unauthenticated caller could fetch arbitrary objects — including raw PDFs/Word docs — by reconstructing keys that the application itself embedded into authenticated list responses (document_api.py:736, :1198).
Same threat model as #14625, but worse: no auth required and not restricted to image-typed files.

This PR:

  1. /documents/images/<image_id> — adds @login_required, validates KB access via KnowledgebaseService.accessible, and restricts the storage key to the chunk-image shape (^[0-9a-f]{16}$, the xxhash64 digest produced in rag/svr/task_executor.py). Thumbnail filenames, raw doc filenames, or any other key shape are rejected before the access check, closing the confused-deputy primitive even for authorized callers. The endpoint now serves only chunk reference images — its only legitimate frontend use.

  2. New /documents/<doc_id>/thumbnail — authenticated, gated by DocumentService.accessible, derives the storage key
    (thumbnail_{doc.id}.png) server-side from the document record. Returns Content-Type: image/png (the previous handler hardcoded image/JPEG even for PNG thumbnails).

  3. URL builders at document_api.py:736 and :1198 now emit /api/v1/documents/{doc_id}/thumbnail instead of the leakable kb_id-key URL. Side benefit: the /thumbnails JSON response no longer exposes kb_id.

  4. Tests — re-enables the previously skipped unit test (renamed to match the new handler name) and adds coverage for: cross-tenant denial on both endpoints, confused-deputy key shapes (thumbnail_*.png, report.pdf, non-hex, 17-hex) rejected before access check, server-derived storage key on the thumbnail endpoint, and exception paths.

Frontend not changed. The two <img src> constructors flagged in the issue (web/src/components/image/index.tsx,
web/src/components/next-message-item/reference-image-list.tsx) consume chunk.image_id (kb_id-xxhash16), not document thumbnail URLs — they keep working against the now-authenticated /documents/images/<image_id> endpoint.
Document thumbnails in the frontend already flow through useFetchDocumentThumbnailsByIds/thumbnails, which now returns the new authenticated URL automatically.

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

@dosubot dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. 🐞 bug Something isn't working, pull request that fix bug. 🧪 test Pull requests that update test cases. labels May 11, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e7dfe869-4a34-441e-8ed3-0cd3f46efed6

📥 Commits

Reviewing files that changed from the base of the PR and between 84a5e10 and 6f1e8a7.

📒 Files selected for processing (2)
  • api/apps/restful_apis/document_api.py
  • test/testcases/test_web_api/test_document_app/test_document_metadata.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • api/apps/restful_apis/document_api.py

📝 Walkthrough

Walkthrough

Adds authentication and strict id validation to chunk-image serving; introduces an authenticated per-document thumbnail route that derives storage keys; rewrites thumbnail URLs in listings and /thumbnails; and adds unit and E2E tests plus a test helper covering validation, authorization, success, missing, and exception paths.

Changes

Document Image Security Hardening

Layer / File(s) Summary
Thumbnail URL rewrite
api/apps/restful_apis/document_api.py
Document list responses now rewrite non-base64 thumbnail fields to /api/v1/documents/<doc_id>/thumbnail.
Authenticated /thumbnails endpoint
api/apps/restful_apis/document_api.py
GET /thumbnails now requires login, filters requested doc_ids to those accessible by the caller, returns an empty result when none are accessible, and rewrites thumbnail references to per-document thumbnail URLs.
Chunk-image route signature & validation
api/apps/restful_apis/document_api.py
Module-level regex added to enforce 16-hex key shape and route declaration updated to require authentication and kb_id-<16hex> image_id format.
Hardened chunk-image + per-document thumbnail
api/apps/restful_apis/document_api.py
GET /documents/images/<image_id> now authenticates, validates kb_id-key with a strict 16-hex key, checks KB accessibility, fetches bytes from storage (kb_id, key), and serves image/JPEG. New GET /documents/<doc_id>/thumbnail authorizes via document access, derives thumbnail_<doc_id>.png storage key, serves image/png when present, or returns Thumbnail not found.; storage exceptions map to error responses.
Test helper & tests
test/testcases/test_web_api/test_common.py, test/testcases/test_web_api/test_document_app/test_document_metadata.py
Adds document_thumbnail(...) helper and tests: TestAuthorization.test_thumbnail_auth_invalid, test_get_document_image_authz_and_validation_unit, test_get_document_thumbnail_authz_and_success_unit, and /thumbnails unit tests verifying filtering and missing input handling.

Sequence Diagram

sequenceDiagram
  participant Client
  participant API as API Endpoint
  participant Auth as Authentication
  participant Validate as Validation
  participant DocSvc as DocumentService
  participant KB as KB access check
  participant Storage as MinIO/Storage

  Client->>API: GET /documents/<doc_id>/thumbnail (authenticated)
  API->>Auth: verify session
  Auth->>DocSvc: accessible(doc_id, user_id)?
  alt Unauthorized or Not found
    DocSvc->>Client: 403/404
  else Authorized
    API->>Storage: get(kb_id, thumbnail_<doc_id>.png)
    Storage->>Client: 200 image/png + bytes or 404 -> "Thumbnail not found."
  end

  Client->>API: GET /documents/images/<image_id> (authenticated)
  API->>Auth: verify session
  API->>Validate: parse image_id -> kb_id, key (16-hex)
  alt Invalid format
    Validate->>Client: 400 Image not found
  else Valid format
    API->>KB: check KB accessible to user?
    alt KB not accessible
      KB->>Client: 403 No authorization
    else Authorized
      API->>Storage: get(kb_id, key)
      Storage->>Client: 200 image/JPEG + bytes
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • wangq8

🐰 I hopped along the API trail,
I checked each key and sealed the mail,
Thumbnails now only show what's right,
I kept the bytes snug and out of sight,
A rabbit's fix — short, safe, and hale.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 35.71% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The PR title is truncated (70 chars, ending with '…') and appears incomplete; it does not fully convey the scope of changes including the new thumbnail endpoint and URL rewriting. Expand the title to describe the complete fix: 'fix: authenticate /documents/images endpoint, restrict to chunk images, add /documents/<doc_id>/thumbnail endpoint'.
✅ Passed checks (3 passed)
Check name Status Explanation
Description check ✅ Passed The PR description follows the required template and provides comprehensive detail: it clearly states the problem solved (closes #14763), specifies the bug-fix type, and explains all four key changes (auth, KB access validation, new endpoint, URL builders, and tests).
Linked Issues check ✅ Passed The PR fully addresses all requirements from issue #14763: adds @login_required and KB access checks to /documents/images, restricts keys to ^[0-9a-f]{16}$, creates /documents/<doc_id>/thumbnail endpoint with server-derived keys, updates URL builders, and provides comprehensive tests for anonymous/cross-tenant denial and key-shape rejection.
Out of Scope Changes check ✅ Passed All changes are tightly scoped to the security fix: document_api.py adds authentication/validation logic and new thumbnail endpoint, tests add corresponding coverage, and test_common.py adds the required API helper function—no extraneous modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@hunnyboy1217
Copy link
Copy Markdown
Contributor Author

Hi, @wangq8 ,
Could you please review my PR?

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
api/apps/restful_apis/document_api.py (1)

1625-1665: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add audit logging for authentication/validation failures in the security-hardened image endpoint.

This endpoint was hardened in response to a cross-tenant exfiltration issue, but failure paths (malformed image_id, invalid key shape, KB access denied) silently return generic error responses. Without logging, suspicious probing (e.g., confused-deputy attempts with thumbnail_* keys) won't be observable in production. Per the coding guidelines, new flows should include logging.

📝 Proposed addition of audit logs
     try:
         parts = image_id.split("-")
         if len(parts) != 2:
+            logging.warning("get_document_image: malformed image_id=%r", image_id)
             return get_data_error_result(message="Image not found.")
         kb_id, key = parts
         if not _CHUNK_IMAGE_KEY_RE.match(key):
+            logging.warning("get_document_image: invalid key shape kb_id=%s key=%r", kb_id, key)
             return get_data_error_result(message="Image not found.")
         if not KnowledgebaseService.accessible(kb_id, current_user.id):
+            logging.warning("get_document_image: cross-tenant denial kb_id=%s user=%s", kb_id, current_user.id)
             return get_data_error_result(message="No authorization.")
         data = await thread_pool_exec(settings.STORAGE_IMPL.get, kb_id, key)

As per coding guidelines: "Add logging for new flows".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/apps/restful_apis/document_api.py` around lines 1625 - 1665, In
get_document_image add audit logging on each validation/auth failure: log an
info/warn entry including current_user.id and the provided image_id (and
optionally request.remote_addr if available) when parts length != 2 (malformed
ID), when _CHUNK_IMAGE_KEY_RE.match(key) fails (invalid key shape), and when
KnowledgebaseService.accessible(kb_id, current_user.id) returns False (access
denied); keep the existing generic responses but ensure logs include the failure
reason (malformed, invalid_key, access_denied) and the kb_id/key when parsable
to aid incident investigation, and use the same logger used elsewhere in the
module for consistency.
🧹 Nitpick comments (3)
test/testcases/test_web_api/test_document_app/test_document_metadata.py (1)

520-606: 💤 Low value

LGTM — thumbnail tests pin the server-derived key shape.

Coverage of cross-tenant denial, missing document, empty-storage, happy path, and exception path is appropriate. The storage_calls == [("kb1", "thumbnail_doc1.png")] assertion on line 596 is especially valuable: it locks in that the storage key is derived from doc.id server-side and not influenced by the URL path parameter, which is the core of the confused-deputy fix.

One minor note: fake_thread_pool_exec_ok on lines 587–588 only forwards positional args (func(*args)), not **_kwargs. That's fine for current call sites in get_document_thumbnail, but if the production handler ever passes kwargs to thread_pool_exec, this stub will silently drop them. Consider return func(*args, **_kwargs) for forward-compatibility.

♻️ Optional tweak
-        async def fake_thread_pool_exec_ok(func, *args, **_kwargs):
-            return func(*args)
+        async def fake_thread_pool_exec_ok(func, *args, **kwargs):
+            return func(*args, **kwargs)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`
around lines 520 - 606, The test stub fake_thread_pool_exec_ok used in
test_get_document_thumbnail_authz_and_success_unit only forwards positional args
(it calls func(*args)) which will drop any keyword arguments passed through
thread_pool_exec; update the stub to forward both positional and keyword
arguments so it mirrors thread_pool_exec's signature (e.g., return func(*args,
**kwargs)), locating and editing the fake_thread_pool_exec_ok definition in this
test to include **_kwargs when invoking func.
api/apps/restful_apis/document_api.py (2)

1668-1708: ⚡ Quick win

Add logging for the new thumbnail endpoint flow.

This is a new authenticated flow guarding cross-tenant access. Denials and not-found paths should be logged for audit/troubleshooting, consistent with other handlers in this file (e.g., upload_info, upload_document that use logging.error/logging.exception).

📝 Proposed addition of logging
     try:
         if not DocumentService.accessible(doc_id, current_user.id):
+            logging.warning("get_document_thumbnail: cross-tenant denial doc_id=%s user=%s", doc_id, current_user.id)
             return get_data_error_result(message="No authorization.")
         e, doc = DocumentService.get_by_id(doc_id)
         if not e:
             return get_data_error_result(message="Document not found.")
         thumbnail_key = f"thumbnail_{doc.id}.png"
         data = await thread_pool_exec(settings.STORAGE_IMPL.get, doc.kb_id, thumbnail_key)
         if not data:
+            logging.info("get_document_thumbnail: missing thumbnail doc_id=%s key=%s", doc.id, thumbnail_key)
             return get_data_error_result(message="Thumbnail not found.")
         response = await make_response(data)
         response.headers.set("Content-Type", "image/png")
         return response
     except Exception as e:
+        logging.exception("get_document_thumbnail failed doc_id=%s", doc_id)
         return server_error_response(e)

As per coding guidelines: "Add logging for new flows".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/apps/restful_apis/document_api.py` around lines 1668 - 1708, The
get_document_thumbnail handler lacks logging for authorization denials, missing
documents/thumbnails and exceptions; add logging calls: use logging.error when
DocumentService.accessible(doc_id, current_user.id) returns False (include
doc_id and current_user.id), logging.error when
DocumentService.get_by_id(doc_id) indicates not found (include doc_id),
logging.error when storage returns no data for the thumbnail_key (include
thumbnail_key and doc.kb_id), and logging.exception in the except block before
returning server_error_response(e); place these logs inside
get_document_thumbnail around the calls to DocumentService.accessible,
DocumentService.get_by_id, settings.STORAGE_IMPL.get, and the except handler to
match patterns used in upload_info/upload_document.

1651-1660: ⚡ Quick win

Image ID parsing is fragile if kb_id contains a hyphen.

Internal kb_id values generated by get_uuid() are hex strings without hyphens, but the current parsing logic image_id.split("-") with len(parts) != 2 assumes exactly one hyphen delimiter. If external ingestion or data migration introduces hyphenated kb_id values (e.g., RFC4122 UUIDs), legitimate lookups will incorrectly 404. Use rpartition("-") to split on the rightmost hyphen, anchoring the parse on the strict key suffix.

♻️ Proposed fix
-        parts = image_id.split("-")
-        if len(parts) != 2:
+        kb_id, sep, key = image_id.rpartition("-")
+        if not sep or not kb_id:
             return get_data_error_result(message="Image not found.")
-        kb_id, key = parts
         if not _CHUNK_IMAGE_KEY_RE.match(key):
             return get_data_error_result(message="Image not found.")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/apps/restful_apis/document_api.py` around lines 1651 - 1660, The image_id
parsing is fragile because image_id.split("-") assumes exactly one hyphen;
change parsing to use a rightmost split (image_id.rpartition("-")) so the suffix
key is extracted reliably even if kb_id contains hyphens: ensure you handle the
case where no hyphen exists (treat as not found), assign kb_id from the left
part and key from the right part of rpartition, keep the existing
_CHUNK_IMAGE_KEY_RE.match(key) validation and the
KnowledgebaseService.accessible(kb_id, current_user.id) check, and then call
thread_pool_exec(settings.STORAGE_IMPL.get, kb_id, key) as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 1625-1665: In get_document_image add audit logging on each
validation/auth failure: log an info/warn entry including current_user.id and
the provided image_id (and optionally request.remote_addr if available) when
parts length != 2 (malformed ID), when _CHUNK_IMAGE_KEY_RE.match(key) fails
(invalid key shape), and when KnowledgebaseService.accessible(kb_id,
current_user.id) returns False (access denied); keep the existing generic
responses but ensure logs include the failure reason (malformed, invalid_key,
access_denied) and the kb_id/key when parsable to aid incident investigation,
and use the same logger used elsewhere in the module for consistency.

---

Nitpick comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 1668-1708: The get_document_thumbnail handler lacks logging for
authorization denials, missing documents/thumbnails and exceptions; add logging
calls: use logging.error when DocumentService.accessible(doc_id,
current_user.id) returns False (include doc_id and current_user.id),
logging.error when DocumentService.get_by_id(doc_id) indicates not found
(include doc_id), logging.error when storage returns no data for the
thumbnail_key (include thumbnail_key and doc.kb_id), and logging.exception in
the except block before returning server_error_response(e); place these logs
inside get_document_thumbnail around the calls to DocumentService.accessible,
DocumentService.get_by_id, settings.STORAGE_IMPL.get, and the except handler to
match patterns used in upload_info/upload_document.
- Around line 1651-1660: The image_id parsing is fragile because
image_id.split("-") assumes exactly one hyphen; change parsing to use a
rightmost split (image_id.rpartition("-")) so the suffix key is extracted
reliably even if kb_id contains hyphens: ensure you handle the case where no
hyphen exists (treat as not found), assign kb_id from the left part and key from
the right part of rpartition, keep the existing _CHUNK_IMAGE_KEY_RE.match(key)
validation and the KnowledgebaseService.accessible(kb_id, current_user.id)
check, and then call thread_pool_exec(settings.STORAGE_IMPL.get, kb_id, key) as
before.

In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`:
- Around line 520-606: The test stub fake_thread_pool_exec_ok used in
test_get_document_thumbnail_authz_and_success_unit only forwards positional args
(it calls func(*args)) which will drop any keyword arguments passed through
thread_pool_exec; update the stub to forward both positional and keyword
arguments so it mirrors thread_pool_exec's signature (e.g., return func(*args,
**kwargs)), locating and editing the fake_thread_pool_exec_ok definition in this
test to include **_kwargs when invoking func.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 765299a7-e55f-4151-bf49-f8a80b0f6965

📥 Commits

Reviewing files that changed from the base of the PR and between 3838770 and c5f9d64.

📒 Files selected for processing (2)
  • api/apps/restful_apis/document_api.py
  • test/testcases/test_web_api/test_document_app/test_document_metadata.py

@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels May 11, 2026
@hunnyboy1217 hunnyboy1217 force-pushed the fix/14763-auth-document-images-endpoint branch from 686bf5d to c5f9d64 Compare May 11, 2026 05:31
@dosubot dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels May 11, 2026
@hunnyboy1217 hunnyboy1217 force-pushed the fix/14763-auth-document-images-endpoint branch from c5f9d64 to 84a5e10 Compare May 11, 2026 05:42


@manager.route("/thumbnails", methods=["GET"]) # noqa: F821
@login_required
Copy link
Copy Markdown
Collaborator

@KevinHuSh KevinHuSh May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this way, images loadding will all fail for embeded <iframe> dialogs. (The dialogs can be embeded into iframe)

@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels May 12, 2026
@hunnyboy1217
Copy link
Copy Markdown
Contributor Author

Hi, @KevinHuSh ,
Thanks for your review.
I just fixed current changes based on your direction.
Could you please review it again?

@hunnyboy1217 hunnyboy1217 requested a review from KevinHuSh May 12, 2026 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🐞 bug Something isn't working, pull request that fix bug. size:L This PR changes 100-499 lines, ignoring generated files. 🧪 test Pull requests that update test cases.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Unauthenticated cross-tenant data exfiltration via GET /api/v1/documents/images/<image_id> (missing @login_required, confused-deputy MinIO fetch)

2 participants