fix: authenticate /documents/images endpoint and restrict to chunk im…#14765
fix: authenticate /documents/images endpoint and restrict to chunk im…#14765hunnyboy1217 wants to merge 2 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds authentication and strict id validation to chunk-image serving; introduces an authenticated per-document thumbnail route that derives storage keys; rewrites thumbnail URLs in listings and /thumbnails; and adds unit and E2E tests plus a test helper covering validation, authorization, success, missing, and exception paths. ChangesDocument Image Security Hardening
Sequence DiagramsequenceDiagram
participant Client
participant API as API Endpoint
participant Auth as Authentication
participant Validate as Validation
participant DocSvc as DocumentService
participant KB as KB access check
participant Storage as MinIO/Storage
Client->>API: GET /documents/<doc_id>/thumbnail (authenticated)
API->>Auth: verify session
Auth->>DocSvc: accessible(doc_id, user_id)?
alt Unauthorized or Not found
DocSvc->>Client: 403/404
else Authorized
API->>Storage: get(kb_id, thumbnail_<doc_id>.png)
Storage->>Client: 200 image/png + bytes or 404 -> "Thumbnail not found."
end
Client->>API: GET /documents/images/<image_id> (authenticated)
API->>Auth: verify session
API->>Validate: parse image_id -> kb_id, key (16-hex)
alt Invalid format
Validate->>Client: 400 Image not found
else Valid format
API->>KB: check KB accessible to user?
alt KB not accessible
KB->>Client: 403 No authorization
else Authorized
API->>Storage: get(kb_id, key)
Storage->>Client: 200 image/JPEG + bytes
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Hi, @wangq8 , |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
api/apps/restful_apis/document_api.py (1)
1625-1665:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd audit logging for authentication/validation failures in the security-hardened image endpoint.
This endpoint was hardened in response to a cross-tenant exfiltration issue, but failure paths (malformed
image_id, invalid key shape, KB access denied) silently return generic error responses. Without logging, suspicious probing (e.g., confused-deputy attempts withthumbnail_*keys) won't be observable in production. Per the coding guidelines, new flows should include logging.📝 Proposed addition of audit logs
try: parts = image_id.split("-") if len(parts) != 2: + logging.warning("get_document_image: malformed image_id=%r", image_id) return get_data_error_result(message="Image not found.") kb_id, key = parts if not _CHUNK_IMAGE_KEY_RE.match(key): + logging.warning("get_document_image: invalid key shape kb_id=%s key=%r", kb_id, key) return get_data_error_result(message="Image not found.") if not KnowledgebaseService.accessible(kb_id, current_user.id): + logging.warning("get_document_image: cross-tenant denial kb_id=%s user=%s", kb_id, current_user.id) return get_data_error_result(message="No authorization.") data = await thread_pool_exec(settings.STORAGE_IMPL.get, kb_id, key)As per coding guidelines: "Add logging for new flows".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/apps/restful_apis/document_api.py` around lines 1625 - 1665, In get_document_image add audit logging on each validation/auth failure: log an info/warn entry including current_user.id and the provided image_id (and optionally request.remote_addr if available) when parts length != 2 (malformed ID), when _CHUNK_IMAGE_KEY_RE.match(key) fails (invalid key shape), and when KnowledgebaseService.accessible(kb_id, current_user.id) returns False (access denied); keep the existing generic responses but ensure logs include the failure reason (malformed, invalid_key, access_denied) and the kb_id/key when parsable to aid incident investigation, and use the same logger used elsewhere in the module for consistency.
🧹 Nitpick comments (3)
test/testcases/test_web_api/test_document_app/test_document_metadata.py (1)
520-606: 💤 Low valueLGTM — thumbnail tests pin the server-derived key shape.
Coverage of cross-tenant denial, missing document, empty-storage, happy path, and exception path is appropriate. The
storage_calls == [("kb1", "thumbnail_doc1.png")]assertion on line 596 is especially valuable: it locks in that the storage key is derived fromdoc.idserver-side and not influenced by the URL path parameter, which is the core of the confused-deputy fix.One minor note:
fake_thread_pool_exec_okon lines 587–588 only forwards positional args (func(*args)), not**_kwargs. That's fine for current call sites inget_document_thumbnail, but if the production handler ever passes kwargs tothread_pool_exec, this stub will silently drop them. Considerreturn func(*args, **_kwargs)for forward-compatibility.♻️ Optional tweak
- async def fake_thread_pool_exec_ok(func, *args, **_kwargs): - return func(*args) + async def fake_thread_pool_exec_ok(func, *args, **kwargs): + return func(*args, **kwargs)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py` around lines 520 - 606, The test stub fake_thread_pool_exec_ok used in test_get_document_thumbnail_authz_and_success_unit only forwards positional args (it calls func(*args)) which will drop any keyword arguments passed through thread_pool_exec; update the stub to forward both positional and keyword arguments so it mirrors thread_pool_exec's signature (e.g., return func(*args, **kwargs)), locating and editing the fake_thread_pool_exec_ok definition in this test to include **_kwargs when invoking func.api/apps/restful_apis/document_api.py (2)
1668-1708: ⚡ Quick winAdd logging for the new thumbnail endpoint flow.
This is a new authenticated flow guarding cross-tenant access. Denials and not-found paths should be logged for audit/troubleshooting, consistent with other handlers in this file (e.g.,
upload_info,upload_documentthat uselogging.error/logging.exception).📝 Proposed addition of logging
try: if not DocumentService.accessible(doc_id, current_user.id): + logging.warning("get_document_thumbnail: cross-tenant denial doc_id=%s user=%s", doc_id, current_user.id) return get_data_error_result(message="No authorization.") e, doc = DocumentService.get_by_id(doc_id) if not e: return get_data_error_result(message="Document not found.") thumbnail_key = f"thumbnail_{doc.id}.png" data = await thread_pool_exec(settings.STORAGE_IMPL.get, doc.kb_id, thumbnail_key) if not data: + logging.info("get_document_thumbnail: missing thumbnail doc_id=%s key=%s", doc.id, thumbnail_key) return get_data_error_result(message="Thumbnail not found.") response = await make_response(data) response.headers.set("Content-Type", "image/png") return response except Exception as e: + logging.exception("get_document_thumbnail failed doc_id=%s", doc_id) return server_error_response(e)As per coding guidelines: "Add logging for new flows".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/apps/restful_apis/document_api.py` around lines 1668 - 1708, The get_document_thumbnail handler lacks logging for authorization denials, missing documents/thumbnails and exceptions; add logging calls: use logging.error when DocumentService.accessible(doc_id, current_user.id) returns False (include doc_id and current_user.id), logging.error when DocumentService.get_by_id(doc_id) indicates not found (include doc_id), logging.error when storage returns no data for the thumbnail_key (include thumbnail_key and doc.kb_id), and logging.exception in the except block before returning server_error_response(e); place these logs inside get_document_thumbnail around the calls to DocumentService.accessible, DocumentService.get_by_id, settings.STORAGE_IMPL.get, and the except handler to match patterns used in upload_info/upload_document.
1651-1660: ⚡ Quick winImage ID parsing is fragile if
kb_idcontains a hyphen.Internal
kb_idvalues generated byget_uuid()are hex strings without hyphens, but the current parsing logicimage_id.split("-")withlen(parts) != 2assumes exactly one hyphen delimiter. If external ingestion or data migration introduces hyphenatedkb_idvalues (e.g., RFC4122 UUIDs), legitimate lookups will incorrectly 404. Userpartition("-")to split on the rightmost hyphen, anchoring the parse on the strict key suffix.♻️ Proposed fix
- parts = image_id.split("-") - if len(parts) != 2: + kb_id, sep, key = image_id.rpartition("-") + if not sep or not kb_id: return get_data_error_result(message="Image not found.") - kb_id, key = parts if not _CHUNK_IMAGE_KEY_RE.match(key): return get_data_error_result(message="Image not found.")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/apps/restful_apis/document_api.py` around lines 1651 - 1660, The image_id parsing is fragile because image_id.split("-") assumes exactly one hyphen; change parsing to use a rightmost split (image_id.rpartition("-")) so the suffix key is extracted reliably even if kb_id contains hyphens: ensure you handle the case where no hyphen exists (treat as not found), assign kb_id from the left part and key from the right part of rpartition, keep the existing _CHUNK_IMAGE_KEY_RE.match(key) validation and the KnowledgebaseService.accessible(kb_id, current_user.id) check, and then call thread_pool_exec(settings.STORAGE_IMPL.get, kb_id, key) as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 1625-1665: In get_document_image add audit logging on each
validation/auth failure: log an info/warn entry including current_user.id and
the provided image_id (and optionally request.remote_addr if available) when
parts length != 2 (malformed ID), when _CHUNK_IMAGE_KEY_RE.match(key) fails
(invalid key shape), and when KnowledgebaseService.accessible(kb_id,
current_user.id) returns False (access denied); keep the existing generic
responses but ensure logs include the failure reason (malformed, invalid_key,
access_denied) and the kb_id/key when parsable to aid incident investigation,
and use the same logger used elsewhere in the module for consistency.
---
Nitpick comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 1668-1708: The get_document_thumbnail handler lacks logging for
authorization denials, missing documents/thumbnails and exceptions; add logging
calls: use logging.error when DocumentService.accessible(doc_id,
current_user.id) returns False (include doc_id and current_user.id),
logging.error when DocumentService.get_by_id(doc_id) indicates not found
(include doc_id), logging.error when storage returns no data for the
thumbnail_key (include thumbnail_key and doc.kb_id), and logging.exception in
the except block before returning server_error_response(e); place these logs
inside get_document_thumbnail around the calls to DocumentService.accessible,
DocumentService.get_by_id, settings.STORAGE_IMPL.get, and the except handler to
match patterns used in upload_info/upload_document.
- Around line 1651-1660: The image_id parsing is fragile because
image_id.split("-") assumes exactly one hyphen; change parsing to use a
rightmost split (image_id.rpartition("-")) so the suffix key is extracted
reliably even if kb_id contains hyphens: ensure you handle the case where no
hyphen exists (treat as not found), assign kb_id from the left part and key from
the right part of rpartition, keep the existing _CHUNK_IMAGE_KEY_RE.match(key)
validation and the KnowledgebaseService.accessible(kb_id, current_user.id)
check, and then call thread_pool_exec(settings.STORAGE_IMPL.get, kb_id, key) as
before.
In `@test/testcases/test_web_api/test_document_app/test_document_metadata.py`:
- Around line 520-606: The test stub fake_thread_pool_exec_ok used in
test_get_document_thumbnail_authz_and_success_unit only forwards positional args
(it calls func(*args)) which will drop any keyword arguments passed through
thread_pool_exec; update the stub to forward both positional and keyword
arguments so it mirrors thread_pool_exec's signature (e.g., return func(*args,
**kwargs)), locating and editing the fake_thread_pool_exec_ok definition in this
test to include **_kwargs when invoking func.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 765299a7-e55f-4151-bf49-f8a80b0f6965
📒 Files selected for processing (2)
api/apps/restful_apis/document_api.pytest/testcases/test_web_api/test_document_app/test_document_metadata.py
686bf5d to
c5f9d64
Compare
c5f9d64 to
84a5e10
Compare
|
|
||
|
|
||
| @manager.route("/thumbnails", methods=["GET"]) # noqa: F821 | ||
| @login_required |
There was a problem hiding this comment.
In this way, images loadding will all fail for embeded <iframe> dialogs. (The dialogs can be embeded into iframe)
|
Hi, @KevinHuSh , |
What problem does this PR solve?
Closes #14763.
GET /api/v1/documents/images/<image_id>had no@login_requiredand split the user-supplied path on-into a(bucket, key)pair passed straight toSTORAGE_IMPL.get. Because thumbnails and raw documents share the same bucket (kb.id), an unauthenticated caller could fetch arbitrary objects — including raw PDFs/Word docs — by reconstructing keys that the application itself embedded into authenticated list responses (document_api.py:736,:1198).Same threat model as #14625, but worse: no auth required and not restricted to image-typed files.
This PR:
/documents/images/<image_id>— adds@login_required, validates KB access viaKnowledgebaseService.accessible, and restricts the storage key to the chunk-image shape (^[0-9a-f]{16}$, the xxhash64 digest produced inrag/svr/task_executor.py). Thumbnail filenames, raw doc filenames, or any other key shape are rejected before the access check, closing the confused-deputy primitive even for authorized callers. The endpoint now serves only chunk reference images — its only legitimate frontend use.New
/documents/<doc_id>/thumbnail— authenticated, gated byDocumentService.accessible, derives the storage key(
thumbnail_{doc.id}.png) server-side from the document record. ReturnsContent-Type: image/png(the previous handler hardcodedimage/JPEGeven for PNG thumbnails).URL builders at
document_api.py:736and:1198now emit/api/v1/documents/{doc_id}/thumbnailinstead of the leakablekb_id-keyURL. Side benefit: the/thumbnailsJSON response no longer exposeskb_id.Tests — re-enables the previously skipped unit test (renamed to match the new handler name) and adds coverage for: cross-tenant denial on both endpoints, confused-deputy key shapes (
thumbnail_*.png,report.pdf, non-hex, 17-hex) rejected before access check, server-derived storage key on the thumbnail endpoint, and exception paths.Frontend not changed. The two
<img src>constructors flagged in the issue (web/src/components/image/index.tsx,web/src/components/next-message-item/reference-image-list.tsx) consumechunk.image_id(kb_id-xxhash16), not document thumbnail URLs — they keep working against the now-authenticated/documents/images/<image_id>endpoint.Document thumbnails in the frontend already flow through
useFetchDocumentThumbnailsByIds→/thumbnails, which now returns the new authenticated URL automatically.Type of change