fix(api): close private dataset doc auth bypass#14749
Conversation
Co-authored-by: Copilot <[email protected]>
📝 WalkthroughWalkthroughThe PR adds authenticated user identification to dataset authorization checks in SDK document endpoints. Four route handlers now accept an optional ChangesUser-Based Authorization for SDK Routes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
api/apps/sdk/doc.py (1)
53-55: ⚡ Quick winAdd a docstring to explain the access control logic.
The function handles important authorization semantics (JWT-authenticated user vs API-key tenant fallback) without documentation. The suggested docstring clarifies when each path is used:
📝 Suggested docstring
def _dataset_access_actor_id(tenant_id: str, authenticated_user_id: str | None = None) -> str: + """ + Determine the actor ID for dataset access authorization. + + Returns authenticated_user_id when available (JWT/login-token flow), + otherwise falls back to tenant_id (API-key flow or unauthenticated). + """ return authenticated_user_id or tenant_id🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/apps/sdk/doc.py` around lines 53 - 55, Add a clear docstring to _dataset_access_actor_id describing the access-control semantics: explain that when a JWT-authenticated user ID is present the function returns that user (used for per-user authorization/audit), and when authenticated_user_id is None it falls back to the tenant_id (used for API-key based requests or tenant-scoped actions); mention expected parameter types and the function's return value and include a short example of both paths for clarity.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@api/apps/sdk/doc.py`:
- Around line 208-211: The authorization check using
KnowledgebaseService.accessible(kb_id=dataset_id,
user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id)) lacks
logging; add logging similar to the download flow: log an info/debug message
before the check indicating the dataset_id and actor_id being validated and log
a warning/error when access is denied (including dataset_id and actor_id) so
operators can trace authorization failures; update the same function that
performs this check to call the logger and follow the existing logging format
used by the download function.
- Around line 302-305: The authorization check using
KnowledgebaseService.accessible(kb_id=dataset_id,
user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id)) is missing
logging; add the same logging pattern used in the download/parse flows to record
the check input (dataset_id, tenant_id, authenticated_user_id) and the
authorization outcome. Specifically, before/around the if, emit a log entry
(matching the existing logger level/format used in download/parse) that includes
dataset_id, tenant_id, actor id from _dataset_access_actor_id(...), and whether
access was granted/denied so the decision is auditable and consistent with other
flows.
- Around line 432-435: The dataset access check loop lacks logging; add
structured logs using the existing actor id and kb_ids: log a single INFO-level
message before the loop mentioning kb_ids and actor_id (from
_dataset_access_actor_id), and inside the loop log a WARN/ERROR when
KnowledgebaseService.accessible(kb_id=id, user_id=actor_id) returns False
including the denied kb id and actor_id (and authenticated_user_id if available)
before returning get_error_data_result so denied access events are recorded for
debugging and audit.
- Around line 97-100: Add structured logging around the
KnowledgebaseService.accessible check: log the attempted access with dataset_id,
the actor returned by _dataset_access_actor_id(tenant_id,
authenticated_user_id), and the authorization result (allowed/denied). Use the
module logger (e.g., logger or security logger) and emit a warning or info-level
entry when access is denied (include tenant_id and authenticated_user_id as
context), so the authorization decision for KnowledgebaseService.accessible is
recorded for audit and debugging.
In `@api/utils/api_utils.py`:
- Around line 339-340: The code path that injects authenticated_user_id into
kwargs (the block checking accepts_authenticated_user_id and setting
kwargs["authenticated_user_id"] = user[0].id) needs an audit log entry; add a
concise log statement (using the module logger or existing logger variable)
immediately before or after the assignment that records the action and key
context such as the injected user id and the target dataset/request identifier
(if available) and use an appropriate level (info/debug) while avoiding
sensitive data; update any import or logger initialization (e.g., logger =
logging.getLogger(__name__)) if not already present.
---
Nitpick comments:
In `@api/apps/sdk/doc.py`:
- Around line 53-55: Add a clear docstring to _dataset_access_actor_id
describing the access-control semantics: explain that when a JWT-authenticated
user ID is present the function returns that user (used for per-user
authorization/audit), and when authenticated_user_id is None it falls back to
the tenant_id (used for API-key based requests or tenant-scoped actions);
mention expected parameter types and the function's return value and include a
short example of both paths for clarity.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 272a4cf4-6bed-4bf2-a7f7-55ab6fde80c1
📒 Files selected for processing (4)
api/apps/sdk/doc.pyapi/utils/api_utils.pytest/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.pytest/unit_test/api/utils/test_api_utils_token_required.py
| if not KnowledgebaseService.accessible( | ||
| kb_id=dataset_id, | ||
| user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id), | ||
| ): |
There was a problem hiding this comment.
Add logging for dataset access authorization checks.
This security-critical authorization check determines access to private datasets but lacks logging. Adding a log statement would improve security audit trails and debugging.
🔒 Proposed logging addition
+ actor_id = _dataset_access_actor_id(tenant_id, authenticated_user_id)
+ logging.debug("Checking dataset access: dataset_id=%s actor_id=%s (authenticated_user_id=%s)", dataset_id, actor_id, authenticated_user_id)
if not KnowledgebaseService.accessible(
kb_id=dataset_id,
- user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id),
+ user_id=actor_id,
):
+ logging.warning("Dataset access denied: dataset_id=%s actor_id=%s", dataset_id, actor_id)
return get_error_data_result(message=f"You do not own the dataset {dataset_id}.")As per coding guidelines, "**/*.py: Add logging for new flows".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@api/apps/sdk/doc.py` around lines 97 - 100, Add structured logging around the
KnowledgebaseService.accessible check: log the attempted access with dataset_id,
the actor returned by _dataset_access_actor_id(tenant_id,
authenticated_user_id), and the authorization result (allowed/denied). Use the
module logger (e.g., logger or security logger) and emit a warning or info-level
entry when access is denied (include tenant_id and authenticated_user_id as
context), so the authorization decision for KnowledgebaseService.accessible is
recorded for audit and debugging.
| if not KnowledgebaseService.accessible( | ||
| kb_id=dataset_id, | ||
| user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id), | ||
| ): |
There was a problem hiding this comment.
Add logging for dataset access authorization checks.
Similar to the download function, this authorization check lacks logging. Consider adding the same logging pattern here for consistency.
As per coding guidelines, "**/*.py: Add logging for new flows".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@api/apps/sdk/doc.py` around lines 208 - 211, The authorization check using
KnowledgebaseService.accessible(kb_id=dataset_id,
user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id)) lacks
logging; add logging similar to the download flow: log an info/debug message
before the check indicating the dataset_id and actor_id being validated and log
a warning/error when access is denied (including dataset_id and actor_id) so
operators can trace authorization failures; update the same function that
performs this check to call the logger and follow the existing logging format
used by the download function.
| if not KnowledgebaseService.accessible( | ||
| kb_id=dataset_id, | ||
| user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id), | ||
| ): |
There was a problem hiding this comment.
Add logging for dataset access authorization checks.
Similar to the download and parse functions, this authorization check lacks logging. Consider adding the same logging pattern for consistency.
As per coding guidelines, "**/*.py: Add logging for new flows".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@api/apps/sdk/doc.py` around lines 302 - 305, The authorization check using
KnowledgebaseService.accessible(kb_id=dataset_id,
user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id)) is missing
logging; add the same logging pattern used in the download/parse flows to record
the check input (dataset_id, tenant_id, authenticated_user_id) and the
authorization outcome. Specifically, before/around the if, emit a log entry
(matching the existing logger level/format used in download/parse) that includes
dataset_id, tenant_id, actor id from _dataset_access_actor_id(...), and whether
access was granted/denied so the decision is auditable and consistent with other
flows.
| actor_id = _dataset_access_actor_id(tenant_id, authenticated_user_id) | ||
| for id in kb_ids: | ||
| if not KnowledgebaseService.accessible(kb_id=id, user_id=tenant_id): | ||
| if not KnowledgebaseService.accessible(kb_id=id, user_id=actor_id): | ||
| return get_error_data_result(f"You don't own the dataset {id}.") |
There was a problem hiding this comment.
Add logging for dataset access authorization checks.
This function checks access to potentially multiple datasets but lacks logging. Since this loops over dataset IDs, consider logging once before the loop (with the full list) and/or within the loop for denied access.
🔍 Proposed logging addition
actor_id = _dataset_access_actor_id(tenant_id, authenticated_user_id)
+ logging.debug("Checking access to datasets: dataset_ids=%s actor_id=%s (authenticated_user_id=%s)", kb_ids, actor_id, authenticated_user_id)
for id in kb_ids:
if not KnowledgebaseService.accessible(kb_id=id, user_id=actor_id):
+ logging.warning("Dataset access denied: dataset_id=%s actor_id=%s", id, actor_id)
return get_error_data_result(f"You don't own the dataset {id}.")As per coding guidelines, "**/*.py: Add logging for new flows".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| actor_id = _dataset_access_actor_id(tenant_id, authenticated_user_id) | |
| for id in kb_ids: | |
| if not KnowledgebaseService.accessible(kb_id=id, user_id=tenant_id): | |
| if not KnowledgebaseService.accessible(kb_id=id, user_id=actor_id): | |
| return get_error_data_result(f"You don't own the dataset {id}.") | |
| actor_id = _dataset_access_actor_id(tenant_id, authenticated_user_id) | |
| logging.debug("Checking access to datasets: dataset_ids=%s actor_id=%s (authenticated_user_id=%s)", kb_ids, actor_id, authenticated_user_id) | |
| for id in kb_ids: | |
| if not KnowledgebaseService.accessible(kb_id=id, user_id=actor_id): | |
| logging.warning("Dataset access denied: dataset_id=%s actor_id=%s", id, actor_id) | |
| return get_error_data_result(f"You don't own the dataset {id}.") |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@api/apps/sdk/doc.py` around lines 432 - 435, The dataset access check loop
lacks logging; add structured logs using the existing actor id and kb_ids: log a
single INFO-level message before the loop mentioning kb_ids and actor_id (from
_dataset_access_actor_id), and inside the loop log a WARN/ERROR when
KnowledgebaseService.accessible(kb_id=id, user_id=actor_id) returns False
including the denied kb id and actor_id (and authenticated_user_id if available)
before returning get_error_data_result so denied access events are recorded for
debugging and audit.
| if accepts_authenticated_user_id: | ||
| kwargs["authenticated_user_id"] = user[0].id |
There was a problem hiding this comment.
Add logging for the authenticated user injection flow.
This security-relevant code path injects the authenticated user ID for dataset authorization checks but lacks logging. Adding a log statement would improve observability and audit trails for private dataset access.
📊 Proposed logging addition
kwargs["tenant_id"] = tenants[0].tenant_id
if accepts_authenticated_user_id:
+ logging.debug("JWT authentication: injecting authenticated_user_id=%s for tenant_id=%s", user[0].id, tenants[0].tenant_id)
kwargs["authenticated_user_id"] = user[0].idAs per coding guidelines, "**/*.py: Add logging for new flows".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@api/utils/api_utils.py` around lines 339 - 340, The code path that injects
authenticated_user_id into kwargs (the block checking
accepts_authenticated_user_id and setting kwargs["authenticated_user_id"] =
user[0].id) needs an audit log entry; add a concise log statement (using the
module logger or existing logger variable) immediately before or after the
assignment that records the action and key context such as the injected user id
and the target dataset/request identifier (if available) and use an appropriate
level (info/debug) while avoiding sensitive data; update any import or logger
initialization (e.g., logger = logging.getLogger(__name__)) if not already
present.
There was a problem hiding this comment.
Pull request overview
This PR fixes an authorization bypass in SDK document endpoints when callers authenticate using a login token: the SDK auth layer now forwards the authenticated user id (in addition to tenant/workspace id), and the SDK document routes use that user id for dataset authorization so permission = me datasets remain owner-only.
Changes:
- Update
token_requiredto optionally injectauthenticated_user_idfor login-token authentication when the wrapped handler accepts it. - Update SDK document routes to authorize dataset access using the authenticated user id (falling back to tenant id when not available).
- Add/adjust unit tests to verify correct user-id propagation and that SDK document routes pass the correct actor id into
KnowledgebaseService.accessible().
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
api/utils/api_utils.py |
Enhances token_required to conditionally pass authenticated_user_id for login-token flows. |
api/apps/sdk/doc.py |
Uses authenticated user id for dataset access checks in SDK document endpoints (download/parse/stop/retrieval). |
test/unit_test/api/utils/test_api_utils_token_required.py |
New unit test validating token_required injects authenticated user id for login tokens. |
test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py |
Updates existing unit tests and adds coverage ensuring SDK routes call dataset access checks with user_id=authenticated_user_id. |
What problem does this PR solve?
Fixes #14659
This PR fixes an authorization bypass in the token-authenticated document SDK APIs for private datasets (permission = me).
The affected routes validated dataset access with KnowledgebaseService.accessible(), but when a login token was used through the SDK token flow, the check received the caller's tenant/workspace id instead of the authenticated user id. That let another member of the same tenant pass access checks for a private dataset if they knew the dataset_id, exposing document listings and allowing document operations that should have remained owner-only.
This change forwards the authenticated user id through the token-auth flow and uses it for dataset authorization in the affected SDK document endpoints, so private datasets remain accessible only to their owner while team datasets continue to work as before.
Type of change