Skip to content

fix(api): close private dataset doc auth bypass#14749

Open
Parvezkhan0 wants to merge 1 commit into
infiniflow:mainfrom
Parvezkhan0:fix/private-dataset-doc-auth-14659
Open

fix(api): close private dataset doc auth bypass#14749
Parvezkhan0 wants to merge 1 commit into
infiniflow:mainfrom
Parvezkhan0:fix/private-dataset-doc-auth-14659

Conversation

@Parvezkhan0
Copy link
Copy Markdown

@Parvezkhan0 Parvezkhan0 commented May 10, 2026

What problem does this PR solve?

Fixes #14659

This PR fixes an authorization bypass in the token-authenticated document SDK APIs for private datasets (permission = me).
The affected routes validated dataset access with KnowledgebaseService.accessible(), but when a login token was used through the SDK token flow, the check received the caller's tenant/workspace id instead of the authenticated user id. That let another member of the same tenant pass access checks for a private dataset if they knew the dataset_id, exposing document listings and allowing document operations that should have remained owner-only.

This change forwards the authenticated user id through the token-auth flow and uses it for dataset authorization in the affected SDK document endpoints, so private datasets remain accessible only to their owner while team datasets continue to work as before.

Type of change

  • Bug Fix (non-breaking change which fixes an issue)

Copilot AI review requested due to automatic review settings May 10, 2026 13:31
@dosubot dosubot Bot added the size:M This PR changes 30-99 lines, ignoring generated files. label May 10, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 10, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

The PR adds authenticated user identification to dataset authorization checks in SDK document endpoints. Four route handlers now accept an optional authenticated_user_id parameter and pass it to KnowledgebaseService.accessible() instead of using only tenant_id. The token_required decorator conditionally injects this parameter based on handler signatures during JWT authentication.

Changes

User-Based Authorization for SDK Routes

Layer / File(s) Summary
Authorization Identity Contract
api/apps/sdk/doc.py
Helper _dataset_access_actor_id selects authorized actor as authenticated_user_id when present, otherwise tenant_id. Route signatures updated: download, parse, stop_parsing, retrieval_test now accept authenticated_user_id: str | None = None.
Route Authorization Implementation
api/apps/sdk/doc.py
download, parse, stop_parsing, and retrieval_test endpoints compute actor via helper and pass it to KnowledgebaseService.accessible(...) for authorization gates instead of tenant_id.
Authentication Decorator
api/utils/api_utils.py
token_required decorator inspects handler signature to detect authenticated_user_id parameter; during JWT login-token flow, conditionally injects resolved user id into kwargs when handler accepts it.
Tests & Verification
test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py, test/unit_test/api/utils/test_api_utils_token_required.py
Existing test_download_and_download_doc_errors updated to verify "do not own dataset" denial. New test_sdk_routes_use_authenticated_user_for_dataset_access asserts routes pass authenticated_user_id to accessible() with correct dataset and user ids. New test module verifies token_required injects parameter into decorated handlers.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

  • #14659: Bug report describing TDocument APIs leaking private datasets to tenant members; this PR adds the user-based authorization mechanism to address the underlying issue where only tenant membership was checked instead of dataset permission and owner identity.

Possibly related PRs

  • infiniflow/ragflow#14645: Rewrites KnowledgebaseService.accessible() and DocumentService.accessible() to enforce per-dataset permission rules (me vs team); directly complements this PR which provides the authenticated user identity to those authorization functions.

Suggested labels

🐞 bug, 🧪 test

Poem

🐰 A rabbit hops through authorization,
Checking not just tenant, but true identification!
When authenticated_user_id flows through the request,
Private datasets finally rest, properly blessed.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.05% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix(api): close private dataset doc auth bypass' directly and clearly summarizes the main change: fixing an authorization bypass in document APIs for private datasets.
Linked Issues check ✅ Passed The code changes fully address issue #14659 by forwarding authenticated_user_id through token-auth flow and using it for dataset authorization in affected SDK document endpoints, ensuring private datasets remain owner-only accessible [#14659].
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the authorization bypass: updates to download, parse, stop_parsing, and retrieval_test endpoints; token decorator enhancement; and corresponding unit tests. No unrelated changes detected.
Description check ✅ Passed The pull request description clearly states the problem being solved with a reference to the issue, explains the authorization bypass in detail, and identifies the type of change as a bug fix matching the template requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@dosubot dosubot Bot added 🐖api The modified files are located under directory 'api/apps/sdk' 🐞 bug Something isn't working, pull request that fix bug. 🧪 test Pull requests that update test cases. labels May 10, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
api/apps/sdk/doc.py (1)

53-55: ⚡ Quick win

Add a docstring to explain the access control logic.

The function handles important authorization semantics (JWT-authenticated user vs API-key tenant fallback) without documentation. The suggested docstring clarifies when each path is used:

📝 Suggested docstring
 def _dataset_access_actor_id(tenant_id: str, authenticated_user_id: str | None = None) -> str:
+    """
+    Determine the actor ID for dataset access authorization.
+    
+    Returns authenticated_user_id when available (JWT/login-token flow),
+    otherwise falls back to tenant_id (API-key flow or unauthenticated).
+    """
     return authenticated_user_id or tenant_id
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/apps/sdk/doc.py` around lines 53 - 55, Add a clear docstring to
_dataset_access_actor_id describing the access-control semantics: explain that
when a JWT-authenticated user ID is present the function returns that user (used
for per-user authorization/audit), and when authenticated_user_id is None it
falls back to the tenant_id (used for API-key based requests or tenant-scoped
actions); mention expected parameter types and the function's return value and
include a short example of both paths for clarity.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@api/apps/sdk/doc.py`:
- Around line 208-211: The authorization check using
KnowledgebaseService.accessible(kb_id=dataset_id,
user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id)) lacks
logging; add logging similar to the download flow: log an info/debug message
before the check indicating the dataset_id and actor_id being validated and log
a warning/error when access is denied (including dataset_id and actor_id) so
operators can trace authorization failures; update the same function that
performs this check to call the logger and follow the existing logging format
used by the download function.
- Around line 302-305: The authorization check using
KnowledgebaseService.accessible(kb_id=dataset_id,
user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id)) is missing
logging; add the same logging pattern used in the download/parse flows to record
the check input (dataset_id, tenant_id, authenticated_user_id) and the
authorization outcome. Specifically, before/around the if, emit a log entry
(matching the existing logger level/format used in download/parse) that includes
dataset_id, tenant_id, actor id from _dataset_access_actor_id(...), and whether
access was granted/denied so the decision is auditable and consistent with other
flows.
- Around line 432-435: The dataset access check loop lacks logging; add
structured logs using the existing actor id and kb_ids: log a single INFO-level
message before the loop mentioning kb_ids and actor_id (from
_dataset_access_actor_id), and inside the loop log a WARN/ERROR when
KnowledgebaseService.accessible(kb_id=id, user_id=actor_id) returns False
including the denied kb id and actor_id (and authenticated_user_id if available)
before returning get_error_data_result so denied access events are recorded for
debugging and audit.
- Around line 97-100: Add structured logging around the
KnowledgebaseService.accessible check: log the attempted access with dataset_id,
the actor returned by _dataset_access_actor_id(tenant_id,
authenticated_user_id), and the authorization result (allowed/denied). Use the
module logger (e.g., logger or security logger) and emit a warning or info-level
entry when access is denied (include tenant_id and authenticated_user_id as
context), so the authorization decision for KnowledgebaseService.accessible is
recorded for audit and debugging.

In `@api/utils/api_utils.py`:
- Around line 339-340: The code path that injects authenticated_user_id into
kwargs (the block checking accepts_authenticated_user_id and setting
kwargs["authenticated_user_id"] = user[0].id) needs an audit log entry; add a
concise log statement (using the module logger or existing logger variable)
immediately before or after the assignment that records the action and key
context such as the injected user id and the target dataset/request identifier
(if available) and use an appropriate level (info/debug) while avoiding
sensitive data; update any import or logger initialization (e.g., logger =
logging.getLogger(__name__)) if not already present.

---

Nitpick comments:
In `@api/apps/sdk/doc.py`:
- Around line 53-55: Add a clear docstring to _dataset_access_actor_id
describing the access-control semantics: explain that when a JWT-authenticated
user ID is present the function returns that user (used for per-user
authorization/audit), and when authenticated_user_id is None it falls back to
the tenant_id (used for API-key based requests or tenant-scoped actions);
mention expected parameter types and the function's return value and include a
short example of both paths for clarity.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 272a4cf4-6bed-4bf2-a7f7-55ab6fde80c1

📥 Commits

Reviewing files that changed from the base of the PR and between 6bfe0f9 and 60e606a.

📒 Files selected for processing (4)
  • api/apps/sdk/doc.py
  • api/utils/api_utils.py
  • test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py
  • test/unit_test/api/utils/test_api_utils_token_required.py

Comment thread api/apps/sdk/doc.py
Comment on lines +97 to +100
if not KnowledgebaseService.accessible(
kb_id=dataset_id,
user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id),
):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add logging for dataset access authorization checks.

This security-critical authorization check determines access to private datasets but lacks logging. Adding a log statement would improve security audit trails and debugging.

🔒 Proposed logging addition
+    actor_id = _dataset_access_actor_id(tenant_id, authenticated_user_id)
+    logging.debug("Checking dataset access: dataset_id=%s actor_id=%s (authenticated_user_id=%s)", dataset_id, actor_id, authenticated_user_id)
     if not KnowledgebaseService.accessible(
         kb_id=dataset_id,
-        user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id),
+        user_id=actor_id,
     ):
+        logging.warning("Dataset access denied: dataset_id=%s actor_id=%s", dataset_id, actor_id)
         return get_error_data_result(message=f"You do not own the dataset {dataset_id}.")

As per coding guidelines, "**/*.py: Add logging for new flows".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/apps/sdk/doc.py` around lines 97 - 100, Add structured logging around the
KnowledgebaseService.accessible check: log the attempted access with dataset_id,
the actor returned by _dataset_access_actor_id(tenant_id,
authenticated_user_id), and the authorization result (allowed/denied). Use the
module logger (e.g., logger or security logger) and emit a warning or info-level
entry when access is denied (include tenant_id and authenticated_user_id as
context), so the authorization decision for KnowledgebaseService.accessible is
recorded for audit and debugging.

Comment thread api/apps/sdk/doc.py
Comment on lines +208 to +211
if not KnowledgebaseService.accessible(
kb_id=dataset_id,
user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id),
):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add logging for dataset access authorization checks.

Similar to the download function, this authorization check lacks logging. Consider adding the same logging pattern here for consistency.

As per coding guidelines, "**/*.py: Add logging for new flows".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/apps/sdk/doc.py` around lines 208 - 211, The authorization check using
KnowledgebaseService.accessible(kb_id=dataset_id,
user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id)) lacks
logging; add logging similar to the download flow: log an info/debug message
before the check indicating the dataset_id and actor_id being validated and log
a warning/error when access is denied (including dataset_id and actor_id) so
operators can trace authorization failures; update the same function that
performs this check to call the logger and follow the existing logging format
used by the download function.

Comment thread api/apps/sdk/doc.py
Comment on lines +302 to +305
if not KnowledgebaseService.accessible(
kb_id=dataset_id,
user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id),
):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add logging for dataset access authorization checks.

Similar to the download and parse functions, this authorization check lacks logging. Consider adding the same logging pattern for consistency.

As per coding guidelines, "**/*.py: Add logging for new flows".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/apps/sdk/doc.py` around lines 302 - 305, The authorization check using
KnowledgebaseService.accessible(kb_id=dataset_id,
user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id)) is missing
logging; add the same logging pattern used in the download/parse flows to record
the check input (dataset_id, tenant_id, authenticated_user_id) and the
authorization outcome. Specifically, before/around the if, emit a log entry
(matching the existing logger level/format used in download/parse) that includes
dataset_id, tenant_id, actor id from _dataset_access_actor_id(...), and whether
access was granted/denied so the decision is auditable and consistent with other
flows.

Comment thread api/apps/sdk/doc.py
Comment on lines +432 to 435
actor_id = _dataset_access_actor_id(tenant_id, authenticated_user_id)
for id in kb_ids:
if not KnowledgebaseService.accessible(kb_id=id, user_id=tenant_id):
if not KnowledgebaseService.accessible(kb_id=id, user_id=actor_id):
return get_error_data_result(f"You don't own the dataset {id}.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add logging for dataset access authorization checks.

This function checks access to potentially multiple datasets but lacks logging. Since this loops over dataset IDs, consider logging once before the loop (with the full list) and/or within the loop for denied access.

🔍 Proposed logging addition
     actor_id = _dataset_access_actor_id(tenant_id, authenticated_user_id)
+    logging.debug("Checking access to datasets: dataset_ids=%s actor_id=%s (authenticated_user_id=%s)", kb_ids, actor_id, authenticated_user_id)
     for id in kb_ids:
         if not KnowledgebaseService.accessible(kb_id=id, user_id=actor_id):
+            logging.warning("Dataset access denied: dataset_id=%s actor_id=%s", id, actor_id)
             return get_error_data_result(f"You don't own the dataset {id}.")

As per coding guidelines, "**/*.py: Add logging for new flows".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
actor_id = _dataset_access_actor_id(tenant_id, authenticated_user_id)
for id in kb_ids:
if not KnowledgebaseService.accessible(kb_id=id, user_id=tenant_id):
if not KnowledgebaseService.accessible(kb_id=id, user_id=actor_id):
return get_error_data_result(f"You don't own the dataset {id}.")
actor_id = _dataset_access_actor_id(tenant_id, authenticated_user_id)
logging.debug("Checking access to datasets: dataset_ids=%s actor_id=%s (authenticated_user_id=%s)", kb_ids, actor_id, authenticated_user_id)
for id in kb_ids:
if not KnowledgebaseService.accessible(kb_id=id, user_id=actor_id):
logging.warning("Dataset access denied: dataset_id=%s actor_id=%s", id, actor_id)
return get_error_data_result(f"You don't own the dataset {id}.")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/apps/sdk/doc.py` around lines 432 - 435, The dataset access check loop
lacks logging; add structured logs using the existing actor id and kb_ids: log a
single INFO-level message before the loop mentioning kb_ids and actor_id (from
_dataset_access_actor_id), and inside the loop log a WARN/ERROR when
KnowledgebaseService.accessible(kb_id=id, user_id=actor_id) returns False
including the denied kb id and actor_id (and authenticated_user_id if available)
before returning get_error_data_result so denied access events are recorded for
debugging and audit.

Comment thread api/utils/api_utils.py
Comment on lines +339 to +340
if accepts_authenticated_user_id:
kwargs["authenticated_user_id"] = user[0].id
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add logging for the authenticated user injection flow.

This security-relevant code path injects the authenticated user ID for dataset authorization checks but lacks logging. Adding a log statement would improve observability and audit trails for private dataset access.

📊 Proposed logging addition
                    kwargs["tenant_id"] = tenants[0].tenant_id
                    if accepts_authenticated_user_id:
+                       logging.debug("JWT authentication: injecting authenticated_user_id=%s for tenant_id=%s", user[0].id, tenants[0].tenant_id)
                        kwargs["authenticated_user_id"] = user[0].id

As per coding guidelines, "**/*.py: Add logging for new flows".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/utils/api_utils.py` around lines 339 - 340, The code path that injects
authenticated_user_id into kwargs (the block checking
accepts_authenticated_user_id and setting kwargs["authenticated_user_id"] =
user[0].id) needs an audit log entry; add a concise log statement (using the
module logger or existing logger variable) immediately before or after the
assignment that records the action and key context such as the injected user id
and the target dataset/request identifier (if available) and use an appropriate
level (info/debug) while avoiding sensitive data; update any import or logger
initialization (e.g., logger = logging.getLogger(__name__)) if not already
present.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an authorization bypass in SDK document endpoints when callers authenticate using a login token: the SDK auth layer now forwards the authenticated user id (in addition to tenant/workspace id), and the SDK document routes use that user id for dataset authorization so permission = me datasets remain owner-only.

Changes:

  • Update token_required to optionally inject authenticated_user_id for login-token authentication when the wrapped handler accepts it.
  • Update SDK document routes to authorize dataset access using the authenticated user id (falling back to tenant id when not available).
  • Add/adjust unit tests to verify correct user-id propagation and that SDK document routes pass the correct actor id into KnowledgebaseService.accessible().

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
api/utils/api_utils.py Enhances token_required to conditionally pass authenticated_user_id for login-token flows.
api/apps/sdk/doc.py Uses authenticated user id for dataset access checks in SDK document endpoints (download/parse/stop/retrieval).
test/unit_test/api/utils/test_api_utils_token_required.py New unit test validating token_required injects authenticated user id for login tokens.
test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py Updates existing unit tests and adds coverage ensuring SDK routes call dataset access checks with user_id=authenticated_user_id.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🐖api The modified files are located under directory 'api/apps/sdk' 🐞 bug Something isn't working, pull request that fix bug. size:M This PR changes 30-99 lines, ignoring generated files. 🧪 test Pull requests that update test cases.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: TDocument APIs leak private datasets to tenant members

2 participants