fix(api): close private dataset doc auth bypass by Parvezkhan0 · Pull Request #14749 · infiniflow/ragflow

Parvezkhan0 · 2026-05-10T13:31:31Z

What problem does this PR solve?

This PR fixes an authorization bypass in the token-authenticated document SDK APIs for private datasets (permission = me).
The affected routes validated dataset access with KnowledgebaseService.accessible(), but when a login token was used through the SDK token flow, the check received the caller's tenant/workspace id instead of the authenticated user id. That let another member of the same tenant pass access checks for a private dataset if they knew the dataset_id, exposing document listings and allowing document operations that should have remained owner-only.

This change forwards the authenticated user id through the token-auth flow and uses it for dataset authorization in the affected SDK document endpoints, so private datasets remain accessible only to their owner while team datasets continue to work as before.

Type of change

Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: Copilot <[email protected]>

coderabbitai · 2026-05-10T13:31:48Z

📝 Walkthrough

Walkthrough

The PR adds authenticated user identification to dataset authorization checks in SDK document endpoints. Four route handlers now accept an optional authenticated_user_id parameter and pass it to KnowledgebaseService.accessible() instead of using only tenant_id. The token_required decorator conditionally injects this parameter based on handler signatures during JWT authentication.

Changes

User-Based Authorization for SDK Routes

Layer / File(s)	Summary
Authorization Identity Contract `api/apps/sdk/doc.py`	Helper `_dataset_access_actor_id` selects authorized actor as `authenticated_user_id` when present, otherwise `tenant_id`. Route signatures updated: `download`, `parse`, `stop_parsing`, `retrieval_test` now accept `authenticated_user_id: str \| None = None`.
Route Authorization Implementation `api/apps/sdk/doc.py`	`download`, `parse`, `stop_parsing`, and `retrieval_test` endpoints compute actor via helper and pass it to `KnowledgebaseService.accessible(...)` for authorization gates instead of `tenant_id`.
Authentication Decorator `api/utils/api_utils.py`	`token_required` decorator inspects handler signature to detect `authenticated_user_id` parameter; during JWT login-token flow, conditionally injects resolved user id into kwargs when handler accepts it.
Tests & Verification `test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py`, `test/unit_test/api/utils/test_api_utils_token_required.py`	Existing `test_download_and_download_doc_errors` updated to verify "do not own dataset" denial. New `test_sdk_routes_use_authenticated_user_for_dataset_access` asserts routes pass `authenticated_user_id` to `accessible()` with correct dataset and user ids. New test module verifies `token_required` injects parameter into decorated handlers.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

#14659: Bug report describing TDocument APIs leaking private datasets to tenant members; this PR adds the user-based authorization mechanism to address the underlying issue where only tenant membership was checked instead of dataset permission and owner identity.

Possibly related PRs

infiniflow/ragflow#14645: Rewrites KnowledgebaseService.accessible() and DocumentService.accessible() to enforce per-dataset permission rules (me vs team); directly complements this PR which provides the authenticated user identity to those authorization functions.

Suggested labels

🐞 bug, 🧪 test

Poem

🐰 A rabbit hops through authorization,
Checking not just tenant, but true identification!
When authenticated_user_id flows through the request,
Private datasets finally rest, properly blessed.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 21.05% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix(api): close private dataset doc auth bypass' directly and clearly summarizes the main change: fixing an authorization bypass in document APIs for private datasets.
Linked Issues check	✅ Passed	The code changes fully address issue `#14659` by forwarding authenticated_user_id through token-auth flow and using it for dataset authorization in affected SDK document endpoints, ensuring private datasets remain owner-only accessible [`#14659`].
Out of Scope Changes check	✅ Passed	All changes are directly scoped to fixing the authorization bypass: updates to `download`, `parse`, `stop_parsing`, and `retrieval_test` endpoints; token decorator enhancement; and corresponding unit tests. No unrelated changes detected.
Description check	✅ Passed	The pull request description clearly states the problem being solved with a reference to the issue, explains the authorization bypass in detail, and identifies the type of change as a bug fix matching the template requirements.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (1)

api/apps/sdk/doc.py (1)

53-55: ⚡ Quick win

Add a docstring to explain the access control logic.

The function handles important authorization semantics (JWT-authenticated user vs API-key tenant fallback) without documentation. The suggested docstring clarifies when each path is used:

📝 Suggested docstring

 def _dataset_access_actor_id(tenant_id: str, authenticated_user_id: str | None = None) -> str:
+    """
+    Determine the actor ID for dataset access authorization.
+    
+    Returns authenticated_user_id when available (JWT/login-token flow),
+    otherwise falls back to tenant_id (API-key flow or unauthenticated).
+    """
     return authenticated_user_id or tenant_id

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/apps/sdk/doc.py` around lines 53 - 55, Add a clear docstring to
_dataset_access_actor_id describing the access-control semantics: explain that
when a JWT-authenticated user ID is present the function returns that user (used
for per-user authorization/audit), and when authenticated_user_id is None it
falls back to the tenant_id (used for API-key based requests or tenant-scoped
actions); mention expected parameter types and the function's return value and
include a short example of both paths for clarity.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@api/apps/sdk/doc.py`:
- Around line 208-211: The authorization check using
KnowledgebaseService.accessible(kb_id=dataset_id,
user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id)) lacks
logging; add logging similar to the download flow: log an info/debug message
before the check indicating the dataset_id and actor_id being validated and log
a warning/error when access is denied (including dataset_id and actor_id) so
operators can trace authorization failures; update the same function that
performs this check to call the logger and follow the existing logging format
used by the download function.
- Around line 302-305: The authorization check using
KnowledgebaseService.accessible(kb_id=dataset_id,
user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id)) is missing
logging; add the same logging pattern used in the download/parse flows to record
the check input (dataset_id, tenant_id, authenticated_user_id) and the
authorization outcome. Specifically, before/around the if, emit a log entry
(matching the existing logger level/format used in download/parse) that includes
dataset_id, tenant_id, actor id from _dataset_access_actor_id(...), and whether
access was granted/denied so the decision is auditable and consistent with other
flows.
- Around line 432-435: The dataset access check loop lacks logging; add
structured logs using the existing actor id and kb_ids: log a single INFO-level
message before the loop mentioning kb_ids and actor_id (from
_dataset_access_actor_id), and inside the loop log a WARN/ERROR when
KnowledgebaseService.accessible(kb_id=id, user_id=actor_id) returns False
including the denied kb id and actor_id (and authenticated_user_id if available)
before returning get_error_data_result so denied access events are recorded for
debugging and audit.
- Around line 97-100: Add structured logging around the
KnowledgebaseService.accessible check: log the attempted access with dataset_id,
the actor returned by _dataset_access_actor_id(tenant_id,
authenticated_user_id), and the authorization result (allowed/denied). Use the
module logger (e.g., logger or security logger) and emit a warning or info-level
entry when access is denied (include tenant_id and authenticated_user_id as
context), so the authorization decision for KnowledgebaseService.accessible is
recorded for audit and debugging.

In `@api/utils/api_utils.py`:
- Around line 339-340: The code path that injects authenticated_user_id into
kwargs (the block checking accepts_authenticated_user_id and setting
kwargs["authenticated_user_id"] = user[0].id) needs an audit log entry; add a
concise log statement (using the module logger or existing logger variable)
immediately before or after the assignment that records the action and key
context such as the injected user id and the target dataset/request identifier
(if available) and use an appropriate level (info/debug) while avoiding
sensitive data; update any import or logger initialization (e.g., logger =
logging.getLogger(__name__)) if not already present.

---

Nitpick comments:
In `@api/apps/sdk/doc.py`:
- Around line 53-55: Add a clear docstring to _dataset_access_actor_id
describing the access-control semantics: explain that when a JWT-authenticated
user ID is present the function returns that user (used for per-user
authorization/audit), and when authenticated_user_id is None it falls back to
the tenant_id (used for API-key based requests or tenant-scoped actions);
mention expected parameter types and the function's return value and include a
short example of both paths for clarity.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 272a4cf4-6bed-4bf2-a7f7-55ab6fde80c1

📥 Commits

Reviewing files that changed from the base of the PR and between 6bfe0f9 and 60e606a.

📒 Files selected for processing (4)

api/apps/sdk/doc.py
api/utils/api_utils.py
test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py
test/unit_test/api/utils/test_api_utils_token_required.py

coderabbitai · 2026-05-10T13:36:16Z

+    if not KnowledgebaseService.accessible(
+        kb_id=dataset_id,
+        user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id),
+    ):


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add logging for dataset access authorization checks.

This security-critical authorization check determines access to private datasets but lacks logging. Adding a log statement would improve security audit trails and debugging.

🔒 Proposed logging addition

+ actor_id = _dataset_access_actor_id(tenant_id, authenticated_user_id) + logging.debug("Checking dataset access: dataset_id=%s actor_id=%s (authenticated_user_id=%s)", dataset_id, actor_id, authenticated_user_id) if not KnowledgebaseService.accessible( kb_id=dataset_id, - user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id), + user_id=actor_id, ): + logging.warning("Dataset access denied: dataset_id=%s actor_id=%s", dataset_id, actor_id) return get_error_data_result(message=f"You do not own the dataset {dataset_id}.")

As per coding guidelines, "**/*.py: Add logging for new flows".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/apps/sdk/doc.py` around lines 97 - 100, Add structured logging around the KnowledgebaseService.accessible check: log the attempted access with dataset_id, the actor returned by _dataset_access_actor_id(tenant_id, authenticated_user_id), and the authorization result (allowed/denied). Use the module logger (e.g., logger or security logger) and emit a warning or info-level entry when access is denied (include tenant_id and authenticated_user_id as context), so the authorization decision for KnowledgebaseService.accessible is recorded for audit and debugging.

coderabbitai · 2026-05-10T13:36:16Z

+    if not KnowledgebaseService.accessible(
+        kb_id=dataset_id,
+        user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id),
+    ):


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add logging for dataset access authorization checks.

Similar to the download function, this authorization check lacks logging. Consider adding the same logging pattern here for consistency.

As per coding guidelines, "**/*.py: Add logging for new flows".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/apps/sdk/doc.py` around lines 208 - 211, The authorization check using KnowledgebaseService.accessible(kb_id=dataset_id, user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id)) lacks logging; add logging similar to the download flow: log an info/debug message before the check indicating the dataset_id and actor_id being validated and log a warning/error when access is denied (including dataset_id and actor_id) so operators can trace authorization failures; update the same function that performs this check to call the logger and follow the existing logging format used by the download function.

coderabbitai · 2026-05-10T13:36:16Z

+    if not KnowledgebaseService.accessible(
+        kb_id=dataset_id,
+        user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id),
+    ):


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add logging for dataset access authorization checks.

Similar to the download and parse functions, this authorization check lacks logging. Consider adding the same logging pattern for consistency.

As per coding guidelines, "**/*.py: Add logging for new flows".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/apps/sdk/doc.py` around lines 302 - 305, The authorization check using KnowledgebaseService.accessible(kb_id=dataset_id, user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id)) is missing logging; add the same logging pattern used in the download/parse flows to record the check input (dataset_id, tenant_id, authenticated_user_id) and the authorization outcome. Specifically, before/around the if, emit a log entry (matching the existing logger level/format used in download/parse) that includes dataset_id, tenant_id, actor id from _dataset_access_actor_id(...), and whether access was granted/denied so the decision is auditable and consistent with other flows.

coderabbitai · 2026-05-10T13:36:16Z

+    actor_id = _dataset_access_actor_id(tenant_id, authenticated_user_id)
    for id in kb_ids:
-        if not KnowledgebaseService.accessible(kb_id=id, user_id=tenant_id):
+        if not KnowledgebaseService.accessible(kb_id=id, user_id=actor_id):
            return get_error_data_result(f"You don't own the dataset {id}.")


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add logging for dataset access authorization checks.

This function checks access to potentially multiple datasets but lacks logging. Since this loops over dataset IDs, consider logging once before the loop (with the full list) and/or within the loop for denied access.

🔍 Proposed logging addition

actor_id = _dataset_access_actor_id(tenant_id, authenticated_user_id) + logging.debug("Checking access to datasets: dataset_ids=%s actor_id=%s (authenticated_user_id=%s)", kb_ids, actor_id, authenticated_user_id) for id in kb_ids: if not KnowledgebaseService.accessible(kb_id=id, user_id=actor_id): + logging.warning("Dataset access denied: dataset_id=%s actor_id=%s", id, actor_id) return get_error_data_result(f"You don't own the dataset {id}.")

As per coding guidelines, "**/*.py: Add logging for new flows".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

actor_id = _dataset_access_actor_id(tenant_id, authenticated_user_id)

for id in kb_ids:

if not KnowledgebaseService.accessible(kb_id=id, user_id=tenant_id):

if not KnowledgebaseService.accessible(kb_id=id, user_id=actor_id):

return get_error_data_result(f"You don't own the dataset {id}.")

actor_id = _dataset_access_actor_id(tenant_id, authenticated_user_id)

logging.debug("Checking access to datasets: dataset_ids=%s actor_id=%s (authenticated_user_id=%s)", kb_ids, actor_id, authenticated_user_id)

for id in kb_ids:

if not KnowledgebaseService.accessible(kb_id=id, user_id=actor_id):

logging.warning("Dataset access denied: dataset_id=%s actor_id=%s", id, actor_id)

return get_error_data_result(f"You don't own the dataset {id}.")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/apps/sdk/doc.py` around lines 432 - 435, The dataset access check loop lacks logging; add structured logs using the existing actor id and kb_ids: log a single INFO-level message before the loop mentioning kb_ids and actor_id (from _dataset_access_actor_id), and inside the loop log a WARN/ERROR when KnowledgebaseService.accessible(kb_id=id, user_id=actor_id) returns False including the denied kb id and actor_id (and authenticated_user_id if available) before returning get_error_data_result so denied access events are recorded for debugging and audit.

coderabbitai · 2026-05-10T13:36:16Z

+                    if accepts_authenticated_user_id:
+                        kwargs["authenticated_user_id"] = user[0].id


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add logging for the authenticated user injection flow.

This security-relevant code path injects the authenticated user ID for dataset authorization checks but lacks logging. Adding a log statement would improve observability and audit trails for private dataset access.

📊 Proposed logging addition

kwargs["tenant_id"] = tenants[0].tenant_id if accepts_authenticated_user_id: + logging.debug("JWT authentication: injecting authenticated_user_id=%s for tenant_id=%s", user[0].id, tenants[0].tenant_id) kwargs["authenticated_user_id"] = user[0].id

As per coding guidelines, "**/*.py: Add logging for new flows".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/utils/api_utils.py` around lines 339 - 340, The code path that injects authenticated_user_id into kwargs (the block checking accepts_authenticated_user_id and setting kwargs["authenticated_user_id"] = user[0].id) needs an audit log entry; add a concise log statement (using the module logger or existing logger variable) immediately before or after the assignment that records the action and key context such as the injected user id and the target dataset/request identifier (if available) and use an appropriate level (info/debug) while avoiding sensitive data; update any import or logger initialization (e.g., logger = logging.getLogger(__name__)) if not already present.

Copilot

Pull request overview

This PR fixes an authorization bypass in SDK document endpoints when callers authenticate using a login token: the SDK auth layer now forwards the authenticated user id (in addition to tenant/workspace id), and the SDK document routes use that user id for dataset authorization so permission = me datasets remain owner-only.

Changes:

Update token_required to optionally inject authenticated_user_id for login-token authentication when the wrapped handler accepts it.
Update SDK document routes to authorize dataset access using the authenticated user id (falling back to tenant id when not available).
Add/adjust unit tests to verify correct user-id propagation and that SDK document routes pass the correct actor id into KnowledgebaseService.accessible().

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
`api/utils/api_utils.py`	Enhances `token_required` to conditionally pass `authenticated_user_id` for login-token flows.
`api/apps/sdk/doc.py`	Uses authenticated user id for dataset access checks in SDK document endpoints (download/parse/stop/retrieval).
`test/unit_test/api/utils/test_api_utils_token_required.py`	New unit test validating `token_required` injects authenticated user id for login tokens.
`test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py`	Updates existing unit tests and adds coverage ensuring SDK routes call dataset access checks with `user_id=authenticated_user_id`.

fix(api): close private dataset doc auth bypass

60e606a

Co-authored-by: Copilot <[email protected]>

Copilot AI review requested due to automatic review settings May 10, 2026 13:31

dosubot Bot added the size:M This PR changes 30-99 lines, ignoring generated files. label May 10, 2026

dosubot Bot added 🐖api The modified files are located under directory 'api/apps/sdk' 🐞 bug Something isn't working, pull request that fix bug. 🧪 test Pull requests that update test cases. labels May 10, 2026

Copilot started reviewing on behalf of Parvezkhan0 May 10, 2026 13:32 View session

coderabbitai Bot reviewed May 10, 2026

View reviewed changes

Copilot AI reviewed May 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(api): close private dataset doc auth bypass#14749

fix(api): close private dataset doc auth bypass#14749
Parvezkhan0 wants to merge 1 commit into
infiniflow:mainfrom
Parvezkhan0:fix/private-dataset-doc-auth-14659

Parvezkhan0 commented May 10, 2026 •

edited by JinHai-CN

Loading

Uh oh!

coderabbitai Bot commented May 10, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 10, 2026

Uh oh!

coderabbitai Bot May 10, 2026

Uh oh!

coderabbitai Bot May 10, 2026

Uh oh!

coderabbitai Bot May 10, 2026

Uh oh!

coderabbitai Bot May 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if accepts_authenticated_user_id:
		kwargs["authenticated_user_id"] = user[0].id

Conversation

Parvezkhan0 commented May 10, 2026 • edited by JinHai-CN Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Type of change

Uh oh!

coderabbitai Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Parvezkhan0 commented May 10, 2026 •

edited by JinHai-CN

Loading

coderabbitai Bot commented May 10, 2026 •

edited

Loading