Skip to content

fix: prepend bucket prefix to Azure SPN and SAS storage paths#14185

Open
voidborne-d wants to merge 1 commit intoinfiniflow:mainfrom
voidborne-d:fix/azure-storage-bucket-prefix
Open

fix: prepend bucket prefix to Azure SPN and SAS storage paths#14185
voidborne-d wants to merge 1 commit intoinfiniflow:mainfrom
voidborne-d:fix/azure-storage-bucket-prefix

Conversation

@voidborne-d
Copy link
Copy Markdown

Summary

Fixes #14159 — files from different datasets can overwrite each other in Azure Blob storage.

Problem

Both azure_spn_conn.py and azure_sas_conn.py ignore the bucket parameter in all storage operations (put, get, rm, obj_exist, get_presigned_url). Files are stored flat using only the filename, so two datasets containing a file with the same name will overwrite each other.

The MinIO and S3 implementations correctly use the bucket (typically the knowledge base ID) as a path prefix to create logical folder isolation:

  • MinIO: uses use_prefix_path decorator → {orig_bucket}/{fnm}
  • S3: uses use_prefix_path decorator → {prefix_path}/{bucket}/{fnm}

Fix

Prepend {bucket}/ to the file path in all 5 operations across both Azure connector files:

File Methods fixed
azure_spn_conn.py put, get, rm, obj_exist, get_presigned_url
azure_sas_conn.py put, get, rm, obj_exist, get_presigned_url

This matches the existing convention where bucket is the knowledge base ID used as a directory prefix.

⚠️ Migration Note

Existing Azure SPN/SAS deployments have files stored without the bucket prefix. After this fix, new files will be stored under {bucket}/{filename} while existing files remain at {filename}. A one-time migration script or manual file move may be needed for existing deployments. New deployments are unaffected.

Testing

  • Verified the fix is consistent across all 5 methods in both files
  • The health() method is intentionally left unchanged as it uses a hardcoded test filename without bucket semantics

Files from different datasets can overwrite each other in Azure Blob
storage because both azure_spn_conn.py and azure_sas_conn.py ignore
the bucket parameter and store files using only the filename.

Prepend bucket (typically the knowledge base ID) as a path prefix to
all storage operations (put, get, rm, obj_exist, get_presigned_url),
matching the behavior of MinIO and S3 implementations which use the
bucket parameter for logical folder isolation.

Fixes infiniflow#14159
@dosubot dosubot Bot added size:S This PR changes 10-29 lines, ignoring generated files. 🐞 bug Something isn't working, pull request that fix bug. labels Apr 17, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 17, 2026

📝 Walkthrough

Walkthrough

Both Azure blob storage implementations (SAS and SPN authentication) were updated to consistently include the bucket prefix in file paths across put, rm, get, obj_exist, and get_presigned_url methods, preventing filename collisions between different datasets.

Changes

Cohort / File(s) Summary
Azure SAS Blob Storage
rag/utils/azure_sas_conn.py
Updated RAGFlowAzureSasBlob to prefix blob names with bucket in all operations: put(), rm(), get(), obj_exist(), and get_presigned_url() now use "{bucket}/{fnm}" instead of fnm.
Azure SPN Data Lake Storage
rag/utils/azure_spn_conn.py
Updated RAGFlowAzureSpnConn to prefix file paths with bucket in all operations: put(), rm(), get(), obj_exist(), and get_presigned_url() now construct fully-qualified paths as "{bucket}/{fnm}".

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 A bucket of joy, at last we've found!
No more files lost in storage ground,
With prefixes placed both firm and neat,
Each dataset's home, so safe and sweet. 🏠
Collisions banished, filenames true,
The blob realm harmonizes anew! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: prepending the bucket prefix to Azure SPN and SAS storage paths, which is the primary objective of this PR.
Description check ✅ Passed The description follows the template with a clear problem statement, detailed fix explanation, migration notes, and testing verification. All required sections are present and well-filled.
Linked Issues check ✅ Passed The PR fully addresses issue #14159 by prepending the bucket prefix to all five required methods (put, get, rm, obj_exist, get_presigned_url) in both azure_spn_conn.py and azure_sas_conn.py files.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the Azure bucket prefix issue. The health() method is intentionally left unchanged as documented, and no unrelated modifications are present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@rag/utils/azure_sas_conn.py`:
- Around line 90-93: The call to non-existent ContainerClient.get_presigned_url
should be replaced with Azure SDK SAS generation: use generate_container_sas (or
generate_account_sas) plus BlobSasPermissions to build a SAS token and then
construct the full URL for blob_name. In the method that currently builds
blob_name and loops (the code using self.conn.get_presigned_url), import and
call generate_container_sas with the account name and key (or account-level
SAS), set permissions to read/GET and expiry to the existing expires value, then
return the URL formed as
"https://{account}.blob.core.windows.net/{bucket}/{fnm}?{sas_token}". Keep the
existing retry loop and error handling, and ensure you reference the same
variables blob_name, bucket, fnm, expires and any account/key config from self
(e.g., self.account_name, self.account_key) when generating the SAS.

In `@rag/utils/azure_spn_conn.py`:
- Around line 113-116: The code currently calls self.conn.get_presigned_url
which doesn't exist on azure.storage.filedatalake FileSystemClient; replace this
call by using the Azure SDK SAS generators (e.g., generate_file_sas or
generate_file_system_sas as appropriate) to create a SAS token and then
construct the presigned URL by appending the token to the file resource URL.
Specifically, where get_presigned_url is used (refer to self.conn and the
f_path/fnm/f_path construction), call generate_file_sas(...) with the same
account_name, file_system_name (bucket), path (f_path), permission and expiry
(use the expires variable), then build the final URL as the file client or
filesystem endpoint + "/" + f_path + "?" + sas_token and return that; ensure you
import generate_file_sas and set correct permissions and protocol when
generating the token.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 72011659-46b5-4be0-96d2-01892dfb596c

📥 Commits

Reviewing files that changed from the base of the PR and between 797aa60 and dbbed3f.

📒 Files selected for processing (2)
  • rag/utils/azure_sas_conn.py
  • rag/utils/azure_spn_conn.py

Comment on lines +90 to +93
blob_name = f"{bucket}/{fnm}"
for _ in range(10):
try:
return self.conn.get_presigned_url("GET", bucket, fnm, expires)
return self.conn.get_presigned_url("GET", bucket, blob_name, expires)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# 1) Locate local get_presigned_url wrappers/implementations
ast-grep --pattern $'def get_presigned_url($_, $_, $_, $_, $_):\n  $$$'

# 2) Inspect container client usage and call signatures in repo
rg -n -C3 'ContainerClient|get_presigned_url\s*\('

# 3) Check declared Azure dependency versions
rg -n -C2 'azure-storage-blob|azure-storage-file-datalake|azure-identity'

Repository: infiniflow/ragflow

Length of output: 11943


Remove or replace invalid ContainerClient.get_presigned_url() call with Azure SDK–supported SAS generation.

Line 93 calls a non-existent method. The Azure Storage Blob SDK (12.28.0) does not expose get_presigned_url() on ContainerClient. Use generate_container_sas() or generate_account_sas() from azure.storage.blob instead, or implement SAS token generation with proper Azure SDK methods.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag/utils/azure_sas_conn.py` around lines 90 - 93, The call to non-existent
ContainerClient.get_presigned_url should be replaced with Azure SDK SAS
generation: use generate_container_sas (or generate_account_sas) plus
BlobSasPermissions to build a SAS token and then construct the full URL for
blob_name. In the method that currently builds blob_name and loops (the code
using self.conn.get_presigned_url), import and call generate_container_sas with
the account name and key (or account-level SAS), set permissions to read/GET and
expiry to the existing expires value, then return the URL formed as
"https://{account}.blob.core.windows.net/{bucket}/{fnm}?{sas_token}". Keep the
existing retry loop and error handling, and ensure you reference the same
variables blob_name, bucket, fnm, expires and any account/key config from self
(e.g., self.account_name, self.account_key) when generating the SAS.

Comment on lines +113 to +116
f_path = f"{bucket}/{fnm}"
for _ in range(10):
try:
return self.conn.get_presigned_url("GET", bucket, fnm, expires)
return self.conn.get_presigned_url("GET", bucket, f_path, expires)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# 1) Find local wrappers/definitions that may provide get_presigned_url
ast-grep --pattern $'def get_presigned_url($_, $_, $_, $_, $_):\n  $$$'

# 2) Inspect call sites and azure client usage context
rg -n -C3 'FileSystemClient|get_presigned_url\s*\('

# 3) Check declared Azure package versions in manifests
rg -n -C2 'azure-storage-file-datalake|azure-storage-blob|azure-identity'

Repository: infiniflow/ragflow

Length of output: 12360


🏁 Script executed:

# Check if there are any tests or usage examples of azure_spn_conn
find . -name "*.py" -type f | xargs grep -l "azure_spn_conn\|AzureSpnConnection" | head -20

# Look for any comments or documentation about this method
rg -n "get_presigned_url" rag/utils/azure_spn_conn.py -B5 -A10

# Check if there are any error messages or logging that might hint at the actual issue
rg -n "fail get\|presigned" rag/utils/azure_spn_conn.py -B2 -A2

Repository: infiniflow/ragflow

Length of output: 675


🌐 Web query:

FileSystemClient azure-storage-file-datalake get_presigned_url signature documentation

💡 Result:

The azure-storage-file-datalake Python SDK's FileSystemClient class does not have a get_presigned_url or generate_sas_url method. Unlike JavaScript/TS and .NET SDKs which provide generateSasUrl methods on FileSystemClient equivalents, the Python SDK uses standalone SAS generation functions. To generate a presigned URL (SAS URL) for a FileSystemClient: 1. Use generate_file_system_sas from azure.storage.filedatalake: from azure.storage.filedatalake import generate_file_system_sas, FileSystemSasPermissions from datetime import datetime, timedelta sas_token = generate_file_system_sas( account_name="youraccount", file_system_name="yourfilesystem", credential="youraccountkey", # or UserDelegationKey permission=FileSystemSasPermissions(read=True), # adjust permissions expiry=datetime.utcnow + timedelta(hours=1) ) 2. Construct the URL manually: filesystem_url = f"https://{account_name}.dfs.core.windows.net/{file_system_name}?{sas_token}" This URL provides time-limited access to the filesystem according to the SAS permissions. For sub-resources (directories/files), use generate_directory_sas or generate_file_sas and append their paths to the URL. Example full usage: python from azure.storage.filedatalake import generate_file_system_sas, FileSystemSasPermissions from datetime import datetime, timedelta account_name = "youraccount" file_system_name = "yourfilesystem" account_key = "yourkey" sas_token = generate_file_system_sas( account_name=account_name, file_system_name=file_system_name, credential=account_key, permission=FileSystemSasPermissions(read=True, list=True), expiry=datetime.utcnow + timedelta(hours=1) ) presigned_url = f"https://{account_name}.dfs.core.windows.net/{file_system_name}?{sas_token}" print(presigned_url) Official docs confirm FileSystemClient methods focus on CRUD operations like create_file_system, get_paths, etc., with no SAS generation on the client itself. SAS functions are in _shared_access_signature module.

Citations:


Remove get_presigned_url call; use Azure SDK SAS generation functions instead.

Line 116 calls self.conn.get_presigned_url(), but FileSystemClient (azure-storage-file-datalake 12.16.0) does not have this method. The Azure SDK provides standalone SAS generation functions: generate_file_system_sas(), generate_directory_sas(), or generate_file_sas(). You must replace this with the appropriate function and manually construct the presigned URL using the returned SAS token.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag/utils/azure_spn_conn.py` around lines 113 - 116, The code currently calls
self.conn.get_presigned_url which doesn't exist on azure.storage.filedatalake
FileSystemClient; replace this call by using the Azure SDK SAS generators (e.g.,
generate_file_sas or generate_file_system_sas as appropriate) to create a SAS
token and then construct the presigned URL by appending the token to the file
resource URL. Specifically, where get_presigned_url is used (refer to self.conn
and the f_path/fnm/f_path construction), call generate_file_sas(...) with the
same account_name, file_system_name (bucket), path (f_path), permission and
expiry (use the expires variable), then build the final URL as the file client
or filesystem endpoint + "/" + f_path + "?" + sas_token and return that; ensure
you import generate_file_sas and set correct permissions and protocol when
generating the token.

@voidborne-d
Copy link
Copy Markdown
Author

@coderabbitai — thanks for the review. The self.conn.get_presigned_url() calls flagged in both files are pre-existing code; this PR only adds the {bucket}/ path prefix to align with the fix applied to the other four operations (put, get, rm, obj_exist). The validity of the underlying get_presigned_url method on the Azure SDK client is a separate concern that existed before this change.

This PR intentionally keeps a narrow scope: fix the missing bucket prefix across all operations so that files from different datasets are isolated correctly. Replacing the pre-existing get_presigned_url wrapper with proper Azure SDK SAS generation would be a good follow-up but is out of scope here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 17, 2026

Tip

For best results, initiate chat on the files or code changes.

@voidborne-d, understood — thank you for the clarification. You're right that the self.conn.get_presigned_url() concern is pre-existing and entirely separate from the bucket-prefix fix this PR addresses. Keeping the scope narrow here is the right call.

Would you like me to open a follow-up GitHub issue to track replacing the get_presigned_url wrapper in both azure_spn_conn.py and azure_sas_conn.py with proper Azure SDK SAS token generation? That way it won't get lost.

( ˶ᵔ ᵕ ᵔ˶ ) 🐇


✏️ Learnings added
Learnt from: voidborne-d
URL: https://github.com/infiniflow/ragflow/pull/14185

Timestamp: 2026-04-17T09:07:06.978Z
Learning: In the RAGFlow repository (infiniflow/ragflow), the `self.conn.get_presigned_url()` calls in `rag/utils/azure_spn_conn.py` and `rag/utils/azure_sas_conn.py` are pre-existing code that predates PR `#14185`. The validity/correctness of the underlying Azure SDK `get_presigned_url` method is a separate concern from the bucket-prefix fix introduced in PR `#14185`. A future follow-up should replace these wrappers with proper Azure SDK SAS token generation.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@JinHai-CN JinHai-CN added the ci Continue Integration label Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🐞 bug Something isn't working, pull request that fix bug. ci Continue Integration size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Azure Blob SPN storage: files from different datasets can overwrite each other due to missing bucket prefix

2 participants