fix: prepend bucket prefix to Azure SPN and SAS storage paths#14185
fix: prepend bucket prefix to Azure SPN and SAS storage paths#14185voidborne-d wants to merge 1 commit intoinfiniflow:mainfrom
Conversation
Files from different datasets can overwrite each other in Azure Blob storage because both azure_spn_conn.py and azure_sas_conn.py ignore the bucket parameter and store files using only the filename. Prepend bucket (typically the knowledge base ID) as a path prefix to all storage operations (put, get, rm, obj_exist, get_presigned_url), matching the behavior of MinIO and S3 implementations which use the bucket parameter for logical folder isolation. Fixes infiniflow#14159
📝 WalkthroughWalkthroughBoth Azure blob storage implementations (SAS and SPN authentication) were updated to consistently include the bucket prefix in file paths across put, rm, get, obj_exist, and get_presigned_url methods, preventing filename collisions between different datasets. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@rag/utils/azure_sas_conn.py`:
- Around line 90-93: The call to non-existent ContainerClient.get_presigned_url
should be replaced with Azure SDK SAS generation: use generate_container_sas (or
generate_account_sas) plus BlobSasPermissions to build a SAS token and then
construct the full URL for blob_name. In the method that currently builds
blob_name and loops (the code using self.conn.get_presigned_url), import and
call generate_container_sas with the account name and key (or account-level
SAS), set permissions to read/GET and expiry to the existing expires value, then
return the URL formed as
"https://{account}.blob.core.windows.net/{bucket}/{fnm}?{sas_token}". Keep the
existing retry loop and error handling, and ensure you reference the same
variables blob_name, bucket, fnm, expires and any account/key config from self
(e.g., self.account_name, self.account_key) when generating the SAS.
In `@rag/utils/azure_spn_conn.py`:
- Around line 113-116: The code currently calls self.conn.get_presigned_url
which doesn't exist on azure.storage.filedatalake FileSystemClient; replace this
call by using the Azure SDK SAS generators (e.g., generate_file_sas or
generate_file_system_sas as appropriate) to create a SAS token and then
construct the presigned URL by appending the token to the file resource URL.
Specifically, where get_presigned_url is used (refer to self.conn and the
f_path/fnm/f_path construction), call generate_file_sas(...) with the same
account_name, file_system_name (bucket), path (f_path), permission and expiry
(use the expires variable), then build the final URL as the file client or
filesystem endpoint + "/" + f_path + "?" + sas_token and return that; ensure you
import generate_file_sas and set correct permissions and protocol when
generating the token.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 72011659-46b5-4be0-96d2-01892dfb596c
📒 Files selected for processing (2)
rag/utils/azure_sas_conn.pyrag/utils/azure_spn_conn.py
| blob_name = f"{bucket}/{fnm}" | ||
| for _ in range(10): | ||
| try: | ||
| return self.conn.get_presigned_url("GET", bucket, fnm, expires) | ||
| return self.conn.get_presigned_url("GET", bucket, blob_name, expires) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# 1) Locate local get_presigned_url wrappers/implementations
ast-grep --pattern $'def get_presigned_url($_, $_, $_, $_, $_):\n $$$'
# 2) Inspect container client usage and call signatures in repo
rg -n -C3 'ContainerClient|get_presigned_url\s*\('
# 3) Check declared Azure dependency versions
rg -n -C2 'azure-storage-blob|azure-storage-file-datalake|azure-identity'Repository: infiniflow/ragflow
Length of output: 11943
Remove or replace invalid ContainerClient.get_presigned_url() call with Azure SDK–supported SAS generation.
Line 93 calls a non-existent method. The Azure Storage Blob SDK (12.28.0) does not expose get_presigned_url() on ContainerClient. Use generate_container_sas() or generate_account_sas() from azure.storage.blob instead, or implement SAS token generation with proper Azure SDK methods.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@rag/utils/azure_sas_conn.py` around lines 90 - 93, The call to non-existent
ContainerClient.get_presigned_url should be replaced with Azure SDK SAS
generation: use generate_container_sas (or generate_account_sas) plus
BlobSasPermissions to build a SAS token and then construct the full URL for
blob_name. In the method that currently builds blob_name and loops (the code
using self.conn.get_presigned_url), import and call generate_container_sas with
the account name and key (or account-level SAS), set permissions to read/GET and
expiry to the existing expires value, then return the URL formed as
"https://{account}.blob.core.windows.net/{bucket}/{fnm}?{sas_token}". Keep the
existing retry loop and error handling, and ensure you reference the same
variables blob_name, bucket, fnm, expires and any account/key config from self
(e.g., self.account_name, self.account_key) when generating the SAS.
| f_path = f"{bucket}/{fnm}" | ||
| for _ in range(10): | ||
| try: | ||
| return self.conn.get_presigned_url("GET", bucket, fnm, expires) | ||
| return self.conn.get_presigned_url("GET", bucket, f_path, expires) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# 1) Find local wrappers/definitions that may provide get_presigned_url
ast-grep --pattern $'def get_presigned_url($_, $_, $_, $_, $_):\n $$$'
# 2) Inspect call sites and azure client usage context
rg -n -C3 'FileSystemClient|get_presigned_url\s*\('
# 3) Check declared Azure package versions in manifests
rg -n -C2 'azure-storage-file-datalake|azure-storage-blob|azure-identity'Repository: infiniflow/ragflow
Length of output: 12360
🏁 Script executed:
# Check if there are any tests or usage examples of azure_spn_conn
find . -name "*.py" -type f | xargs grep -l "azure_spn_conn\|AzureSpnConnection" | head -20
# Look for any comments or documentation about this method
rg -n "get_presigned_url" rag/utils/azure_spn_conn.py -B5 -A10
# Check if there are any error messages or logging that might hint at the actual issue
rg -n "fail get\|presigned" rag/utils/azure_spn_conn.py -B2 -A2Repository: infiniflow/ragflow
Length of output: 675
🌐 Web query:
FileSystemClient azure-storage-file-datalake get_presigned_url signature documentation
💡 Result:
The azure-storage-file-datalake Python SDK's FileSystemClient class does not have a get_presigned_url or generate_sas_url method. Unlike JavaScript/TS and .NET SDKs which provide generateSasUrl methods on FileSystemClient equivalents, the Python SDK uses standalone SAS generation functions. To generate a presigned URL (SAS URL) for a FileSystemClient: 1. Use generate_file_system_sas from azure.storage.filedatalake: from azure.storage.filedatalake import generate_file_system_sas, FileSystemSasPermissions from datetime import datetime, timedelta sas_token = generate_file_system_sas( account_name="youraccount", file_system_name="yourfilesystem", credential="youraccountkey", # or UserDelegationKey permission=FileSystemSasPermissions(read=True), # adjust permissions expiry=datetime.utcnow + timedelta(hours=1) ) 2. Construct the URL manually: filesystem_url = f"https://{account_name}.dfs.core.windows.net/{file_system_name}?{sas_token}" This URL provides time-limited access to the filesystem according to the SAS permissions. For sub-resources (directories/files), use generate_directory_sas or generate_file_sas and append their paths to the URL. Example full usage: python from azure.storage.filedatalake import generate_file_system_sas, FileSystemSasPermissions from datetime import datetime, timedelta account_name = "youraccount" file_system_name = "yourfilesystem" account_key = "yourkey" sas_token = generate_file_system_sas( account_name=account_name, file_system_name=file_system_name, credential=account_key, permission=FileSystemSasPermissions(read=True, list=True), expiry=datetime.utcnow + timedelta(hours=1) ) presigned_url = f"https://{account_name}.dfs.core.windows.net/{file_system_name}?{sas_token}" print(presigned_url) Official docs confirm FileSystemClient methods focus on CRUD operations like create_file_system, get_paths, etc., with no SAS generation on the client itself. SAS functions are in _shared_access_signature module.
Citations:
- 1: https://learn.microsoft.com/en-us/python/api/azure-storage-file-datalake/azure.storage.filedatalake.filesystemclient?view=azure-python
- 2: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/storage/azure-storage-file-datalake/azure/storage/filedatalake/_file_system_client.py
- 3: https://azuresdkdocs.z19.web.core.windows.net/python/azure-storage-file-datalake/12.21.0/_modules/azure/storage/filedatalake/_shared_access_signature.html
- 4: https://stackoverflow.com/questions/76430246/generating-sas-url-for-azure-data-lake-storage
- 5: https://learn.microsoft.com/en-us/python/api/overview/azure/storage-file-datalake-readme?view=azure-python
- 6: https://learn.microsoft.com/en-us/python/api/azure-storage-file-datalake/azure.storage.filedatalake.datalakeserviceclient?view=azure-python
Remove get_presigned_url call; use Azure SDK SAS generation functions instead.
Line 116 calls self.conn.get_presigned_url(), but FileSystemClient (azure-storage-file-datalake 12.16.0) does not have this method. The Azure SDK provides standalone SAS generation functions: generate_file_system_sas(), generate_directory_sas(), or generate_file_sas(). You must replace this with the appropriate function and manually construct the presigned URL using the returned SAS token.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@rag/utils/azure_spn_conn.py` around lines 113 - 116, The code currently calls
self.conn.get_presigned_url which doesn't exist on azure.storage.filedatalake
FileSystemClient; replace this call by using the Azure SDK SAS generators (e.g.,
generate_file_sas or generate_file_system_sas as appropriate) to create a SAS
token and then construct the presigned URL by appending the token to the file
resource URL. Specifically, where get_presigned_url is used (refer to self.conn
and the f_path/fnm/f_path construction), call generate_file_sas(...) with the
same account_name, file_system_name (bucket), path (f_path), permission and
expiry (use the expires variable), then build the final URL as the file client
or filesystem endpoint + "/" + f_path + "?" + sas_token and return that; ensure
you import generate_file_sas and set correct permissions and protocol when
generating the token.
|
@coderabbitai — thanks for the review. The This PR intentionally keeps a narrow scope: fix the missing bucket prefix across all operations so that files from different datasets are isolated correctly. Replacing the pre-existing |
|
Tip For best results, initiate chat on the files or code changes.
Would you like me to open a follow-up GitHub issue to track replacing the ( ˶ᵔ ᵕ ᵔ˶ ) 🐇 ✏️ Learnings added
|
Summary
Fixes #14159 — files from different datasets can overwrite each other in Azure Blob storage.
Problem
Both
azure_spn_conn.pyandazure_sas_conn.pyignore thebucketparameter in all storage operations (put,get,rm,obj_exist,get_presigned_url). Files are stored flat using only the filename, so two datasets containing a file with the same name will overwrite each other.The MinIO and S3 implementations correctly use the bucket (typically the knowledge base ID) as a path prefix to create logical folder isolation:
use_prefix_pathdecorator →{orig_bucket}/{fnm}use_prefix_pathdecorator →{prefix_path}/{bucket}/{fnm}Fix
Prepend
{bucket}/to the file path in all 5 operations across both Azure connector files:azure_spn_conn.pyput,get,rm,obj_exist,get_presigned_urlazure_sas_conn.pyput,get,rm,obj_exist,get_presigned_urlThis matches the existing convention where
bucketis the knowledge base ID used as a directory prefix.Existing Azure SPN/SAS deployments have files stored without the bucket prefix. After this fix, new files will be stored under
{bucket}/{filename}while existing files remain at{filename}. A one-time migration script or manual file move may be needed for existing deployments. New deployments are unaffected.Testing
health()method is intentionally left unchanged as it uses a hardcoded test filename without bucket semantics