fix: prepend bucket prefix in Azure Blob (SAS/SPN) to prevent cross-dataset file overwrites#14174
Conversation
…ile overwrites (fixes infiniflow#14159)
📝 WalkthroughWalkthroughBoth Azure authentication modules (SAS and SPN) now prefix the Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
rag/utils/azure_spn_conn.py (2)
110-113:⚠️ Potential issue | 🟠 MajorPresigned URL pathing is not updated to the new
bucket/filenamekey format.Line 113 still passes
fnmdirectly, so URL-based access may bypass the isolation fix.Suggested fix
def get_presigned_url(self, bucket, fnm, expires): for _ in range(10): try: - return self.conn.get_presigned_url("GET", bucket, fnm, expires) + return self.conn.get_presigned_url("GET", bucket, f"{bucket}/{fnm}", expires) except Exception: logging.exception(f"fail get {bucket}/{fnm}") self.__open__() time.sleep(1) return None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rag/utils/azure_spn_conn.py` around lines 110 - 113, In get_presigned_url, the code currently passes fnm directly to self.conn.get_presigned_url which bypasses the new isolation key format; modify get_presigned_url in the class to build the object key as bucket + "/" + fnm (e.g., key = f"{bucket}/{fnm}") and pass that key to self.conn.get_presigned_url while preserving the existing retry loop and expires parameter so URLs follow the new bucket/filename pathing.
72-82:⚠️ Potential issue | 🟠 MajorRetry loop in
put()is broken by early return in the exception block.Line 81 exits on the first failure, so retries at Line 72 never happen.
Suggested fix
def put(self, bucket, fnm, binary): for _ in range(3): try: f = self.conn.create_file(f"{bucket}/{fnm}") f.append_data(binary, offset=0, length=len(binary)) return f.flush_data(len(binary)) except Exception: logging.exception(f"Fail put {bucket}/{fnm}") self.__open__() time.sleep(1) - return None return None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rag/utils/azure_spn_conn.py` around lines 72 - 82, The retry loop in put() is aborted by the early "return None" inside the except block; remove that return so exceptions cause the loop to retry up to 3 times (call self.__open__(), sleep, and continue), and only return None after the for-loop if all attempts fail; keep the logging.exception(f"Fail put {bucket}/{fnm}") and the self.__open__(), time.sleep(1) calls in the except handler, and ensure successful path still returns f.flush_data(len(binary)) after f.append_data.rag/utils/azure_sas_conn.py (1)
87-90:⚠️ Potential issue | 🟠 MajorPresigned URL generation is still missing bucket-prefixed object pathing.
Line 90 still uses
fnmdirectly. This leaves the issue only partially fixed for URL-based access paths.Suggested fix
def get_presigned_url(self, bucket, fnm, expires): for _ in range(10): try: - return self.conn.get_presigned_url("GET", bucket, fnm, expires) + return self.conn.get_presigned_url("GET", bucket, f"{bucket}/{fnm}", expires) except Exception: logging.exception(f"fail get {bucket}/{fnm}") self.__open__() time.sleep(1) return🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rag/utils/azure_sas_conn.py` around lines 87 - 90, The get_presigned_url method currently passes fnm directly to self.conn.get_presigned_url, which misses bucket-prefixed object pathing; update get_presigned_url to build the object path (e.g., object_path = f"{bucket}/{fnm}" unless fnm already starts with the bucket) and call self.conn.get_presigned_url("GET", bucket, object_path, expires) so the returned SAS URL uses the bucket-prefixed object path; adjust any retry/exception handling in get_presigned_url to use this new object_path variable.
🧹 Nitpick comments (2)
rag/utils/azure_sas_conn.py (1)
54-83: Add explicit logs for the new bucket-prefixed flow.The new namespace flow is critical for debugging migration/compat incidents; add at least debug-level logs on put/get/rm/obj_exist path selection.
As per coding guidelines,**/*.py: Add logging for new flows.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rag/utils/azure_sas_conn.py` around lines 54 - 83, The put/get/rm/obj_exist methods lack debug logging for the new bucket-prefixed namespace flow; add debug-level logs in each method (put, get, rm, obj_exist) that record which path/name is being used (e.g., the f"{bucket}/{fnm}" key) before calling self.conn.upload_blob / download_blob / delete_blob / get_blob_client.exists(), and log the selection of the bucket-prefixed flow when falling back/reopening (inside the except blocks where self.__open__() is called) so callers can trace migration/compat decisions and retries; also include the same debug info in the exception handlers (alongside the existing logging.exception) to make the attempted key explicit.rag/utils/azure_spn_conn.py (1)
71-108: Please add explicit logging for the new bucket-prefixed namespace flow.Given this is a behavior-defining storage path change, debug logs for resolved remote path (or migration fallback path) will help production diagnosis.
As per coding guidelines,**/*.py: Add logging for new flows.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rag/utils/azure_spn_conn.py` around lines 71 - 108, Add explicit debug logging of the resolved remote path for the bucket-prefixed namespace flow in the Azure SPN connection methods so production diagnostics can see exactly what path is used (and whether a fallback/migration path was chosen). Specifically, in put, get, rm, and obj_exist (methods of the azure_spn_conn class) compute the remote_path once (e.g. remote_path = f"{bucket}/{fnm}"), log a debug message like "resolved remote_path=<remote_path> (bucket_prefixed_namespace)" or include fallback info before calling self.conn.create_file/get_file_client/delete_file, and keep existing exception logging; ensure the same remote_path is used for the subsequent operation and in the exception messages so logs match the attempted action.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@rag/utils/azure_sas_conn.py`:
- Around line 69-73: The get method in rag/utils/azure_sas_conn.py currently
only tries self.conn.download_blob(f"{bucket}/{fnm}") which breaks legacy
unprefixed blobs; change get (and the corresponding existence check method that
mirrors lines ~80-83) to first attempt the "{bucket}/{fnm}" path and if that
fails (catch the not-found/read error) fall back to trying the raw "fnm" path
before giving up, returning the blob bytes on success; do the same for the
exists/existence helper so callers like image_utils.py and pdf_chunk_metadata.py
remain compatible with legacy keys.
In `@rag/utils/azure_spn_conn.py`:
- Around line 90-95: The get method (and corresponding existence check)
currently only tries the prefixed path f"{bucket}/{fnm}", which will make legacy
unprefixed objects inaccessible; update the get and exists logic in class
methods (e.g., get and exists using self.conn.get_file_client and
client.download_file) to attempt the prefixed path first and, on a NotFound/404
or other read failure, fall back to the legacy unprefixed path (just f"{fnm}")
before returning failure — catch the download/get exceptions, try the alternate
key, and return the result from whichever succeeds (or propagate a clear error
if both fail).
---
Outside diff comments:
In `@rag/utils/azure_sas_conn.py`:
- Around line 87-90: The get_presigned_url method currently passes fnm directly
to self.conn.get_presigned_url, which misses bucket-prefixed object pathing;
update get_presigned_url to build the object path (e.g., object_path =
f"{bucket}/{fnm}" unless fnm already starts with the bucket) and call
self.conn.get_presigned_url("GET", bucket, object_path, expires) so the returned
SAS URL uses the bucket-prefixed object path; adjust any retry/exception
handling in get_presigned_url to use this new object_path variable.
In `@rag/utils/azure_spn_conn.py`:
- Around line 110-113: In get_presigned_url, the code currently passes fnm
directly to self.conn.get_presigned_url which bypasses the new isolation key
format; modify get_presigned_url in the class to build the object key as bucket
+ "/" + fnm (e.g., key = f"{bucket}/{fnm}") and pass that key to
self.conn.get_presigned_url while preserving the existing retry loop and expires
parameter so URLs follow the new bucket/filename pathing.
- Around line 72-82: The retry loop in put() is aborted by the early "return
None" inside the except block; remove that return so exceptions cause the loop
to retry up to 3 times (call self.__open__(), sleep, and continue), and only
return None after the for-loop if all attempts fail; keep the
logging.exception(f"Fail put {bucket}/{fnm}") and the self.__open__(),
time.sleep(1) calls in the except handler, and ensure successful path still
returns f.flush_data(len(binary)) after f.append_data.
---
Nitpick comments:
In `@rag/utils/azure_sas_conn.py`:
- Around line 54-83: The put/get/rm/obj_exist methods lack debug logging for the
new bucket-prefixed namespace flow; add debug-level logs in each method (put,
get, rm, obj_exist) that record which path/name is being used (e.g., the
f"{bucket}/{fnm}" key) before calling self.conn.upload_blob / download_blob /
delete_blob / get_blob_client.exists(), and log the selection of the
bucket-prefixed flow when falling back/reopening (inside the except blocks where
self.__open__() is called) so callers can trace migration/compat decisions and
retries; also include the same debug info in the exception handlers (alongside
the existing logging.exception) to make the attempted key explicit.
In `@rag/utils/azure_spn_conn.py`:
- Around line 71-108: Add explicit debug logging of the resolved remote path for
the bucket-prefixed namespace flow in the Azure SPN connection methods so
production diagnostics can see exactly what path is used (and whether a
fallback/migration path was chosen). Specifically, in put, get, rm, and
obj_exist (methods of the azure_spn_conn class) compute the remote_path once
(e.g. remote_path = f"{bucket}/{fnm}"), log a debug message like "resolved
remote_path=<remote_path> (bucket_prefixed_namespace)" or include fallback info
before calling self.conn.create_file/get_file_client/delete_file, and keep
existing exception logging; ensure the same remote_path is used for the
subsequent operation and in the exception messages so logs match the attempted
action.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 57b8fe6c-badc-467c-b40e-4580c5c1a024
📒 Files selected for processing (2)
rag/utils/azure_sas_conn.pyrag/utils/azure_spn_conn.py
| def get(self, bucket, fnm): | ||
| for _ in range(1): | ||
| try: | ||
| r = self.conn.download_blob(fnm) | ||
| r = self.conn.download_blob(f"{bucket}/{fnm}") | ||
| return r.read() |
There was a problem hiding this comment.
Legacy blobs can become unreadable after this path migration.
Line 72 and Line 82 now only target "{bucket}/{fnm}". Existing unprefixed blobs (fnm) can no longer be found/read, which can break current read paths (e.g., api/utils/image_utils.py:23-27, rag/flow/parser/pdf_chunk_metadata.py:202-206).
Suggested compatibility fallback (read/exist)
def get(self, bucket, fnm):
- for _ in range(1):
- try:
- r = self.conn.download_blob(f"{bucket}/{fnm}")
- return r.read()
- except Exception:
- logging.exception(f"fail get {bucket}/{fnm}")
- self.__open__()
- time.sleep(1)
+ candidates = [f"{bucket}/{fnm}", fnm]
+ for name in candidates:
+ try:
+ r = self.conn.download_blob(name)
+ return r.read()
+ except Exception:
+ continue
+ logging.exception(f"fail get {bucket}/{fnm}")
+ self.__open__()
+ time.sleep(1)
return
def obj_exist(self, bucket, fnm):
try:
- return self.conn.get_blob_client(f"{bucket}/{fnm}").exists()
+ return (
+ self.conn.get_blob_client(f"{bucket}/{fnm}").exists()
+ or self.conn.get_blob_client(fnm).exists()
+ )
except Exception:
logging.exception(f"Fail put {bucket}/{fnm}")
return FalseAlso applies to: 80-83
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@rag/utils/azure_sas_conn.py` around lines 69 - 73, The get method in
rag/utils/azure_sas_conn.py currently only tries
self.conn.download_blob(f"{bucket}/{fnm}") which breaks legacy unprefixed blobs;
change get (and the corresponding existence check method that mirrors lines
~80-83) to first attempt the "{bucket}/{fnm}" path and if that fails (catch the
not-found/read error) fall back to trying the raw "fnm" path before giving up,
returning the blob bytes on success; do the same for the exists/existence helper
so callers like image_utils.py and pdf_chunk_metadata.py remain compatible with
legacy keys.
| def get(self, bucket, fnm): | ||
| for _ in range(1): | ||
| try: | ||
| client = self.conn.get_file_client(fnm) | ||
| client = self.conn.get_file_client(f"{bucket}/{fnm}") | ||
| r = client.download_file() | ||
| return r.read() |
There was a problem hiding this comment.
Read/existence paths need a legacy fallback to avoid data invisibility.
Line 93 and Line 104 now only use "{bucket}/{fnm}". Older unprefixed objects can become unreadable/uncheckable immediately after deploy.
Also applies to: 102-105
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@rag/utils/azure_spn_conn.py` around lines 90 - 95, The get method (and
corresponding existence check) currently only tries the prefixed path
f"{bucket}/{fnm}", which will make legacy unprefixed objects inaccessible;
update the get and exists logic in class methods (e.g., get and exists using
self.conn.get_file_client and client.download_file) to attempt the prefixed path
first and, on a NotFound/404 or other read failure, fall back to the legacy
unprefixed path (just f"{fnm}") before returning failure — catch the
download/get exceptions, try the alternate key, and return the result from
whichever succeeds (or propagate a clear error if both fail).
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #14174 +/- ##
=======================================
Coverage 94.16% 94.16%
=======================================
Files 10 10
Lines 703 703
Branches 112 112
=======================================
Hits 662 662
Misses 25 25
Partials 16 16 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Fixes #14159
Problem
The
put(),get(),rm(), andobj_exist()methods in bothazure_spn_conn.pyandazure_sas_conn.pyignore thebucketparameter entirely, storing all files flat using only the filename. This causes files from different datasets to overwrite each other when they share the same filename.By contrast, the MinIO and S3 implementations correctly use the bucket (typically the knowledge base ID) as a path prefix, creating logical folder isolation like
{kb_id}/{filename}.Solution
Prepend the
bucketparameter as a path prefix to all file operations in both Azure storage implementations:azure_spn_conn.py:create_file,delete_file,get_file_clientnow usef"{bucket}/{fnm}"azure_sas_conn.py:upload_blob,delete_blob,download_blob,get_blob_clientnow usef"{bucket}/{fnm}"This matches the behavior of all other storage backends (MinIO, S3) and prevents filename collisions across knowledge bases.
Testing
health()method is left unchanged as it uses a fixed test path for connectivity checks only