Skip to content

fix: prepend bucket prefix in Azure Blob (SAS/SPN) to prevent cross-dataset file overwrites#14174

Merged
JinHai-CN merged 2 commits intoinfiniflow:mainfrom
octo-patch:fix/issue-14159-azure-blob-bucket-prefix
May 7, 2026
Merged

fix: prepend bucket prefix in Azure Blob (SAS/SPN) to prevent cross-dataset file overwrites#14174
JinHai-CN merged 2 commits intoinfiniflow:mainfrom
octo-patch:fix/issue-14159-azure-blob-bucket-prefix

Conversation

@octo-patch
Copy link
Copy Markdown
Contributor

Fixes #14159

Problem

The put(), get(), rm(), and obj_exist() methods in both azure_spn_conn.py and azure_sas_conn.py ignore the bucket parameter entirely, storing all files flat using only the filename. This causes files from different datasets to overwrite each other when they share the same filename.

By contrast, the MinIO and S3 implementations correctly use the bucket (typically the knowledge base ID) as a path prefix, creating logical folder isolation like {kb_id}/{filename}.

Solution

Prepend the bucket parameter as a path prefix to all file operations in both Azure storage implementations:

  • azure_spn_conn.py: create_file, delete_file, get_file_client now use f"{bucket}/{fnm}"
  • azure_sas_conn.py: upload_blob, delete_blob, download_blob, get_blob_client now use f"{bucket}/{fnm}"

This matches the behavior of all other storage backends (MinIO, S3) and prevents filename collisions across knowledge bases.

Testing

  • Verified the fix aligns with how MinIO/S3 connectors handle the bucket parameter
  • The health() method is left unchanged as it uses a fixed test path for connectivity checks only

@dosubot dosubot Bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Apr 17, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 17, 2026

📝 Walkthrough

Walkthrough

Both Azure authentication modules (SAS and SPN) now prefix the bucket parameter to file paths in core blob operations (put, get, rm, obj_exist), ensuring files from different datasets maintain isolation and prevent overwrites.

Changes

Cohort / File(s) Summary
Azure SAS Blob Storage
rag/utils/azure_sas_conn.py
Updated path construction to include bucket prefix: put(), get(), rm(), and obj_exist() methods now prefix blob names with {bucket}/ to isolate files by dataset.
Azure Data Lake SPN Storage
rag/utils/azure_spn_conn.py
Updated path construction to include bucket prefix: put(), get(), rm(), and obj_exist() methods now prefix file paths with {bucket}/ to prevent cross-dataset filename collisions.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

🐰 A rabbit hops through storage bins,
With buckets now in every path,
No more files where chaos spins—
Each dataset gets its own half bath!
Files safe from collisions bright,
Prefixes make the world feel right. 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: prepending bucket prefix to Azure Blob operations to prevent cross-dataset file overwrites.
Description check ✅ Passed The description follows the template with clear problem statement, solution details, and testing notes, covering all essential sections.
Linked Issues check ✅ Passed The PR successfully addresses all coding requirements from #14159: bucket prefix is prepended in both azure_spn_conn.py and azure_sas_conn.py for put/get/rm/obj_exist operations.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the bucket prefix issue in Azure storage implementations; no unrelated modifications are present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@dosubot dosubot Bot added the 🐞 bug Something isn't working, pull request that fix bug. label Apr 17, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
rag/utils/azure_spn_conn.py (2)

110-113: ⚠️ Potential issue | 🟠 Major

Presigned URL pathing is not updated to the new bucket/filename key format.

Line 113 still passes fnm directly, so URL-based access may bypass the isolation fix.

Suggested fix
 def get_presigned_url(self, bucket, fnm, expires):
     for _ in range(10):
         try:
-            return self.conn.get_presigned_url("GET", bucket, fnm, expires)
+            return self.conn.get_presigned_url("GET", bucket, f"{bucket}/{fnm}", expires)
         except Exception:
             logging.exception(f"fail get {bucket}/{fnm}")
             self.__open__()
             time.sleep(1)
     return None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag/utils/azure_spn_conn.py` around lines 110 - 113, In get_presigned_url,
the code currently passes fnm directly to self.conn.get_presigned_url which
bypasses the new isolation key format; modify get_presigned_url in the class to
build the object key as bucket + "/" + fnm (e.g., key = f"{bucket}/{fnm}") and
pass that key to self.conn.get_presigned_url while preserving the existing retry
loop and expires parameter so URLs follow the new bucket/filename pathing.

72-82: ⚠️ Potential issue | 🟠 Major

Retry loop in put() is broken by early return in the exception block.

Line 81 exits on the first failure, so retries at Line 72 never happen.

Suggested fix
 def put(self, bucket, fnm, binary):
     for _ in range(3):
         try:
             f = self.conn.create_file(f"{bucket}/{fnm}")
             f.append_data(binary, offset=0, length=len(binary))
             return f.flush_data(len(binary))
         except Exception:
             logging.exception(f"Fail put {bucket}/{fnm}")
             self.__open__()
             time.sleep(1)
-            return None
     return None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag/utils/azure_spn_conn.py` around lines 72 - 82, The retry loop in put() is
aborted by the early "return None" inside the except block; remove that return
so exceptions cause the loop to retry up to 3 times (call self.__open__(),
sleep, and continue), and only return None after the for-loop if all attempts
fail; keep the logging.exception(f"Fail put {bucket}/{fnm}") and the
self.__open__(), time.sleep(1) calls in the except handler, and ensure
successful path still returns f.flush_data(len(binary)) after f.append_data.
rag/utils/azure_sas_conn.py (1)

87-90: ⚠️ Potential issue | 🟠 Major

Presigned URL generation is still missing bucket-prefixed object pathing.

Line 90 still uses fnm directly. This leaves the issue only partially fixed for URL-based access paths.

Suggested fix
 def get_presigned_url(self, bucket, fnm, expires):
     for _ in range(10):
         try:
-            return self.conn.get_presigned_url("GET", bucket, fnm, expires)
+            return self.conn.get_presigned_url("GET", bucket, f"{bucket}/{fnm}", expires)
         except Exception:
             logging.exception(f"fail get {bucket}/{fnm}")
             self.__open__()
             time.sleep(1)
     return
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag/utils/azure_sas_conn.py` around lines 87 - 90, The get_presigned_url
method currently passes fnm directly to self.conn.get_presigned_url, which
misses bucket-prefixed object pathing; update get_presigned_url to build the
object path (e.g., object_path = f"{bucket}/{fnm}" unless fnm already starts
with the bucket) and call self.conn.get_presigned_url("GET", bucket,
object_path, expires) so the returned SAS URL uses the bucket-prefixed object
path; adjust any retry/exception handling in get_presigned_url to use this new
object_path variable.
🧹 Nitpick comments (2)
rag/utils/azure_sas_conn.py (1)

54-83: Add explicit logs for the new bucket-prefixed flow.

The new namespace flow is critical for debugging migration/compat incidents; add at least debug-level logs on put/get/rm/obj_exist path selection.
As per coding guidelines, **/*.py: Add logging for new flows.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag/utils/azure_sas_conn.py` around lines 54 - 83, The put/get/rm/obj_exist
methods lack debug logging for the new bucket-prefixed namespace flow; add
debug-level logs in each method (put, get, rm, obj_exist) that record which
path/name is being used (e.g., the f"{bucket}/{fnm}" key) before calling
self.conn.upload_blob / download_blob / delete_blob / get_blob_client.exists(),
and log the selection of the bucket-prefixed flow when falling back/reopening
(inside the except blocks where self.__open__() is called) so callers can trace
migration/compat decisions and retries; also include the same debug info in the
exception handlers (alongside the existing logging.exception) to make the
attempted key explicit.
rag/utils/azure_spn_conn.py (1)

71-108: Please add explicit logging for the new bucket-prefixed namespace flow.

Given this is a behavior-defining storage path change, debug logs for resolved remote path (or migration fallback path) will help production diagnosis.
As per coding guidelines, **/*.py: Add logging for new flows.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag/utils/azure_spn_conn.py` around lines 71 - 108, Add explicit debug
logging of the resolved remote path for the bucket-prefixed namespace flow in
the Azure SPN connection methods so production diagnostics can see exactly what
path is used (and whether a fallback/migration path was chosen). Specifically,
in put, get, rm, and obj_exist (methods of the azure_spn_conn class) compute the
remote_path once (e.g. remote_path = f"{bucket}/{fnm}"), log a debug message
like "resolved remote_path=<remote_path> (bucket_prefixed_namespace)" or include
fallback info before calling self.conn.create_file/get_file_client/delete_file,
and keep existing exception logging; ensure the same remote_path is used for the
subsequent operation and in the exception messages so logs match the attempted
action.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@rag/utils/azure_sas_conn.py`:
- Around line 69-73: The get method in rag/utils/azure_sas_conn.py currently
only tries self.conn.download_blob(f"{bucket}/{fnm}") which breaks legacy
unprefixed blobs; change get (and the corresponding existence check method that
mirrors lines ~80-83) to first attempt the "{bucket}/{fnm}" path and if that
fails (catch the not-found/read error) fall back to trying the raw "fnm" path
before giving up, returning the blob bytes on success; do the same for the
exists/existence helper so callers like image_utils.py and pdf_chunk_metadata.py
remain compatible with legacy keys.

In `@rag/utils/azure_spn_conn.py`:
- Around line 90-95: The get method (and corresponding existence check)
currently only tries the prefixed path f"{bucket}/{fnm}", which will make legacy
unprefixed objects inaccessible; update the get and exists logic in class
methods (e.g., get and exists using self.conn.get_file_client and
client.download_file) to attempt the prefixed path first and, on a NotFound/404
or other read failure, fall back to the legacy unprefixed path (just f"{fnm}")
before returning failure — catch the download/get exceptions, try the alternate
key, and return the result from whichever succeeds (or propagate a clear error
if both fail).

---

Outside diff comments:
In `@rag/utils/azure_sas_conn.py`:
- Around line 87-90: The get_presigned_url method currently passes fnm directly
to self.conn.get_presigned_url, which misses bucket-prefixed object pathing;
update get_presigned_url to build the object path (e.g., object_path =
f"{bucket}/{fnm}" unless fnm already starts with the bucket) and call
self.conn.get_presigned_url("GET", bucket, object_path, expires) so the returned
SAS URL uses the bucket-prefixed object path; adjust any retry/exception
handling in get_presigned_url to use this new object_path variable.

In `@rag/utils/azure_spn_conn.py`:
- Around line 110-113: In get_presigned_url, the code currently passes fnm
directly to self.conn.get_presigned_url which bypasses the new isolation key
format; modify get_presigned_url in the class to build the object key as bucket
+ "/" + fnm (e.g., key = f"{bucket}/{fnm}") and pass that key to
self.conn.get_presigned_url while preserving the existing retry loop and expires
parameter so URLs follow the new bucket/filename pathing.
- Around line 72-82: The retry loop in put() is aborted by the early "return
None" inside the except block; remove that return so exceptions cause the loop
to retry up to 3 times (call self.__open__(), sleep, and continue), and only
return None after the for-loop if all attempts fail; keep the
logging.exception(f"Fail put {bucket}/{fnm}") and the self.__open__(),
time.sleep(1) calls in the except handler, and ensure successful path still
returns f.flush_data(len(binary)) after f.append_data.

---

Nitpick comments:
In `@rag/utils/azure_sas_conn.py`:
- Around line 54-83: The put/get/rm/obj_exist methods lack debug logging for the
new bucket-prefixed namespace flow; add debug-level logs in each method (put,
get, rm, obj_exist) that record which path/name is being used (e.g., the
f"{bucket}/{fnm}" key) before calling self.conn.upload_blob / download_blob /
delete_blob / get_blob_client.exists(), and log the selection of the
bucket-prefixed flow when falling back/reopening (inside the except blocks where
self.__open__() is called) so callers can trace migration/compat decisions and
retries; also include the same debug info in the exception handlers (alongside
the existing logging.exception) to make the attempted key explicit.

In `@rag/utils/azure_spn_conn.py`:
- Around line 71-108: Add explicit debug logging of the resolved remote path for
the bucket-prefixed namespace flow in the Azure SPN connection methods so
production diagnostics can see exactly what path is used (and whether a
fallback/migration path was chosen). Specifically, in put, get, rm, and
obj_exist (methods of the azure_spn_conn class) compute the remote_path once
(e.g. remote_path = f"{bucket}/{fnm}"), log a debug message like "resolved
remote_path=<remote_path> (bucket_prefixed_namespace)" or include fallback info
before calling self.conn.create_file/get_file_client/delete_file, and keep
existing exception logging; ensure the same remote_path is used for the
subsequent operation and in the exception messages so logs match the attempted
action.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 57b8fe6c-badc-467c-b40e-4580c5c1a024

📥 Commits

Reviewing files that changed from the base of the PR and between e03212f and 9d3e63a.

📒 Files selected for processing (2)
  • rag/utils/azure_sas_conn.py
  • rag/utils/azure_spn_conn.py

Comment on lines 69 to 73
def get(self, bucket, fnm):
for _ in range(1):
try:
r = self.conn.download_blob(fnm)
r = self.conn.download_blob(f"{bucket}/{fnm}")
return r.read()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Legacy blobs can become unreadable after this path migration.

Line 72 and Line 82 now only target "{bucket}/{fnm}". Existing unprefixed blobs (fnm) can no longer be found/read, which can break current read paths (e.g., api/utils/image_utils.py:23-27, rag/flow/parser/pdf_chunk_metadata.py:202-206).

Suggested compatibility fallback (read/exist)
 def get(self, bucket, fnm):
-    for _ in range(1):
-        try:
-            r = self.conn.download_blob(f"{bucket}/{fnm}")
-            return r.read()
-        except Exception:
-            logging.exception(f"fail get {bucket}/{fnm}")
-            self.__open__()
-            time.sleep(1)
+    candidates = [f"{bucket}/{fnm}", fnm]
+    for name in candidates:
+        try:
+            r = self.conn.download_blob(name)
+            return r.read()
+        except Exception:
+            continue
+    logging.exception(f"fail get {bucket}/{fnm}")
+    self.__open__()
+    time.sleep(1)
     return

 def obj_exist(self, bucket, fnm):
     try:
-        return self.conn.get_blob_client(f"{bucket}/{fnm}").exists()
+        return (
+            self.conn.get_blob_client(f"{bucket}/{fnm}").exists()
+            or self.conn.get_blob_client(fnm).exists()
+        )
     except Exception:
         logging.exception(f"Fail put {bucket}/{fnm}")
     return False

Also applies to: 80-83

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag/utils/azure_sas_conn.py` around lines 69 - 73, The get method in
rag/utils/azure_sas_conn.py currently only tries
self.conn.download_blob(f"{bucket}/{fnm}") which breaks legacy unprefixed blobs;
change get (and the corresponding existence check method that mirrors lines
~80-83) to first attempt the "{bucket}/{fnm}" path and if that fails (catch the
not-found/read error) fall back to trying the raw "fnm" path before giving up,
returning the blob bytes on success; do the same for the exists/existence helper
so callers like image_utils.py and pdf_chunk_metadata.py remain compatible with
legacy keys.

Comment on lines 90 to 95
def get(self, bucket, fnm):
for _ in range(1):
try:
client = self.conn.get_file_client(fnm)
client = self.conn.get_file_client(f"{bucket}/{fnm}")
r = client.download_file()
return r.read()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Read/existence paths need a legacy fallback to avoid data invisibility.

Line 93 and Line 104 now only use "{bucket}/{fnm}". Older unprefixed objects can become unreadable/uncheckable immediately after deploy.

Also applies to: 102-105

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag/utils/azure_spn_conn.py` around lines 90 - 95, The get method (and
corresponding existence check) currently only tries the prefixed path
f"{bucket}/{fnm}", which will make legacy unprefixed objects inaccessible;
update the get and exists logic in class methods (e.g., get and exists using
self.conn.get_file_client and client.download_file) to attempt the prefixed path
first and, on a NotFound/404 or other read failure, fall back to the legacy
unprefixed path (just f"{fnm}") before returning failure — catch the
download/get exceptions, try the alternate key, and return the result from
whichever succeeds (or propagate a clear error if both fail).

@JinHai-CN JinHai-CN added the ci Continue Integration label May 7, 2026
@JinHai-CN JinHai-CN marked this pull request as draft May 7, 2026 07:34
@JinHai-CN JinHai-CN marked this pull request as ready for review May 7, 2026 07:34
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.16%. Comparing base (c50028b) to head (b12eb75).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #14174   +/-   ##
=======================================
  Coverage   94.16%   94.16%           
=======================================
  Files          10       10           
  Lines         703      703           
  Branches      112      112           
=======================================
  Hits          662      662           
  Misses         25       25           
  Partials       16       16           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@JinHai-CN JinHai-CN merged commit 5c9124c into infiniflow:main May 7, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🐞 bug Something isn't working, pull request that fix bug. ci Continue Integration size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Azure Blob SPN storage: files from different datasets can overwrite each other due to missing bucket prefix

2 participants