Skip to content

Fix: filter api in dataset document#14728

Merged
wangq8 merged 1 commit intoinfiniflow:mainfrom
Magicbook1108:main
May 9, 2026
Merged

Fix: filter api in dataset document#14728
wangq8 merged 1 commit intoinfiniflow:mainfrom
Magicbook1108:main

Conversation

@Magicbook1108
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Fix: filter api in dataset document

Type of change

  • Bug Fix (non-breaking change which fixes an issue)

@dosubot dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label May 9, 2026
@Magicbook1108 Magicbook1108 added the ci Continue Integration label May 9, 2026
@dosubot dosubot Bot added the 🐞 bug Something isn't working, pull request that fix bug. label May 9, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 9, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

The list_docs endpoint's filter response logic is refactored to delegate filter aggregation to a new request-validation helper. Instead of aggregating filters from fetched documents, a new _get_doc_filters_with_request function validates query parameters and retrieves aggregated filter data via DocumentService.get_filter_by_kb_id. The old client-side aggregation helper is removed.

Changes

Document Filter Path Refactoring

Layer / File(s) Summary
New Filter Retrieval Helper
api/apps/restful_apis/document_api.py
_get_doc_filters_with_request added to parse filter query parameters (keywords, suffix, types, run) with validation and type conversion, then fetch aggregated filter data via DocumentService.get_filter_by_kb_id.
List Docs Filter Integration
api/apps/restful_apis/document_api.py
list_docs branches early for type=filter requests to use the new helper instead of aggregating from returned documents; default path continues enriching documents with thumbnail URLs, source_type normalization, and schema conversion.
Old Aggregation Removal
api/apps/restful_apis/document_api.py
_aggregate_filters(docs) function is removed; filter aggregation now occurs server-side via DocumentService.get_filter_by_kb_id.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • infiniflow/ragflow#14230: Modifies the same document filter path in list_docs and _aggregate_filters, introducing local filter aggregation that this PR later replaces with server-side delegation.
  • infiniflow/ragflow#14248: Also modifies the filter-aggregation helper and list_docs type=filter handling in the same file.

Suggested labels

size:S

Suggested reviewers

  • yingfeng
  • yuzhichang
  • JinHai-CN

Poem

🐰 Filters now flow through the service gate,
No more docs in loops to aggregate!
A helper validates each query with care,
Server-side wisdom floats through the air. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description includes only the problem statement and type of change, but lacks details on the solution, affected files, testing, or any additional context expected in the template. Add more details such as what the fix actually does, how it was tested, and any breaking changes or side effects to consider.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix: filter api in dataset document' directly relates to the main change, which refactors filter handling in the document API by replacing the filter aggregation logic.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Magicbook1108 Magicbook1108 marked this pull request as draft May 9, 2026 06:04
@Magicbook1108 Magicbook1108 marked this pull request as ready for review May 9, 2026 06:04
@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label May 9, 2026
@wangq8
Copy link
Copy Markdown
Collaborator

wangq8 commented May 9, 2026

Fix #14634

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 723-727: Add logging for the new "type=filter" request flow: at
entry to the branch that checks request.args.get("type") == "filter" emit a
debug log (including dataset_id and request.args) and when calling
_get_doc_filters_with_request log before/after the call; if err_code !=
RetCode.SUCCESS emit a warning or error log that includes err_code and err_msg
before returning get_data_error_result; also add analogous debug and
validation-failure logs around the other early-return branch referenced (lines
~847-860) so all validation exits are logged for traceability. Ensure logs
reference the request flow and include identifiers like dataset_id and any
relevant request params, and use the existing logger used elsewhere in
document_api.py.
- Around line 838-869: _get_doc_filters_with_request currently only reads
keywords, suffix, types, and run status and calls
DocumentService.get_filter_by_kb_id, which causes it to ignore
metadata_condition, id/ids, name, and create_time_* filters and drop scoping
used by the document listing flow; update _get_doc_filters_with_request to parse
the same filters as _get_docs_with_request (reuse
_parse_doc_id_filter_with_metadata and the create_time/name parsing logic) and
pass the resulting id list, metadata_condition, name, and time-range parameters
into DocumentService.get_filter_by_kb_id (or extend that method if needed) so
the filter aggregation respects the caller’s constraints, and add appropriate
logging statements for the new flow consistent with the project logging
guidelines.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8cd704f7-eebd-49e5-aaff-50f11643c4a0

📥 Commits

Reviewing files that changed from the base of the PR and between ee0de58 and 64f830f.

📒 Files selected for processing (1)
  • api/apps/restful_apis/document_api.py

Comment on lines +723 to +727
if request.args.get("type") == "filter":
err_code, err_msg, payload, total = _get_doc_filters_with_request(request, dataset_id)
if err_code != RetCode.SUCCESS:
return get_data_error_result(code=err_code, message=err_msg)
return get_json_result(data={"total": total, "filter": payload})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add logs for the new type=filter flow and validation exits.

Lines 723-727 and Lines 847-860 introduce new request flow and early-return branches without any logging. Please add at least debug entry + warning/error logs on validation failures for traceability.

Suggested patch
@@
 def list_docs(dataset_id, tenant_id):
@@
-    if request.args.get("type") == "filter":
+    if request.args.get("type") == "filter":
+        logging.debug("list_docs filter mode: dataset_id=%s, args=%s", dataset_id, dict(request.args))
         err_code, err_msg, payload, total = _get_doc_filters_with_request(request, dataset_id)
         if err_code != RetCode.SUCCESS:
+            logging.warning("list_docs filter mode validation failed: dataset_id=%s, error=%s", dataset_id, err_msg)
             return get_data_error_result(code=err_code, message=err_msg)
         return get_json_result(data={"total": total, "filter": payload})
@@
 def _get_doc_filters_with_request(req, dataset_id: str):
@@
     types = q.getlist("types")
     if types:
         invalid_types = {t for t in types if t not in VALID_FILE_TYPES}
         if invalid_types:
             msg = f"Invalid filter conditions: {', '.join(invalid_types)} type{'s' if len(invalid_types) > 1 else ''}"
+            logging.warning("invalid type filters: dataset_id=%s, invalid_types=%s", dataset_id, sorted(invalid_types))
             return RetCode.DATA_ERROR, msg, {}, 0
@@
     if run_status_converted:
         invalid_status = {s for s in run_status_converted if s not in run_status_text_to_numeric.values()}
         if invalid_status:
             msg = f"Invalid filter run status conditions: {', '.join(invalid_status)}"
+            logging.warning("invalid run filters: dataset_id=%s, invalid_status=%s", dataset_id, sorted(invalid_status))
             return RetCode.DATA_ERROR, msg, {}, 0

As per coding guidelines, **/*.py: Add logging for new flows.

Also applies to: 847-860

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/apps/restful_apis/document_api.py` around lines 723 - 727, Add logging
for the new "type=filter" request flow: at entry to the branch that checks
request.args.get("type") == "filter" emit a debug log (including dataset_id and
request.args) and when calling _get_doc_filters_with_request log before/after
the call; if err_code != RetCode.SUCCESS emit a warning or error log that
includes err_code and err_msg before returning get_data_error_result; also add
analogous debug and validation-failure logs around the other early-return branch
referenced (lines ~847-860) so all validation exits are logged for traceability.
Ensure logs reference the request flow and include identifiers like dataset_id
and any relevant request params, and use the existing logger used elsewhere in
document_api.py.

Comment on lines +838 to +869
def _get_doc_filters_with_request(req, dataset_id: str):
"""Get aggregated document filters with request parameters from a dataset."""
q = req.args

keywords = q.get("keywords", "")

suffix = q.getlist("suffix")

types = q.getlist("types")
if types:
invalid_types = {t for t in types if t not in VALID_FILE_TYPES}
if invalid_types:
msg = f"Invalid filter conditions: {', '.join(invalid_types)} type{'s' if len(invalid_types) > 1 else ''}"
return RetCode.DATA_ERROR, msg, {}, 0

run_status = q.getlist("run")
run_status_text_to_numeric = {"UNSTART": "0", "RUNNING": "1", "CANCEL": "2", "DONE": "3", "FAIL": "4"}
run_status_converted = [run_status_text_to_numeric.get(v, v) for v in run_status]
if run_status_converted:
invalid_status = {s for s in run_status_converted if s not in run_status_text_to_numeric.values()}
if invalid_status:
msg = f"Invalid filter run status conditions: {', '.join(invalid_status)}"
return RetCode.DATA_ERROR, msg, {}, 0

docs_filter, total = DocumentService.get_filter_by_kb_id(
dataset_id,
keywords,
run_status_converted,
types,
suffix,
)
return RetCode.SUCCESS, "", docs_filter, total
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "1) Inspect filter service API surface"
rg -n --type=py -C3 '\bdef\s+get_filter_by_kb_id\s*\('

echo
echo "2) Compare list-docs parsing logic vs filter parsing logic"
rg -n --type=py -C2 '_get_docs_with_request|_get_doc_filters_with_request|create_time_from|metadata_condition|return_empty_metadata|doc_ids|name=doc_name'

echo
echo "3) Check tests/usages that combine type=filter with metadata/time/id constraints"
rg -n -C2 'type=filter|metadata_condition|create_time_from|create_time_to|\\bids\\b|\\bid\\b'

Repository: infiniflow/ragflow

Length of output: 50375


🏁 Script executed:

# Find where _get_doc_filters_with_request is called from (likely in routes)
rg -n --type=py '_get_doc_filters_with_request' -B 5 -A 10

# Find the endpoint/route definition that calls this function
rg -n --type=py 'def.*filter.*\(|@.*\.route.*filter' -A 5 api/apps/restful_apis/

# Check what parameters the API endpoint actually accepts from request
rg -n --type=py 'q\.get|q\.getlist' api/apps/restful_apis/document_api.py | head -40

Repository: infiniflow/ragflow

Length of output: 5846


🏁 Script executed:

# Check if metadata_condition is parsed in _get_doc_filters_with_request
sed -n '838,870p' api/apps/restful_apis/document_api.py

# Check what _parse_doc_id_filter_with_metadata does and if it's used
rg -n '_parse_doc_id_filter_with_metadata' api/apps/restful_apis/document_api.py -A 30 | head -50

# Check tests that combine type=filter with metadata or time params
rg -n 'type=filter.*create_time|type=filter.*metadata|type=filter.*\bid\b|type=filter.*\bids\b' test/ sdk/ -r 2>/dev/null | head -20

Repository: infiniflow/ragflow

Length of output: 4298


🏁 Script executed:

# Check the full implementation of _parse_doc_id_filter_with_metadata to understand metadata handling
sed -n '871,950p' api/apps/restful_apis/document_api.py

# Check DocumentService.get_filter_by_kb_id full implementation to see if it supports additional params
sed -n '185,250p' api/db/services/document_service.py

# Check if there are any tests that use type=filter with metadata/time/id params
rg -n 'type.*filter' test/testcases/ sdk/python/test/ -A 5 -B 2 | grep -E '(type|filter|metadata|create_time|\.get|\.post)' | head -40

Repository: infiniflow/ragflow

Length of output: 7736


🏁 Script executed:

# Verify if the filter endpoint query parameters are actually accepted/documented
# by checking the route definition for the documents endpoint
rg -n '@.*\.route.*documents' api/apps/restful_apis/document_api.py -B 5 -A 20 | head -80

# Check if there are any tests that combine type=filter with metadata or time constraints
rg -n 'document_filter.*metadata|document_filter.*create_time|document_filter.*\bid\b|document_filter.*\bids\b' test/ -r -A 5 -B 5

Repository: infiniflow/ragflow

Length of output: 3198


🏁 Script executed:

# Check the list_documents endpoint signature and documentation
sed -n '703,750p' api/apps/restful_apis/document_api.py

# Search more carefully for tests using document_filter
rg -n 'document_filter' test/ -A 3 -B 3 | grep -E '(metadata|create_time|id|ids|name)' | head -30

# Check the route documentation/swagger for the documents endpoint
rg -n 'parameters:' api/apps/restful_apis/document_api.py -A 100 | grep -A 100 'def list_documents' | head -80

Repository: infiniflow/ragflow

Length of output: 5877


🏁 Script executed:

# Look for any test that calls document_filter with parameters besides the basic ones
rg -n 'document_filter.*\{' test/ -A 2 -B 1 | head -50

# Check the route documentation to see what parameters are documented as supported
sed -n '650,750p' api/apps/restful_apis/document_api.py

Repository: infiniflow/ragflow

Length of output: 5530


🏁 Script executed:

# Check if _get_doc_filters_with_request is newly added or if it's part of old code
git log -p --all -S '_get_doc_filters_with_request' -- api/apps/restful_apis/document_api.py 2>/dev/null | head -50 || echo "Git history not available"

# Check logging in the file
rg -n 'logging\.' api/apps/restful_apis/document_api.py | wc -l

# Check if there's logging in _get_doc_filters_with_request
sed -n '838,870p' api/apps/restful_apis/document_api.py | grep -i log

Repository: infiniflow/ragflow

Length of output: 65


🏁 Script executed:

# Check if there's logging specifically in _get_doc_filters_with_request and _get_docs_with_request
echo "=== Logging in _get_docs_with_request ==="
sed -n '774,835p' api/apps/restful_apis/document_api.py | grep -i log

echo
echo "=== Logging in _get_doc_filters_with_request ==="
sed -n '838,870p' api/apps/restful_apis/document_api.py | grep -i log

# Check if there's ANY logging about which path is taken
sed -n '720,730p' api/apps/restful_apis/document_api.py

Repository: infiniflow/ragflow

Length of output: 759


🏁 Script executed:

# Verify the test actually passes ids to the filter endpoint and check if it expects them to be processed
sed -n '213,220p' test/testcases/test_web_api/test_document_app/test_document_metadata.py

# Check if there are any comments in the code explaining why filter endpoint doesn't support id/ids filtering
sed -n '720,870p' api/apps/restful_apis/document_api.py | grep -i 'filter\|ignore\|note\|todo\|fixme' -A 2 -B 2

# Verify the guidelines requirement about logging
grep -r 'Add logging' .

Repository: infiniflow/ragflow

Length of output: 4908


type=filter endpoint silently ignores metadata_condition, id/ids, name, and create_time_* query parameters.

The _get_doc_filters_with_request function (lines 838–869) only processes keywords, suffix, types, and run_status, then calls DocumentService.get_filter_by_kb_id() with only these parameters. In contrast, _get_docs_with_request (lines 774–835) respects metadata filtering, document ID constraints, time range filtering, and name filters via _parse_doc_id_filter_with_metadata and direct parameter parsing. This means when a caller passes ?type=filter&ids=... or ?type=filter&create_time_from=X, those constraints are silently dropped and the filter payload returns dataset-wide aggregations instead of the scoped counts—a behavioral inconsistency with the document listing path.

Additionally, this new flow lacks logging as required by coding guidelines (**/*.py: Add logging for new flows).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/apps/restful_apis/document_api.py` around lines 838 - 869,
_get_doc_filters_with_request currently only reads keywords, suffix, types, and
run status and calls DocumentService.get_filter_by_kb_id, which causes it to
ignore metadata_condition, id/ids, name, and create_time_* filters and drop
scoping used by the document listing flow; update _get_doc_filters_with_request
to parse the same filters as _get_docs_with_request (reuse
_parse_doc_id_filter_with_metadata and the create_time/name parsing logic) and
pass the resulting id list, metadata_condition, name, and time-range parameters
into DocumentService.get_filter_by_kb_id (or extend that method if needed) so
the filter aggregation respects the caller’s constraints, and add appropriate
logging statements for the new flow consistent with the project logging
guidelines.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.16%. Comparing base (4f3711d) to head (64f830f).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #14728   +/-   ##
=======================================
  Coverage   94.16%   94.16%           
=======================================
  Files          10       10           
  Lines         703      703           
  Branches      112      112           
=======================================
  Hits          662      662           
  Misses         25       25           
  Partials       16       16           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@wangq8 wangq8 merged commit f7e8c39 into infiniflow:main May 9, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🐞 bug Something isn't working, pull request that fix bug. ci Continue Integration lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants