Fix: filter api in dataset document by Magicbook1108 · Pull Request #14728 · infiniflow/ragflow

Magicbook1108 · 2026-05-09T06:04:06Z

What problem does this PR solve?

Fix: filter api in dataset document

Type of change

Bug Fix (non-breaking change which fixes an issue)

coderabbitai · 2026-05-09T06:04:21Z

📝 Walkthrough

Walkthrough

The list_docs endpoint's filter response logic is refactored to delegate filter aggregation to a new request-validation helper. Instead of aggregating filters from fetched documents, a new _get_doc_filters_with_request function validates query parameters and retrieves aggregated filter data via DocumentService.get_filter_by_kb_id. The old client-side aggregation helper is removed.

Changes

Document Filter Path Refactoring

Layer / File(s)	Summary
New Filter Retrieval Helper `api/apps/restful_apis/document_api.py`	`_get_doc_filters_with_request` added to parse filter query parameters (`keywords`, `suffix`, `types`, `run`) with validation and type conversion, then fetch aggregated filter data via `DocumentService.get_filter_by_kb_id`.
List Docs Filter Integration `api/apps/restful_apis/document_api.py`	`list_docs` branches early for `type=filter` requests to use the new helper instead of aggregating from returned documents; default path continues enriching documents with thumbnail URLs, `source_type` normalization, and schema conversion.
Old Aggregation Removal `api/apps/restful_apis/document_api.py`	`_aggregate_filters(docs)` function is removed; filter aggregation now occurs server-side via `DocumentService.get_filter_by_kb_id`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

infiniflow/ragflow#14230: Modifies the same document filter path in list_docs and _aggregate_filters, introducing local filter aggregation that this PR later replaces with server-side delegation.
infiniflow/ragflow#14248: Also modifies the filter-aggregation helper and list_docs type=filter handling in the same file.

Suggested labels

size:S

Suggested reviewers

yingfeng
yuzhichang
JinHai-CN

Poem

🐰 Filters now flow through the service gate,
No more docs in loops to aggregate!
A helper validates each query with care,
Server-side wisdom floats through the air. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description includes only the problem statement and type of change, but lacks details on the solution, affected files, testing, or any additional context expected in the template.	Add more details such as what the fix actually does, how it was tested, and any breaking changes or side effects to consider.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Fix: filter api in dataset document' directly relates to the main change, which refactors filter handling in the document API by replacing the filter aggregation logic.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

wangq8 · 2026-05-09T06:06:53Z

Fix #14634

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@api/apps/restful_apis/document_api.py`:
- Around line 723-727: Add logging for the new "type=filter" request flow: at
entry to the branch that checks request.args.get("type") == "filter" emit a
debug log (including dataset_id and request.args) and when calling
_get_doc_filters_with_request log before/after the call; if err_code !=
RetCode.SUCCESS emit a warning or error log that includes err_code and err_msg
before returning get_data_error_result; also add analogous debug and
validation-failure logs around the other early-return branch referenced (lines
~847-860) so all validation exits are logged for traceability. Ensure logs
reference the request flow and include identifiers like dataset_id and any
relevant request params, and use the existing logger used elsewhere in
document_api.py.
- Around line 838-869: _get_doc_filters_with_request currently only reads
keywords, suffix, types, and run status and calls
DocumentService.get_filter_by_kb_id, which causes it to ignore
metadata_condition, id/ids, name, and create_time_* filters and drop scoping
used by the document listing flow; update _get_doc_filters_with_request to parse
the same filters as _get_docs_with_request (reuse
_parse_doc_id_filter_with_metadata and the create_time/name parsing logic) and
pass the resulting id list, metadata_condition, name, and time-range parameters
into DocumentService.get_filter_by_kb_id (or extend that method if needed) so
the filter aggregation respects the caller’s constraints, and add appropriate
logging statements for the new flow consistent with the project logging
guidelines.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8cd704f7-eebd-49e5-aaff-50f11643c4a0

📥 Commits

Reviewing files that changed from the base of the PR and between ee0de58 and 64f830f.

📒 Files selected for processing (1)

api/apps/restful_apis/document_api.py

coderabbitai · 2026-05-09T06:08:23Z

+    if request.args.get("type") == "filter":
+        err_code, err_msg, payload, total = _get_doc_filters_with_request(request, dataset_id)
+        if err_code != RetCode.SUCCESS:
+            return get_data_error_result(code=err_code, message=err_msg)
+        return get_json_result(data={"total": total, "filter": payload})


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add logs for the new type=filter flow and validation exits.

Lines 723-727 and Lines 847-860 introduce new request flow and early-return branches without any logging. Please add at least debug entry + warning/error logs on validation failures for traceability.

Suggested patch

@@ def list_docs(dataset_id, tenant_id): @@ - if request.args.get("type") == "filter": + if request.args.get("type") == "filter": + logging.debug("list_docs filter mode: dataset_id=%s, args=%s", dataset_id, dict(request.args)) err_code, err_msg, payload, total = _get_doc_filters_with_request(request, dataset_id) if err_code != RetCode.SUCCESS: + logging.warning("list_docs filter mode validation failed: dataset_id=%s, error=%s", dataset_id, err_msg) return get_data_error_result(code=err_code, message=err_msg) return get_json_result(data={"total": total, "filter": payload}) @@ def _get_doc_filters_with_request(req, dataset_id: str): @@ types = q.getlist("types") if types: invalid_types = {t for t in types if t not in VALID_FILE_TYPES} if invalid_types: msg = f"Invalid filter conditions: {', '.join(invalid_types)} type{'s' if len(invalid_types) > 1 else ''}" + logging.warning("invalid type filters: dataset_id=%s, invalid_types=%s", dataset_id, sorted(invalid_types)) return RetCode.DATA_ERROR, msg, {}, 0 @@ if run_status_converted: invalid_status = {s for s in run_status_converted if s not in run_status_text_to_numeric.values()} if invalid_status: msg = f"Invalid filter run status conditions: {', '.join(invalid_status)}" + logging.warning("invalid run filters: dataset_id=%s, invalid_status=%s", dataset_id, sorted(invalid_status)) return RetCode.DATA_ERROR, msg, {}, 0

As per coding guidelines, **/*.py: Add logging for new flows.

Also applies to: 847-860

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/apps/restful_apis/document_api.py` around lines 723 - 727, Add logging for the new "type=filter" request flow: at entry to the branch that checks request.args.get("type") == "filter" emit a debug log (including dataset_id and request.args) and when calling _get_doc_filters_with_request log before/after the call; if err_code != RetCode.SUCCESS emit a warning or error log that includes err_code and err_msg before returning get_data_error_result; also add analogous debug and validation-failure logs around the other early-return branch referenced (lines ~847-860) so all validation exits are logged for traceability. Ensure logs reference the request flow and include identifiers like dataset_id and any relevant request params, and use the existing logger used elsewhere in document_api.py.

coderabbitai · 2026-05-09T06:08:23Z

+def _get_doc_filters_with_request(req, dataset_id: str):
+    """Get aggregated document filters with request parameters from a dataset."""
+    q = req.args
+
+    keywords = q.get("keywords", "")
+
+    suffix = q.getlist("suffix")
+
+    types = q.getlist("types")
+    if types:
+        invalid_types = {t for t in types if t not in VALID_FILE_TYPES}
+        if invalid_types:
+            msg = f"Invalid filter conditions: {', '.join(invalid_types)} type{'s' if len(invalid_types) > 1 else ''}"
+            return RetCode.DATA_ERROR, msg, {}, 0
+
+    run_status = q.getlist("run")
+    run_status_text_to_numeric = {"UNSTART": "0", "RUNNING": "1", "CANCEL": "2", "DONE": "3", "FAIL": "4"}
+    run_status_converted = [run_status_text_to_numeric.get(v, v) for v in run_status]
+    if run_status_converted:
+        invalid_status = {s for s in run_status_converted if s not in run_status_text_to_numeric.values()}
+        if invalid_status:
+            msg = f"Invalid filter run status conditions: {', '.join(invalid_status)}"
+            return RetCode.DATA_ERROR, msg, {}, 0
+
+    docs_filter, total = DocumentService.get_filter_by_kb_id(
+        dataset_id,
+        keywords,
+        run_status_converted,
+        types,
+        suffix,
+    )
+    return RetCode.SUCCESS, "", docs_filter, total


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "1) Inspect filter service API surface" rg -n --type=py -C3 '\bdef\s+get_filter_by_kb_id\s*\(' echo echo "2) Compare list-docs parsing logic vs filter parsing logic" rg -n --type=py -C2 '_get_docs_with_request|_get_doc_filters_with_request|create_time_from|metadata_condition|return_empty_metadata|doc_ids|name=doc_name' echo echo "3) Check tests/usages that combine type=filter with metadata/time/id constraints" rg -n -C2 'type=filter|metadata_condition|create_time_from|create_time_to|\\bids\\b|\\bid\\b'

Repository: infiniflow/ragflow

Length of output: 50375

🏁 Script executed:

# Find where _get_doc_filters_with_request is called from (likely in routes) rg -n --type=py '_get_doc_filters_with_request' -B 5 -A 10 # Find the endpoint/route definition that calls this function rg -n --type=py 'def.*filter.*\(|@.*\.route.*filter' -A 5 api/apps/restful_apis/ # Check what parameters the API endpoint actually accepts from request rg -n --type=py 'q\.get|q\.getlist' api/apps/restful_apis/document_api.py | head -40

Repository: infiniflow/ragflow

Length of output: 5846

🏁 Script executed:

# Check if metadata_condition is parsed in _get_doc_filters_with_request sed -n '838,870p' api/apps/restful_apis/document_api.py # Check what _parse_doc_id_filter_with_metadata does and if it's used rg -n '_parse_doc_id_filter_with_metadata' api/apps/restful_apis/document_api.py -A 30 | head -50 # Check tests that combine type=filter with metadata or time params rg -n 'type=filter.*create_time|type=filter.*metadata|type=filter.*\bid\b|type=filter.*\bids\b' test/ sdk/ -r 2>/dev/null | head -20

Repository: infiniflow/ragflow

Length of output: 4298

🏁 Script executed:

# Check the full implementation of _parse_doc_id_filter_with_metadata to understand metadata handling sed -n '871,950p' api/apps/restful_apis/document_api.py # Check DocumentService.get_filter_by_kb_id full implementation to see if it supports additional params sed -n '185,250p' api/db/services/document_service.py # Check if there are any tests that use type=filter with metadata/time/id params rg -n 'type.*filter' test/testcases/ sdk/python/test/ -A 5 -B 2 | grep -E '(type|filter|metadata|create_time|\.get|\.post)' | head -40

Repository: infiniflow/ragflow

Length of output: 7736

🏁 Script executed:

# Verify if the filter endpoint query parameters are actually accepted/documented # by checking the route definition for the documents endpoint rg -n '@.*\.route.*documents' api/apps/restful_apis/document_api.py -B 5 -A 20 | head -80 # Check if there are any tests that combine type=filter with metadata or time constraints rg -n 'document_filter.*metadata|document_filter.*create_time|document_filter.*\bid\b|document_filter.*\bids\b' test/ -r -A 5 -B 5

Repository: infiniflow/ragflow

Length of output: 3198

🏁 Script executed:

# Check the list_documents endpoint signature and documentation sed -n '703,750p' api/apps/restful_apis/document_api.py # Search more carefully for tests using document_filter rg -n 'document_filter' test/ -A 3 -B 3 | grep -E '(metadata|create_time|id|ids|name)' | head -30 # Check the route documentation/swagger for the documents endpoint rg -n 'parameters:' api/apps/restful_apis/document_api.py -A 100 | grep -A 100 'def list_documents' | head -80

Repository: infiniflow/ragflow

Length of output: 5877

🏁 Script executed:

# Look for any test that calls document_filter with parameters besides the basic ones rg -n 'document_filter.*\{' test/ -A 2 -B 1 | head -50 # Check the route documentation to see what parameters are documented as supported sed -n '650,750p' api/apps/restful_apis/document_api.py

Repository: infiniflow/ragflow

Length of output: 5530

🏁 Script executed:

# Check if _get_doc_filters_with_request is newly added or if it's part of old code git log -p --all -S '_get_doc_filters_with_request' -- api/apps/restful_apis/document_api.py 2>/dev/null | head -50 || echo "Git history not available" # Check logging in the file rg -n 'logging\.' api/apps/restful_apis/document_api.py | wc -l # Check if there's logging in _get_doc_filters_with_request sed -n '838,870p' api/apps/restful_apis/document_api.py | grep -i log

Repository: infiniflow/ragflow

Length of output: 65

🏁 Script executed:

# Check if there's logging specifically in _get_doc_filters_with_request and _get_docs_with_request echo "=== Logging in _get_docs_with_request ===" sed -n '774,835p' api/apps/restful_apis/document_api.py | grep -i log echo echo "=== Logging in _get_doc_filters_with_request ===" sed -n '838,870p' api/apps/restful_apis/document_api.py | grep -i log # Check if there's ANY logging about which path is taken sed -n '720,730p' api/apps/restful_apis/document_api.py

Repository: infiniflow/ragflow

Length of output: 759

🏁 Script executed:

# Verify the test actually passes ids to the filter endpoint and check if it expects them to be processed sed -n '213,220p' test/testcases/test_web_api/test_document_app/test_document_metadata.py # Check if there are any comments in the code explaining why filter endpoint doesn't support id/ids filtering sed -n '720,870p' api/apps/restful_apis/document_api.py | grep -i 'filter\|ignore\|note\|todo\|fixme' -A 2 -B 2 # Verify the guidelines requirement about logging grep -r 'Add logging' .

Repository: infiniflow/ragflow

Length of output: 4908

type=filter endpoint silently ignores metadata_condition, id/ids, name, and create_time_* query parameters.

The _get_doc_filters_with_request function (lines 838–869) only processes keywords, suffix, types, and run_status, then calls DocumentService.get_filter_by_kb_id() with only these parameters. In contrast, _get_docs_with_request (lines 774–835) respects metadata filtering, document ID constraints, time range filtering, and name filters via _parse_doc_id_filter_with_metadata and direct parameter parsing. This means when a caller passes ?type=filter&ids=... or ?type=filter&create_time_from=X, those constraints are silently dropped and the filter payload returns dataset-wide aggregations instead of the scoped counts—a behavioral inconsistency with the document listing path.

Additionally, this new flow lacks logging as required by coding guidelines (**/*.py: Add logging for new flows).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/apps/restful_apis/document_api.py` around lines 838 - 869, _get_doc_filters_with_request currently only reads keywords, suffix, types, and run status and calls DocumentService.get_filter_by_kb_id, which causes it to ignore metadata_condition, id/ids, name, and create_time_* filters and drop scoping used by the document listing flow; update _get_doc_filters_with_request to parse the same filters as _get_docs_with_request (reuse _parse_doc_id_filter_with_metadata and the create_time/name parsing logic) and pass the resulting id list, metadata_condition, name, and time-range parameters into DocumentService.get_filter_by_kb_id (or extend that method if needed) so the filter aggregation respects the caller’s constraints, and add appropriate logging statements for the new flow consistent with the project logging guidelines.

codecov · 2026-05-09T06:23:12Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.16%. Comparing base (4f3711d) to head (64f830f).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main   #14728   +/-   ##
=======================================
  Coverage   94.16%   94.16%           
=======================================
  Files          10       10           
  Lines         703      703           
  Branches      112      112           
=======================================
  Hits          662      662           
  Misses         25       25           
  Partials       16       16

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

fix filter

64f830f

dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label May 9, 2026

Magicbook1108 added the ci Continue Integration label May 9, 2026

dosubot Bot added the 🐞 bug Something isn't working, pull request that fix bug. label May 9, 2026

Magicbook1108 marked this pull request as draft May 9, 2026 06:04

Magicbook1108 marked this pull request as ready for review May 9, 2026 06:04

wangq8 approved these changes May 9, 2026

View reviewed changes

dosubot Bot added the lgtm This PR has been approved by a maintainer label May 9, 2026

coderabbitai Bot reviewed May 9, 2026

View reviewed changes

wangq8 merged commit f7e8c39 into infiniflow:main May 9, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: filter api in dataset document#14728

Fix: filter api in dataset document#14728
wangq8 merged 1 commit intoinfiniflow:mainfrom
Magicbook1108:main

Magicbook1108 commented May 9, 2026

Uh oh!

coderabbitai Bot commented May 9, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

wangq8 commented May 9, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 9, 2026

Uh oh!

coderabbitai Bot May 9, 2026

Uh oh!

codecov Bot commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Magicbook1108 commented May 9, 2026

What problem does this PR solve?

Type of change

Uh oh!

coderabbitai Bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

wangq8 commented May 9, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 9, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 9, 2026 •

edited

Loading