feat(grep): integrate VikingDB bm25 keyword search for grep engine by ByteDanceLiuYang · Pull Request #2144 · volcengine/OpenViking

ByteDanceLiuYang · 2026-05-20T08:36:14Z

Summary

The existing grep path performs filesystem traversal: walking the directory tree, reading candidate files, and applying regex matching line by line. On large resource trees, this can become prohibitively slow because the cost grows with the number of files that must be scanned.

This PR introduces an adaptive two-phase grep strategy for VikingDB / Volcengine vector-store backends:

Use VikingDB FullText / BM25 keyword search as a coarse phase-1 recall step to narrow down candidate files.
Run the existing local filesystem regex matching on the recalled candidate files to preserve exact grep semantics for returned matches.

The public grep API remains stable. HTTP API, Python SDK, Go SDK, and Rust CLI do not expose extra per-request engine or BM25 tuning parameters. The acceleration is controlled by server-side ov.conf, and engine=auto falls back to filesystem grep whenever VikingDB BM25 is unavailable or not suitable.

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring (no functional changes)
Performance improvement
Test update

Feature Usage

Public API / SDK / CLI

The public grep interface remains unchanged:

HTTP API: POST /api/v1/search/grep
Python SDK: client.grep(uri, pattern, case_insensitive=..., node_limit=..., exclude_uri=..., level_limit=...)
Go SDK: client.Grep(ctx, uri, pattern, &openviking.GrepOptions{...})
Rust CLI: ov grep "pattern" --uri viking://resources/...

No public remote_return_limit parameter is exposed.

Server-side Configuration (`ov.conf`)

VikingDB BM25 acceleration is controlled by grep config in ov.conf:

{
  "grep": {
    "engine": "auto",
    "switch_to_remote_threshold": 10000
  }
}

Parameter	Type	Default	Constraints	Description
`engine`	`str`	`"auto"`	`"auto"` or `"fs"`	Search engine mode. `"auto"` uses VikingDB BM25 recall when available and falls back to filesystem grep. `"fs"` forces filesystem grep only.
`switch_to_remote_threshold`	`int`	`10000`	`>= 0`	L2 record count threshold for switching to VikingDB BM25 phase-1 recall. If the number of files under the search scope reaches this threshold, auto mode uses VikingDB BM25 when available. Set to `0` to always try VikingDB BM25 when available.

Internal BM25 Recall Limit

remote_return_limit is intentionally internal only. It is derived from the user-facing node_limit:

If node_limit is set: min(node_limit * 5, 100000)
If node_limit is unset: 100000

This avoids exposing another tuning knob while still giving the second-stage regex matcher enough BM25 candidates to reduce false truncation.

Usage Examples

1. Configure VikingDB / Volcengine backend

The storage.vectordb section must use volcengine or vikingdb backend to enable BM25 recall. Example:

{
  "storage": {
    "vectordb": {
      "backend": "volcengine",
      "volcengine": {
        "ak": "YOUR_AK",
        "sk": "YOUR_SK",
        "region": "cn-beijing"
      },
      "name": "my_collection_for_ov",
      "index_name": "my_index_1"
    }
  },
  "grep": {
    "engine": "auto",
    "switch_to_remote_threshold": 10000
  }
}

Note: The collection must include the content text field and FullText config. Older existing collections that do not have this schema automatically fall back to filesystem grep in auto mode. To enable VikingDB-based grep for those collections, recreate or rebuild the collection with the new schema.

2. Basic grep, auto mode

ov --account default --user default grep --uri viking://resources/code 'VikingDB'

In default engine=auto mode, OpenViking checks:

whether the vector store is available;
whether the backend supports VikingDB / Volcengine keyword search;
whether the collection has content field and FullText config;
whether the L2 file count under the search scope reaches switch_to_remote_threshold.

If all checks pass, grep uses VikingDB BM25 phase-1 recall plus local filesystem regex matching. Otherwise it uses filesystem grep.

3. Force filesystem grep

Set engine to fs in ov.conf:

{
  "grep": {
    "engine": "fs",
    "switch_to_remote_threshold": 10000
  }
}

Then use the same grep command:

ov --account default --user default grep --uri viking://resources/code 'VikingDB'

4. Always try VikingDB BM25 when available

Set switch_to_remote_threshold to 0 in ov.conf:

{
  "grep": {
    "engine": "auto",
    "switch_to_remote_threshold": 0
  }
}

Changes Made

1. Grep Engine Dispatch

Mode	Behavior
`auto`	Adaptive mode. Checks vector-store availability, backend type, collection FullText capability, and data-volume threshold. Uses VikingDB BM25 + filesystem regex matching only when all checks pass.
`fs`	Forces the original filesystem grep path.

Auto mode decision chain:

vector store unavailable → filesystem grep
backend is not volcengine / vikingdb → filesystem grep
collection lacks content field or FullText config → filesystem grep
switch_to_remote_threshold == 0 → VikingDB BM25 + filesystem regex matching
file count under scope < switch_to_remote_threshold → filesystem grep
otherwise → VikingDB BM25 + filesystem regex matching

2. VikingDB BM25 + Filesystem Regex Pipeline

_grep_vikingdb_then_fs() performs phase-1 keyword recall through vector_store.search_by_keywords().
BM25 recall returns candidate file URIs only.
The existing local regex matcher then reads recalled files and applies the exact regex pattern.
Final returned matches still come from local regex matching, not directly from BM25.
If the VikingDB recall step raises an error, grep falls back to filesystem grep.

3. Public Interface Kept Stable

Removed external remote_return_limit exposure from HTTP API, Python clients, Rust CLI, and Go SDK.
Removed per-request engine / switch_to_remote_threshold from grep API and CLI.
Kept server-side engine and switch_to_remote_threshold in ov.conf.
Kept remote_return_limit as an internal implementation detail derived from node_limit.

4. Schema & FullText Compatibility

Adds / validates content text field for FullText indexing.
Adds FullText config for the content field.
Checks collection metadata before using BM25 grep.
Existing collections without content FullText support automatically use filesystem grep in auto mode.

5. Data Pipeline for FullText Content

File vectorization keeps raw file content available for BM25 indexing separately from embedding-truncated text.
Unknown-suffix text-like files are best-effort decoded for FullText indexing.
Known text files reuse existing reads where possible to avoid unnecessary extra IO.
VikingDB write payload truncates indexed content at the backend boundary to respect storage limits without mutating the source data.

6. Configuration and Documentation Consistency

switch_to_remote_threshold default is now consistently 10000 in:
- code default path;
- GrepConfig default;
- example ov.conf;
- English / Chinese configuration docs.
Grep API docs now clarify the level_limit default difference:
- Python SDK: 5
- HTTP API / CLI / Go SDK: 10
CLI examples were corrected to pass pattern as the positional argument and URI via --uri.

7. Backend / Adapter Support

Collection keyword search path supports BM25 / FullText recall.
Vector index backend exposes keyword-search and collection-metadata methods used by grep auto mode.
VikingDB HTTP requests include User-Agent: openviking/{version} for server-side troubleshooting and traffic attribution.

Testing

Latest local targeted validation:

python -m ruff check openviking/storage/viking_fs.py tests/storage/test_viking_fs_grep.py
cargo check -p ov_cli
python -m pytest -o addopts='' tests/storage/test_viking_fs_grep.py

Result:

ruff check: passed
cargo check -p ov_cli: passed
tests/storage/test_viking_fs_grep.py: 12 passed

Notable test coverage added / updated in this PR:

Grep config default switch_to_remote_threshold is 10000.
No-config grep path uses the documented remote threshold behavior.
BM25 internal recall limit auto-adapts from node_limit and caps at 100000.
Grep preserves DFS order and node_limit semantics.
Grep respects level_limit in the VikingDB phase-1 path scope.
Fallback behavior is covered when VikingDB keyword search fails.
Collection schema tests validate content field and FullText config.
Vectorization tests cover raw content preservation for FullText indexing.
I have added tests that prove my fix is effective or that my feature works
New and existing targeted tests pass locally with my changes
I have tested this on the following platforms:
- Linux
- macOS
- Windows

Benchmark

Benchmarks were run on Debian 10, 12 CPU, 24 GB memory using the scripts under benchmark/retrieval/grep/vikingdb_bm25/.

Performance

Dataset and setup:

Synthetic dataset generated by performance/step0_prepare_data.py
200,000 files, about 13 GB total data
node_limit=256
15 valid keywords across 0.01%, 0.05%, 0.1%, and 1% frequency tiers, plus one no-match keyword
Each query was measured with RUNS=3, WARMUP=1
Compared engine=fs against engine=auto with VikingDB BM25 recall enabled

Key result:

Returned match counts were identical for all tested keywords: auto.matches == fs.matches.
Across the 15 valid keywords, auto averaged 1,517.8 ms and median 1,491.1 ms.
fs averaged 32,064.8 ms and median 33,459.2 ms.
Overall speedup: 21.4x average, 22.4x median.
No-match query: auto=1,435.2 ms, fs=32,914.9 ms, 22.9x faster.

Frequency	Samples	auto avg	fs avg	Avg speedup	Matches
0.010%	3	1,455.1 ms	33,561.0 ms	23.0x	19
0.050%	3	1,484.9 ms	32,877.7 ms	22.2x	75-100
0.100%	6	1,496.8 ms	38,953.7 ms	26.0x	176-215
1.000%	3	1,655.5 ms	15,978.1 ms	9.6x	256
No match	1	1,435.2 ms	32,914.9 ms	22.9x	0

Effectiveness

Dataset and setup:

Real code repository imported by effectiveness/step1_add_resource.py
Ground truth generated with engine=fs
Evaluated engine=auto using VikingDB BM25 recall + local regex matching
14 grep patterns, including English identifiers, Chinese terms, alternation, and no-match query

Key result:

Weighted recall: 96.3%.
8 / 14 patterns reached 100% recall.
The previous “independent token exists but BM25 missed it” indexing issue was not reproduced.
Remaining false negatives are concentrated in two known categories:
- non-ASCII / Japanese URI write or indexing issue;
- BM25 token semantics vs filesystem regex substring semantics, e.g. embedding vs embeddings, grep vs ripgrep, reindex vs requireindex.

Representative results:

Pattern	Truth	Found	Recall	Note
`build_index`	14	14	100.0%	stable full recall
`SyncHTTPClient`	22	22	100.0%	stable full recall
`MarkdownParser`	18	18	100.0%	stable full recall
`检索`	23	23	100.0%	stable full recall
`向量数据库`	4	4	100.0%	stable full recall
`vikingdb`	97	96	99.0%	only remaining miss is a Japanese URI case
`add_resource`	78	77	98.7%	only remaining miss is a Japanese URI case
`embedding`	163	155	95.1%	URI issue + token/sub-string semantic gap
`grep`	77	72	93.5%	URI issue + `pgrep`/`ripgrep` semantic gap
`reindex`	26	24	92.3%	`requireindex` vs `reindex`, expected token-semantic gap

In short, BM25 grep provides a large and stable latency improvement on large datasets, while the observed recall gaps are now mostly explainable by URI indexing issues or the expected difference between token recall and regex substring matching.

Notes and Caveats

1. BM25 recall is keyword based, not regex based

The VikingDB phase-1 recall step uses keyword search. It does not execute regex. The regex is still applied in phase 2 on locally read candidate files.

Simple alternation such as error|warning|fail can work well because the implementation can turn it into keyword-like recall terms. Complex regex patterns may not recall the same candidate set as a full filesystem scan, for example:

character classes: [Ee]rror
quantifiers: test\d+
anchors: ^import
wildcard-heavy patterns: .*config.*

For complex regex workloads that require exhaustive regex semantics, configure grep.engine = "fs".

2. Tokenizer behavior can affect recall

VikingDB FullText uses tokenizer-based indexing. This can differ from substring regex matching:

CJK token boundaries may not align with arbitrary substrings.
Singular / plural or other inflected forms may be treated as different tokens.
Stop words or punctuation-heavy queries may not recall all files that a raw regex scan would match.

Phase 2 guarantees precision for recalled files, but phase 1 recall may still miss files if the tokenizer does not produce matching tokens.

3. Case handling may over-recall, which is safe

The tokenizer may normalize case, so BM25 can recall more files than a case-sensitive regex would match. This is safe because phase 2 applies the exact regex with the requested case_insensitive behavior and filters false positives.

Compatibility

Existing public grep callers continue to work without code changes.
Existing collections without FullText-compatible schema continue to work through filesystem grep fallback.
To benefit from BM25 acceleration, collections must have the content text field and FullText config.

github-actions · 2026-05-20T08:38:14Z

PR Reviewer Guide 🔍

(Review updated until commit `fbf4fea`)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 80
🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes Sub-PR theme: Grep Engine Integration (Core + CLI + Tests) Relevant files: openviking/storage/viking_fs.py openviking/storage/viking_vector_index_backend.py openviking/storage/collection_schemas.py openviking/server/routers/search.py openviking/service/fs_service.py openviking/async_client.py openviking/client/local.py openviking/sync_client.py openviking_cli/client/sync_http.py openviking_cli/client/http.py openviking_cli/client/base.py openviking/storage/queuefs/embedding_msg_converter.py openviking/storage/ovpack/vectors.py crates/ov_cli/src/main.rs crates/ov_cli/src/client.rs crates/ov_cli/src/handlers.rs crates/ov_cli/src/commands/search.rs tests/storage/test_rebuild_schema.py tests/storage/test_collection_schemas.py Sub-PR theme: Grep BM25 Benchmark Scripts Relevant files: benchmark/retrieval/grep/vikingdb_bm25/step1_generate.py benchmark/retrieval/grep/vikingdb_bm25/step2_quick_add_resource.py benchmark/retrieval/grep/vikingdb_bm25/step3_build_index.py benchmark/retrieval/grep/vikingdb_bm25/step4_benchmark.py Sub-PR theme: Vector DB Search by Keywords Extensions Relevant files: openviking/storage/vectordb/collection/http_collection.py openviking/storage/vectordb/collection/collection.py openviking/storage/vectordb/collection/volcengine_collection.py openviking/storage/vectordb/collection/vikingdb_collection.py openviking/storage/vectordb/collection/volcengine_api_key_collection.py openviking/storage/vectordb/collection/local_collection.py openviking/storage/vectordb/collection/volcengine_clients.py openviking/storage/vectordb/collection/vikingdb_clients.py openviking/storage/vectordb_adapters/base.py openviking/storage/vectordb/utils/validation.py openviking/storage/vectordb/service/app_models.py tests/storage/mock_backend.py
⚡ Recommended focus areas for review Naive Regex Splitting for BM25 Keywords The code splits the regex pattern on '\|' to extract keywords for BM25 search. This will fail for complex regex patterns with groups, character classes, or escaped '\|' (e.g., 'error\|(warning\|fail)', 'a\|b', '[a\|b]'), leading to incorrect keywords and potential fallback to fs mode unnecessarily. # for bm25 search. Limit to 10 keywords per VikingDB API constraint. keywords = [kw.strip() for kw in pattern.split("\|") if kw.strip()][:10] Count Cache Eviction Not Atomic The _count_cache eviction logic (checking size then deleting keys) is not atomic in an async context. Multiple concurrent calls could lead to unexpected cache behavior, though the impact is low since it's a cache. if len(self._count_cache) >= self._count_cache_max_size: oldest_keys = sorted(self._count_cache, key=lambda k: self._count_cache[k][1]) for k in oldest_keys[:len(oldest_keys) // 2]: del self._count_cache[k] self._count_cache[cache_key] = (count, now) Broad Exception Swallowing in search_by_keywords The search_by_keywords method catches all exceptions and returns an empty list, which could hide errors. Consider logging the exception and re-raising or falling back more explicitly. except Exception as e: logger.error("Error searching by keywords: %s", e) return []

github-actions · 2026-05-20T08:41:02Z

PR Code Suggestions ✨

No code suggestions found for the PR.

github-actions · 2026-05-25T04:03:27Z

Persistent review updated to latest commit fbf4fea

github-actions · 2026-05-25T04:05:43Z

PR Code Suggestions ✨

No code suggestions found for the PR.

MaojiaSheng · 2026-05-25T13:03:26Z

    node_limit: i32,
    level_limit: i32,
+    engine: Option<String>,
+    switch_to_remote_threshold: Option<i32>,


这些参数由于不常用，考虑放入 ovcli.conf 而不通过 flags 暴露

好的，已调整

MaojiaSheng · 2026-05-25T13:04:20Z

        offset: int = 0,
        filters: Optional[Dict[str, Any]] = None,
        output_fields: Optional[List[str]] = None,
+        mode: Optional[str] = None,


这两个参数设计还要斟酌下

好的，先删掉了。这2个参数目前在openviking其实可以不传，就是走的默认值行为

…arams in keywords search

…edundant API calls

…suites with async reindex

…ery"

github-project-automation Bot added this to OpenViking project May 20, 2026

github-project-automation Bot moved this to Backlog in OpenViking project May 20, 2026

github-actions Bot added the Review effort 3/5 label May 20, 2026

qin-ctx requested a review from zhoujh01 May 20, 2026 08:57

ByteDanceLiuYang marked this pull request as draft May 20, 2026 09:36

ByteDanceLiuYang force-pushed the grep_vikingdb branch 8 times, most recently from 92329f5 to fbf4fea Compare May 23, 2026 15:18

ByteDanceLiuYang marked this pull request as ready for review May 25, 2026 04:01

github-actions Bot added Review effort 4/5 and removed Review effort 3/5 labels May 25, 2026

ByteDanceLiuYang force-pushed the grep_vikingdb branch 6 times, most recently from 0672236 to 4d21a96 Compare May 25, 2026 07:59

ByteDanceLiuYang marked this pull request as draft May 25, 2026 12:51

MaojiaSheng reviewed May 25, 2026

View reviewed changes

ByteDanceLiuYang force-pushed the grep_vikingdb branch from faaaf83 to b80738d Compare May 26, 2026 09:35

ByteDanceLiuYang added 3 commits May 28, 2026 11:01

optimize: auto adapt remote_return_limit by agg API; rm unnecessary p…

3ea7b3f

…arams in keywords search

fix: adjust benchmark scripts

1c65d64

fix(grep): store full content for BM25; use PathScope depth; reduce r…

f9b4065

…edundant API calls

ByteDanceLiuYang force-pushed the grep_vikingdb branch from 91916a6 to f9b4065 Compare May 28, 2026 03:08

ByteDanceLiuYang added 3 commits May 28, 2026 20:38

refactor: new benchmark

240fd27

fix: step1 add resource by real code data

7337653

feat(benchmark): split grep benchmark into effectiveness/performance …

a03c61b

…suites with async reindex

ByteDanceLiuYang force-pushed the grep_vikingdb branch from b4b38c4 to a03c61b Compare May 29, 2026 12:33

optimize (benchmark): adjust keywords and ground truth for testing

6a80745

ByteDanceLiuYang force-pushed the grep_vikingdb branch from 5c96d24 to 6a80745 Compare June 1, 2026 08:04

ByteDanceLiuYang added 6 commits June 1, 2026 21:18

fix: truncate 64KB for content field

599ae64

Merge branch 'main' into grep_vikingdb

ae9c078

optimize: effectiveness add resource plainly

f13617b

optimize: change param use of SearchByKeywords from "keywords" to "qu…

0fe3c1e

…ery"

optimize(benchmark): refactor effectiveness scripts

02d18d8

Merge branch 'main' into grep_vikingdb

a6aefef

ByteDanceLiuYang force-pushed the grep_vikingdb branch from c5c029e to a6aefef Compare June 10, 2026 12:05

ByteDanceLiuYang added 8 commits June 11, 2026 12:12

optimize: ensure raw data for content field

0c8eef0

Merge branch 'main' into grep_vikingdb

a570241

optimize: fulltext analyzer's stop-words only use symbols

4b8481a

fix: adapt to new ov cli for benchmark

3ae5dbf

optimize: reuse file content to avoid re-read AGFS file

211d04b

Merge branch 'main' into grep_vikingdb

8f0a7dc

optimize: tune grep vikingdb defaults and refresh bm25 benchmark scripts

b132249

optimize: benchmark client timeout

f135133

ByteDanceLiuYang force-pushed the grep_vikingdb branch from a34e7b4 to f135133 Compare June 17, 2026 12:05

ByteDanceLiuYang added 3 commits June 17, 2026 21:22

update README

4ec3ec9

Merge branch 'main' into grep_vikingdb

75a5483

fix: rm unused param

98424e1

ByteDanceLiuYang force-pushed the grep_vikingdb branch from 2759a87 to 98424e1 Compare June 23, 2026 10:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(grep): integrate VikingDB bm25 keyword search for grep engine#2144

feat(grep): integrate VikingDB bm25 keyword search for grep engine#2144
ByteDanceLiuYang wants to merge 33 commits into
volcengine:mainfrom
ByteDanceLiuYang:grep_vikingdb

ByteDanceLiuYang commented May 20, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 20, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

MaojiaSheng May 25, 2026

Uh oh!

ByteDanceLiuYang May 26, 2026

Uh oh!

MaojiaSheng May 25, 2026

Uh oh!

ByteDanceLiuYang May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ByteDanceLiuYang commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of Change

Feature Usage

Public API / SDK / CLI

Server-side Configuration (ov.conf)

Internal BM25 Recall Limit

Usage Examples

1. Configure VikingDB / Volcengine backend

2. Basic grep, auto mode

3. Force filesystem grep

4. Always try VikingDB BM25 when available

Changes Made

1. Grep Engine Dispatch

2. VikingDB BM25 + Filesystem Regex Pipeline

3. Public Interface Kept Stable

4. Schema & FullText Compatibility

5. Data Pipeline for FullText Content

6. Configuration and Documentation Consistency

7. Backend / Adapter Support

Testing

Benchmark

Performance

Effectiveness

Notes and Caveats

1. BM25 recall is keyword based, not regex based

2. Tokenizer behavior can affect recall

3. Case handling may over-recall, which is safe

Compatibility

Uh oh!

github-actions Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Reviewer Guide 🔍

(Review updated until commit fbf4fea)

Uh oh!

github-actions Bot commented May 20, 2026

PR Code Suggestions ✨

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026

PR Code Suggestions ✨

Uh oh!

MaojiaSheng May 25, 2026

Choose a reason for hiding this comment

Uh oh!

ByteDanceLiuYang May 26, 2026

Choose a reason for hiding this comment

Uh oh!

MaojiaSheng May 25, 2026

Choose a reason for hiding this comment

Uh oh!

ByteDanceLiuYang May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ByteDanceLiuYang commented May 20, 2026 •

edited

Loading

Server-side Configuration (`ov.conf`)

github-actions Bot commented May 20, 2026 •

edited

Loading

(Review updated until commit `fbf4fea`)