feat(grep): integrate VikingDB bm25 keyword search for grep engine#2144
feat(grep): integrate VikingDB bm25 keyword search for grep engine#2144ByteDanceLiuYang wants to merge 33 commits into
Conversation
PR Reviewer Guide 🔍(Review updated until commit fbf4fea)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨No code suggestions found for the PR. |
92329f5 to
fbf4fea
Compare
|
Persistent review updated to latest commit fbf4fea |
PR Code Suggestions ✨No code suggestions found for the PR. |
0672236 to
4d21a96
Compare
| node_limit: i32, | ||
| level_limit: i32, | ||
| engine: Option<String>, | ||
| switch_to_remote_threshold: Option<i32>, |
There was a problem hiding this comment.
这些参数由于不常用,考虑放入 ovcli.conf 而不通过 flags 暴露
| offset: int = 0, | ||
| filters: Optional[Dict[str, Any]] = None, | ||
| output_fields: Optional[List[str]] = None, | ||
| mode: Optional[str] = None, |
There was a problem hiding this comment.
好的,先删掉了。这2个参数目前在openviking其实可以不传,就是走的默认值行为
faaaf83 to
b80738d
Compare
…arams in keywords search
…edundant API calls
91916a6 to
f9b4065
Compare
b4b38c4 to
a03c61b
Compare
5c96d24 to
6a80745
Compare
c5c029e to
a6aefef
Compare
a34e7b4 to
f135133
Compare
2759a87 to
98424e1
Compare
Summary
The existing
greppath performs filesystem traversal: walking the directory tree, reading candidate files, and applying regex matching line by line. On large resource trees, this can become prohibitively slow because the cost grows with the number of files that must be scanned.This PR introduces an adaptive two-phase grep strategy for VikingDB / Volcengine vector-store backends:
The public grep API remains stable. HTTP API, Python SDK, Go SDK, and Rust CLI do not expose extra per-request engine or BM25 tuning parameters. The acceleration is controlled by server-side
ov.conf, andengine=autofalls back to filesystem grep whenever VikingDB BM25 is unavailable or not suitable.Type of Change
Feature Usage
Public API / SDK / CLI
The public grep interface remains unchanged:
POST /api/v1/search/grepclient.grep(uri, pattern, case_insensitive=..., node_limit=..., exclude_uri=..., level_limit=...)client.Grep(ctx, uri, pattern, &openviking.GrepOptions{...})ov grep "pattern" --uri viking://resources/...No public
remote_return_limitparameter is exposed.Server-side Configuration (
ov.conf)VikingDB BM25 acceleration is controlled by
grepconfig inov.conf:{ "grep": { "engine": "auto", "switch_to_remote_threshold": 10000 } }enginestr"auto""auto"or"fs""auto"uses VikingDB BM25 recall when available and falls back to filesystem grep."fs"forces filesystem grep only.switch_to_remote_thresholdint10000>= 00to always try VikingDB BM25 when available.Internal BM25 Recall Limit
remote_return_limitis intentionally internal only. It is derived from the user-facingnode_limit:node_limitis set:min(node_limit * 5, 100000)node_limitis unset:100000This avoids exposing another tuning knob while still giving the second-stage regex matcher enough BM25 candidates to reduce false truncation.
Usage Examples
1. Configure VikingDB / Volcengine backend
The
storage.vectordbsection must usevolcengineorvikingdbbackend to enable BM25 recall. Example:{ "storage": { "vectordb": { "backend": "volcengine", "volcengine": { "ak": "YOUR_AK", "sk": "YOUR_SK", "region": "cn-beijing" }, "name": "my_collection_for_ov", "index_name": "my_index_1" } }, "grep": { "engine": "auto", "switch_to_remote_threshold": 10000 } }2. Basic grep, auto mode
ov --account default --user default grep --uri viking://resources/code 'VikingDB'In default
engine=automode, OpenViking checks:contentfield and FullText config;switch_to_remote_threshold.If all checks pass, grep uses VikingDB BM25 phase-1 recall plus local filesystem regex matching. Otherwise it uses filesystem grep.
3. Force filesystem grep
Set
enginetofsinov.conf:{ "grep": { "engine": "fs", "switch_to_remote_threshold": 10000 } }Then use the same grep command:
ov --account default --user default grep --uri viking://resources/code 'VikingDB'4. Always try VikingDB BM25 when available
Set
switch_to_remote_thresholdto0inov.conf:{ "grep": { "engine": "auto", "switch_to_remote_threshold": 0 } }Changes Made
1. Grep Engine Dispatch
autofsAuto mode decision chain:
volcengine/vikingdb→ filesystem grepcontentfield or FullText config → filesystem grepswitch_to_remote_threshold == 0→ VikingDB BM25 + filesystem regex matching< switch_to_remote_threshold→ filesystem grep2. VikingDB BM25 + Filesystem Regex Pipeline
_grep_vikingdb_then_fs()performs phase-1 keyword recall throughvector_store.search_by_keywords().3. Public Interface Kept Stable
remote_return_limitexposure from HTTP API, Python clients, Rust CLI, and Go SDK.engine/switch_to_remote_thresholdfrom grep API and CLI.engineandswitch_to_remote_thresholdinov.conf.remote_return_limitas an internal implementation detail derived fromnode_limit.4. Schema & FullText Compatibility
contenttext field for FullText indexing.contentfield.contentFullText support automatically use filesystem grep in auto mode.5. Data Pipeline for FullText Content
6. Configuration and Documentation Consistency
switch_to_remote_thresholddefault is now consistently10000in:GrepConfigdefault;ov.conf;level_limitdefault difference:510--uri.7. Backend / Adapter Support
User-Agent: openviking/{version}for server-side troubleshooting and traffic attribution.Testing
Latest local targeted validation:
python -m ruff check openviking/storage/viking_fs.py tests/storage/test_viking_fs_grep.py cargo check -p ov_cli python -m pytest -o addopts='' tests/storage/test_viking_fs_grep.pyResult:
ruff check: passedcargo check -p ov_cli: passedtests/storage/test_viking_fs_grep.py: 12 passedNotable test coverage added / updated in this PR:
Grep config default
switch_to_remote_thresholdis10000.No-config grep path uses the documented remote threshold behavior.
BM25 internal recall limit auto-adapts from
node_limitand caps at100000.Grep preserves DFS order and
node_limitsemantics.Grep respects
level_limitin the VikingDB phase-1 path scope.Fallback behavior is covered when VikingDB keyword search fails.
Collection schema tests validate
contentfield and FullText config.Vectorization tests cover raw content preservation for FullText indexing.
I have added tests that prove my fix is effective or that my feature works
New and existing targeted tests pass locally with my changes
I have tested this on the following platforms:
Benchmark
Benchmarks were run on Debian 10, 12 CPU, 24 GB memory using the scripts under
benchmark/retrieval/grep/vikingdb_bm25/.Performance
Dataset and setup:
performance/step0_prepare_data.pynode_limit=256RUNS=3,WARMUP=1engine=fsagainstengine=autowith VikingDB BM25 recall enabledKey result:
auto.matches == fs.matches.autoaveraged 1,517.8 ms and median 1,491.1 ms.fsaveraged 32,064.8 ms and median 33,459.2 ms.auto=1,435.2 ms,fs=32,914.9 ms, 22.9x faster.Effectiveness
Dataset and setup:
effectiveness/step1_add_resource.pyengine=fsengine=autousing VikingDB BM25 recall + local regex matchingKey result:
embeddingvsembeddings,grepvsripgrep,reindexvsrequireindex.Representative results:
build_indexSyncHTTPClientMarkdownParser检索向量数据库vikingdbadd_resourceembeddinggreppgrep/ripgrepsemantic gapreindexrequireindexvsreindex, expected token-semantic gapIn short, BM25 grep provides a large and stable latency improvement on large datasets, while the observed recall gaps are now mostly explainable by URI indexing issues or the expected difference between token recall and regex substring matching.
Notes and Caveats
1. BM25 recall is keyword based, not regex based
The VikingDB phase-1 recall step uses keyword search. It does not execute regex. The regex is still applied in phase 2 on locally read candidate files.
Simple alternation such as
error|warning|failcan work well because the implementation can turn it into keyword-like recall terms. Complex regex patterns may not recall the same candidate set as a full filesystem scan, for example:[Ee]rrortest\d+^import.*config.*For complex regex workloads that require exhaustive regex semantics, configure
grep.engine = "fs".2. Tokenizer behavior can affect recall
VikingDB FullText uses tokenizer-based indexing. This can differ from substring regex matching:
Phase 2 guarantees precision for recalled files, but phase 1 recall may still miss files if the tokenizer does not produce matching tokens.
3. Case handling may over-recall, which is safe
The tokenizer may normalize case, so BM25 can recall more files than a case-sensitive regex would match. This is safe because phase 2 applies the exact regex with the requested
case_insensitivebehavior and filters false positives.Compatibility
contenttext field and FullText config.