Skip to content

feat(grep): integrate VikingDB bm25 keyword search for grep engine#2144

Draft
ByteDanceLiuYang wants to merge 33 commits into
volcengine:mainfrom
ByteDanceLiuYang:grep_vikingdb
Draft

feat(grep): integrate VikingDB bm25 keyword search for grep engine#2144
ByteDanceLiuYang wants to merge 33 commits into
volcengine:mainfrom
ByteDanceLiuYang:grep_vikingdb

Conversation

@ByteDanceLiuYang

@ByteDanceLiuYang ByteDanceLiuYang commented May 20, 2026

Copy link
Copy Markdown
Contributor

Summary

The existing grep path performs filesystem traversal: walking the directory tree, reading candidate files, and applying regex matching line by line. On large resource trees, this can become prohibitively slow because the cost grows with the number of files that must be scanned.

This PR introduces an adaptive two-phase grep strategy for VikingDB / Volcengine vector-store backends:

  1. Use VikingDB FullText / BM25 keyword search as a coarse phase-1 recall step to narrow down candidate files.
  2. Run the existing local filesystem regex matching on the recalled candidate files to preserve exact grep semantics for returned matches.

The public grep API remains stable. HTTP API, Python SDK, Go SDK, and Rust CLI do not expose extra per-request engine or BM25 tuning parameters. The acceleration is controlled by server-side ov.conf, and engine=auto falls back to filesystem grep whenever VikingDB BM25 is unavailable or not suitable.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update

Feature Usage

Public API / SDK / CLI

The public grep interface remains unchanged:

  • HTTP API: POST /api/v1/search/grep
  • Python SDK: client.grep(uri, pattern, case_insensitive=..., node_limit=..., exclude_uri=..., level_limit=...)
  • Go SDK: client.Grep(ctx, uri, pattern, &openviking.GrepOptions{...})
  • Rust CLI: ov grep "pattern" --uri viking://resources/...

No public remote_return_limit parameter is exposed.

Server-side Configuration (ov.conf)

VikingDB BM25 acceleration is controlled by grep config in ov.conf:

{
  "grep": {
    "engine": "auto",
    "switch_to_remote_threshold": 10000
  }
}
Parameter Type Default Constraints Description
engine str "auto" "auto" or "fs" Search engine mode. "auto" uses VikingDB BM25 recall when available and falls back to filesystem grep. "fs" forces filesystem grep only.
switch_to_remote_threshold int 10000 >= 0 L2 record count threshold for switching to VikingDB BM25 phase-1 recall. If the number of files under the search scope reaches this threshold, auto mode uses VikingDB BM25 when available. Set to 0 to always try VikingDB BM25 when available.

Internal BM25 Recall Limit

remote_return_limit is intentionally internal only. It is derived from the user-facing node_limit:

  • If node_limit is set: min(node_limit * 5, 100000)
  • If node_limit is unset: 100000

This avoids exposing another tuning knob while still giving the second-stage regex matcher enough BM25 candidates to reduce false truncation.

Usage Examples

1. Configure VikingDB / Volcengine backend

The storage.vectordb section must use volcengine or vikingdb backend to enable BM25 recall. Example:

{
  "storage": {
    "vectordb": {
      "backend": "volcengine",
      "volcengine": {
        "ak": "YOUR_AK",
        "sk": "YOUR_SK",
        "region": "cn-beijing"
      },
      "name": "my_collection_for_ov",
      "index_name": "my_index_1"
    }
  },
  "grep": {
    "engine": "auto",
    "switch_to_remote_threshold": 10000
  }
}

Note: The collection must include the content text field and FullText config. Older existing collections that do not have this schema automatically fall back to filesystem grep in auto mode. To enable VikingDB-based grep for those collections, recreate or rebuild the collection with the new schema.

2. Basic grep, auto mode

ov --account default --user default grep --uri viking://resources/code 'VikingDB'

In default engine=auto mode, OpenViking checks:

  1. whether the vector store is available;
  2. whether the backend supports VikingDB / Volcengine keyword search;
  3. whether the collection has content field and FullText config;
  4. whether the L2 file count under the search scope reaches switch_to_remote_threshold.

If all checks pass, grep uses VikingDB BM25 phase-1 recall plus local filesystem regex matching. Otherwise it uses filesystem grep.

3. Force filesystem grep

Set engine to fs in ov.conf:

{
  "grep": {
    "engine": "fs",
    "switch_to_remote_threshold": 10000
  }
}

Then use the same grep command:

ov --account default --user default grep --uri viking://resources/code 'VikingDB'

4. Always try VikingDB BM25 when available

Set switch_to_remote_threshold to 0 in ov.conf:

{
  "grep": {
    "engine": "auto",
    "switch_to_remote_threshold": 0
  }
}

Changes Made

1. Grep Engine Dispatch

Mode Behavior
auto Adaptive mode. Checks vector-store availability, backend type, collection FullText capability, and data-volume threshold. Uses VikingDB BM25 + filesystem regex matching only when all checks pass.
fs Forces the original filesystem grep path.

Auto mode decision chain:

  1. vector store unavailable → filesystem grep
  2. backend is not volcengine / vikingdb → filesystem grep
  3. collection lacks content field or FullText config → filesystem grep
  4. switch_to_remote_threshold == 0 → VikingDB BM25 + filesystem regex matching
  5. file count under scope < switch_to_remote_threshold → filesystem grep
  6. otherwise → VikingDB BM25 + filesystem regex matching

2. VikingDB BM25 + Filesystem Regex Pipeline

  • _grep_vikingdb_then_fs() performs phase-1 keyword recall through vector_store.search_by_keywords().
  • BM25 recall returns candidate file URIs only.
  • The existing local regex matcher then reads recalled files and applies the exact regex pattern.
  • Final returned matches still come from local regex matching, not directly from BM25.
  • If the VikingDB recall step raises an error, grep falls back to filesystem grep.

3. Public Interface Kept Stable

  • Removed external remote_return_limit exposure from HTTP API, Python clients, Rust CLI, and Go SDK.
  • Removed per-request engine / switch_to_remote_threshold from grep API and CLI.
  • Kept server-side engine and switch_to_remote_threshold in ov.conf.
  • Kept remote_return_limit as an internal implementation detail derived from node_limit.

4. Schema & FullText Compatibility

  • Adds / validates content text field for FullText indexing.
  • Adds FullText config for the content field.
  • Checks collection metadata before using BM25 grep.
  • Existing collections without content FullText support automatically use filesystem grep in auto mode.

5. Data Pipeline for FullText Content

  • File vectorization keeps raw file content available for BM25 indexing separately from embedding-truncated text.
  • Unknown-suffix text-like files are best-effort decoded for FullText indexing.
  • Known text files reuse existing reads where possible to avoid unnecessary extra IO.
  • VikingDB write payload truncates indexed content at the backend boundary to respect storage limits without mutating the source data.

6. Configuration and Documentation Consistency

  • switch_to_remote_threshold default is now consistently 10000 in:
    • code default path;
    • GrepConfig default;
    • example ov.conf;
    • English / Chinese configuration docs.
  • Grep API docs now clarify the level_limit default difference:
    • Python SDK: 5
    • HTTP API / CLI / Go SDK: 10
  • CLI examples were corrected to pass pattern as the positional argument and URI via --uri.

7. Backend / Adapter Support

  • Collection keyword search path supports BM25 / FullText recall.
  • Vector index backend exposes keyword-search and collection-metadata methods used by grep auto mode.
  • VikingDB HTTP requests include User-Agent: openviking/{version} for server-side troubleshooting and traffic attribution.

Testing

Latest local targeted validation:

python -m ruff check openviking/storage/viking_fs.py tests/storage/test_viking_fs_grep.py
cargo check -p ov_cli
python -m pytest -o addopts='' tests/storage/test_viking_fs_grep.py

Result:

  • ruff check: passed
  • cargo check -p ov_cli: passed
  • tests/storage/test_viking_fs_grep.py: 12 passed

Notable test coverage added / updated in this PR:

  • Grep config default switch_to_remote_threshold is 10000.

  • No-config grep path uses the documented remote threshold behavior.

  • BM25 internal recall limit auto-adapts from node_limit and caps at 100000.

  • Grep preserves DFS order and node_limit semantics.

  • Grep respects level_limit in the VikingDB phase-1 path scope.

  • Fallback behavior is covered when VikingDB keyword search fails.

  • Collection schema tests validate content field and FullText config.

  • Vectorization tests cover raw content preservation for FullText indexing.

  • I have added tests that prove my fix is effective or that my feature works

  • New and existing targeted tests pass locally with my changes

  • I have tested this on the following platforms:

    • Linux
    • macOS
    • Windows

Benchmark

Benchmarks were run on Debian 10, 12 CPU, 24 GB memory using the scripts under benchmark/retrieval/grep/vikingdb_bm25/.

Performance

Dataset and setup:

  • Synthetic dataset generated by performance/step0_prepare_data.py
  • 200,000 files, about 13 GB total data
  • node_limit=256
  • 15 valid keywords across 0.01%, 0.05%, 0.1%, and 1% frequency tiers, plus one no-match keyword
  • Each query was measured with RUNS=3, WARMUP=1
  • Compared engine=fs against engine=auto with VikingDB BM25 recall enabled

Key result:

  • Returned match counts were identical for all tested keywords: auto.matches == fs.matches.
  • Across the 15 valid keywords, auto averaged 1,517.8 ms and median 1,491.1 ms.
  • fs averaged 32,064.8 ms and median 33,459.2 ms.
  • Overall speedup: 21.4x average, 22.4x median.
  • No-match query: auto=1,435.2 ms, fs=32,914.9 ms, 22.9x faster.
Frequency Samples auto avg fs avg Avg speedup Matches
0.010% 3 1,455.1 ms 33,561.0 ms 23.0x 19
0.050% 3 1,484.9 ms 32,877.7 ms 22.2x 75-100
0.100% 6 1,496.8 ms 38,953.7 ms 26.0x 176-215
1.000% 3 1,655.5 ms 15,978.1 ms 9.6x 256
No match 1 1,435.2 ms 32,914.9 ms 22.9x 0

Effectiveness

Dataset and setup:

  • Real code repository imported by effectiveness/step1_add_resource.py
  • Ground truth generated with engine=fs
  • Evaluated engine=auto using VikingDB BM25 recall + local regex matching
  • 14 grep patterns, including English identifiers, Chinese terms, alternation, and no-match query

Key result:

  • Weighted recall: 96.3%.
  • 8 / 14 patterns reached 100% recall.
  • The previous “independent token exists but BM25 missed it” indexing issue was not reproduced.
  • Remaining false negatives are concentrated in two known categories:
    • non-ASCII / Japanese URI write or indexing issue;
    • BM25 token semantics vs filesystem regex substring semantics, e.g. embedding vs embeddings, grep vs ripgrep, reindex vs requireindex.

Representative results:

Pattern Truth Found Recall Note
build_index 14 14 100.0% stable full recall
SyncHTTPClient 22 22 100.0% stable full recall
MarkdownParser 18 18 100.0% stable full recall
检索 23 23 100.0% stable full recall
向量数据库 4 4 100.0% stable full recall
vikingdb 97 96 99.0% only remaining miss is a Japanese URI case
add_resource 78 77 98.7% only remaining miss is a Japanese URI case
embedding 163 155 95.1% URI issue + token/sub-string semantic gap
grep 77 72 93.5% URI issue + pgrep/ripgrep semantic gap
reindex 26 24 92.3% requireindex vs reindex, expected token-semantic gap

In short, BM25 grep provides a large and stable latency improvement on large datasets, while the observed recall gaps are now mostly explainable by URI indexing issues or the expected difference between token recall and regex substring matching.

Notes and Caveats

1. BM25 recall is keyword based, not regex based

The VikingDB phase-1 recall step uses keyword search. It does not execute regex. The regex is still applied in phase 2 on locally read candidate files.

Simple alternation such as error|warning|fail can work well because the implementation can turn it into keyword-like recall terms. Complex regex patterns may not recall the same candidate set as a full filesystem scan, for example:

  • character classes: [Ee]rror
  • quantifiers: test\d+
  • anchors: ^import
  • wildcard-heavy patterns: .*config.*

For complex regex workloads that require exhaustive regex semantics, configure grep.engine = "fs".

2. Tokenizer behavior can affect recall

VikingDB FullText uses tokenizer-based indexing. This can differ from substring regex matching:

  • CJK token boundaries may not align with arbitrary substrings.
  • Singular / plural or other inflected forms may be treated as different tokens.
  • Stop words or punctuation-heavy queries may not recall all files that a raw regex scan would match.

Phase 2 guarantees precision for recalled files, but phase 1 recall may still miss files if the tokenizer does not produce matching tokens.

3. Case handling may over-recall, which is safe

The tokenizer may normalize case, so BM25 can recall more files than a case-sensitive regex would match. This is safe because phase 2 applies the exact regex with the requested case_insensitive behavior and filters false positives.

Compatibility

  • Existing public grep callers continue to work without code changes.
  • Existing collections without FullText-compatible schema continue to work through filesystem grep fallback.
  • To benefit from BM25 acceleration, collections must have the content text field and FullText config.

@github-actions

github-actions Bot commented May 20, 2026

Copy link
Copy Markdown

PR Reviewer Guide 🔍

(Review updated until commit fbf4fea)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 80
🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Grep Engine Integration (Core + CLI + Tests)

Relevant files:

  • openviking/storage/viking_fs.py
  • openviking/storage/viking_vector_index_backend.py
  • openviking/storage/collection_schemas.py
  • openviking/server/routers/search.py
  • openviking/service/fs_service.py
  • openviking/async_client.py
  • openviking/client/local.py
  • openviking/sync_client.py
  • openviking_cli/client/sync_http.py
  • openviking_cli/client/http.py
  • openviking_cli/client/base.py
  • openviking/storage/queuefs/embedding_msg_converter.py
  • openviking/storage/ovpack/vectors.py
  • crates/ov_cli/src/main.rs
  • crates/ov_cli/src/client.rs
  • crates/ov_cli/src/handlers.rs
  • crates/ov_cli/src/commands/search.rs
  • tests/storage/test_rebuild_schema.py
  • tests/storage/test_collection_schemas.py

Sub-PR theme: Grep BM25 Benchmark Scripts

Relevant files:

  • benchmark/retrieval/grep/vikingdb_bm25/step1_generate.py
  • benchmark/retrieval/grep/vikingdb_bm25/step2_quick_add_resource.py
  • benchmark/retrieval/grep/vikingdb_bm25/step3_build_index.py
  • benchmark/retrieval/grep/vikingdb_bm25/step4_benchmark.py

Sub-PR theme: Vector DB Search by Keywords Extensions

Relevant files:

  • openviking/storage/vectordb/collection/http_collection.py
  • openviking/storage/vectordb/collection/collection.py
  • openviking/storage/vectordb/collection/volcengine_collection.py
  • openviking/storage/vectordb/collection/vikingdb_collection.py
  • openviking/storage/vectordb/collection/volcengine_api_key_collection.py
  • openviking/storage/vectordb/collection/local_collection.py
  • openviking/storage/vectordb/collection/volcengine_clients.py
  • openviking/storage/vectordb/collection/vikingdb_clients.py
  • openviking/storage/vectordb_adapters/base.py
  • openviking/storage/vectordb/utils/validation.py
  • openviking/storage/vectordb/service/app_models.py
  • tests/storage/mock_backend.py

⚡ Recommended focus areas for review

Naive Regex Splitting for BM25 Keywords

The code splits the regex pattern on '|' to extract keywords for BM25 search. This will fail for complex regex patterns with groups, character classes, or escaped '|' (e.g., 'error|(warning|fail)', 'a|b', '[a|b]'), leading to incorrect keywords and potential fallback to fs mode unnecessarily.

# for bm25 search. Limit to 10 keywords per VikingDB API constraint.
keywords = [kw.strip() for kw in pattern.split("|") if kw.strip()][:10]
Count Cache Eviction Not Atomic

The _count_cache eviction logic (checking size then deleting keys) is not atomic in an async context. Multiple concurrent calls could lead to unexpected cache behavior, though the impact is low since it's a cache.

if len(self._count_cache) >= self._count_cache_max_size:
    oldest_keys = sorted(self._count_cache, key=lambda k: self._count_cache[k][1])
    for k in oldest_keys[:len(oldest_keys) // 2]:
        del self._count_cache[k]
self._count_cache[cache_key] = (count, now)
Broad Exception Swallowing in search_by_keywords

The search_by_keywords method catches all exceptions and returns an empty list, which could hide errors. Consider logging the exception and re-raising or falling back more explicitly.

except Exception as e:
    logger.error("Error searching by keywords: %s", e)
    return []

@github-actions

Copy link
Copy Markdown

PR Code Suggestions ✨

No code suggestions found for the PR.

@qin-ctx qin-ctx requested a review from zhoujh01 May 20, 2026 08:57
@ByteDanceLiuYang ByteDanceLiuYang marked this pull request as draft May 20, 2026 09:36
@ByteDanceLiuYang ByteDanceLiuYang force-pushed the grep_vikingdb branch 8 times, most recently from 92329f5 to fbf4fea Compare May 23, 2026 15:18
@ByteDanceLiuYang ByteDanceLiuYang marked this pull request as ready for review May 25, 2026 04:01
@github-actions

Copy link
Copy Markdown

Persistent review updated to latest commit fbf4fea

@github-actions

Copy link
Copy Markdown

PR Code Suggestions ✨

No code suggestions found for the PR.

@ByteDanceLiuYang ByteDanceLiuYang force-pushed the grep_vikingdb branch 6 times, most recently from 0672236 to 4d21a96 Compare May 25, 2026 07:59
@ByteDanceLiuYang ByteDanceLiuYang marked this pull request as draft May 25, 2026 12:51
Comment thread crates/ov_cli/src/commands/search.rs Outdated
node_limit: i32,
level_limit: i32,
engine: Option<String>,
switch_to_remote_threshold: Option<i32>,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些参数由于不常用,考虑放入 ovcli.conf 而不通过 flags 暴露

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的,已调整

offset: int = 0,
filters: Optional[Dict[str, Any]] = None,
output_fields: Optional[List[str]] = None,
mode: Optional[str] = None,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这两个参数设计还要斟酌下

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的,先删掉了。这2个参数目前在openviking其实可以不传,就是走的默认值行为

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

2 participants