volcengine · ByteDanceLiuYang · May 20, 2026 · May 20, 2026 · May 21, 2026 · May 21, 2026
diff --git a/benchmark/locomo/vikingbot/preflight_eval_config.py b/benchmark/locomo/vikingbot/preflight_eval_config.py
@@ -64,7 +64,9 @@ def _resolve_ov_conf_path() -> Path:
         return Path(configured_path).expanduser()
 
     resolved = resolve_config_path(None, OPENVIKING_CONFIG_ENV, DEFAULT_OV_CONF)
-    default_path = str(resolved) if resolved is not None else str(Path.home() / ".openviking" / "ov.conf")
+    default_path = (
+        str(resolved) if resolved is not None else str(Path.home() / ".openviking" / "ov.conf")
+    )
 
     if _is_interactive():
         _log_info(f"OpenViking 配置默认路径: {default_path}")

diff --git a/benchmark/retrieval/grep/vikingdb_bm25/README.md b/benchmark/retrieval/grep/vikingdb_bm25/README.md
@@ -0,0 +1,118 @@
+# VikingDB BM25 Grep Benchmark
+
+Benchmark suite for evaluating OpenViking's grep retrieval with VikingDB BM25 engine.
+
+## Directory Structure
+
+```
+vikingdb_bm25/
+├── ai_wiki.txt              # Source text for synthetic data generation
+├── effectiveness/            # Retrieval effectiveness (recall/precision/F1)
+│   ├── step1_add_resource.py
+│   └── step2_quality.py
+└── performance/              # Retrieval performance (latency + returned match count at scale)
+    ├── step0_prepare_data.py
+    ├── step1_add_resource.py
+    ├── step2_reindex.py
+    └── step3_benchmark.py
+```
+
+## Effectiveness — Retrieval Quality
+
+Tests whether grep can find **all** matching files in real code repositories.
+
+**Data source:** Real code repos (download manually, place under `~/.openviking/data/benchmark/`).
+
+| Step | Script | Description |
+|------|--------|-------------|
+| 1 | `step1_add_resource.py` | Import code repos (with indexing, single import) |
+| 2 | `step2_quality.py` | Compare grep results vs ground truth (fs engine, cached) |
+
+### Usage
+
+```bash
+# Step 1: Import repos (with VLM/embedding, single import)
+cd effectiveness/
+python3 step1_add_resource.py --source ~/.openviking/data/benchmark/OpenViking-main
+
+# Step 2: Evaluate retrieval quality
+#   First run MUST use engine=fs in ov.conf to generate ground truth cache:
+#     1. Set ov.conf: "grep": {"engine": "fs"}
+#     2. Restart server
+python3 step2_quality.py --keywords grep reindex SyncHTTPClient
+
+#   Subsequent runs can use any engine (ground truth is read from cache):
+#     1. Set ov.conf: "grep": {"engine": "auto", "switch_to_remote_threshold": 0}
+#     2. Restart server
+python3 step2_quality.py --keywords grep reindex SyncHTTPClient
+
+# Optional: --regenerate-ground-truth  (force recompute, requires engine=fs)
+```
+
+## Performance — Latency at Scale
+
+Tests grep speed and returned match count on a large synthetic dataset (default: 200K files).
+
+**Data source:** Generated from `ai_wiki.txt` with target words injected at known probabilities.
+
+| Step | Script | Description |
+|------|--------|-------------|
+| 0 | `step0_prepare_data.py` | Generate synthetic dataset (dir_xxx/wiki_xxx.txt) |
+| 1 | `step1_add_resource.py` | Import data (no VLM/embedding, fast) |
+| 2 | `step2_reindex.py` | Async reindex via openviking-server (concurrency=16, polling) |
+| 3 | `step3_benchmark.py` | Measure latency and returned match count with `node_limit=256` |
+
+### Target Words
+
+15 words across 5 probability tiers:
+
+These word groups are defined in `performance/step0_prepare_data.py` and reused by `performance/step3_benchmark.py`.
+
+| Probability | Words | Expected hits (per 200K files) |
+|-------------|-------|-------------------------------|
+| 1% | heliofract, prismcache, fluxkernel | ~2,000 |
+| 0.1% | auroracode, kiteshade, glyphvector | ~200 |
+| 0.1% | cortexmint, latticewave, spiralsync | ~200 |
+| 0.05% | ripplehash, embertrace, novaframe | ~100 |
+| 0.01% | zephyrloom, quartzrelay, nebulaindex | ~20 |
+
+### Usage
+
+```bash
+cd performance/
+
+# Step 0: Generate data (default: 200 dirs x 1000 files = 200K files)
+python3 step0_prepare_data.py
+
+# Optional: append more data for scale-out without overwriting existing dirs
+python3 step0_prepare_data.py --start-dir 100 --num-dirs 100
+
+# Step 1: Import without indexing (fast)
+python3 step1_add_resource.py
+
+# Step 2: Build vector indexes (requires openviking-server running)
+python3 step2_reindex.py
+# Optional: --concurrency N  (default: 16)
+
+# Step 3: Benchmark — run with different engine configs
+#   Run A: fs engine
+#     1. Set ov.conf: "grep": {"engine": "fs"}
+#     2. Restart server
+python3 step3_benchmark.py --engine-label fs
+
+#   Run B: auto engine (bm25)
+#     1. Set ov.conf: "grep": {"engine": "auto", "switch_to_remote_threshold": 0}
+#     2. Restart server
+python3 step3_benchmark.py --engine-label auto --compare step3_result_fs.json
+```
+
+## Key Concepts
+
+- **Effectiveness** tests compare grep results against ground truth from fs-engine grep (cached locally)
+- **Performance** tests compare grep latency and returned match counts between engine configs; no ground truth is generated
+- **Effectiveness** imports real repos with indexing in a single step, then evaluates quality
+- **Performance** imports synthetic data without indexing, builds vector indexes asynchronously, then benchmarks latency
+- **Performance** import/reindex steps support resumable execution via progress files
+- Change grep engine via `ov.conf` and restart the server between benchmark runs
+- To horizontally scale the synthetic dataset, run Step 0 again with a new `--start-dir`,
+  then rerun Step 1 and Step 2.
diff --git a/benchmark/retrieval/grep/vikingdb_bm25/README_CN.md b/benchmark/retrieval/grep/vikingdb_bm25/README_CN.md
@@ -0,0 +1,117 @@
+# VikingDB BM25 Grep 基准测试
+
+用于评估 OpenViking grep 检索配合 VikingDB BM25 引擎的基准测试套件。
+
+## 目录结构
+
+```
+vikingdb_bm25/
+├── ai_wiki.txt              # 合成数据生成的原始文本
+├── effectiveness/            # 检索效果测试（召回率/精确率/F1）
+│   ├── step1_add_resource.py
+│   └── step2_quality.py
+└── performance/              # 检索性能测试（延迟 + 大规模返回匹配数）
+    ├── step0_prepare_data.py
+    ├── step1_add_resource.py
+    ├── step2_reindex.py
+    └── step3_benchmark.py
+```
+
+## Effectiveness — 检索效果
+
+测试 grep 在真实代码仓库中是否能找到**所有**匹配文件。
+
+**数据来源：** 真实代码仓库（手动下载，放置于 `~/.openviking/data/benchmark/`）。
+
+| 步骤 | 脚本 | 说明 |
+|------|------|------|
+| 1 | `step1_add_resource.py` | 导入代码仓库（含建索引，一次性导入） |
+| 2 | `step2_quality.py` | SDK grep 与 fs 引擎 ground truth 对比（缓存） |
+
+### 使用方法
+
+```bash
+# 步骤 1：导入代码仓库（含建索引，一次性导入）
+cd effectiveness/
+python3 step1_add_resource.py --source ~/.openviking/data/benchmark/OpenViking-main
+
+# 步骤 2：评估检索质量
+#   首次运行必须使用 engine=fs 生成 ground truth 缓存：
+#     1. 设置 ov.conf: "grep": {"engine": "fs"}
+#     2. 重启服务
+python3 step2_quality.py --keywords grep reindex SyncHTTPClient
+
+#   后续运行可使用任意引擎（ground truth 从缓存读取）：
+#     1. 设置 ov.conf: "grep": {"engine": "auto", "switch_to_remote_threshold": 0}
+#     2. 重启服务
+python3 step2_quality.py --keywords grep reindex SyncHTTPClient
+
+# 可选参数：--regenerate-ground-truth  （强制重算，需 engine=fs）
+```
+
+## Performance — 检索延迟
+
+在大规模合成数据集（默认 20 万文件）上测试 grep 速度和返回匹配数。
+
+**数据来源：** 从 `ai_wiki.txt` 生成，按已知概率注入目标单词。
+
+| 步骤 | 脚本 | 说明 |
+|------|------|------|
+| 0 | `step0_prepare_data.py` | 生成合成数据集（dir_xxx/wiki_xxx.txt） |
+| 1 | `step1_add_resource.py` | 导入数据（不建索引，速度快） |
+| 2 | `step2_reindex.py` | 通过 openviking-server 异步构建索引（并发=16，轮询） |
+| 3 | `step3_benchmark.py` | 使用 `node_limit=256` 测量延迟和返回匹配数 |
+
+### 目标单词
+
+15 个单词，分 5 个概率层级：
+
+这些词组定义在 `performance/step0_prepare_data.py` 中，并由 `performance/step3_benchmark.py` 复用。
+
+| 概率 | 单词 | 预期命中数（每 20 万文件） |
+|------|------|---------------------------|
+| 1% | heliofract, prismcache, fluxkernel | ~2,000 |
+| 0.1% | auroracode, kiteshade, glyphvector | ~200 |
+| 0.1% | cortexmint, latticewave, spiralsync | ~200 |
+| 0.05% | ripplehash, embertrace, novaframe | ~100 |
+| 0.01% | zephyrloom, quartzrelay, nebulaindex | ~20 |
+
+### 使用方法
+
+```bash
+cd performance/
+
+# 步骤 0：生成数据（默认：200 目录 x 1000 文件 = 20 万文件）
+python3 step0_prepare_data.py
+
+# 可选：追加更多数据，用于水平扩容，不覆盖已有目录
+python3 step0_prepare_data.py --start-dir 100 --num-dirs 100
+
+# 步骤 1：导入数据（不建索引，速度快）
+python3 step1_add_resource.py
+
+# 步骤 2：构建向量索引（需 openviking-server 运行中）
+python3 step2_reindex.py
+# 可选参数：--concurrency N  （默认：16）
+
+# 步骤 3：基准测试 — 用不同引擎配置各跑一次
+#   运行 A：fs 引擎
+#     1. 设置 ov.conf: "grep": {"engine": "fs"}
+#     2. 重启服务
+python3 step3_benchmark.py --engine-label fs
+
+#   运行 B：auto 引擎（bm25）
+#     1. 设置 ov.conf: "grep": {"engine": "auto", "switch_to_remote_threshold": 0}
+#     2. 重启服务
+python3 step3_benchmark.py --engine-label auto --compare step3_result_fs.json
+```
+
+## 核心概念
+
+- **Effectiveness（效果测试）** 将 grep 结果与 fs 引擎的 ground truth 对比（本地缓存）
+- **Performance（性能测试）** 对比不同引擎的延迟和返回匹配数，不生成 ground truth
+- **Effectiveness** 直接一次性导入真实代码仓并建索引，然后执行效果评估
+- **Performance** 先导入合成数据（不建索引），再异步建向量索引，最后执行延迟基准测试
+- **Performance** 的导入与 reindex 步骤支持**断点续传**（各有独立进度文件）
+- 切换 grep 引擎需修改 `ov.conf` 并重启服务，在不同运行之间对比
+- 如需水平扩展合成数据集，可用新的 `--start-dir` 再运行步骤 0，然后重跑步骤 1 和步骤 2。