feat(graphrag): fix merge concurrency and add resume-from-checkpoint#14238
Conversation
GraphRAG: fix merge concurrency bug and add resume-from-checkpoint support
This PR addresses three related GraphRAG reliability issues that together
allow long-running GraphRAG tasks (10+ hours of LLM extraction) to be
resumed after a crash or pause without re-doing completed work.
## 1. Fix concurrent merge crash (priority bug)
Long GraphRAG runs would crash near the end with:
```
RuntimeError: dictionary keys changed during iteration
```
in `Extractor._merge_graph_nodes` during entity resolution. Two changes:
- `rag/graphrag/general/extractor.py`: snapshot `graph.neighbors(node1)`
via `list(...)` before iterating, so concurrent `add_edge` /
`remove_node` mutations on the shared `nx.Graph` cannot invalidate the
iterator.
- `rag/graphrag/entity_resolution.py`: serialize the merge step with a
dedicated `asyncio.Semaphore(1)`. `nx.Graph` is not thread-safe and
concurrent merges on overlapping neighbourhoods can produce incorrect
results even with the snapshot fix.
## 2. Don't wipe partial graph on pause
Previously the pause / cancel UI path called
`settings.docStoreConn.delete({"knowledge_graph_kwd": [...]}, ...)`,
destroying every subgraph, entity, relation, and graph row. Re-triggering
then started GraphRAG from scratch even though PR infiniflow#14096 had already added
`load_subgraph_from_store`.
- `api/apps/kb_app.py`: `unbind_task` now accepts `wipe` (default
`true` to preserve historical behaviour). When `wipe=false` the task is
cancelled but doc-store rows are kept.
- `web/...`: the GraphRAG pause action now passes `wipe=false`; raptor is
unchanged.
## 3. Phase-completion markers (`rag/graphrag/phase_markers.py`)
A small Redis-backed marker layer at
`graphrag:phase:{kb_id}:{resolution_done|community_done}` (7-day TTL).
`run_graphrag_for_kb` consults the markers on entry and skips phases that
already completed in a prior run; markers are cleared automatically when new
docs are merged into the graph (which invalidates prior resolution and
community results) and when `unbind_task` wipes the graph. Redis failures
never block a run -- markers are an optimization, not a gate.
## 4. Idempotent community detection
`extract_community` previously did `delete-then-insert` on
`community_report` rows; a crash mid-insert left the dataset with no
reports. Now report IDs are derived deterministically from
`(kb_id, community.title)`, the existing report IDs are snapshotted before
insert, new rows are written, then only stale rows are pruned. A failure at
any step leaves either the prior or the new report set intact -- never a
partial mix.
## Tests
- `test/unit_test/rag/graphrag/test_merge_graph_nodes.py` (3 tests):
dense neighbourhood merge, neighbour-snapshot regression, concurrent
serialized merges.
- `test/unit_test/rag/graphrag/test_phase_markers.py` (4 tests): set/has
round-trip, kb-scoped clear, no-op on empty input, graceful Redis failure.
- `test/testcases/test_web_api/test_kb_app/test_kb_routes_unit.py`:
`test_unbind_task_wipe_flag` covers `wipe=false` for both GraphRAG and
raptor and confirms the default still wipes.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughFixes a concurrency bug in GraphRAG entity resolution where concurrent graph mutations caused "dictionary keys changed during iteration" errors. Introduces phase-marker-based checkpointing to enable resume capability, adds deterministic persistence for community reports, and provides control over partial-progress preservation during task pause operations. Changes
Sequence DiagramsequenceDiagram
participant Client as Client/UI
participant BuildOne as build_one<br/>(GraphRAG)
participant Redis as Redis<br/>(Phase Markers)
participant Extractor as Extractor<br/>(Merge/Extract)
participant KB as Knowledge Base<br/>(Documents)
Client->>BuildOne: build_one(kb_id, ...)
BuildOne->>Redis: has_phase_marker(kb_id, PHASE_RESOLUTION)
alt Resolution pending
Redis-->>BuildOne: False
BuildOne->>Extractor: resolve_entities()
Extractor->>Extractor: _merge_graph_nodes(snapshot neighbors)
Extractor->>KB: update merged nodes/edges
BuildOne->>Redis: set_phase_marker(kb_id, PHASE_RESOLUTION)
Redis-->>BuildOne: True
else Resolution complete
Redis-->>BuildOne: True
BuildOne->>BuildOne: log "already completed"
end
BuildOne->>Redis: has_phase_marker(kb_id, PHASE_COMMUNITY)
alt Community pending
Redis-->>BuildOne: False
BuildOne->>Extractor: extract_community()
Extractor->>KB: insert community_report<br/>(deterministic IDs)
Extractor->>KB: prune stale reports
BuildOne->>Redis: set_phase_marker(kb_id, PHASE_COMMUNITY)
Redis-->>BuildOne: True
else Community complete
Redis-->>BuildOne: True
BuildOne->>BuildOne: log "already completed"
end
BuildOne-->>Client: completion/resume result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
rag/graphrag/general/extractor.py (1)
313-340:⚠️ Potential issue | 🔴 CriticalRefresh
node0adjacency after redirecting a new neighbor.Line 313 snapshots
node0_neighborsonce and never updates it. If two merged nodes both connect to the same external neighbor thatnodes[0]did not originally have, the second pass falls into theelsebranch on Lines 339-340 and overwrites the first redirected edge instead of merging weight/description/source IDs. That silently drops relationship data during multi-node merges.🐛 Minimal fix
node0_attrs = graph.nodes[nodes[0]] node0_neighbors = set(graph.neighbors(nodes[0])) for node1 in nodes[1:]: @@ else: graph.add_edge(nodes[0], neighbor, **edge1_attrs) + node0_neighbors.add(neighbor) graph.remove_node(node1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rag/graphrag/general/extractor.py` around lines 313 - 340, The snapshot node0_neighbors is taken once and never updated, so when redirecting neighbors from node1 to nodes[0] the first redirected edge can be overwritten by a later redirect because nodes[0] is not seen as having that neighbor; to fix, after creating or merging an edge toward a neighbor update node0_neighbors (e.g., node0_neighbors.add(neighbor)) so subsequent iterations treat it as an existing neighbor and perform the edge-merge path (adjust weight/description/source_id) rather than the else branch that blindly adds/overwrites via graph.add_edge; locate the loop handling graph.neighbors(node1) and the branches that call graph.add_edge and self._handle_entity_relation_summary to apply this change.rag/graphrag/general/index.py (1)
682-706:⚠️ Potential issue | 🟠 MajorCommunity report IDs can collide when different communities receive identical titles from the LLM.
The ID is deterministically derived from
kb_idandstru['title'](line 688). Since the title comes from the LLM with no uniqueness constraint, two different communities within the same KB can receive the same title and thus the same ID. When chunks with identical IDs are inserted into the database, one silently overwrites the other with no collision detection or warning.This is a data loss issue: consider two overlapping communities that the LLM names identically—only one community report will persist in the index.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rag/graphrag/general/index.py` around lines 682 - 706, The deterministic ID derivation using chunk_payload_for_id with only kb_id and stru['title'] can cause collisions when the LLM emits identical titles; update the payload used by chunk_id to include a secondary uniqueness factor (e.g., a fingerprint of the community membership or source ids) so different communities with the same title produce different ids. Concretely, modify chunk_payload_for_id (used by chunk_id) to include either a unique community identifier if present (e.g., stru['community_id']) or a stable hash/fingerprint of the sorted list of source ids/doc_ids (or sorted members) so chunk creation (chunk and chunks.append) yields collision-resistant ids while preserving deterministic behavior.
🧹 Nitpick comments (2)
api/apps/kb_app.py (1)
820-843: Log the new wipe vs preserve path.This route now has two very different outcomes, but nothing records which branch ran or whether GraphRAG artifacts/markers were cleared. That will be painful to debug once pause/resume issues hit production.
🪵 Small observability patch
wipe_arg = (request.args.get("wipe", "true") or "true").strip().lower() wipe = wipe_arg not in ("false", "0", "no", "off") + logging.info("Unbinding %s task for kb %s (wipe=%s)", pipeline_task_type, kb_id, wipe) @@ if wipe: settings.docStoreConn.delete({"knowledge_graph_kwd": ["graph", "subgraph", "entity", "relation", "community_report"]}, search.index_name(kb.tenant_id), kb_id) # Wiping the graph invalidates any phase-completion markers # for resolution / community detection. from rag.graphrag.phase_markers import clear_phase_markers clear_phase_markers(kb_id) + logging.info("Cleared GraphRAG artifacts and phase markers for kb %s", kb_id)As per coding guidelines,
**/*.py: Add logging for new flows.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@api/apps/kb_app.py` around lines 820 - 843, Add logging to record whether the handler took the wipe (delete) or preserve path and what side effects occurred: log when cancel_task(...) is invoked (include task_id and kb_task_id_field), log the value of wipe, and if wipe is true log that settings.docStoreConn.delete(...) was executed (include kb_id and kb.tenant_id) and that clear_phase_markers(kb_id) was called; place these logs in the PipelineTaskType.GRAPH_RAG branch immediately before/after calling cancel_task, before calling settings.docStoreConn.delete, and after calling clear_phase_markers; use the module logger/processLogger already used in the file so messages include context for debugging.test/testcases/test_web_api/test_kb_app/test_kb_routes_unit.py (1)
781-850: Assert the phase-marker cleanup here too.This test only checks
docStoreConn.delete. Ifclear_phase_markers()stops being called on the default GraphRAG wipe path, the test still passes even though resume behavior breaks. Stubbing that import and asserting0calls forwipe=falseand1call for the default wipe path would cover the new behavior and keep the test isolated.🧪 Targeted test extension
def test_unbind_task_wipe_flag(monkeypatch): @@ cancelled = [] deleted = [] update_payloads = [] + cleared_phase_markers = [] monkeypatch.setattr(module.REDIS_CONN, "set", lambda key, value: cancelled.append((key, value))) monkeypatch.setattr(module.search, "index_name", lambda _tenant_id: "idx") monkeypatch.setattr(module.settings, "docStoreConn", SimpleNamespace(delete=lambda *args, **_kwargs: deleted.append(args))) + phase_markers_module = ModuleType("rag.graphrag.phase_markers") + phase_markers_module.clear_phase_markers = lambda dataset_id: cleared_phase_markers.append(dataset_id) + monkeypatch.setitem(sys.modules, "rag.graphrag.phase_markers", phase_markers_module) @@ assert res["code"] == module.RetCode.SUCCESS, res assert ("graph-task-cancel", "x") in cancelled, cancelled assert deleted == [], f"docStore.delete must not be called when wipe=false: {deleted}" + assert cleared_phase_markers == [], cleared_phase_markers @@ res = route() assert res["code"] == module.RetCode.SUCCESS, res assert len(deleted) == 1, f"default wipe must call docStore.delete once: {deleted}" + assert cleared_phase_markers == ["kb-1"], cleared_phase_markersBased on learnings,
Applies to tests/**/*.py : Add/adjust tests for behavior changes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/testcases/test_web_api/test_kb_app/test_kb_routes_unit.py` around lines 781 - 850, The test needs to assert calls to clear_phase_markers invoked by delete_kb_task: stub module.clear_phase_markers (e.g. replace with a function that appends to a phase_calls list) before calling route(), then after each scenario assert phase_calls length is 0 for wipe="false" and wipe="0" (graph and raptor cases) and is 1 for the default GraphRAG case; reference the delete_kb_task route under test and the clear_phase_markers import on the module to locate where to monkeypatch and add assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@rag/graphrag/entity_resolution.py`:
- Around line 162-170: The merge semaphore currently wraps the entire awaited
call to _merge_graph_nodes (in limited_merge_nodes), which blocks during
network-bound work like _handle_entity_relation_summary; instead restrict the
critical section to only the graph-mutating steps: refactor _merge_graph_nodes
into a prepare/commit split (e.g., _merge_graph_nodes_prepare that does
summaries/IO and returns a concrete merge plan, and _merge_graph_nodes_commit
that applies add_edge/remove_node to the nx.Graph), then in limited_merge_nodes
await the prepare phase without the semaphore and only async with
merge_semaphore while calling the commit phase to perform the actual graph
reads/writes; alternatively compute a complete merge plan before acquiring
merge_semaphore and apply it inside the semaphore.
---
Outside diff comments:
In `@rag/graphrag/general/extractor.py`:
- Around line 313-340: The snapshot node0_neighbors is taken once and never
updated, so when redirecting neighbors from node1 to nodes[0] the first
redirected edge can be overwritten by a later redirect because nodes[0] is not
seen as having that neighbor; to fix, after creating or merging an edge toward a
neighbor update node0_neighbors (e.g., node0_neighbors.add(neighbor)) so
subsequent iterations treat it as an existing neighbor and perform the
edge-merge path (adjust weight/description/source_id) rather than the else
branch that blindly adds/overwrites via graph.add_edge; locate the loop handling
graph.neighbors(node1) and the branches that call graph.add_edge and
self._handle_entity_relation_summary to apply this change.
In `@rag/graphrag/general/index.py`:
- Around line 682-706: The deterministic ID derivation using
chunk_payload_for_id with only kb_id and stru['title'] can cause collisions when
the LLM emits identical titles; update the payload used by chunk_id to include a
secondary uniqueness factor (e.g., a fingerprint of the community membership or
source ids) so different communities with the same title produce different ids.
Concretely, modify chunk_payload_for_id (used by chunk_id) to include either a
unique community identifier if present (e.g., stru['community_id']) or a stable
hash/fingerprint of the sorted list of source ids/doc_ids (or sorted members) so
chunk creation (chunk and chunks.append) yields collision-resistant ids while
preserving deterministic behavior.
---
Nitpick comments:
In `@api/apps/kb_app.py`:
- Around line 820-843: Add logging to record whether the handler took the wipe
(delete) or preserve path and what side effects occurred: log when
cancel_task(...) is invoked (include task_id and kb_task_id_field), log the
value of wipe, and if wipe is true log that settings.docStoreConn.delete(...)
was executed (include kb_id and kb.tenant_id) and that
clear_phase_markers(kb_id) was called; place these logs in the
PipelineTaskType.GRAPH_RAG branch immediately before/after calling cancel_task,
before calling settings.docStoreConn.delete, and after calling
clear_phase_markers; use the module logger/processLogger already used in the
file so messages include context for debugging.
In `@test/testcases/test_web_api/test_kb_app/test_kb_routes_unit.py`:
- Around line 781-850: The test needs to assert calls to clear_phase_markers
invoked by delete_kb_task: stub module.clear_phase_markers (e.g. replace with a
function that appends to a phase_calls list) before calling route(), then after
each scenario assert phase_calls length is 0 for wipe="false" and wipe="0"
(graph and raptor cases) and is 1 for the default GraphRAG case; reference the
delete_kb_task route under test and the clear_phase_markers import on the module
to locate where to monkeypatch and add assertions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 19b7a4f0-e738-4d2a-95ae-58008a95b8bb
📒 Files selected for processing (12)
api/apps/kb_app.pyapi/apps/services/dataset_api_service.pyrag/graphrag/entity_resolution.pyrag/graphrag/general/extractor.pyrag/graphrag/general/index.pyrag/graphrag/phase_markers.pytest/testcases/test_web_api/test_kb_app/test_kb_routes_unit.pytest/unit_test/rag/graphrag/test_merge_graph_nodes.pytest/unit_test/rag/graphrag/test_phase_markers.pyweb/src/pages/dataset/dataset/generate-button/hook.tsweb/src/services/knowledge-service.tsweb/src/utils/api.ts
…e-marker cleanup * extractor: track each redirected neighbor in node0_neighbors so a later merged node sharing the same external neighbor takes the edge-merge branch instead of overwriting via add_edge. Prevents silent loss of weight/description/source_id when multiple nodes are merged into one. * kb_app.unbind_task: log kb_id, task type, wipe flag, and whether GraphRAG artefacts/phase markers were cleared, per repo logging guideline. * test_unbind_task_wipe_flag: stub rag.graphrag.phase_markers and assert clear_phase_markers is called exactly once on default wipe and not at all when wipe=false (covers both GraphRAG and raptor paths). Addresses CodeRabbit review on infiniflow#14238.
|
Thanks @coderabbitai. Pushed 33004a8 addressing the actionable items: 1. 2. 3. Not changed (intentionally):
|
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 7156
The deferrals are reasonable:
The PR looks good to me. 🐇✨ |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/testcases/test_web_api/test_kb_app/test_kb_routes_unit.py`:
- Around line 827-852: Initialize the missing tracker list cleared_phase_markers
before it's asserted and ensure the mocked phase-marker-clearing logic appends
to it; specifically, add cleared_phase_markers = [] prior to setting up the
request/mocks used by route() (the same place where deleted is defined) and
update the mock/stub that simulates clearing phase markers to append the kb_id
(e.g., "kb-1") to cleared_phase_markers when invoked so the later assertions
against cleared_phase_markers pass.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c3297e4e-983b-47bd-97c5-4b575d42063a
📒 Files selected for processing (3)
api/apps/kb_app.pyrag/graphrag/general/extractor.pytest/testcases/test_web_api/test_kb_app/test_kb_routes_unit.py
🚧 Files skipped from review as they are similar to previous changes (2)
- rag/graphrag/general/extractor.py
- api/apps/kb_app.py
…sert tuning Main was merged in and PR infiniflow#14394 deleted api/apps/kb_app.py, taking unbind_task with it. The pause-without-wipe behaviour is restored on the new REST surface (DELETE /v1/datasets/<id>/<index_type>): * api/apps/services/dataset_api_service.py: delete_index gains a wipe: bool = True parameter; when False the docStoreConn.delete call and clear_phase_markers() are skipped. Default preserves existing behaviour. Logs which branch ran. * api/apps/restful_apis/dataset_api.py: parses ?wipe=false|0|no|off from request.args and forwards it to the service. * web/src/utils/api.ts + web/src/services/knowledge-service.ts: unbindPipelineTask appends ?wipe=false when explicitly false. The GraphRAG pause path in generate-button/hook.ts already passes wipe: false for KnowledgeGraph; raptor and the explicit Delete action keep the historical wipe-true behaviour. * test/testcases/test_web_api/test_dataset_management/ test_dataset_sdk_routes_unit.py: new test_delete_index_wipe_flag_unit covers wipe=false (graph + raptor) and the default wipe path, asserting clear_phase_markers is invoked exactly when the docStore is wiped. GraphRAG insert pipeline is also tunable now (was hardcoded es_bulk_size=4, sequential): * rag/graphrag/utils.py: new insert_chunks_bounded() helper using a bounded asyncio.Semaphore. Defaults to 64 docs/batch and 4 batches in flight; override with GRAPHRAG_INSERT_BULK_SIZE and GRAPHRAG_INSERT_CONCURRENCY. Same retry/timeout semantics as the prior loop. set_graph now uses it. * rag/graphrag/general/index.py: extract_community's community_report insert also uses insert_chunks_bounded.
|
Pushed 1889930 and updated the PR description. After merging main into this branch I noticed #14394 deleted
Also added in this commit (called out as section 5 in the description): the GraphRAG insert pipeline is now tunable. The hardcoded All 7 GraphRAG unit tests + the new dataset-API test pass under |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #14238 +/- ##
=======================================
Coverage 94.16% 94.16%
=======================================
Files 10 10
Lines 703 703
Branches 112 112
=======================================
Hits 662 662
Misses 25 25
Partials 16 16 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Hi @prpercival, the PR looks good to me, please merge main and resolve conflict so that we can merge. |
…n getKbDetail The new /api/v1/datasets/<id> endpoint returns chunk_count/document_count, but legacy consumers (e.g. the GraphRAG/Raptor 'magic wand' enable check in dataset/index.tsx) read chunk_num/doc_num. Without this normalization the wand button stays disabled even when the dataset has chunks. Mirrors the existing mapDocumentToLegacy treatment for documents.
…checkpoint # Conflicts: # api/apps/services/dataset_api_service.py # web/src/utils/api.ts
Just resolved, thanks for reviewing. |
This PR addresses three related GraphRAG reliability issues that together allow long-running GraphRAG tasks (10+ hours of LLM extraction) to be resumed after a crash or pause without re-doing completed work. It builds on #14096 (per-doc subgraph cache) and extends the same idea to the resolution and community-detection phases.
Fixes #14236.
1. Fix concurrent merge crash
Long GraphRAG runs would crash near the end of entity resolution with:
in
Extractor._merge_graph_nodes. Two changes:rag/graphrag/general/extractor.py: snapshotgraph.neighbors(node1)vialist(...)before iterating, so concurrentadd_edge/remove_nodemutations on the sharednx.Graphcannot invalidate the iterator. Also tracks each redirected neighbour innode0_neighborsso a later merged node sharing the same external neighbour takes the edge-merge branch instead of overwriting viaadd_edge.rag/graphrag/entity_resolution.py: serialize the merge step with a dedicatedasyncio.Semaphore(1).nx.Graphis not thread-safe and concurrent merges on overlapping neighbourhoods can produce incorrect results even with the snapshot fix.2. Don't wipe partial graph on pause
Previously the pause / cancel UI path called
settings.docStoreConn.delete({"knowledge_graph_kwd": [...]}, ...), destroying every subgraph, entity, relation, and graph row. Re-triggering then started GraphRAG from scratch even though #14096 had already addedload_subgraph_from_store.After main was merged in (which deleted
api/apps/kb_app.pyper #14394), the pause path now lives on the new REST surfaceDELETE /v1/datasets/<id>/<index_type>:api/apps/services/dataset_api_service.py:delete_indexaccepts awipe: bool = Trueparameter. WhenFalsethe doc-store rows and GraphRAG phase markers are left intact and only the running task is cancelled. Default preserves historical behaviour.api/apps/restful_apis/dataset_api.py: parses?wipe=false|0|no|offfrom the query string and forwards it.web/src/utils/api.ts+web/src/services/knowledge-service.ts:unbindPipelineTaskappends?wipe=falsewhen explicitly false.web/src/pages/dataset/dataset/generate-button/hook.tspasseswipe: falseforKnowledgeGraph; raptor is unchanged.UX impact: the pause icon next to a running GraphRAG task no longer wipes graph data. The only path that still wipes is the explicit Delete action in
GenerateLogButton(trash icon behind a confirmation modal).3. Phase-completion markers (
rag/graphrag/phase_markers.py)A small Redis-backed marker layer at
graphrag:phase:{kb_id}:{resolution_done|community_done}(7-day TTL).run_graphrag_for_kbconsults the markers on entry and skips phases that already completed in a prior run. Markers are cleared automatically when:delete_indexwipes the graph, ordelete_knowledge_graphis called.Redis failures never block a run -- markers are an optimization, not a gate.
4. Idempotent community detection
extract_communitypreviously diddelete-then-insertoncommunity_reportrows; a crash mid-insert left the dataset with no reports. Now report IDs are derived deterministically from(kb_id, community.title), the existing report IDs are snapshotted before insert, new rows are written, then only stale rows are pruned. A failure at any step leaves either the prior or the new report set intact -- never a partial mix.5. Tunable doc-store insert pipeline
The GraphRAG insert loop in
rag/graphrag/utils.pyand thecommunity_reportinsert inrag/graphrag/general/index.pywere both hardcoded toes_bulk_size = 4and ran strictly sequentially. On a real KB this meant 1077 chunks took ~21 minutes for a 100-chunk slice -- pure round-trip overhead.insert_chunks_bounded()helper inrag/graphrag/utils.pybatches inserts via a boundedasyncio.Semaphore. Same retry / timeout semantics as the prior loop.document_service.py). Tunable per-deployment viaGRAPHRAG_INSERT_BULK_SIZEandGRAPHRAG_INSERT_CONCURRENCY.set_graphandextract_communitynow use the helper.This dropped the same 1077-chunk insert from minutes to seconds in local testing without measurable extra pressure on Infinity (total in-flight docs ≤
BULK_SIZE × CONCURRENCY= 256 by default).Tests
test/unit_test/rag/graphrag/test_merge_graph_nodes.py(3 tests): dense neighbourhood merge, neighbour-snapshot regression, concurrent serialized merges.test/unit_test/rag/graphrag/test_phase_markers.py(4 tests): set/has round-trip, kb-scoped clear, no-op on empty input, graceful Redis failure.test/testcases/test_web_api/test_dataset_management/test_dataset_sdk_routes_unit.py: newtest_delete_index_wipe_flag_unitcoverswipe=falsefor both GraphRAG and raptor on the new REST route, and confirms the default still wipes and clears phase markers.Compatibility
wipe=true, no markers expected).wipeonDELETE /v1/datasets/<id>/<index_type>.GRAPHRAG_INSERT_BULK_SIZEandGRAPHRAG_INSERT_CONCURRENCY; defaults preserve safe behaviour.Example of resume
Screenshot below shows a test resuming knowledge graph generation after applying the concurrency fix and re-deploying.
Type of change