Skip to content

feat(graphrag): fix merge concurrency and add resume-from-checkpoint#14238

Merged
wangq8 merged 9 commits into
infiniflow:mainfrom
prpercival:feat/graphrag-resume-checkpoint
May 6, 2026
Merged

feat(graphrag): fix merge concurrency and add resume-from-checkpoint#14238
wangq8 merged 9 commits into
infiniflow:mainfrom
prpercival:feat/graphrag-resume-checkpoint

Conversation

@prpercival
Copy link
Copy Markdown

@prpercival prpercival commented Apr 20, 2026

This PR addresses three related GraphRAG reliability issues that together allow long-running GraphRAG tasks (10+ hours of LLM extraction) to be resumed after a crash or pause without re-doing completed work. It builds on #14096 (per-doc subgraph cache) and extends the same idea to the resolution and community-detection phases.

Fixes #14236.

1. Fix concurrent merge crash

Long GraphRAG runs would crash near the end of entity resolution with:

RuntimeError: dictionary keys changed during iteration

in Extractor._merge_graph_nodes. Two changes:

  • rag/graphrag/general/extractor.py: snapshot graph.neighbors(node1) via list(...) before iterating, so concurrent add_edge / remove_node mutations on the shared nx.Graph cannot invalidate the iterator. Also tracks each redirected neighbour in node0_neighbors so a later merged node sharing the same external neighbour takes the edge-merge branch instead of overwriting via add_edge.
  • rag/graphrag/entity_resolution.py: serialize the merge step with a dedicated asyncio.Semaphore(1). nx.Graph is not thread-safe and concurrent merges on overlapping neighbourhoods can produce incorrect results even with the snapshot fix.

2. Don't wipe partial graph on pause

Previously the pause / cancel UI path called settings.docStoreConn.delete({"knowledge_graph_kwd": [...]}, ...), destroying every subgraph, entity, relation, and graph row. Re-triggering then started GraphRAG from scratch even though #14096 had already added load_subgraph_from_store.

After main was merged in (which deleted api/apps/kb_app.py per #14394), the pause path now lives on the new REST surface DELETE /v1/datasets/<id>/<index_type>:

  • api/apps/services/dataset_api_service.py: delete_index accepts a wipe: bool = True parameter. When False the doc-store rows and GraphRAG phase markers are left intact and only the running task is cancelled. Default preserves historical behaviour.
  • api/apps/restful_apis/dataset_api.py: parses ?wipe=false|0|no|off from the query string and forwards it.
  • web/src/utils/api.ts + web/src/services/knowledge-service.ts: unbindPipelineTask appends ?wipe=false when explicitly false.
  • The GraphRAG pause action in web/src/pages/dataset/dataset/generate-button/hook.ts passes wipe: false for KnowledgeGraph; raptor is unchanged.

UX impact: the pause icon next to a running GraphRAG task no longer wipes graph data. The only path that still wipes is the explicit Delete action in GenerateLogButton (trash icon behind a confirmation modal).

3. Phase-completion markers (rag/graphrag/phase_markers.py)

A small Redis-backed marker layer at graphrag:phase:{kb_id}:{resolution_done|community_done} (7-day TTL). run_graphrag_for_kb consults the markers on entry and skips phases that already completed in a prior run. Markers are cleared automatically when:

  • new docs are merged into the graph (which invalidates prior resolution and community results),
  • delete_index wipes the graph, or
  • delete_knowledge_graph is called.

Redis failures never block a run -- markers are an optimization, not a gate.

4. Idempotent community detection

extract_community previously did delete-then-insert on community_report rows; a crash mid-insert left the dataset with no reports. Now report IDs are derived deterministically from (kb_id, community.title), the existing report IDs are snapshotted before insert, new rows are written, then only stale rows are pruned. A failure at any step leaves either the prior or the new report set intact -- never a partial mix.

5. Tunable doc-store insert pipeline

The GraphRAG insert loop in rag/graphrag/utils.py and the community_report insert in rag/graphrag/general/index.py were both hardcoded to es_bulk_size = 4 and ran strictly sequentially. On a real KB this meant 1077 chunks took ~21 minutes for a 100-chunk slice -- pure round-trip overhead.

  • New insert_chunks_bounded() helper in rag/graphrag/utils.py batches inserts via a bounded asyncio.Semaphore. Same retry / timeout semantics as the prior loop.
  • Defaults: 64 docs per batch, 4 batches in flight (matches the regular ingest pipeline in document_service.py). Tunable per-deployment via GRAPHRAG_INSERT_BULK_SIZE and GRAPHRAG_INSERT_CONCURRENCY.
  • Both set_graph and extract_community now use the helper.

This dropped the same 1077-chunk insert from minutes to seconds in local testing without measurable extra pressure on Infinity (total in-flight docs ≤ BULK_SIZE × CONCURRENCY = 256 by default).

Tests

  • test/unit_test/rag/graphrag/test_merge_graph_nodes.py (3 tests): dense neighbourhood merge, neighbour-snapshot regression, concurrent serialized merges.
  • test/unit_test/rag/graphrag/test_phase_markers.py (4 tests): set/has round-trip, kb-scoped clear, no-op on empty input, graceful Redis failure.
  • test/testcases/test_web_api/test_dataset_management/test_dataset_sdk_routes_unit.py: new test_delete_index_wipe_flag_unit covers wipe=false for both GraphRAG and raptor on the new REST route, and confirms the default still wipes and clears phase markers.

Compatibility

  • Backward compatible: tasks queued before this change behave identically (default wipe=true, no markers expected).
  • No schema/migration changes; all new state lives in Redis.
  • New optional REST query param wipe on DELETE /v1/datasets/<id>/<index_type>.
  • New optional env vars GRAPHRAG_INSERT_BULK_SIZE and GRAPHRAG_INSERT_CONCURRENCY; defaults preserve safe behaviour.

Example of resume

Screenshot below shows a test resuming knowledge graph generation after applying the concurrency fix and re-deploying.

image

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

GraphRAG: fix merge concurrency bug and add resume-from-checkpoint support

This PR addresses three related GraphRAG reliability issues that together
allow long-running GraphRAG tasks (10+ hours of LLM extraction) to be
resumed after a crash or pause without re-doing completed work.

## 1. Fix concurrent merge crash (priority bug)

Long GraphRAG runs would crash near the end with:
```
RuntimeError: dictionary keys changed during iteration
```
in `Extractor._merge_graph_nodes` during entity resolution. Two changes:

- `rag/graphrag/general/extractor.py`: snapshot `graph.neighbors(node1)`
  via `list(...)` before iterating, so concurrent `add_edge` /
  `remove_node` mutations on the shared `nx.Graph` cannot invalidate the
  iterator.
- `rag/graphrag/entity_resolution.py`: serialize the merge step with a
  dedicated `asyncio.Semaphore(1)`. `nx.Graph` is not thread-safe and
  concurrent merges on overlapping neighbourhoods can produce incorrect
  results even with the snapshot fix.

## 2. Don't wipe partial graph on pause

Previously the pause / cancel UI path called
`settings.docStoreConn.delete({"knowledge_graph_kwd": [...]}, ...)`,
destroying every subgraph, entity, relation, and graph row. Re-triggering
then started GraphRAG from scratch even though PR infiniflow#14096 had already added
`load_subgraph_from_store`.

- `api/apps/kb_app.py`: `unbind_task` now accepts `wipe` (default
  `true` to preserve historical behaviour). When `wipe=false` the task is
  cancelled but doc-store rows are kept.
- `web/...`: the GraphRAG pause action now passes `wipe=false`; raptor is
  unchanged.

## 3. Phase-completion markers (`rag/graphrag/phase_markers.py`)

A small Redis-backed marker layer at
`graphrag:phase:{kb_id}:{resolution_done|community_done}` (7-day TTL).
`run_graphrag_for_kb` consults the markers on entry and skips phases that
already completed in a prior run; markers are cleared automatically when new
docs are merged into the graph (which invalidates prior resolution and
community results) and when `unbind_task` wipes the graph. Redis failures
never block a run -- markers are an optimization, not a gate.

## 4. Idempotent community detection

`extract_community` previously did `delete-then-insert` on
`community_report` rows; a crash mid-insert left the dataset with no
reports. Now report IDs are derived deterministically from
`(kb_id, community.title)`, the existing report IDs are snapshotted before
insert, new rows are written, then only stale rows are pruned. A failure at
any step leaves either the prior or the new report set intact -- never a
partial mix.

## Tests

- `test/unit_test/rag/graphrag/test_merge_graph_nodes.py` (3 tests):
  dense neighbourhood merge, neighbour-snapshot regression, concurrent
  serialized merges.
- `test/unit_test/rag/graphrag/test_phase_markers.py` (4 tests): set/has
  round-trip, kb-scoped clear, no-op on empty input, graceful Redis failure.
- `test/testcases/test_web_api/test_kb_app/test_kb_routes_unit.py`:
  `test_unbind_task_wipe_flag` covers `wipe=false` for both GraphRAG and
  raptor and confirms the default still wipes.
@dosubot dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Apr 20, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Fixes a concurrency bug in GraphRAG entity resolution where concurrent graph mutations caused "dictionary keys changed during iteration" errors. Introduces phase-marker-based checkpointing to enable resume capability, adds deterministic persistence for community reports, and provides control over partial-progress preservation during task pause operations.

Changes

Cohort / File(s) Summary
Graph Concurrency Fix
rag/graphrag/general/extractor.py, test/unit_test/rag/graphrag/test_merge_graph_nodes.py
Fixes concurrent mutation bug by materializing list(graph.neighbors(node1)) before iteration and tracking newly redirected neighbors to prevent iteration-time adjacency errors. Includes regression tests validating merge behavior under concurrent/overlapping operations.
Phase-Marker Checkpointing System
rag/graphrag/phase_markers.py, test/unit_test/rag/graphrag/test_phase_markers.py
New Redis-backed module enabling KB-scoped phase-completion markers (resolution_done, community_done) with helper functions has_phase_marker, set_phase_marker, clear_phase_markers (all fail-safe, non-blocking). Includes comprehensive unit tests covering round-trip semantics, cleanup, and Redis failure injection.
KB Processing Control Flow
rag/graphrag/general/index.py
Refactors build_one to use phase markers for conditional execution: checks/skips already-completed phases on resume, clears markers when new content merges, sets markers after phase completion. Changes community_report persistence from random to deterministic chunk IDs and implements snapshot-prune strategy to replace delete-all behavior.
Knowledge Graph Cleanup
api/apps/services/dataset_api_service.py
Extends delete_knowledge_graph to delete community_report documents alongside existing graph types and invokes clear_phase_markers to remove stale phase-completion markers.
Pause Behavior & Wipe Control
web/src/pages/dataset/dataset/generate-button/hook.ts, web/src/services/knowledge-service.ts
Adds optional wipe?: boolean parameter to deletePipelineTask and pause flow; for KnowledgeGraph type passes wipe: false to preserve partial progress, other types pass wipe: undefined to retain default behavior.

Sequence Diagram

sequenceDiagram
    participant Client as Client/UI
    participant BuildOne as build_one<br/>(GraphRAG)
    participant Redis as Redis<br/>(Phase Markers)
    participant Extractor as Extractor<br/>(Merge/Extract)
    participant KB as Knowledge Base<br/>(Documents)

    Client->>BuildOne: build_one(kb_id, ...)
    BuildOne->>Redis: has_phase_marker(kb_id, PHASE_RESOLUTION)
    alt Resolution pending
        Redis-->>BuildOne: False
        BuildOne->>Extractor: resolve_entities()
        Extractor->>Extractor: _merge_graph_nodes(snapshot neighbors)
        Extractor->>KB: update merged nodes/edges
        BuildOne->>Redis: set_phase_marker(kb_id, PHASE_RESOLUTION)
        Redis-->>BuildOne: True
    else Resolution complete
        Redis-->>BuildOne: True
        BuildOne->>BuildOne: log "already completed"
    end

    BuildOne->>Redis: has_phase_marker(kb_id, PHASE_COMMUNITY)
    alt Community pending
        Redis-->>BuildOne: False
        BuildOne->>Extractor: extract_community()
        Extractor->>KB: insert community_report<br/>(deterministic IDs)
        Extractor->>KB: prune stale reports
        BuildOne->>Redis: set_phase_marker(kb_id, PHASE_COMMUNITY)
        Redis-->>BuildOne: True
    else Community complete
        Redis-->>BuildOne: True
        BuildOne->>BuildOne: log "already completed"
    end

    BuildOne-->>Client: completion/resume result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

💞 feature, ☯️ refactor, 🌈 python

Suggested reviewers

  • wangq8
  • yuzhichang
  • 6ba3i

Poem

🐰 A rabbit's ode to concurrency fixed:

Snapshot the neighbors, they'd vanish mid-flight,
Phase markers persistent make checkpoints just right,
Community reports now deterministic and sure,
Dictionary woes? Now we pause them with care!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: fixing merge concurrency issues and adding resume-from-checkpoint functionality for GraphRAG.
Linked Issues check ✅ Passed The PR fully addresses issue #14236: fixes concurrent merge crash via neighbor snapshotting and serialization [#14236], preserves partial graph on pause [#14236], adds phase markers for resumption [#14236], and implements idempotent community detection [#14236].
Out of Scope Changes check ✅ Passed All changes directly support the stated objectives: merge concurrency fix, pause/resume preservation, phase markers, and idempotent community detection with no unrelated modifications detected.
Description check ✅ Passed The PR description comprehensively documents all changes, includes a detailed problem statement, and provides context with numbered sections covering each major fix.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@dosubot dosubot Bot added 🐞 bug Something isn't working, pull request that fix bug. 🧪 test Pull requests that update test cases. labels Apr 20, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
rag/graphrag/general/extractor.py (1)

313-340: ⚠️ Potential issue | 🔴 Critical

Refresh node0 adjacency after redirecting a new neighbor.

Line 313 snapshots node0_neighbors once and never updates it. If two merged nodes both connect to the same external neighbor that nodes[0] did not originally have, the second pass falls into the else branch on Lines 339-340 and overwrites the first redirected edge instead of merging weight/description/source IDs. That silently drops relationship data during multi-node merges.

🐛 Minimal fix
         node0_attrs = graph.nodes[nodes[0]]
         node0_neighbors = set(graph.neighbors(nodes[0]))
         for node1 in nodes[1:]:
@@
                     else:
                         graph.add_edge(nodes[0], neighbor, **edge1_attrs)
+                        node0_neighbors.add(neighbor)
             graph.remove_node(node1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag/graphrag/general/extractor.py` around lines 313 - 340, The snapshot
node0_neighbors is taken once and never updated, so when redirecting neighbors
from node1 to nodes[0] the first redirected edge can be overwritten by a later
redirect because nodes[0] is not seen as having that neighbor; to fix, after
creating or merging an edge toward a neighbor update node0_neighbors (e.g.,
node0_neighbors.add(neighbor)) so subsequent iterations treat it as an existing
neighbor and perform the edge-merge path (adjust weight/description/source_id)
rather than the else branch that blindly adds/overwrites via graph.add_edge;
locate the loop handling graph.neighbors(node1) and the branches that call
graph.add_edge and self._handle_entity_relation_summary to apply this change.
rag/graphrag/general/index.py (1)

682-706: ⚠️ Potential issue | 🟠 Major

Community report IDs can collide when different communities receive identical titles from the LLM.

The ID is deterministically derived from kb_id and stru['title'] (line 688). Since the title comes from the LLM with no uniqueness constraint, two different communities within the same KB can receive the same title and thus the same ID. When chunks with identical IDs are inserted into the database, one silently overwrites the other with no collision detection or warning.

This is a data loss issue: consider two overlapping communities that the LLM names identically—only one community report will persist in the index.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag/graphrag/general/index.py` around lines 682 - 706, The deterministic ID
derivation using chunk_payload_for_id with only kb_id and stru['title'] can
cause collisions when the LLM emits identical titles; update the payload used by
chunk_id to include a secondary uniqueness factor (e.g., a fingerprint of the
community membership or source ids) so different communities with the same title
produce different ids. Concretely, modify chunk_payload_for_id (used by
chunk_id) to include either a unique community identifier if present (e.g.,
stru['community_id']) or a stable hash/fingerprint of the sorted list of source
ids/doc_ids (or sorted members) so chunk creation (chunk and chunks.append)
yields collision-resistant ids while preserving deterministic behavior.
🧹 Nitpick comments (2)
api/apps/kb_app.py (1)

820-843: Log the new wipe vs preserve path.

This route now has two very different outcomes, but nothing records which branch ran or whether GraphRAG artifacts/markers were cleared. That will be painful to debug once pause/resume issues hit production.

🪵 Small observability patch
     wipe_arg = (request.args.get("wipe", "true") or "true").strip().lower()
     wipe = wipe_arg not in ("false", "0", "no", "off")
+    logging.info("Unbinding %s task for kb %s (wipe=%s)", pipeline_task_type, kb_id, wipe)
@@
             if wipe:
                 settings.docStoreConn.delete({"knowledge_graph_kwd": ["graph", "subgraph", "entity", "relation", "community_report"]}, search.index_name(kb.tenant_id), kb_id)
                 # Wiping the graph invalidates any phase-completion markers
                 # for resolution / community detection.
                 from rag.graphrag.phase_markers import clear_phase_markers
                 clear_phase_markers(kb_id)
+                logging.info("Cleared GraphRAG artifacts and phase markers for kb %s", kb_id)

As per coding guidelines, **/*.py: Add logging for new flows.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/apps/kb_app.py` around lines 820 - 843, Add logging to record whether the
handler took the wipe (delete) or preserve path and what side effects occurred:
log when cancel_task(...) is invoked (include task_id and kb_task_id_field), log
the value of wipe, and if wipe is true log that
settings.docStoreConn.delete(...) was executed (include kb_id and kb.tenant_id)
and that clear_phase_markers(kb_id) was called; place these logs in the
PipelineTaskType.GRAPH_RAG branch immediately before/after calling cancel_task,
before calling settings.docStoreConn.delete, and after calling
clear_phase_markers; use the module logger/processLogger already used in the
file so messages include context for debugging.
test/testcases/test_web_api/test_kb_app/test_kb_routes_unit.py (1)

781-850: Assert the phase-marker cleanup here too.

This test only checks docStoreConn.delete. If clear_phase_markers() stops being called on the default GraphRAG wipe path, the test still passes even though resume behavior breaks. Stubbing that import and asserting 0 calls for wipe=false and 1 call for the default wipe path would cover the new behavior and keep the test isolated.

🧪 Targeted test extension
 def test_unbind_task_wipe_flag(monkeypatch):
@@
     cancelled = []
     deleted = []
     update_payloads = []
+    cleared_phase_markers = []
     monkeypatch.setattr(module.REDIS_CONN, "set", lambda key, value: cancelled.append((key, value)))
     monkeypatch.setattr(module.search, "index_name", lambda _tenant_id: "idx")
     monkeypatch.setattr(module.settings, "docStoreConn", SimpleNamespace(delete=lambda *args, **_kwargs: deleted.append(args)))
+    phase_markers_module = ModuleType("rag.graphrag.phase_markers")
+    phase_markers_module.clear_phase_markers = lambda dataset_id: cleared_phase_markers.append(dataset_id)
+    monkeypatch.setitem(sys.modules, "rag.graphrag.phase_markers", phase_markers_module)
@@
     assert res["code"] == module.RetCode.SUCCESS, res
     assert ("graph-task-cancel", "x") in cancelled, cancelled
     assert deleted == [], f"docStore.delete must not be called when wipe=false: {deleted}"
+    assert cleared_phase_markers == [], cleared_phase_markers
@@
     res = route()
     assert res["code"] == module.RetCode.SUCCESS, res
     assert len(deleted) == 1, f"default wipe must call docStore.delete once: {deleted}"
+    assert cleared_phase_markers == ["kb-1"], cleared_phase_markers

Based on learnings, Applies to tests/**/*.py : Add/adjust tests for behavior changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/testcases/test_web_api/test_kb_app/test_kb_routes_unit.py` around lines
781 - 850, The test needs to assert calls to clear_phase_markers invoked by
delete_kb_task: stub module.clear_phase_markers (e.g. replace with a function
that appends to a phase_calls list) before calling route(), then after each
scenario assert phase_calls length is 0 for wipe="false" and wipe="0" (graph and
raptor cases) and is 1 for the default GraphRAG case; reference the
delete_kb_task route under test and the clear_phase_markers import on the module
to locate where to monkeypatch and add assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@rag/graphrag/entity_resolution.py`:
- Around line 162-170: The merge semaphore currently wraps the entire awaited
call to _merge_graph_nodes (in limited_merge_nodes), which blocks during
network-bound work like _handle_entity_relation_summary; instead restrict the
critical section to only the graph-mutating steps: refactor _merge_graph_nodes
into a prepare/commit split (e.g., _merge_graph_nodes_prepare that does
summaries/IO and returns a concrete merge plan, and _merge_graph_nodes_commit
that applies add_edge/remove_node to the nx.Graph), then in limited_merge_nodes
await the prepare phase without the semaphore and only async with
merge_semaphore while calling the commit phase to perform the actual graph
reads/writes; alternatively compute a complete merge plan before acquiring
merge_semaphore and apply it inside the semaphore.

---

Outside diff comments:
In `@rag/graphrag/general/extractor.py`:
- Around line 313-340: The snapshot node0_neighbors is taken once and never
updated, so when redirecting neighbors from node1 to nodes[0] the first
redirected edge can be overwritten by a later redirect because nodes[0] is not
seen as having that neighbor; to fix, after creating or merging an edge toward a
neighbor update node0_neighbors (e.g., node0_neighbors.add(neighbor)) so
subsequent iterations treat it as an existing neighbor and perform the
edge-merge path (adjust weight/description/source_id) rather than the else
branch that blindly adds/overwrites via graph.add_edge; locate the loop handling
graph.neighbors(node1) and the branches that call graph.add_edge and
self._handle_entity_relation_summary to apply this change.

In `@rag/graphrag/general/index.py`:
- Around line 682-706: The deterministic ID derivation using
chunk_payload_for_id with only kb_id and stru['title'] can cause collisions when
the LLM emits identical titles; update the payload used by chunk_id to include a
secondary uniqueness factor (e.g., a fingerprint of the community membership or
source ids) so different communities with the same title produce different ids.
Concretely, modify chunk_payload_for_id (used by chunk_id) to include either a
unique community identifier if present (e.g., stru['community_id']) or a stable
hash/fingerprint of the sorted list of source ids/doc_ids (or sorted members) so
chunk creation (chunk and chunks.append) yields collision-resistant ids while
preserving deterministic behavior.

---

Nitpick comments:
In `@api/apps/kb_app.py`:
- Around line 820-843: Add logging to record whether the handler took the wipe
(delete) or preserve path and what side effects occurred: log when
cancel_task(...) is invoked (include task_id and kb_task_id_field), log the
value of wipe, and if wipe is true log that settings.docStoreConn.delete(...)
was executed (include kb_id and kb.tenant_id) and that
clear_phase_markers(kb_id) was called; place these logs in the
PipelineTaskType.GRAPH_RAG branch immediately before/after calling cancel_task,
before calling settings.docStoreConn.delete, and after calling
clear_phase_markers; use the module logger/processLogger already used in the
file so messages include context for debugging.

In `@test/testcases/test_web_api/test_kb_app/test_kb_routes_unit.py`:
- Around line 781-850: The test needs to assert calls to clear_phase_markers
invoked by delete_kb_task: stub module.clear_phase_markers (e.g. replace with a
function that appends to a phase_calls list) before calling route(), then after
each scenario assert phase_calls length is 0 for wipe="false" and wipe="0"
(graph and raptor cases) and is 1 for the default GraphRAG case; reference the
delete_kb_task route under test and the clear_phase_markers import on the module
to locate where to monkeypatch and add assertions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 19b7a4f0-e738-4d2a-95ae-58008a95b8bb

📥 Commits

Reviewing files that changed from the base of the PR and between f269ee9 and afca146.

📒 Files selected for processing (12)
  • api/apps/kb_app.py
  • api/apps/services/dataset_api_service.py
  • rag/graphrag/entity_resolution.py
  • rag/graphrag/general/extractor.py
  • rag/graphrag/general/index.py
  • rag/graphrag/phase_markers.py
  • test/testcases/test_web_api/test_kb_app/test_kb_routes_unit.py
  • test/unit_test/rag/graphrag/test_merge_graph_nodes.py
  • test/unit_test/rag/graphrag/test_phase_markers.py
  • web/src/pages/dataset/dataset/generate-button/hook.ts
  • web/src/services/knowledge-service.ts
  • web/src/utils/api.ts

Comment thread rag/graphrag/entity_resolution.py Outdated
…e-marker cleanup

* extractor: track each redirected neighbor in node0_neighbors so a later
  merged node sharing the same external neighbor takes the edge-merge
  branch instead of overwriting via add_edge. Prevents silent loss of
  weight/description/source_id when multiple nodes are merged into one.
* kb_app.unbind_task: log kb_id, task type, wipe flag, and whether
  GraphRAG artefacts/phase markers were cleared, per repo logging guideline.
* test_unbind_task_wipe_flag: stub rag.graphrag.phase_markers and assert
  clear_phase_markers is called exactly once on default wipe and not at
  all when wipe=false (covers both GraphRAG and raptor paths).

Addresses CodeRabbit review on infiniflow#14238.
@prpercival
Copy link
Copy Markdown
Author

Thanks @coderabbitai. Pushed 33004a8 addressing the actionable items:

1. _merge_graph_nodes neighbor tracking (Critical) - Fixed. When the else branch redirects an edge from node1 to nodes[0], neighbor is now added to node0_neighbors so a subsequent node1 merged in the same call takes the edge-merge branch instead of overwriting. (rag/graphrag/general/extractor.py)

2. unbind_task observability - Added logging.info for the entry point (kb_id, task type, wipe flag) and for the GraphRAG wipe path that clears doc-store rows and phase markers. (api/apps/kb_app.py)

3. test_unbind_task_wipe_flag phase-marker assertion - Stubbed rag.graphrag.phase_markers via sys.modules and added assertions: zero calls on wipe=false for both GraphRAG and raptor, exactly one call on the default GraphRAG wipe path. (test/testcases/test_web_api/test_kb_app/test_kb_routes_unit.py)

Not changed (intentionally):

  • Community report ID collision on identical LLM titles - Real edge case but not a regression: the prior delete-then-insert was also keyed on title. Picking the right tiebreaker (community members fingerprint? source_id set?) is a design decision I'd rather hand to a maintainer than guess at in this PR. Happy to split out a follow-up if you want a specific fingerprint.
  • Splitting the merge semaphore into prepare/commit phases - The current Semaphore(1) around the whole merge is conservative but correct. Refactoring _merge_graph_nodes into a plan/apply split is a sensible optimization but outside the scope of this fix; _handle_entity_relation_summary LLM calls during merge are rare enough that I'd rather land the correctness fix first and revisit throughput separately.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 20, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the changes in commit 33004a87a
git show 33004a87a --stat
echo "---"
# Show the specific changes to the three files mentioned
git show 33004a87a -- rag/graphrag/general/extractor.py
echo "==="
git show 33004a87a -- api/apps/kb_app.py
echo "==="
git show 33004a87a -- test/testcases/test_web_api/test_kb_app/test_kb_routes_unit.py

Length of output: 7156


@prpercival All three fixes check out cleanly in commit 33004a87a.

_merge_graph_nodes neighbor tracking — the node0_neighbors.add(neighbor) placement in the else branch is exactly right. It keeps the in-flight set consistent so any subsequent node1 that shares that neighbor goes through the weight/description/source_id merge path rather than a silent add_edge overwrite. The commit message captures the why well too.

unbind_task observability — both log lines are well-scoped: one at the entry gate (kb, type, wipe flag) and one at the outcome (artefacts + markers cleared). No noise for the wipe=false path where nothing is deleted, which is the right balance.

test_unbind_task_wipe_flag phase-marker assertions — the two new assertions are crisp. cleared_phase_markers == [] for both wipe=false paths and == ["kb-1"] for the default wipe give precise negative and positive coverage without over-specifying call counts beyond what the implementation guarantees.

The deferrals are reasonable:

  • Community ID collision — agreed it's a pre-existing design question, not a regression introduced here. A fingerprint based on sorted community member IDs would be a natural tiebreaker if a maintainer wants to pursue it in a follow-up.
  • Semaphore splitSemaphore(1) around the full merge is conservative-but-correct and the right call to land first. The plan/apply refactor can be a performance follow-up once this is stable in production.

The PR looks good to me. 🐇✨

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/testcases/test_web_api/test_kb_app/test_kb_routes_unit.py`:
- Around line 827-852: Initialize the missing tracker list cleared_phase_markers
before it's asserted and ensure the mocked phase-marker-clearing logic appends
to it; specifically, add cleared_phase_markers = [] prior to setting up the
request/mocks used by route() (the same place where deleted is defined) and
update the mock/stub that simulates clearing phase markers to append the kb_id
(e.g., "kb-1") to cleared_phase_markers when invoked so the later assertions
against cleared_phase_markers pass.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c3297e4e-983b-47bd-97c5-4b575d42063a

📥 Commits

Reviewing files that changed from the base of the PR and between afca146 and 33004a8.

📒 Files selected for processing (3)
  • api/apps/kb_app.py
  • rag/graphrag/general/extractor.py
  • test/testcases/test_web_api/test_kb_app/test_kb_routes_unit.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • rag/graphrag/general/extractor.py
  • api/apps/kb_app.py

Comment thread test/testcases/test_web_api/test_kb_app/test_kb_routes_unit.py Outdated
@yingfeng yingfeng requested a review from wangq8 April 21, 2026 04:32
@wangq8 wangq8 added the ci Continue Integration label Apr 27, 2026
…sert tuning

Main was merged in and PR infiniflow#14394 deleted api/apps/kb_app.py, taking
unbind_task with it.  The pause-without-wipe behaviour is restored on
the new REST surface (DELETE /v1/datasets/<id>/<index_type>):

* api/apps/services/dataset_api_service.py: delete_index gains a
  wipe: bool = True parameter; when False the docStoreConn.delete call
  and clear_phase_markers() are skipped.  Default preserves existing
  behaviour.  Logs which branch ran.
* api/apps/restful_apis/dataset_api.py: parses ?wipe=false|0|no|off
  from request.args and forwards it to the service.
* web/src/utils/api.ts + web/src/services/knowledge-service.ts:
  unbindPipelineTask appends ?wipe=false when explicitly false.  The
  GraphRAG pause path in generate-button/hook.ts already passes
  wipe: false for KnowledgeGraph; raptor and the explicit Delete
  action keep the historical wipe-true behaviour.
* test/testcases/test_web_api/test_dataset_management/
  test_dataset_sdk_routes_unit.py: new
  test_delete_index_wipe_flag_unit covers wipe=false (graph + raptor)
  and the default wipe path, asserting clear_phase_markers is invoked
  exactly when the docStore is wiped.

GraphRAG insert pipeline is also tunable now (was hardcoded
es_bulk_size=4, sequential):

* rag/graphrag/utils.py: new insert_chunks_bounded() helper using a
  bounded asyncio.Semaphore.  Defaults to 64 docs/batch and 4 batches
  in flight; override with GRAPHRAG_INSERT_BULK_SIZE and
  GRAPHRAG_INSERT_CONCURRENCY.  Same retry/timeout semantics as the
  prior loop.  set_graph now uses it.
* rag/graphrag/general/index.py: extract_community's community_report
  insert also uses insert_chunks_bounded.
@dosubot dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Apr 28, 2026
@prpercival
Copy link
Copy Markdown
Author

Pushed 1889930 and updated the PR description.

After merging main into this branch I noticed #14394 deleted api/apps/kb_app.py, which removed the unbind_task route this PR originally targeted. The pause-without-wipe behaviour is now restored on the new REST surface DELETE /v1/datasets/<id>/<index_type>:

  • api/apps/services/dataset_api_service.py: delete_index gains wipe: bool = True; when False neither the doc-store rows nor the GraphRAG phase markers are touched.
  • api/apps/restful_apis/dataset_api.py: parses ?wipe=false|0|no|off and forwards it.
  • Frontend (api.ts + knowledge-service.ts): appends ?wipe=false when explicitly false. The pause path in generate-button/hook.ts already passes wipe: false for KnowledgeGraph.
  • Test: test_delete_index_wipe_flag_unit (new, in test_dataset_sdk_routes_unit.py) replaces the deleted test_unbind_task_wipe_flag.

Also added in this commit (called out as section 5 in the description): the GraphRAG insert pipeline is now tunable. The hardcoded es_bulk_size = 4 in both rag/graphrag/utils.py::set_graph and rag/graphrag/general/index.py::extract_community is replaced by a shared insert_chunks_bounded() helper using a bounded asyncio.Semaphore. Defaults: 64 docs/batch, 4 batches in flight; tunable via GRAPHRAG_INSERT_BULK_SIZE / GRAPHRAG_INSERT_CONCURRENCY. This dropped the same 1077-chunk insert from minutes to seconds locally.

All 7 GraphRAG unit tests + the new dataset-API test pass under uv run pytest.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.16%. Comparing base (12af73f) to head (797fa53).

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #14238   +/-   ##
=======================================
  Coverage   94.16%   94.16%           
=======================================
  Files          10       10           
  Lines         703      703           
  Branches      112      112           
=======================================
  Hits          662      662           
  Misses         25       25           
  Partials       16       16           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@wangq8
Copy link
Copy Markdown
Collaborator

wangq8 commented Apr 30, 2026

Hi @prpercival, the PR looks good to me, please merge main and resolve conflict so that we can merge.

@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label Apr 30, 2026
@wangq8 wangq8 self-requested a review April 30, 2026 07:53
…n getKbDetail

The new /api/v1/datasets/<id> endpoint returns chunk_count/document_count,
but legacy consumers (e.g. the GraphRAG/Raptor 'magic wand' enable check in
dataset/index.tsx) read chunk_num/doc_num. Without this normalization the wand
button stays disabled even when the dataset has chunks. Mirrors the existing
mapDocumentToLegacy treatment for documents.
…checkpoint

# Conflicts:
#	api/apps/services/dataset_api_service.py
#	web/src/utils/api.ts
@prpercival
Copy link
Copy Markdown
Author

Hi @prpercival, the PR looks good to me, please merge main and resolve conflict so that we can merge.

Just resolved, thanks for reviewing.

@wangq8 wangq8 merged commit e8f19aa into infiniflow:main May 6, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🐞 bug Something isn't working, pull request that fix bug. ci Continue Integration lgtm This PR has been approved by a maintainer size:XL This PR changes 500-999 lines, ignoring generated files. 🧪 test Pull requests that update test cases.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Graph Generation Fails at Community Resolution due to Dictionary Concurrency Bug

2 participants