Skip to content

Fix: serialize GraphRAG entity resolution merges to avoid graph mutation races #14237

Merged
yingfeng merged 2 commits intoinfiniflow:mainfrom
spider-yamet:fix/graphrag-entity-resolution-merge-race
Apr 22, 2026
Merged

Fix: serialize GraphRAG entity resolution merges to avoid graph mutation races #14237
yingfeng merged 2 commits intoinfiniflow:mainfrom
spider-yamet:fix/graphrag-entity-resolution-merge-race

Conversation

@spider-yamet
Copy link
Copy Markdown
Contributor

@spider-yamet spider-yamet commented Apr 20, 2026

What problem does this PR solve?

This PR fixes the merge-phase crash reported in #14236 during GraphRAG entity resolution.

The issue happens after candidate pair resolution completes, when multiple merge coroutines mutate the same shared networkx graph concurrently. In _merge_graph_nodes, the code iterates over graph.neighbors(node1) and also awaits during edge/description merging. That allows another coroutine to modify the graph adjacency structure in between, which can trigger RuntimeError: dictionary keys changed during iteration and can also lead to unsafe shared-graph mutation.

This change keeps the PR scoped to that single issue by:

  • serializing merge-time graph mutations with a dedicated merge lock
  • snapshotting graph.neighbors(node1) with list(...) before iteration

Together, these changes prevent concurrent mutation of the shared graph during the merge phase and make the merge loop safe against live-view invalidation.

Fixes #14236

Type of change

  • Bug Fix (non-breaking change which fixes an issue)

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 20, 2026

📝 Walkthrough

Walkthrough

Two files were modified to fix concurrent graph mutation errors during entity resolution's merge phase. The entity resolution module now serializes merge operations using a dedicated asyncio lock, while the extractor module materializes a graph neighbors iterator into a list before iteration to ensure stable iteration semantics when the graph is modified.

Changes

Cohort / File(s) Summary
Concurrency Serialization
rag/graphrag/entity_resolution.py
Replaced asyncio.Semaphore(5) with dedicated merge_lock = asyncio.Lock() to serialize calls to _merge_graph_nodes, ensuring only one merge operation executes at a time. Reformatted task creation syntax.
Graph Iteration Safety
rag/graphrag/general/extractor.py
Materialized graph.neighbors(node1) iterator into a list before iteration loop, preventing "dictionary keys changed during iteration" errors when the graph is modified (edges removed/added, nodes deleted) during in-loop operations.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

🐰 The race to merge was swift but broke,
When graphs spoke to each other—what a joke!
A lock, a list, and all is mended,
The wild concurrent chaos—now suspended!
Graphs stay steady, no more fright,
The resolution dances through the night. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: serializing GraphRAG entity resolution merges to fix race conditions caused by concurrent graph mutations.
Linked Issues check ✅ Passed The PR implements the core fixes from issue #14236: serializes merge mutations with a dedicated lock and snapshots graph neighbors before iteration to prevent concurrent graph mutation races.
Out of Scope Changes check ✅ Passed All changes are scoped to fixing the merge-phase race condition: a dedicated merge lock in entity_resolution.py and neighbor snapshotting in general/extractor.py, with no unrelated modifications.
Description check ✅ Passed The PR description comprehensively covers the problem, root cause, solution approach, and references the related issue #14236 with explicit type classification.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@dosubot dosubot Bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Apr 20, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
rag/graphrag/general/extractor.py (1)

322-337: ⚠️ Potential issue | 🟠 Major

Update node0_neighbors after adding a new edge.

The list(...) snapshot fixes the iterator crash, but node0_neighbors is still stale after the first graph.add_edge(nodes[0], neighbor, **edge1_attrs). If two merged nodes both point to the same external neighbor that nodes[0] did not originally have, the later one will hit the else branch again and overwrite the existing edge attrs instead of aggregating them.

💡 Proposed fix
         for neighbor in list(graph.neighbors(node1)):
             change.removed_edges.add(get_from_to(node1, neighbor))
             if neighbor not in nodes_set:
                 edge1_attrs = graph.get_edge_data(node1, neighbor)
                 if neighbor in node0_neighbors:
                     # Merge two edges
                     change.added_updated_edges.add(get_from_to(nodes[0], neighbor))
                     edge0_attrs = graph.get_edge_data(nodes[0], neighbor)
                     edge0_attrs["weight"] += edge1_attrs["weight"]
                     edge0_attrs["description"] += f"{GRAPH_FIELD_SEP}{edge1_attrs['description']}"
                     for attr in ["keywords", "source_id"]:
                         edge0_attrs[attr] = sorted(set(edge0_attrs[attr] + edge1_attrs[attr]))
                     edge0_attrs["description"] = await self._handle_entity_relation_summary(f"({nodes[0]}, {neighbor})", edge0_attrs["description"], task_id=task_id)
                     graph.add_edge(nodes[0], neighbor, **edge0_attrs)
                 else:
                     graph.add_edge(nodes[0], neighbor, **edge1_attrs)
+                    node0_neighbors.add(neighbor)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag/graphrag/general/extractor.py` around lines 322 - 337, The loop creates
new edges to nodes[0] but never updates the node0_neighbors set, so when a later
neighbor also connects to the same external node you end up overwriting instead
of merging; after any graph.add_edge(nodes[0], neighbor, ...) (both in the merge
and in the else branch where edge1_attrs is copied) update node0_neighbors
(e.g., node0_neighbors.add(neighbor) or re-fetch neighbors via
set(graph.neighbors(nodes[0]))) so subsequent iterations see the newly-added
neighbor and go through the merge logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@rag/graphrag/general/extractor.py`:
- Around line 322-337: The loop creates new edges to nodes[0] but never updates
the node0_neighbors set, so when a later neighbor also connects to the same
external node you end up overwriting instead of merging; after any
graph.add_edge(nodes[0], neighbor, ...) (both in the merge and in the else
branch where edge1_attrs is copied) update node0_neighbors (e.g.,
node0_neighbors.add(neighbor) or re-fetch neighbors via
set(graph.neighbors(nodes[0]))) so subsequent iterations see the newly-added
neighbor and go through the merge logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c24eaddb-739a-4c04-89a2-b685f23fdf1b

📥 Commits

Reviewing files that changed from the base of the PR and between f269ee9 and 7ac36fc.

📒 Files selected for processing (2)
  • rag/graphrag/entity_resolution.py
  • rag/graphrag/general/extractor.py

@6ba3i 6ba3i self-requested a review April 21, 2026 11:27
Copy link
Copy Markdown
Contributor

@6ba3i 6ba3i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! I traced the path for #14236 and I agree with the main diagnosis: the crash happens in the merge phase, where multiple merge coroutines can mutate the same shared networkx graph at the same time.

Because of that, I think the fix here is the merge serialization in rag/graphrag/entity_resolution.py.

I would recommend keeping:

  • merge_lock = asyncio.Lock()
  • async with merge_lock: in limited_merge_nodes(...)

and dropping the list(graph.neighbors(node1)) change in rag/graphrag/general/extractor.py.

My reasoning is that once merge coroutines are serialized, that snapshot is no longer needed for the reported crash, so removing it keeps the patch smaller and more tightly scoped to the actual issue.

Suggested minimal diff:

diff --git a/rag/graphrag/entity_resolution.py b/rag/graphrag/entity_resolution.py
index 6c3c48aeb..d75898ae2 100644
--- a/rag/graphrag/entity_resolution.py
+++ b/rag/graphrag/entity_resolution.py
@@ -158,9 +158,10 @@ class EntityResolution(Extractor):
         change = GraphChange()
         connect_graph = nx.Graph()
         connect_graph.add_edges_from(resolution_result)
+        merge_lock = asyncio.Lock()
 
         async def limited_merge_nodes(graph, nodes, change):
-            async with semaphore:
+            async with merge_lock:
                 await self._merge_graph_nodes(graph, nodes, change, task_id)

Thanks again for your contribution !

@spider-yamet
Copy link
Copy Markdown
Contributor Author

@6ba3i Thanks for the review. I updated the fix to follow your suggestion and kept the patch minimal.

The current change keeps only the merge serialization in rag/graphrag/entity_resolution.py:

merge_lock = asyncio.Lock()
async with merge_lock in limited_merge_nodes(...)
I removed the list(graph.neighbors(node1)) change from rag/graphrag/general/extractor.py, so the PR now stays tightly scoped to the reported merge-phase concurrency crash.

I would appreciate your feedback. Thanks :)

Copy link
Copy Markdown
Contributor

@6ba3i 6ba3i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the modifications ! seems to be correct but do you mind sharing some logs of pre and post fix ? following that i'll be ready to recommend a merge.
Thank you @spider-yamet

@spider-yamet
Copy link
Copy Markdown
Contributor Author

spider-yamet commented Apr 22, 2026

@6ba3i Thanks for the follow-up. I ran a focused local repro of the merge-phase race and got the following before/after behavior.
Seems working fine.

image

@spider-yamet spider-yamet requested a review from 6ba3i April 22, 2026 06:23
Copy link
Copy Markdown
Contributor

@6ba3i 6ba3i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! @yingfeng

@yingfeng yingfeng added the ci Continue Integration label Apr 22, 2026
@yingfeng yingfeng marked this pull request as draft April 22, 2026 06:50
@yingfeng yingfeng marked this pull request as ready for review April 22, 2026 06:50
@dosubot dosubot Bot added 🌈 python Pull requests that update Python code 🐞 bug Something isn't working, pull request that fix bug. labels Apr 22, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.97%. Comparing base (f269ee9) to head (75eadba).
⚠️ Report is 30 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #14237      +/-   ##
==========================================
- Coverage   98.11%   95.97%   -2.15%     
==========================================
  Files          10       10              
  Lines         690      695       +5     
  Branches      108      111       +3     
==========================================
- Hits          677      667      -10     
- Misses          4       11       +7     
- Partials        9       17       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@yingfeng yingfeng merged commit 38e45a1 into infiniflow:main Apr 22, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🐞 bug Something isn't working, pull request that fix bug. ci Continue Integration 🌈 python Pull requests that update Python code size:XS This PR changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Graph Generation Fails at Community Resolution due to Dictionary Concurrency Bug

3 participants