⚡ Bolt: Graph Entity Similarity Performance#483
Conversation
DESCRIPTION: Rewritten similar entities logic to use a SQL JOIN instead of an N+1 Query. IMPACT: Huge speedup, 0.75s to 0.005s. Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
📝 WalkthroughWalkthrough
ChangesGraphManager SQL Refactor and find_similar_entities N+1 Fix
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/ippoc/mnemosyne/graph/manager.py (2)
75-86:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift
get_or_createhas a TOCTOU race for entity creation.Two concurrent requests can both miss the
SELECTand attemptINSERT; one will hit unique-name constraints (or duplicate if constraints drift). This should be made atomic (DB upsert / insert-on-conflict, then select id).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/ippoc/mnemosyne/graph/manager.py` around lines 75 - 86, The get_or_create function has a race condition where two concurrent requests can both execute the SELECT query, both find no existing entity, and then both attempt to INSERT, causing a conflict. Replace the separate SELECT followed by INSERT logic with a single atomic database operation using an upsert pattern (INSERT ... ON CONFLICT or equivalent database-specific syntax). This should insert the new Entity if it does not exist based on the unique name constraint, then select and return the id of the entity (whether newly created or pre-existing) in one atomic transaction to prevent concurrent duplicate creation attempts.
93-103:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftRelation insert path is also non-atomic (check-then-insert).
SELECT ... IF NOT EXISTS -> INSERTis race-prone and can create duplicate edges under concurrency unless enforced with a unique constraint + upsert semantics.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/ippoc/mnemosyne/graph/manager.py` around lines 93 - 103, The current implementation in the relation insertion logic uses a non-atomic check-then-insert pattern where a SELECT query determines if a Relation exists before calling session.add(rel) and session.commit(). This is vulnerable to race conditions under concurrent access. Replace the SELECT check and conditional insert with either a database-level upsert operation using INSERT ... ON CONFLICT (or equivalent database syntax) on a unique constraint covering source_id, target_id, and relation columns, or wrap the insert operation in a try-catch block to handle IntegrityError exceptions that would be raised by a unique constraint violation. Ensure the kg_relations table has a unique constraint defined on the combination of source_id, target_id, and relation if using the constraint approach.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/ippoc/mnemosyne/graph/manager.py`:
- Around line 61-74: The add_triple method doesn't initialize the database
schema before performing write operations, which causes failures when this
method is called first in a fresh environment. Add an await self.init_db() call
at the beginning of the add_triple method, right after the docstring and before
the async with self.Session() block, to ensure the database tables are created
before any write operations are attempted.
---
Outside diff comments:
In `@src/ippoc/mnemosyne/graph/manager.py`:
- Around line 75-86: The get_or_create function has a race condition where two
concurrent requests can both execute the SELECT query, both find no existing
entity, and then both attempt to INSERT, causing a conflict. Replace the
separate SELECT followed by INSERT logic with a single atomic database operation
using an upsert pattern (INSERT ... ON CONFLICT or equivalent database-specific
syntax). This should insert the new Entity if it does not exist based on the
unique name constraint, then select and return the id of the entity (whether
newly created or pre-existing) in one atomic transaction to prevent concurrent
duplicate creation attempts.
- Around line 93-103: The current implementation in the relation insertion logic
uses a non-atomic check-then-insert pattern where a SELECT query determines if a
Relation exists before calling session.add(rel) and session.commit(). This is
vulnerable to race conditions under concurrent access. Replace the SELECT check
and conditional insert with either a database-level upsert operation using
INSERT ... ON CONFLICT (or equivalent database syntax) on a unique constraint
covering source_id, target_id, and relation columns, or wrap the insert
operation in a try-catch block to handle IntegrityError exceptions that would be
raised by a unique constraint violation. Ensure the kg_relations table has a
unique constraint defined on the combination of source_id, target_id, and
relation if using the constraint approach.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e58f1d93-9674-434d-84e5-b34496b97b2e
📒 Files selected for processing (2)
.jules/bolt.mdsrc/ippoc/mnemosyne/graph/manager.py
| async def add_triple( | ||
| self, | ||
| source: str, | ||
| relation: str, | ||
| target: str, | ||
| source_type="Concept", | ||
| target_type="Concept", | ||
| ): | ||
| """ | ||
| Adds (Source) -> [Relation] -> (Target) to the graph. | ||
| Idempotent (get_or_create). | ||
| """ | ||
| async with self.Session() as session: | ||
| # Helper to get/create entity |
There was a problem hiding this comment.
Initialize schema before writes in add_triple.
add_triple can run before tables are created because it doesn’t call await self.init_db(), unlike the other public read/write methods. This can fail on first use in a fresh environment.
Suggested fix
async def add_triple(
@@
):
@@
- async with self.Session() as session:
+ await self.init_db()
+ async with self.Session() as session:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/ippoc/mnemosyne/graph/manager.py` around lines 61 - 74, The add_triple
method doesn't initialize the database schema before performing write
operations, which causes failures when this method is called first in a fresh
environment. Add an await self.init_db() call at the beginning of the add_triple
method, right after the docstring and before the async with self.Session()
block, to ensure the database tables are created before any write operations are
attempted.
💡 What: Rewrote the logic for finding similar entities to use a SQL JOIN with Common Table Expressions (CTEs) rather than iterating through entities with N+1 queries in Python.
🎯 Why: The original unoptimized code was performing individual queries for every entity, fetching all relationships to manually calculate Jaccard similarity in the app layer.
📊 Impact: The execution time dropped significantly.
🔬 Measurement: The local benchmark running against 200 entities decreased from ~0.75s to ~0.005s.
PR created automatically by Jules for task 14235096974391323292 started by @Theory903
Summary by CodeRabbit
Bug Fixes
Performance