Skip to content

⚡ Bolt: Graph Entity Similarity Performance#483

Open
Theory903 wants to merge 1 commit into
mainfrom
jules-14235096974391323292-cc823f58
Open

⚡ Bolt: Graph Entity Similarity Performance#483
Theory903 wants to merge 1 commit into
mainfrom
jules-14235096974391323292-cc823f58

Conversation

@Theory903

@Theory903 Theory903 commented Jun 21, 2026

Copy link
Copy Markdown
Owner

💡 What: Rewrote the logic for finding similar entities to use a SQL JOIN with Common Table Expressions (CTEs) rather than iterating through entities with N+1 queries in Python.
🎯 Why: The original unoptimized code was performing individual queries for every entity, fetching all relationships to manually calculate Jaccard similarity in the app layer.
📊 Impact: The execution time dropped significantly.
🔬 Measurement: The local benchmark running against 200 entities decreased from ~0.75s to ~0.005s.


PR created automatically by Jules for task 14235096974391323292 started by @Theory903

Summary by CodeRabbit

  • Bug Fixes

    • Improved error messaging when entity context retrieval fails for missing entities.
    • Enhanced metadata parsing robustness to handle malformed data gracefully.
  • Performance

    • Optimized entity similarity queries for faster computation.
    • Streamlined relationship pathfinding operations for improved efficiency.

DESCRIPTION: Rewritten similar entities logic to use a SQL JOIN instead of an N+1 Query.

IMPACT: Huge speedup, 0.75s to 0.005s.

Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>
@google-labs-jules

Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@coderabbitai

coderabbitai Bot commented Jun 21, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

GraphManager in manager.py is refactored to use multiline parameterized SQL across all async methods. get_entity_context gains an early-return error dict for missing entities and guarded JSON parsing. find_similar_entities replaces N+1 per-entity relation checks with a single SQL JOIN. A new .jules/bolt.md log entry records the CTE optimization guideline.

Changes

GraphManager SQL Refactor and find_similar_entities N+1 Fix

Layer / File(s) Summary
Imports, ORM models, and init cleanup
src/ippoc/mnemosyne/graph/manager.py
Removes unused Tuple, Optional, DateTime, and relationship imports; switches to parenthesized SQLAlchemy imports; reformats Entity/Relation model definitions and GraphManager.__init__ without changing persisted fields.
add_triple and get_neighbors multiline SQL
src/ippoc/mnemosyne/graph/manager.py
Reformats add_triple's nested get_or_create helper and relation-exists check, and get_neighbors' entity-id lookup, to multiline parameterized session.execute calls; semantics are unchanged.
find_relationship_path and _find_paths_cte reformatting
src/ippoc/mnemosyne/graph/manager.py
Expands method signatures, parameterized SQL dicts, and row-parsing logic to multiline form; recursive CTE execution, cycle check, and path-object construction are functionally unchanged.
get_entity_context error handling and find_similar_entities SQL JOIN
src/ippoc/mnemosyne/graph/manager.py
get_entity_context returns an error dict immediately when the entity row is missing, and wraps metadata_ JSON parsing in try/except to yield {} on malformed input. find_similar_entities replaces N+1 per-entity relationship checks with a single SQL JOIN; result shape (entity, similarity, shared_relations) is preserved.
delete_entity multiline SQL reformatting
src/ippoc/mnemosyne/graph/manager.py
Reformats entity-id SELECT and incident-relations DELETE to multiline parameterized calls; splits logger.info string; return value and error handling are unchanged.
Jules bolt.md optimization log
.jules/bolt.md
Adds a new entry documenting the N+1-to-CTE optimization learning and action guideline for SQL batching over per-entity queries.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • Theory903/open-ippoc#34: Directly overlaps with this PR's find_similar_entities refactor — both replace N+1 per-entity checks with an aggregate/JOIN SQL query for similarity computation.
  • Theory903/open-ippoc#37: Both PRs modify GraphManager.find_similar_entities to switch similarity computation to a SQL CTE-based intersection approach instead of per-row Python-side relationship checks.
  • Theory903/open-ippoc#154: Both PRs modify the CTE-based similarity candidate selection logic in find_similar_entities, with the retrieved PR adding early HAVING pruning to the candidates CTE.

Poem

🐇 Hop, hop, no more loops in sight,
One JOIN to rule the SQL night!
N+1 queries? Gone away,
A single batch to seize the day.
The graph now hops with SQL might! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description provides clear context (What/Why/Impact/Measurement) but does not follow the mandatory IPPOC-FS template structure, missing all required sections including Intent Declaration, Canon Compliance, Scope Control, Contract Compliance, and Boundary Declaration. Fill out all mandatory sections of the IPPOC-FS template, including organ declaration, change type, compliance checklist, scope summary, and safety assertions.
Docstring Coverage ⚠️ Warning Docstring coverage is 70.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title refers to a real aspect of the changeset (graph entity similarity performance optimization), but it is overly broad and vague. The emoji and generic terms like 'Bolt' don't clearly convey what was specifically changed or optimized. Consider revising the title to be more specific, e.g., 'Optimize GraphManager.find_similar_entities with SQL CTEs' or 'Replace N+1 queries with CTE-based similarity computation'.
✅ Passed checks (2 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jules-14235096974391323292-cc823f58

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/ippoc/mnemosyne/graph/manager.py (2)

75-86: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

get_or_create has a TOCTOU race for entity creation.

Two concurrent requests can both miss the SELECT and attempt INSERT; one will hit unique-name constraints (or duplicate if constraints drift). This should be made atomic (DB upsert / insert-on-conflict, then select id).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/ippoc/mnemosyne/graph/manager.py` around lines 75 - 86, The get_or_create
function has a race condition where two concurrent requests can both execute the
SELECT query, both find no existing entity, and then both attempt to INSERT,
causing a conflict. Replace the separate SELECT followed by INSERT logic with a
single atomic database operation using an upsert pattern (INSERT ... ON CONFLICT
or equivalent database-specific syntax). This should insert the new Entity if it
does not exist based on the unique name constraint, then select and return the
id of the entity (whether newly created or pre-existing) in one atomic
transaction to prevent concurrent duplicate creation attempts.

93-103: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Relation insert path is also non-atomic (check-then-insert).

SELECT ... IF NOT EXISTS -> INSERT is race-prone and can create duplicate edges under concurrency unless enforced with a unique constraint + upsert semantics.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/ippoc/mnemosyne/graph/manager.py` around lines 93 - 103, The current
implementation in the relation insertion logic uses a non-atomic
check-then-insert pattern where a SELECT query determines if a Relation exists
before calling session.add(rel) and session.commit(). This is vulnerable to race
conditions under concurrent access. Replace the SELECT check and conditional
insert with either a database-level upsert operation using INSERT ... ON
CONFLICT (or equivalent database syntax) on a unique constraint covering
source_id, target_id, and relation columns, or wrap the insert operation in a
try-catch block to handle IntegrityError exceptions that would be raised by a
unique constraint violation. Ensure the kg_relations table has a unique
constraint defined on the combination of source_id, target_id, and relation if
using the constraint approach.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/ippoc/mnemosyne/graph/manager.py`:
- Around line 61-74: The add_triple method doesn't initialize the database
schema before performing write operations, which causes failures when this
method is called first in a fresh environment. Add an await self.init_db() call
at the beginning of the add_triple method, right after the docstring and before
the async with self.Session() block, to ensure the database tables are created
before any write operations are attempted.

---

Outside diff comments:
In `@src/ippoc/mnemosyne/graph/manager.py`:
- Around line 75-86: The get_or_create function has a race condition where two
concurrent requests can both execute the SELECT query, both find no existing
entity, and then both attempt to INSERT, causing a conflict. Replace the
separate SELECT followed by INSERT logic with a single atomic database operation
using an upsert pattern (INSERT ... ON CONFLICT or equivalent database-specific
syntax). This should insert the new Entity if it does not exist based on the
unique name constraint, then select and return the id of the entity (whether
newly created or pre-existing) in one atomic transaction to prevent concurrent
duplicate creation attempts.
- Around line 93-103: The current implementation in the relation insertion logic
uses a non-atomic check-then-insert pattern where a SELECT query determines if a
Relation exists before calling session.add(rel) and session.commit(). This is
vulnerable to race conditions under concurrent access. Replace the SELECT check
and conditional insert with either a database-level upsert operation using
INSERT ... ON CONFLICT (or equivalent database syntax) on a unique constraint
covering source_id, target_id, and relation columns, or wrap the insert
operation in a try-catch block to handle IntegrityError exceptions that would be
raised by a unique constraint violation. Ensure the kg_relations table has a
unique constraint defined on the combination of source_id, target_id, and
relation if using the constraint approach.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e58f1d93-9674-434d-84e5-b34496b97b2e

📥 Commits

Reviewing files that changed from the base of the PR and between 29c13ba and c7dc8df.

📒 Files selected for processing (2)
  • .jules/bolt.md
  • src/ippoc/mnemosyne/graph/manager.py

Comment on lines +61 to 74
async def add_triple(
self,
source: str,
relation: str,
target: str,
source_type="Concept",
target_type="Concept",
):
"""
Adds (Source) -> [Relation] -> (Target) to the graph.
Idempotent (get_or_create).
"""
async with self.Session() as session:
# Helper to get/create entity

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Initialize schema before writes in add_triple.

add_triple can run before tables are created because it doesn’t call await self.init_db(), unlike the other public read/write methods. This can fail on first use in a fresh environment.

Suggested fix
 async def add_triple(
@@
     ):
@@
-        async with self.Session() as session:
+        await self.init_db()
+        async with self.Session() as session:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/ippoc/mnemosyne/graph/manager.py` around lines 61 - 74, The add_triple
method doesn't initialize the database schema before performing write
operations, which causes failures when this method is called first in a fresh
environment. Add an await self.init_db() call at the beginning of the add_triple
method, right after the docstring and before the async with self.Session()
block, to ensure the database tables are created before any write operations are
attempted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant