⚡ Bolt: Graph Entity Similarity Performance by Theory903 · Pull Request #483 · Theory903/open-ippoc

Theory903 · 2026-06-21T20:22:55Z

💡 What: Rewrote the logic for finding similar entities to use a SQL JOIN with Common Table Expressions (CTEs) rather than iterating through entities with N+1 queries in Python.
🎯 Why: The original unoptimized code was performing individual queries for every entity, fetching all relationships to manually calculate Jaccard similarity in the app layer.
📊 Impact: The execution time dropped significantly.
🔬 Measurement: The local benchmark running against 200 entities decreased from ~0.75s to ~0.005s.

PR created automatically by Jules for task 14235096974391323292 started by @Theory903

Summary by CodeRabbit

Bug Fixes
- Improved error messaging when entity context retrieval fails for missing entities.
- Enhanced metadata parsing robustness to handle malformed data gracefully.
Performance
- Optimized entity similarity queries for faster computation.
- Streamlined relationship pathfinding operations for improved efficiency.

DESCRIPTION: Rewritten similar entities logic to use a SQL JOIN instead of an N+1 Query. IMPACT: Huge speedup, 0.75s to 0.005s. Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>

google-labs-jules · 2026-06-21T20:22:56Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.

For security, I will only act on instructions from the user who triggered this task.

coderabbitai · 2026-06-21T20:23:07Z

📝 Walkthrough

Walkthrough

GraphManager in manager.py is refactored to use multiline parameterized SQL across all async methods. get_entity_context gains an early-return error dict for missing entities and guarded JSON parsing. find_similar_entities replaces N+1 per-entity relation checks with a single SQL JOIN. A new .jules/bolt.md log entry records the CTE optimization guideline.

Changes

GraphManager SQL Refactor and find_similar_entities N+1 Fix

Layer / File(s)	Summary
Imports, ORM models, and init cleanup `src/ippoc/mnemosyne/graph/manager.py`	Removes unused `Tuple`, `Optional`, `DateTime`, and `relationship` imports; switches to parenthesized SQLAlchemy imports; reformats `Entity`/`Relation` model definitions and `GraphManager.__init__` without changing persisted fields.
`add_triple` and `get_neighbors` multiline SQL `src/ippoc/mnemosyne/graph/manager.py`	Reformats `add_triple`'s nested `get_or_create` helper and relation-exists check, and `get_neighbors`' entity-id lookup, to multiline parameterized `session.execute` calls; semantics are unchanged.
`find_relationship_path` and `_find_paths_cte` reformatting `src/ippoc/mnemosyne/graph/manager.py`	Expands method signatures, parameterized SQL dicts, and row-parsing logic to multiline form; recursive CTE execution, cycle check, and path-object construction are functionally unchanged.
`get_entity_context` error handling and `find_similar_entities` SQL JOIN `src/ippoc/mnemosyne/graph/manager.py`	`get_entity_context` returns an error dict immediately when the entity row is missing, and wraps `metadata_` JSON parsing in `try/except` to yield `{}` on malformed input. `find_similar_entities` replaces N+1 per-entity relationship checks with a single SQL JOIN; result shape (`entity`, `similarity`, `shared_relations`) is preserved.
`delete_entity` multiline SQL reformatting `src/ippoc/mnemosyne/graph/manager.py`	Reformats entity-id SELECT and incident-relations DELETE to multiline parameterized calls; splits `logger.info` string; return value and error handling are unchanged.
Jules bolt.md optimization log `.jules/bolt.md`	Adds a new entry documenting the N+1-to-CTE optimization learning and action guideline for SQL batching over per-entity queries.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Theory903/open-ippoc#34: Directly overlaps with this PR's find_similar_entities refactor — both replace N+1 per-entity checks with an aggregate/JOIN SQL query for similarity computation.
Theory903/open-ippoc#37: Both PRs modify GraphManager.find_similar_entities to switch similarity computation to a SQL CTE-based intersection approach instead of per-row Python-side relationship checks.
Theory903/open-ippoc#154: Both PRs modify the CTE-based similarity candidate selection logic in find_similar_entities, with the retrieved PR adding early HAVING pruning to the candidates CTE.

Poem

🐇 Hop, hop, no more loops in sight,
One JOIN to rule the SQL night!
N+1 queries? Gone away,
A single batch to seize the day.
The graph now hops with SQL might! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description provides clear context (What/Why/Impact/Measurement) but does not follow the mandatory IPPOC-FS template structure, missing all required sections including Intent Declaration, Canon Compliance, Scope Control, Contract Compliance, and Boundary Declaration.	Fill out all mandatory sections of the IPPOC-FS template, including organ declaration, change type, compliance checklist, scope summary, and safety assertions.
Docstring Coverage	⚠️ Warning	Docstring coverage is 70.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title refers to a real aspect of the changeset (graph entity similarity performance optimization), but it is overly broad and vague. The emoji and generic terms like 'Bolt' don't clearly convey what was specifically changed or optimized.	Consider revising the title to be more specific, e.g., 'Optimize GraphManager.find_similar_entities with SQL CTEs' or 'Replace N+1 queries with CTE-based similarity computation'.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch jules-14235096974391323292-cc823f58

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/ippoc/mnemosyne/graph/manager.py (2)

75-86: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

get_or_create has a TOCTOU race for entity creation.

Two concurrent requests can both miss the SELECT and attempt INSERT; one will hit unique-name constraints (or duplicate if constraints drift). This should be made atomic (DB upsert / insert-on-conflict, then select id).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/ippoc/mnemosyne/graph/manager.py` around lines 75 - 86, The get_or_create
function has a race condition where two concurrent requests can both execute the
SELECT query, both find no existing entity, and then both attempt to INSERT,
causing a conflict. Replace the separate SELECT followed by INSERT logic with a
single atomic database operation using an upsert pattern (INSERT ... ON CONFLICT
or equivalent database-specific syntax). This should insert the new Entity if it
does not exist based on the unique name constraint, then select and return the
id of the entity (whether newly created or pre-existing) in one atomic
transaction to prevent concurrent duplicate creation attempts.

93-103: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Relation insert path is also non-atomic (check-then-insert).

SELECT ... IF NOT EXISTS -> INSERT is race-prone and can create duplicate edges under concurrency unless enforced with a unique constraint + upsert semantics.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/ippoc/mnemosyne/graph/manager.py` around lines 93 - 103, The current
implementation in the relation insertion logic uses a non-atomic
check-then-insert pattern where a SELECT query determines if a Relation exists
before calling session.add(rel) and session.commit(). This is vulnerable to race
conditions under concurrent access. Replace the SELECT check and conditional
insert with either a database-level upsert operation using INSERT ... ON
CONFLICT (or equivalent database syntax) on a unique constraint covering
source_id, target_id, and relation columns, or wrap the insert operation in a
try-catch block to handle IntegrityError exceptions that would be raised by a
unique constraint violation. Ensure the kg_relations table has a unique
constraint defined on the combination of source_id, target_id, and relation if
using the constraint approach.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/ippoc/mnemosyne/graph/manager.py`:
- Around line 61-74: The add_triple method doesn't initialize the database
schema before performing write operations, which causes failures when this
method is called first in a fresh environment. Add an await self.init_db() call
at the beginning of the add_triple method, right after the docstring and before
the async with self.Session() block, to ensure the database tables are created
before any write operations are attempted.

---

Outside diff comments:
In `@src/ippoc/mnemosyne/graph/manager.py`:
- Around line 75-86: The get_or_create function has a race condition where two
concurrent requests can both execute the SELECT query, both find no existing
entity, and then both attempt to INSERT, causing a conflict. Replace the
separate SELECT followed by INSERT logic with a single atomic database operation
using an upsert pattern (INSERT ... ON CONFLICT or equivalent database-specific
syntax). This should insert the new Entity if it does not exist based on the
unique name constraint, then select and return the id of the entity (whether
newly created or pre-existing) in one atomic transaction to prevent concurrent
duplicate creation attempts.
- Around line 93-103: The current implementation in the relation insertion logic
uses a non-atomic check-then-insert pattern where a SELECT query determines if a
Relation exists before calling session.add(rel) and session.commit(). This is
vulnerable to race conditions under concurrent access. Replace the SELECT check
and conditional insert with either a database-level upsert operation using
INSERT ... ON CONFLICT (or equivalent database syntax) on a unique constraint
covering source_id, target_id, and relation columns, or wrap the insert
operation in a try-catch block to handle IntegrityError exceptions that would be
raised by a unique constraint violation. Ensure the kg_relations table has a
unique constraint defined on the combination of source_id, target_id, and
relation if using the constraint approach.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e58f1d93-9674-434d-84e5-b34496b97b2e

📥 Commits

Reviewing files that changed from the base of the PR and between 29c13ba and c7dc8df.

📒 Files selected for processing (2)

.jules/bolt.md
src/ippoc/mnemosyne/graph/manager.py

coderabbitai · 2026-06-21T20:26:06Z

+    async def add_triple(
+        self,
+        source: str,
+        relation: str,
+        target: str,
+        source_type="Concept",
+        target_type="Concept",
+    ):
        """
        Adds (Source) -> [Relation] -> (Target) to the graph.
        Idempotent (get_or_create).
        """
        async with self.Session() as session:
            # Helper to get/create entity


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Initialize schema before writes in add_triple.

add_triple can run before tables are created because it doesn’t call await self.init_db(), unlike the other public read/write methods. This can fail on first use in a fresh environment.

Suggested fix

async def add_triple( @@ ): @@ - async with self.Session() as session: + await self.init_db() + async with self.Session() as session:

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/ippoc/mnemosyne/graph/manager.py` around lines 61 - 74, The add_triple method doesn't initialize the database schema before performing write operations, which causes failures when this method is called first in a fresh environment. Add an await self.init_db() call at the beginning of the add_triple method, right after the docstring and before the async with self.Session() block, to ensure the database tables are created before any write operations are attempted.

brain/perf: improve similar entities query using SQL CTEs

c7dc8df

DESCRIPTION: Rewritten similar entities logic to use a SQL JOIN instead of an N+1 Query. IMPACT: Huge speedup, 0.75s to 0.005s. Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>

coderabbitai Bot reviewed Jun 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡ Bolt: Graph Entity Similarity Performance#483

⚡ Bolt: Graph Entity Similarity Performance#483
Theory903 wants to merge 1 commit into
mainfrom
jules-14235096974391323292-cc823f58

Theory903 commented Jun 21, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

google-labs-jules Bot commented Jun 21, 2026

Uh oh!

coderabbitai Bot commented Jun 21, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Theory903 commented Jun 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

google-labs-jules Bot commented Jun 21, 2026

Uh oh!

coderabbitai Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Theory903 commented Jun 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 21, 2026 •

edited

Loading