Skip to content

Omni lite#1875

Closed
tomer-levin-nv wants to merge 78 commits intoNVIDIA:mainfrom
ftatiana-nv:omni_lite
Closed

Omni lite#1875
tomer-levin-nv wants to merge 78 commits intoNVIDIA:mainfrom
ftatiana-nv:omni_lite

Conversation

@tomer-levin-nv
Copy link
Copy Markdown
Collaborator

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

polazilber and others added 30 commits April 9, 2026 11:16
- Delete sql_create_temp_table_parser.py
- Remove add_temp_schema_to_graph from queries.py, queries_parser.py, and sql_create_table_parser.py
- Remove TEMP_TABLE/TEMP_COLUMN labels from model/query.py, queries_comparison.py, queries_dal.py, and sql_select_parser.py
- Fix shared.graph import paths to use nemo_retriever.tabular_data.ingestion.graph throughout
- Replace flatten_list with flat_list_recursive from utils.py
- Add sqloxide dependency to pyproject.toml
- Move populate_queries/populate_views to services/queries.py; update write_to_graph.py imports

Made-with: Cursor
…s constants; add parser tests

- Fix shared.graph import paths in sql_select_parser, sql_insert_into_parser,
  sql_update_table_parser, sql_merge_table_parser, and sql_view_parser
- Add SQLType, SQL, Parser, Views, Labels.ALIAS/OPERATOR/COMMAND/FUNCTION/CONSTANT,
  ArgsForSQLFunctions, SQLFunctions, JoinNodes, SQLFunctionsWithConsantArg stubs
  to reserved_words.py
- Add test_queries_parser.py: 25 unit tests covering pre_process, dispatch_sqls
  routing, and parse_single (dialect fallback, error cases, schema propagation)

Made-with: Cursor
…queries

- Simplify expand_info Cypher: drop document, zone, and user_participants
  filtering from n:term, n:attribute, n:analysis, and n:metric cases
- Remove certified_for_zones and zones fields from returned sql objects
- Remove user_participants from query params dict
- Fix imports in CandidateRetrievalAgent and inline get_question_for_processing
- Remove account_id and user_participants from CandidateRetrievalAgent.run
  and extract_candidates call

Made-with: Cursor
…lot extractor

- Delete old graph/parsers/sql/ per-statement parsers (select, insert, update,
  merge, create-table, view) and replace with a single sqlglot_extractor.py
  that resolves tables/columns for any SQL dialect without a full AST
- Rewrite services/queries.py: parse_query_slim + chunked populate_queries;
  remove populate_views, filter_and_sort_queries_df and related dead code
- Move parsers/sql/ and sqlglot_extractor.py out of graph/ sub-package into
  the canonical ingestion/parsers/ directory; delete empty graph/ folder
- Fix all .graph.* imports across dal/, model/, services/, parsers/
- Add services/schema.get_account_schemas(); fix Node→Neo4jNode throughout
- Remove dead queries_comparison.py and test_queries_parser.py

Made-with: Cursor
…and tests

- Replace sqllineage with pure sqlglot for column attribution; remove
  sqllineage dependency from pyproject.toml
- Fix join_keys extraction: use select.args["from_"] (not "from") and
  fall back to ON-equality detection when USING is converted at parse time
- Support schema-qualified tables (schema.table): qualified keys in
  source_table_names and alias_map keep same-named tables from different
  schemas distinct; _build_schema_dict stays flat for qualify() compat
- Remove lazy get_account_schemas() fallback; all_schemas is always
  caller-provided (default {})
- Add test_sqlglot_extractor.py covering the two duckdb_connector queries
  (RFM segmentation, CLV) plus edge-case tests, all without Neo4j

Made-with: Cursor
- Remove dead functions from queries_dal.py (get_sql_by_id, etc.)
- Delete unused reserved_words classes (Views, SQL, RelTypes, SQLType, Parser)
- Remove utterance and default_schema params from Query
- Simplify parse_queries_df to return failed_queries instead of taking it as input
- Optimise table_to_schema lookup in sqlglot_extractor (use table_exists, two-pass strategy)
- Rename get_account_schemas → get_schemas; drop multi-DB loop (single DB only)
- Delete dead sql/queries_parser.py and its parsers/sql package
- Update test mock to implement table_exists

Made-with: Cursor
- Remove sql_node, utterance, default_schema, query_tag params from Query
- Drop reached_columns_ids field and its getter/setter (unused)
- Add parse_query_single for parsing a bare SQL string
- parse_queries_df returns failed_queries instead of taking it as input

Made-with: Cursor
polazilber and others added 23 commits April 15, 2026 10:38
Fixes flake8 F401; tests do not use pytest directly.

Made-with: Cursor
Drop the extra top-level definition; the same implementation remains
later in the module. Fixes flake8 F811.

Made-with: Cursor
data_for_populate_tabular includes a queries DataFrame; extend
EXPECTED_DATA_KEYS and assertions. Document extract_tabular_db_data return keys.

Made-with: Cursor
…a extract

- Add dialect attribute (default "sqlite") to SQLDatabase ABC so every
  connector advertises its SQL dialect.
- DuckDB connector already sets self.dialect = "duckdb".
- Fix TabularSchemaExtractOp.process() to read dialect from the data
  argument instead of the undefined tabular_params name.
- Pass dialect through extract_data so store_relational_db_in_neo4j
  receives the correct value.

Made-with: Cursor
Pass dialect=duckdb to match store_relational_db_in_neo4j signature; include
queries in dummy_data for extract payload shape.

Made-with: Cursor
…DuckDB dev connector

Remove class-level dialect default and dialect kwarg from SQLDatabase __init__.
Expose dialect as a property; DuckDB connector returns "duckdb".

Made-with: Cursor
The retriever instance (and its embedding model weights) was being
created from scratch on every call to search_lancedb_semantic_index.
Cache it at module level keyed by (uri, table_name) so weights load once.

Made-with: Cursor
Made-with: Cursor

# Conflicts:
#	nemo_retriever/pyproject.toml
#	nemo_retriever/src/nemo_retriever/tabular_data/ingestion/dal/queries_dal.py
#	nemo_retriever/src/nemo_retriever/tabular_data/ingestion/embeddings.py
#	nemo_retriever/src/nemo_retriever/tabular_data/ingestion/model/query.py
#	nemo_retriever/src/nemo_retriever/tabular_data/ingestion/parsers/sqlglot_extractor.py
#	nemo_retriever/src/nemo_retriever/tabular_data/ingestion/services/queries.py
#	nemo_retriever/src/nemo_retriever/tabular_data/ingestion/write_to_graph.py
#	nemo_retriever/src/nemo_retriever/tabular_data/sql_database.py
#	nemo_retriever/tabular-dev-tools/duckdb_connector.py
#	nemo_retriever/tests/test_tabular_pipeline.py
#	nemo_retriever/uv.lock
@tomer-levin-nv tomer-levin-nv requested review from a team as code owners April 20, 2026 13:36
@tomer-levin-nv tomer-levin-nv requested a review from jdye64 April 20, 2026 13:36
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 20, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 20, 2026

Greptile Summary

This PR introduces the "Omni lite" text-to-SQL pipeline: a LangGraph-based multi-agent system that translates natural language questions into SQL, executes them via an injected connector, and formats the result. The implementation replaces prior AGENTS.md/SKILL.md scaffolding with concrete Python agent classes built on a new BaseAgent abstraction.

There are three P0/P1 defects that should be fixed before merging:

  • _run_sql returns None when no connector — causes an AttributeError crash on response_from_db.error in every run without an injected connector (sql_execution.py).
  • Off-by-one in retry guardif attempt < RETRY_MAX_ATTEMPTS is always True in the loop, so safe_invoke_with_structured_output never raises on exhausted retries and silently returns None instead (llm_invoke.py).
  • Module-level side effectsconn = get_neo4j_conn() in utils.py and the _make_llm() / create_graph() / graph.compile() calls in main.py all execute at import time, violating the lazy-initialization policy and hard-failing when env vars are absent.

Confidence Score: 3/5

Not safe to merge — a crash on every no-connector execution, a silent retry swallow, and module-level LLM/DB init need to be addressed first.

Three P1 findings (one borderline P0) affect the primary execution path: null-deref crash, silent failure swallowing, and import-time side effects that break test isolation and deployment. These are present defects in the changed code, not speculative concerns.

sql_execution.py, llm_invoke.py, utils.py, and main.py each contain a confirmed defect on the primary execution path.

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/tabular_data/retrieval/text_to_sql/agents/sql_execution.py New SQL execution agent — _run_sql implicitly returns None when no connector is provided, causing an AttributeError crash on every invocation without a connector.
nemo_retriever/src/nemo_retriever/tabular_data/retrieval/text_to_sql/llm_invoke.py New LLM invocation helpers — off-by-one in retry guard means safe_invoke_with_structured_output never raises on exhausted retries, silently returning None.
nemo_retriever/src/nemo_retriever/tabular_data/retrieval/text_to_sql/utils.py Large utility module — module-level conn = get_neo4j_conn() establishes a Neo4j connection at import time, violating lazy-loading policy; also has unvalidated f-string interpolation in Cypher queries.
nemo_retriever/src/nemo_retriever/tabular_data/retrieval/text_to_sql/main.py Entry point — _make_llm(), create_graph(), and graph.compile() all execute at module scope, making import fail hard if credentials are missing.
nemo_retriever/src/nemo_retriever/tabular_data/retrieval/text_to_sql/text_to_sql_graph.py LangGraph definition — routing function route_sql_validation mutates shared state, which is unsafe in LangGraph's replay/checkpoint model; otherwise graph structure is clean.
nemo_retriever/src/nemo_retriever/tabular_data/retrieval/text_to_sql/base.py New BaseAgent and agent_wrapper — well-structured with good error handling and logging; missing SPDX header.
nemo_retriever/src/nemo_retriever/tabular_data/retrieval/text_to_sql/state.py State TypedDicts — clean definitions; contains commented-out code with a stale TODO that should be removed.
nemo_retriever/src/nemo_retriever/tabular_data/retrieval/text_to_sql/retrieval_override.py LanceDB label-filter extension of Retriever — correctly handles SQL escaping via _sql_in_literals; has SPDX header.
nemo_retriever/src/nemo_retriever/tabular_data/retrieval/text_to_sql/agents/candidates_preparation.py Candidate preparation agent — solid implementation; missing SPDX header like other agent files.
nemo_retriever/src/nemo_retriever/tabular_data/ingestion/embeddings.py New Neo4j→embedding dataframe helper — clean, has SPDX header, queries are read-only with no injection risk.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[entities_extraction] --> B[retrieve_candidates]
    B --> C[prepare_candidates]
    C --> D[construct_sql_from_candidates]
    D -->|constructable| E[validate_sql_query]
    D -->|unconstructable| K[unconstructable_sql_response]
    E -->|valid_sql| F[validate_intent]
    E -->|skip_intent_validation| G[execute_sql_query]
    E -->|invalid_sql| H[reconstruct_sql]
    E -->|fallback| I[construct_sql_not_from_snippets]
    E -->|unconstructable| K
    F -->|valid_sql| G
    F -->|invalid_sql| H
    G -->|valid_sql| J[format_and_respond]
    G -->|invalid_sql| H
    H --> E
    I --> E
    J --> END([END])
    K --> END
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/tabular_data/retrieval/text_to_sql/agents/sql_execution.py
Line: 24-34

Comment:
**`_run_sql` returns `None` when no connector is provided**

When `connector` is `None`, the function falls off the end and implicitly returns `None`. The caller unconditionally accesses `response_from_db.error` on the return value, causing `AttributeError: 'NoneType' object has no attribute 'error'` whenever the graph is invoked without a connector.

```python
def _run_sql(sql: str, state: AgentState) -> QueryResponse:
    """Execute SQL via the injected ``connector``."""
    connector = state.get("connector")
    if connector is not None:
        try:
            df = connector.execute(sql)
        except Exception as e:
            logger.exception("SQL execution failed (injected connector)")
            return QueryResponse(result=None, sliced=False, error=str(e))
        payload = df.to_json(orient="records", default_handler=str) if len(df) else "[]"
        return QueryResponse(result=[payload], sliced=False, error=None)
    return QueryResponse(result=None, sliced=False, error="No database connector provided")
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/tabular_data/retrieval/text_to_sql/llm_invoke.py
Line: 29-44

Comment:
**Off-by-one makes the final-failure `raise` unreachable**

The loop runs `for attempt in range(RETRY_MAX_ATTEMPTS)` (values 0, 1, 2). The guard `if attempt < RETRY_MAX_ATTEMPTS` is always `True` (max loop value is 2, which is always `< 3`), so the `else: raise` branch is never entered. On the last attempt's `ValidationError`, the error message is appended to `current_messages`, the loop exits cleanly, and `safe_invoke_with_structured_output` returns `None` implicitly instead of raising — silently swallowing the failure.

```python
        except ValidationError as e:
            if attempt < RETRY_MAX_ATTEMPTS - 1:
                current_messages.append(
                    SystemMessage(
                        content=(
                            "Your previous output did not validate. "
                            f"Validation errors:\n{str(e)}\n"
                            "Please return a **fully valid** object that satisfies the schema. "
                            "Do not omit required fields. Do not include extra keys."
                        )
                    )
                )
            else:
                logger.error(f"Validation failed after {RETRY_MAX_ATTEMPTS} attempts for {schema_name}")
                raise  # If still failing after max tries, raise
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/tabular_data/retrieval/text_to_sql/utils.py
Line: 34

Comment:
**Module-level Neo4j connection established at import time**

`conn = get_neo4j_conn()` executes at module scope, so importing this module (or anything that transitively imports it) immediately opens a Neo4j connection. This violates the lazy-loading policy, breaks unit tests that run without Neo4j, and makes any import failure (missing env vars, unreachable host) a hard crash rather than a deferred error.

Move `conn` into each function that needs it, or use a module-level cached accessor:

```python
_conn = None

def _get_conn():
    global _conn
    if _conn is None:
        _conn = get_neo4j_conn()
    return _conn
```

Then replace all bare `conn.` calls with `_get_conn().`.

**Rule Used:** Models must be loaded lazily (on first use or expl... ([source](https://app.greptile.com/review/custom-context?memory=model-lazy-loading))

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/tabular_data/retrieval/text_to_sql/main.py
Line: 16-23

Comment:
**LLM client and graph compiled at import time**

`_make_llm()`, `create_graph()`, and `graph.compile()` all run at module scope. Importing `main` immediately makes LLM API calls (or at least validates credentials) and allocates graph objects. This violates the lazy-initialization policy and means any missing env vars (`LLM_INVOKE_URL`, `LLM_API_KEY`) produce a hard failure on import, not on first use.

Wrap these in a lazy accessor or a factory function that callers invoke explicitly, similar to the pattern already used in `utils.py` for `_make_llm`.

**Rule Used:** Models must be loaded lazily (on first use or expl... ([source](https://app.greptile.com/review/custom-context?memory=model-lazy-loading))

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/tabular_data/retrieval/text_to_sql/text_to_sql_graph.py
Line: 42-53

Comment:
**Routing function mutates shared state**

`route_sql_validation` writes to `state["path_state"]["sql_attempts"]` in-place. LangGraph router functions are expected to be pure read-only predicates; side effects here can cause double-counting if the router is replayed (e.g., due to checkpointing or streaming). Move the counter increment into the agent's `execute` method (e.g., `SQLReconstructionAgent`) and make the router only read `sql_attempts` for its decision.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/tabular_data/retrieval/text_to_sql/utils.py
Line: 116-126

Comment:
**Property name interpolated directly into Cypher query**

`get_stored_usage_percentiles` embeds `percentiles_type_name` directly into the Cypher string via an f-string. Cypher does not support parameterized property names, so this cannot be fully parameterized, but the caller must ensure only trusted constants are passed. Consider asserting/validating `percentiles_type_name` against an allowlist of known constants (`QUERIES_USAGE_PERCENTILE`, `TABLES_USAGE_PERCENTILE`, `COLUMNS_USAGE_PERCENTILE`) before interpolation to make the constraint explicit.

```python
_ALLOWED_PERCENTILE_NAMES = {QUERIES_USAGE_PERCENTILE, TABLES_USAGE_PERCENTILE, COLUMNS_USAGE_PERCENTILE}

def get_stored_usage_percentiles(percentiles_type_name: str):
    if percentiles_type_name not in _ALLOWED_PERCENTILE_NAMES:
        raise ValueError(f"Unknown percentile type: {percentiles_type_name!r}")
    ...
```

**Rule Used:** Always use parameterized queries when interacting ... ([source](https://app.greptile.com/review/custom-context?memory=no-sql-injection))

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/tabular_data/retrieval/text_to_sql/state.py
Line: 45-47

Comment:
**Commented-out logic with stale TODO**

The commented-out `normalized_question` path and TODO have been superseded by the current implementation. The comment contradicts the behaviour of the function (it now reads `path_state["initial_question"]`, not `initial_question` from `state`). Remove the dead code and clarify the intent so future readers understand why `path_state["initial_question"]` takes precedence.

How can I resolve this? If you propose a fix, please make it concise.

(1 more prompt omitted due to length limits.)

Reviews (1): Last reviewed commit: "regenerate uv.lock after merge from main" | Re-trigger Greptile

Comment on lines +24 to +34
def _run_sql(sql: str, state: AgentState) -> QueryResponse:
"""Execute SQL via the injected ``connector``."""
connector = state.get("connector")
if connector is not None:
try:
df = connector.execute(sql)
except Exception as e:
logger.exception("SQL execution failed (injected connector)")
return QueryResponse(result=None, sliced=False, error=str(e))
payload = df.to_json(orient="records", default_handler=str) if len(df) else "[]"
return QueryResponse(result=[payload], sliced=False, error=None)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 _run_sql returns None when no connector is provided

When connector is None, the function falls off the end and implicitly returns None. The caller unconditionally accesses response_from_db.error on the return value, causing AttributeError: 'NoneType' object has no attribute 'error' whenever the graph is invoked without a connector.

def _run_sql(sql: str, state: AgentState) -> QueryResponse:
    """Execute SQL via the injected ``connector``."""
    connector = state.get("connector")
    if connector is not None:
        try:
            df = connector.execute(sql)
        except Exception as e:
            logger.exception("SQL execution failed (injected connector)")
            return QueryResponse(result=None, sliced=False, error=str(e))
        payload = df.to_json(orient="records", default_handler=str) if len(df) else "[]"
        return QueryResponse(result=[payload], sliced=False, error=None)
    return QueryResponse(result=None, sliced=False, error="No database connector provided")
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/tabular_data/retrieval/text_to_sql/agents/sql_execution.py
Line: 24-34

Comment:
**`_run_sql` returns `None` when no connector is provided**

When `connector` is `None`, the function falls off the end and implicitly returns `None`. The caller unconditionally accesses `response_from_db.error` on the return value, causing `AttributeError: 'NoneType' object has no attribute 'error'` whenever the graph is invoked without a connector.

```python
def _run_sql(sql: str, state: AgentState) -> QueryResponse:
    """Execute SQL via the injected ``connector``."""
    connector = state.get("connector")
    if connector is not None:
        try:
            df = connector.execute(sql)
        except Exception as e:
            logger.exception("SQL execution failed (injected connector)")
            return QueryResponse(result=None, sliced=False, error=str(e))
        payload = df.to_json(orient="records", default_handler=str) if len(df) else "[]"
        return QueryResponse(result=[payload], sliced=False, error=None)
    return QueryResponse(result=None, sliced=False, error="No database connector provided")
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +29 to +44
except ValidationError as e:
if attempt < RETRY_MAX_ATTEMPTS:
# Explain what's missing/invalid and ask the model to fix it
current_messages.append(
SystemMessage(
content=(
"Your previous output did not validate. "
f"Validation errors:\n{str(e)}\n"
"Please return a **fully valid** object that satisfies the schema. "
"Do not omit required fields. Do not include extra keys."
)
)
)
else:
logger.error(f"Validation failed after {RETRY_MAX_ATTEMPTS} attempts for {schema_name}")
raise # If still failing after max tries, raise
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Off-by-one makes the final-failure raise unreachable

The loop runs for attempt in range(RETRY_MAX_ATTEMPTS) (values 0, 1, 2). The guard if attempt < RETRY_MAX_ATTEMPTS is always True (max loop value is 2, which is always < 3), so the else: raise branch is never entered. On the last attempt's ValidationError, the error message is appended to current_messages, the loop exits cleanly, and safe_invoke_with_structured_output returns None implicitly instead of raising — silently swallowing the failure.

        except ValidationError as e:
            if attempt < RETRY_MAX_ATTEMPTS - 1:
                current_messages.append(
                    SystemMessage(
                        content=(
                            "Your previous output did not validate. "
                            f"Validation errors:\n{str(e)}\n"
                            "Please return a **fully valid** object that satisfies the schema. "
                            "Do not omit required fields. Do not include extra keys."
                        )
                    )
                )
            else:
                logger.error(f"Validation failed after {RETRY_MAX_ATTEMPTS} attempts for {schema_name}")
                raise  # If still failing after max tries, raise
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/tabular_data/retrieval/text_to_sql/llm_invoke.py
Line: 29-44

Comment:
**Off-by-one makes the final-failure `raise` unreachable**

The loop runs `for attempt in range(RETRY_MAX_ATTEMPTS)` (values 0, 1, 2). The guard `if attempt < RETRY_MAX_ATTEMPTS` is always `True` (max loop value is 2, which is always `< 3`), so the `else: raise` branch is never entered. On the last attempt's `ValidationError`, the error message is appended to `current_messages`, the loop exits cleanly, and `safe_invoke_with_structured_output` returns `None` implicitly instead of raising — silently swallowing the failure.

```python
        except ValidationError as e:
            if attempt < RETRY_MAX_ATTEMPTS - 1:
                current_messages.append(
                    SystemMessage(
                        content=(
                            "Your previous output did not validate. "
                            f"Validation errors:\n{str(e)}\n"
                            "Please return a **fully valid** object that satisfies the schema. "
                            "Do not omit required fields. Do not include extra keys."
                        )
                    )
                )
            else:
                logger.error(f"Validation failed after {RETRY_MAX_ATTEMPTS} attempts for {schema_name}")
                raise  # If still failing after max tries, raise
```

How can I resolve this? If you propose a fix, please make it concise.

except ImportError:
pass

conn = get_neo4j_conn()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Module-level Neo4j connection established at import time

conn = get_neo4j_conn() executes at module scope, so importing this module (or anything that transitively imports it) immediately opens a Neo4j connection. This violates the lazy-loading policy, breaks unit tests that run without Neo4j, and makes any import failure (missing env vars, unreachable host) a hard crash rather than a deferred error.

Move conn into each function that needs it, or use a module-level cached accessor:

_conn = None

def _get_conn():
    global _conn
    if _conn is None:
        _conn = get_neo4j_conn()
    return _conn

Then replace all bare conn. calls with _get_conn()..

Rule Used: Models must be loaded lazily (on first use or expl... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/tabular_data/retrieval/text_to_sql/utils.py
Line: 34

Comment:
**Module-level Neo4j connection established at import time**

`conn = get_neo4j_conn()` executes at module scope, so importing this module (or anything that transitively imports it) immediately opens a Neo4j connection. This violates the lazy-loading policy, breaks unit tests that run without Neo4j, and makes any import failure (missing env vars, unreachable host) a hard crash rather than a deferred error.

Move `conn` into each function that needs it, or use a module-level cached accessor:

```python
_conn = None

def _get_conn():
    global _conn
    if _conn is None:
        _conn = get_neo4j_conn()
    return _conn
```

Then replace all bare `conn.` calls with `_get_conn().`.

**Rule Used:** Models must be loaded lazily (on first use or expl... ([source](https://app.greptile.com/review/custom-context?memory=model-lazy-loading))

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +16 to +23
try:
llm_client = _make_llm()
except ValueError as e:
logger.error("Failed to initialize LLM client: %s", e)
llm_client = None

graph = create_graph()
app = graph.compile()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 LLM client and graph compiled at import time

_make_llm(), create_graph(), and graph.compile() all run at module scope. Importing main immediately makes LLM API calls (or at least validates credentials) and allocates graph objects. This violates the lazy-initialization policy and means any missing env vars (LLM_INVOKE_URL, LLM_API_KEY) produce a hard failure on import, not on first use.

Wrap these in a lazy accessor or a factory function that callers invoke explicitly, similar to the pattern already used in utils.py for _make_llm.

Rule Used: Models must be loaded lazily (on first use or expl... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/tabular_data/retrieval/text_to_sql/main.py
Line: 16-23

Comment:
**LLM client and graph compiled at import time**

`_make_llm()`, `create_graph()`, and `graph.compile()` all run at module scope. Importing `main` immediately makes LLM API calls (or at least validates credentials) and allocates graph objects. This violates the lazy-initialization policy and means any missing env vars (`LLM_INVOKE_URL`, `LLM_API_KEY`) produce a hard failure on import, not on first use.

Wrap these in a lazy accessor or a factory function that callers invoke explicitly, similar to the pattern already used in `utils.py` for `_make_llm`.

**Rule Used:** Models must be loaded lazily (on first use or expl... ([source](https://app.greptile.com/review/custom-context?memory=model-lazy-loading))

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +42 to +53
if state["decision"] == "invalid_sql":
attempts = state["path_state"].get("sql_attempts", 0)
logger.info(f"Construct sql attempt: {attempts}")
state["path_state"]["sql_attempts"] = attempts + 1
if attempts == 4:
logger.info("Can not construct sql from snippets, try from relevant tables. Fallback.")
return "fallback" # try constructing from tables, not only snippets
elif attempts < 8:
return "invalid_sql"
elif attempts == 8:
logger.error("SQL construction failed after 8 attempts")
return "unconstructable"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Routing function mutates shared state

route_sql_validation writes to state["path_state"]["sql_attempts"] in-place. LangGraph router functions are expected to be pure read-only predicates; side effects here can cause double-counting if the router is replayed (e.g., due to checkpointing or streaming). Move the counter increment into the agent's execute method (e.g., SQLReconstructionAgent) and make the router only read sql_attempts for its decision.

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/tabular_data/retrieval/text_to_sql/text_to_sql_graph.py
Line: 42-53

Comment:
**Routing function mutates shared state**

`route_sql_validation` writes to `state["path_state"]["sql_attempts"]` in-place. LangGraph router functions are expected to be pure read-only predicates; side effects here can cause double-counting if the router is replayed (e.g., due to checkpointing or streaming). Move the counter increment into the agent's `execute` method (e.g., `SQLReconstructionAgent`) and make the router only read `sql_attempts` for its decision.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +116 to +126
def get_stored_usage_percentiles(percentiles_type_name: str):
query = f"""
MATCH (n:Database)
RETURN n.{f"{percentiles_type_name}_25"} as usage_percentile_25,
n.{f"{percentiles_type_name}_75"} as usage_percentile_75
"""
results = conn.query_read(
query=query,
parameters={},
)
return results
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Property name interpolated directly into Cypher query

get_stored_usage_percentiles embeds percentiles_type_name directly into the Cypher string via an f-string. Cypher does not support parameterized property names, so this cannot be fully parameterized, but the caller must ensure only trusted constants are passed. Consider asserting/validating percentiles_type_name against an allowlist of known constants (QUERIES_USAGE_PERCENTILE, TABLES_USAGE_PERCENTILE, COLUMNS_USAGE_PERCENTILE) before interpolation to make the constraint explicit.

_ALLOWED_PERCENTILE_NAMES = {QUERIES_USAGE_PERCENTILE, TABLES_USAGE_PERCENTILE, COLUMNS_USAGE_PERCENTILE}

def get_stored_usage_percentiles(percentiles_type_name: str):
    if percentiles_type_name not in _ALLOWED_PERCENTILE_NAMES:
        raise ValueError(f"Unknown percentile type: {percentiles_type_name!r}")
    ...

Rule Used: Always use parameterized queries when interacting ... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/tabular_data/retrieval/text_to_sql/utils.py
Line: 116-126

Comment:
**Property name interpolated directly into Cypher query**

`get_stored_usage_percentiles` embeds `percentiles_type_name` directly into the Cypher string via an f-string. Cypher does not support parameterized property names, so this cannot be fully parameterized, but the caller must ensure only trusted constants are passed. Consider asserting/validating `percentiles_type_name` against an allowlist of known constants (`QUERIES_USAGE_PERCENTILE`, `TABLES_USAGE_PERCENTILE`, `COLUMNS_USAGE_PERCENTILE`) before interpolation to make the constraint explicit.

```python
_ALLOWED_PERCENTILE_NAMES = {QUERIES_USAGE_PERCENTILE, TABLES_USAGE_PERCENTILE, COLUMNS_USAGE_PERCENTILE}

def get_stored_usage_percentiles(percentiles_type_name: str):
    if percentiles_type_name not in _ALLOWED_PERCENTILE_NAMES:
        raise ValueError(f"Unknown percentile type: {percentiles_type_name!r}")
    ...
```

**Rule Used:** Always use parameterized queries when interacting ... ([source](https://app.greptile.com/review/custom-context?memory=no-sql-injection))

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +45 to +47
# normalized_question = path_state.get("normalized_question")
# TODO remove normalized question, for question : give me all student from seattle , seattle was removed
normalized_question = path_state.get("initial_question")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Commented-out logic with stale TODO

The commented-out normalized_question path and TODO have been superseded by the current implementation. The comment contradicts the behaviour of the function (it now reads path_state["initial_question"], not initial_question from state). Remove the dead code and clarify the intent so future readers understand why path_state["initial_question"] takes precedence.

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/tabular_data/retrieval/text_to_sql/state.py
Line: 45-47

Comment:
**Commented-out logic with stale TODO**

The commented-out `normalized_question` path and TODO have been superseded by the current implementation. The comment contradicts the behaviour of the function (it now reads `path_state["initial_question"]`, not `initial_question` from `state`). Remove the dead code and clarify the intent so future readers understand why `path_state["initial_question"]` takes precedence.

How can I resolve this? If you propose a fix, please make it concise.

@@ -0,0 +1,205 @@
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing SPDX license header

This file (and the other new Python files added in this PR — main.py, utils.py, state.py, models.py, llm_invoke.py, text_to_sql_graph.py, prompts.py, and all agents under agents/) are missing the required SPDX header. The embeddings.py change already has it correctly. Add to the top of each affected file:

# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.
# All rights reserved.
# SPDX-License-Identifier: Apache-2.0

Rule Used: Python files added in this PR must include the SPD... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/tabular_data/retrieval/text_to_sql/base.py
Line: 1

Comment:
**Missing SPDX license header**

This file (and the other new Python files added in this PR — `main.py`, `utils.py`, `state.py`, `models.py`, `llm_invoke.py`, `text_to_sql_graph.py`, `prompts.py`, and all agents under `agents/`) are missing the required SPDX header. The `embeddings.py` change already has it correctly. Add to the top of each affected file:

```python
# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.
# All rights reserved.
# SPDX-License-Identifier: Apache-2.0
```

**Rule Used:** Python files added in this PR must include the SPD... ([source](https://app.greptile.com/review/custom-context?memory=spdx-license-header))

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants