[KYUUBI #7379][2b/4] Data Agent Engine: agent runtime, middleware stack, and OpenAI provider by wangzhigang1999 · Pull Request #7417 · apache/kyuubi

wangzhigang1999 · 2026-04-22T13:07:34Z

Why are the changes needed?

Part 2b of 4 for the Data Agent Engine (umbrella, KPIP-7373).

This PR adds the ReAct agent runtime that drives the LLM <-> tool loop, a composable middleware stack around it, and a production OpenAiProvider. It sits on top of the tool system and data source abstraction introduced in PR 2a, and is consumed by the REST layer in PR 3.

Changes include:

ReactAgent — ReAct loop with streaming, tool-call dispatch, turn budget, malformed-tool-call recovery
ConversationMemory — message history with cumulative prompt-token tracking
AgentRunContext / AgentInvocation / ApprovalMode — per-run state plumbing
ToolOutputStore — size-gated tool-output offload, keyed by session+call-id, with ReadToolOutputTool / GrepToolOutputTool for LLM-driven retrieval
AgentMiddleware interface with onRegister hook for tool wiring, plus four middlewares:
- LoggingMiddleware — structured request/response logging
- ApprovalMiddleware — risk-level-based approval gate
- CompactionMiddleware — token-threshold-driven history summarization keyed by session
- ToolResultOffloadMiddleware — transparently owns the ToolOutputStore and registers retrieval tools
OpenAiProvider — OpenAI-compatible chat completions with streaming and tool calls
ExecuteStatement.scala — SSE encoding extended to emit Compaction events
Dialects moved under datasource.dialect package for organization
New kyuubi.engine.data.agent.compaction.trigger.tokens configuration entry
MockLlmProvider — deterministic mock for middleware and runtime tests
mysql-connector-j moved to test scope (GPL-licensed; cannot be bundled in an Apache binary release — addresses review feedback on [KYUUBI #7379][2b/4] Data Agent Engine: agent runtime, middleware stack, and OpenAI provider #7417)

How was this patch tested?

Unit tests (Java): ConversationMemoryTest, ToolOutputStoreTest, ApprovalMiddlewareTest, CompactionMiddlewareTest, ToolResultOffloadMiddlewareTest, event/EventTest, plus updates to ToolRegistryThreadSafetyTest / ToolTest / RunSelectQueryToolTest / RunMutationQueryToolTest / JdbcDialectTest / MySQL DialectTest
Live LLM tests (opt-in, require DATA_AGENT_LLM_API_KEY / DATA_AGENT_LLM_API_URL / DATA_AGENT_LLM_MODEL): ReactAgentLiveTest, CompactionMiddlewareLiveTest — exercise the full loop against a real OpenAI-compatible endpoint
E2E (Scala): DataAgentE2ESuite extended with OpenAI-provider paths; new DataAgentCompactionE2ESuite observes compaction via JDBC
Existing unit + MySQL Testcontainers tests from PR 2a remain green

Was this patch authored or co-authored using generative AI tooling?

Partially assisted by Claude Code (Claude Opus 4.7) for test generation, code review, and PR formatting. Core design and implementation are human-authored.

…re stack, OpenAI provider, and live E2E tests This PR delivers the runtime layer of the Data Agent Engine on top of the tool system and data source plumbing from 2a/4: - ReactAgent: ReAct-style loop with streaming LLM responses, per-step tool dispatch, and AgentRunContext tracking token usage, iterations, and session. - Middleware stack (AgentMiddleware + ReactAgent.Builder): * LoggingMiddleware -- structured per-step/LLM/tool/finish logs with MDC. * ApprovalMiddleware -- CompletableFuture-based resolve for DESTRUCTIVE tools; modes NORMAL / STRICT / AUTO_APPROVE. * CompactionMiddleware -- token-threshold-triggered history summarization with KEEP_RECENT_TURNS=4, emits a Compaction AgentEvent so clients can observe the mechanism firing. * ToolResultOffloadMiddleware -- spills large tool outputs to disk and surfaces `read_tool_output` / `grep_tool_output` companion tools for the LLM to re-query truncated previews. - OpenAiProvider: single shared ReactAgent, per-session ConversationMemory, streaming chat completions, Hikari-pooled JDBC data source; reads model and thresholds from KyuubiConf. - ExecuteStatement (Scala): encodes all AgentEvents (including compaction and approval_request) as SSE JSON rows streamed through the JDBC reply column. - KyuubiConf: new keys for LLM provider/api-url/model/api-key, approval mode, compaction trigger tokens, offload root/thresholds, max iterations, etc. - Tests: * Unit tests for runtime, middlewares, offload store, and event shapes. * Live tests gated on DATA_AGENT_LLM_API_KEY covering full LLM round-trips: ReactAgentLiveTest (offload+grep, approval approve/deny), DataAgentE2ESuite and DataAgentApprovalE2ESuite (JDBC layer), DataAgentCompactionE2ESuite (JDBC-observable compaction event + post-compaction recovery), CompactionMiddlewareLiveTest. * Compatibility verified against qwen3.6-plus, glm-5, and kimi-k2.5 via per-call `model=` logging in ReactAgent.

MySQL Connector/J is GPL-licensed and cannot be bundled in an Apache binary release. Users who need the MySQL/StarRocks datasource at runtime should provide the driver jar themselves on the engine classpath. Addresses review feedback on apache#7417.

wangzhigang1999 · 2026-04-23T03:30:51Z

Evidence: runtime under real workload

Supplementary evidence for this PR. The benchmark harness itself is kept on a local branch (pr2c/...) and is not part of this PR — the numbers below are cited only to show that the runtime shipped here behaves correctly under a non-trivial text-to-SQL workload.

TL;DR

End-to-end robustness. 500/500 BIRD mini-dev questions completed, gen_ok=100%, zero framework-level failures (no crashes, no stuck sessions, no rate-limit fallout at concurrency=8).
Accuracy (EX=64.80%, F1=68.72% with kimi-k2.5) is included as context — this PR is not a model or accuracy deliverable.

Setup


Runtime under test	`ReactAgent` + `ToolResultOffloadMiddleware` + `CompactionMiddleware` (this PR)
Tools	`submit_sql`, `run_select_query`; `max_iterations=30`
Model	`kimi-k2.5` via dashscope OpenAI-compatible endpoint
Workload	BIRD mini-dev, 500 questions / 11 SQLite databases
Harness	internal, on local branch `pr2c/...` (not submitted)
Concurrency	8
Run	`benchmark-out/20260423-012636/`

Results

Overall: 500 questions, gen_ok=100%, EX=64.80%, F1=68.72%, avg 8.9 steps / 43.6k tokens / 42s per question.

By difficulty:

Difficulty	N	EX	F1	avg steps	avg tokens
simple	148	75.00%	78.33%	7.6	33k
moderate	250	62.80%	66.99%	9.1	46k
challenging	102	54.90%	59.01%	10.2	53k

By database (sorted by EX):

DB	N	EX	F1
superhero	52	84.62%	89.35%
european_football_2	51	76.47%	79.48%
student_club	48	72.92%	74.44%
toxicology	40	72.50%	73.68%
codebase_community	49	69.39%	74.17%
card_games	52	63.46%	58.27%
debit_card_specializing	30	60.00%	62.78%
formula_1	66	54.55%	62.13%
thrombosis_prediction	50	54.00%	59.60%
financial	32	50.00%	58.79%
california_schools	30	43.33%	54.29%

Cost: ~45 min wall time at concurrency=8, ~21M tokens total.

What this evidence supports

The runtime completes 500 non-trivial ReAct loops end-to-end without framework-level errors, at concurrency=8, against a live OpenAI-compatible provider. That's the load-bearing claim for this PR.
Difficulty scaling is monotonic (75% → 63% → 55%) — no pathological inversion that would suggest the runtime mishandles longer loops.

Scope disclaimers

Not a model benchmark. Numbers are bound to kimi-k2.5; other models will move them.
Not a leaderboard submission. Vanilla ReAct with the stock prompt; no schema-ranking, few-shot, or self-consistency. This is a framework viability floor, not SOTA.
Not a regression gate. LLM non-determinism is ±1pp run-to-run.
Harness not shipping with this PR. If you want to reproduce, ping me — the branch can be shared on request.

Follow-up: Spark backend run

Same harness, same 500 BIRD questions, but the agent targets a real EMR Kyuubi + Spark 3.5.3 cluster via jdbc:hive2://...:10009 instead of local SQLite. BIRD data was loaded from the SQLite files into Spark managed tables (Parquet on OSS-DLS) under 11 databases named bird_<db_id>.

Metric	SQLite (v2, middleware on)	Spark (v3)
Overall EX	64.80%	60.80%
F1	68.72%	59.70%
simple	75.00%	71.62%
moderate	62.80%	60.00%
challenging	54.90%	47.06%
gen_ok	100%	100%
BadRequestException	0	0
avg steps / q	8.9	11.9
avg tokens / q	44k	62k
wall-clock	~45 min	~82 min

This setup is materially stricter than the official BIRD evaluation. BIRD pins the target db_id per example and only measures SQL generation quality against a known schema. Here the agent is not told which database to target — and, importantly, the Hive metastore on the cluster holds 23 databases (11 BIRD-loaded plus 12 unrelated production schemas: tpcds_*, parquet_db_*, orc_db_*, kyuubi_test, doctor_test, etc.). The agent has to disambiguate the right bird_<db_id> out of that mixed list every question, which BIRD explicitly does not measure. The 4pp EX drop (64.80% → 60.80%) is largely attributable to this multi-database disambiguation task, observed failure mode: for a question about EUR/CZK currencies, the agent picked bird_financial instead of the correct bird_debit_card_specializing because the former name "looked more relevant" to the question's domain.

What this run confirms for the PR:

Runtime is datasource-agnostic. No changes to ReactAgent, no changes to any middleware, no changes to the provider — only the harness swapped JDBC URLs and plumbed a dialectName through the prompt builder.
Middleware scales to heavier prompts. Spark's richer schema-discovery round-trips push avg per-question prompt size from 44k to 62k tokens, yet zero BadRequest / zero framework errors, same as the SQLite run. The 128k compaction threshold still provides the load-bearing guard even though it happens not to trigger on any single question at this size.
Difficulty scaling stays monotonic (72% → 60% → 47%), same qualitative shape as SQLite — the runtime does not degrade non-linearly on harder questions when swapping backends.

wangzhigang1999 · 2026-04-23T05:43:13Z

ReactAgent Execution Flow

ReactAgent.run(request, memory, eventConsumer)
│
├─ memory.addUserMessage(userInput)
├─ dispatchAgentStart  /  emit(AgentStart)
│
├─ for step in 1..maxIterations:
│    ├─ emit(StepStart)
│    │
│    ├─ messages = memory.buildLlmMessages()
│    ├─ messages = middleware.beforeLlmCall(messages)      ← may rewrite or abort the call
│    │
│    ├─ streamLlmResponse(messages)                        ← streaming + chunk accumulation
│    │     └─ emit(ContentDelta)*   (one per token)
│    │
│    ├─ emit(ContentComplete)
│    ├─ memory.addAssistantMessage(...)
│    ├─ middleware.afterLlmCall(...)
│    │
│    ├─ if no toolCalls → emit(StepEnd) + emit(AgentFinish) + return   ✅ normal termination
│    │
│    └─ executeToolCalls (3-phase pipeline):
│         ├─ Phase 1 (serial)   : parse args → beforeToolCall → approval gate
│         ├─ Phase 2 (parallel) : toolRegistry.submitTool(...) → futures
│         └─ Phase 3 (serial)   : future.join()
│                                 → afterToolCall (may rewrite result, e.g. offload)
│                                 → memory.addToolResult(...)
│                                 → emit(ToolResult)
│
├─ exceeded maxIterations → emit(AgentError) + emit(AgentFinish)
├─ exception thrown       → emit(AgentError) + emit(AgentFinish)
└─ finally: dispatchAgentFinish   ← guarantees middleware cleanup

wangzhigang1999 · 2026-04-23T06:38:39Z

Hi @pan3793, when you have time, could I ask for a review on this one? 🙏

Third PR of the Data Agent Engine series (umbrella #7379, labeled 2b/4) — adds the ReactAgent runtime, middleware stack (logging / approval / compaction / tool-output offload), and OpenAiProvider. Sits on top of 2a (#7400) and is consumed by the final REST-layer PR.

It's on the larger side (~5.3k lines, almost all under externals/kyuubi-data-agent-engine/), so a high-level pass on the agent/middleware shape and session lifecycle is more than enough — happy to iterate on line-level feedback after. ReactAgentLiveTest exercises the full loop end-to-end if easier to poke at locally (needs DATA_AGENT_LLM_API_KEY / DATA_AGENT_LLM_API_URL / DATA_AGENT_LLM_MODEL).

No rush — thanks!

… drop SQLite/PostgreSQL bundle, pin kotlin/okhttp/okio Two pom-level cleanups requested in apache#7417 review: 1. Drop sqlite-jdbc and postgresql JDBC drivers from the binary bundle. sqlite-jdbc moves to test scope (still needed for unit tests); postgresql is no longer declared. Users targeting those databases provide the driver jar on the engine classpath the same way they do for any other JDBC source. Trims ~14 MB from the bundled tgz. 2. Pin the kotlin runtime and okhttp/okio versions transitively introduced by openai-java in the data-agent module's pom, so any drift across openai-java upgrades becomes a deliberate change rather than a silent transitive shift. Versions pinned at the values openai-java currently resolves to (kotlin-stdlib* 1.8.0, kotlin-reflect 2.0.21, okhttp 4.12.0, okio 3.6.0); the dependency tree is identical to before. Addresses review feedback on apache#7417.

…ename OpenAiProvider to ChatCompletionProvider The previous config namespace and provider class name were ambiguous — both implied an LLM/vendor identity (OpenAI) when in practice they denote the OpenAI-compatible chat-completion protocol that virtually every modern LLM endpoint speaks. Reviewers asked for vendor-neutral naming aligned with Trino's ai.* function configuration style. Config keys: kyuubi.engine.data.agent.llm.api.key -> openai.api.key kyuubi.engine.data.agent.llm.api.url -> openai.endpoint kyuubi.engine.data.agent.llm.model -> model Provider class: org.apache.kyuubi.engine.dataagent.provider.openai.OpenAiProvider -> org.apache.kyuubi.engine.dataagent.provider.chatcompletion .ChatCompletionProvider Env vars consumed by tests/E2E suites and the regenerated settings.md follow the same renames. Addresses review feedback on apache#7417.

…ialect class names Reviewer asked for proper acronym casing in class names. Rename: SqliteDialect -> SQLiteDialect MysqlDialect -> MySQLDialect and update test method names that embed the same tokens (testSqlite*, testMysql*, testDatasourceSqlite, testDatasourceMysql). Addresses review feedback on apache#7417.

…it sentinel actions in AgentMiddleware Reviewer pushed back on null propagation in AgentMiddleware return types. Apply the same sealed-style pattern uniformly across the three hooks that historically used null to mean "do nothing": beforeLlmCall -> LlmCallAction { LlmNoopAction | LlmSkip | LlmModifyMessages } beforeToolCall -> ToolCallAction { ToolCallApproval | ToolCallDenial } afterToolCall -> ToolResultAction { ToolResultUnchanged | ToolResultReplace } Each base type is non-instantiable, the no-op subtype is a singleton (*.INSTANCE), and the active subtype carries its payload. Defaults and all built-in middleware (Logging, Approval, Compaction, ToolResultOffload) return the appropriate sentinel; the ReactAgent dispatchers switch from null checks to instanceof checks. Tests assert on the singleton or on instanceof + cast to read the payload. No behavior change; the goal is just to remove null from the contract. Addresses review feedback on apache#7417.

…ter in AgentTool.execute Reviewer asked for the per-invocation context to come first so the parameter list reads context-then-payload, matching the conventional shape of "function(context, args)" used elsewhere in the codebase. Update the AgentTool interface signature, all production tool implementations (ReadToolOutputTool, GrepToolOutputTool, RunSelectQueryTool, RunMutationQueryTool), the ToolRegistry call site, and every test/test-helper call site that exercises tool.execute(...). Addresses review feedback on apache#7417.

…n types under Decision<T> Collapse the three sealed action hierarchies (LlmCallAction, ToolCallAction, ToolResultAction) plus nullable onEvent into a single generic Decision<T> with proceed / replace / abort. Pack tool-call (id, name, args) into ToolInvocation so beforeToolCall can rewrite args (e.g. inject SQL LIMIT, redact params), and align afterLlmCall by moving its dispatch ahead of the memory write so replace actually rewrites what enters memory and tool-call extraction.

…ATE in approval live test testApprovalApproveFlow asked the model to increment a counter and return the new value, but UPDATE returns no value, so weaker models (e.g. kimi-k2.5) hallucinated "0" instead of running a follow-up SELECT. Make the instruction explicit so behavior converges across models.

…lient + composite MiddlewareDispatcher ReactAgent had grown to mix three concerns: the ReAct control loop, OpenAI streaming/chunk assembly, and middleware fold logic. Extract: - LlmStreamClient: owns one streaming chat completion call, accumulates content + tool-call deltas, and exposes StreamResult.toAssistantMessage for SDK message construction. Depends only on the OpenAI SDK and AgentRunContext (emits ContentDelta via ctx.emit, no dispatcher reference). - MiddlewareDispatcher: implements AgentMiddleware as a composite over the configured list. ReactAgent calls onAgentStart / onEvent / beforeLlmCall etc. on the composite the same way it would call any middleware; resolveApproval stays as a non-interface accessor for the approval flow's special case. Also: afterToolCall now returns Decision<String> for symmetry with the other interceptor hooks; ABORT marks ToolResult.isError=true so listeners can distinguish a middleware-vetoed result from a successful one. The emit-then-forward step splits cleanly: the composite runs onEvent, and ReactAgent's ctx.setEventEmitter lambda forwards the filtered event to the user's raw consumer. ReactAgent's run() drops the eventConsumer parameter threading through internal helpers — everywhere downstream uses ctx.emit().

wangzhigang1999 · 2026-05-02T07:25:59Z

Thanks @pan3793 — pushed followup commits addressing all the feedback (deps & bundling, Trino-style config keys, ChatCompletionProvider rename, SQLite/MySQL capitalization, null-as-noop → Decision sentinel, ToolContext first parameter).

Also folded in some internal cleanup: split ReactAgent into LlmStreamClient + a composite MiddlewareDispatcher. The dispatcher itself implements AgentMiddleware, so the agent treats the whole pipeline as one middleware. 1aac6aa

                 ┌──────────────────────────────────────┐
                 │            ReactAgent                │
                 │      (ReAct control loop only)       │
                 └──────────────┬───────────────────────┘
                                │
              ┌─────────────────┴─────────────────┐
              ▼                                   ▼
    ┌──────────────────┐          ┌────────────────────────────┐
    │ LlmStreamClient  │          │   MiddlewareDispatcher     │
    │ one streaming    │          │   implements AgentMiddleware
    │ chat completion  │          │   (Composite)              │
    │ + chunk assembly │          └─────────────┬──────────────┘
    └──────────────────┘                        │
                                  ┌─────────────┼─────────────┐
                                  ▼             ▼             ▼
                              Logging       Approval     Compaction  ...
                              Middleware    Middleware   Middleware

pan3793 · 2026-05-04T13:55:03Z

thanks, merged to master

github-actions Bot added kind:documentation Documentation is a feature! module:common kind:build labels Apr 22, 2026

wangzhigang1999 force-pushed the pr2b/data-agent-runtime branch from c534fd7 to 3011909 Compare April 22, 2026 16:15

pan3793 reviewed Apr 23, 2026

View reviewed changes

Comment thread externals/kyuubi-data-agent-engine/pom.xml

wangzhigang1999 force-pushed the pr2b/data-agent-runtime branch from 3011909 to ce4eecc Compare April 23, 2026 02:45

wangzhigang1999 marked this pull request as ready for review April 23, 2026 06:21