feat(agent): detect availability via session/new probe and assistant-first identity#500
Open
kaizhou-lab wants to merge 139 commits into
Open
feat(agent): detect availability via session/new probe and assistant-first identity#500kaizhou-lab wants to merge 139 commits into
kaizhou-lab wants to merge 139 commits into
Conversation
added 30 commits
June 15, 2026 17:57
The availability scheduler runs `try_connect_custom_agent` every 5 minutes for every agent, spawning a CLI subprocess and tearing it down once the ACP handshake completes (or fails). For wrapper CLIs that fork a long-lived grandchild — `npm exec openclaw --acp` is the production case — cleanup was leaking the grandchild because: * `kill_on_drop(true)` on the tokio Command only signals the direct child (the npm exec wrapper), not its grandchild. * The probe relied on `drop(protocol)` for the success path and on no explicit cleanup for the handshake-fail path, so `proc.kill` was never called. * `CliAgentProcess::kill` itself short-circuited and returned Ok the moment the leader exited within the grace period — so even when callers did invoke it, no group-wide SIGKILL was sent. Result: dozens of zombie `openclaw-acp` processes accumulated per day under the 5-minute scheduler. Fix: 1. `CliAgentProcess::kill` now always issues a group-wide SIGKILL after the grace period, even when the leader has already exited. `force_kill` already maps ESRCH to success, so the sweep is idempotent for already-reaped trees. 2. `try_connect_custom_agent` calls `proc.kill` on every outcome (success, ACP failure, handshake timeout) by hoisting the spawn out of the inner future and running cleanup unconditionally after the timeout race resolves. 3. New regression test `probe_kills_grandchild_left_behind_by_wrapper` exercises the exact wrapper-grandchild shape from production and asserts the grandchild is reaped before the probe returns.
added 30 commits
June 18, 2026 16:55
list_management_rows now sets env to Vec::new() instead of copying meta.env, which contained merged override secrets. The management row already exposes has_command_override + env_override_key_count; the UI does not need plaintext values. The e2e test now asserts the management row's env is empty AND that the secret value "sk-x" does not appear anywhere in the management response body.
Added record_session_success to AgentAvailabilityFeedbackPort trait and AgentAvailabilityService. It persists a session-kind available snapshot, which clears needs_auth and updates last_success_at. Turn orchestrator now calls record_agent_session_success when send_message succeeds, mirroring the existing session failure path. Test: record_session_success_clears_needs_auth verifies that a needs_auth state set via session failure is cleared by session success.
Bare assistants for single-engine agents (e.g. Aion CLI) were generated with an empty agent_backend because their engine identity lives in agent_type, not the ACP-vendor backend column. An empty preset_agent_type made the frontend route aionrs assistants as ACP, dropping the top-level model and persisting a NULL model that later failed warmup with "Provider '' not found". Reconcile now falls back to agent_type when backend is empty. Also fix pre-existing test breakage left from the assistant-first migration: - add missing agent_status/team_selectable/deletable to the AssistantResponse camelCase rejection fixture - update channel default-settings e2e to expect the generated aionrs bare assistant binding instead of a null assistant - drop the obsolete agent.select persist integration test (direct agent selection is no longer supported; covered by the unknown-action unit test) - add missing last_check_error_details to the bare assistant test row
…/offline Align the backend agent status model with the simplified frontend semantics: a probe only verifies an ACP handshake (reachability), not authorization, so the misleading "available"/"needs_auth" verdicts are dropped in favor of plain online/offline. - agent_discovery: rename AgentManagementStatus and AgentSnapshotCheckStatus variants Available/Unavailable/NeedsAuth -> Online/Offline (serde snake_case). - availability: simplify snapshot persistence — record_session_failure always yields offline; drop the success-recording path and auth detection. - registry: parse "online"/"offline"; derive management status as Offline -> Offline else Online. - turn_orchestrator: always report session_send_failed on send error; no success recording. - assistant/service + tests: migrate status values to online/offline.
…iders Deepen the agent health probe from `initialize` to `session/new` so it reflects real usability, not just protocol reachability — `initialize` returns authMethods even for authorized agents and cannot tell apart "reachable but not signed in". - custom_agent_probe: after `initialize`, open a throwaway `session/new` (no prompt); classify the outcome as Ok / Auth (ACP auth_required, JSON-RPC -32000) / Fail. Applies to both the custom and builtin-managed probe paths. - api-types: add `TryConnectCustomAgentResponse::FailAuth` (tag `fail_auth`). - availability: map FailAuth → offline + `auth_required` code; gate aionrs (built-in agent, no external CLI) availability on having at least one enabled model provider, mirroring AssistantService::resolve_default_agent_type — offline + `no_provider` otherwise. - custom: accept test-on-save when the agent is reachable but auth_required (a valid agent the user just hasn't logged into yet). - registry: add guidance for auth_required and no_provider error codes. - The background scheduler shares run_probe, so periodic checks reflect the same session/new-based status.
…ficeAI/AionCore into feat/agent-connection-testing-phase2 * 'feat/agent-connection-testing-phase2' of github.com:iOfficeAI/AionCore:
…ackend
An assistant's agent_status was matched to its agent row by `backend`
only. aionrs (the built-in Rust agent) has a NULL `backend` and is keyed
by `agent_type` ("aionrs"), so every aionrs-backed assistant failed to
resolve a row and was mislabelled Missing/unavailable.
Match the agent row on `backend == effective_backend` OR
`agent_type.serde_name() == effective_backend`, so aionrs assistants
resolve to the real aionrs row and reflect its actual status.
Add a regression test covering an aionrs assistant (row with NULL backend,
agent_type Aionrs, Online) resolving to Online instead of Missing.
…-testing-phase2 # Conflicts: # crates/aionui-conversation/src/service.rs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds accurate agent availability detection and completes the assistant-first identity migration across the agent, team, conversation, channel, and cron modules.
Agent availability detection
session/newto verify an agent can establish a session, instead of inferring reachability.needs_auth/ auth-required from offline/unreachable, and clearneeds_authon session success.online/offlinereporting plus an explicit auth-required signal.aionrsavailability on the resolved provider configuration rather than backend identity alone.agent_statusbyagent_type(with backend fallback) so bare/built-in assistants report the correct status.agent_typewhen backend is null.Assistant-first identity migration
TeamRunfor assistant-first creation and close the assistant-first guide handoff; reword guide prompts and MCP tools around assistants.Fixes
Test
just pushgreen: 6475 tests passed, lint/clippy and fmt clean.Closes #499