Skip to content

feat(agent): detect availability via session/new probe and assistant-first identity#500

Open
kaizhou-lab wants to merge 139 commits into
mainfrom
feat/agent-connection-testing-phase2
Open

feat(agent): detect availability via session/new probe and assistant-first identity#500
kaizhou-lab wants to merge 139 commits into
mainfrom
feat/agent-connection-testing-phase2

Conversation

@kaizhou-lab

Copy link
Copy Markdown
Contributor

Summary

Adds accurate agent availability detection and completes the assistant-first identity migration across the agent, team, conversation, channel, and cron modules.

Agent availability detection

  • Actively probe session/new to verify an agent can establish a session, instead of inferring reachability.
  • Distinguish needs_auth / auth-required from offline/unreachable, and clear needs_auth on session success.
  • Collapse the prior available/unavailable/needs_auth model into clearer online/offline reporting plus an explicit auth-required signal.
  • Gate aionrs availability on the resolved provider configuration rather than backend identity alone.
  • Resolve assistant agent_status by agent_type (with backend fallback) so bare/built-in assistants report the correct status.
  • Reap the probe process tree / kill the probe process group to stop orphaned wrapper grandchild leaks.
  • Expose a backend logo catalog endpoint and key the logo catalog by agent_type when backend is null.

Assistant-first identity migration

  • Persist and resolve assistant identity across team, conversation, channel, and cron flows; derive backends, avatars, names, and runtime types from assistants.
  • Reject legacy backend/custom-agent/preset id fields in public write DTOs; canonicalize legacy ids on read.
  • Bootstrap TeamRun for assistant-first creation and close the assistant-first guide handoff; reword guide prompts and MCP tools around assistants.
  • Prefer assistant runtime seeds and persisted ACP session identity when creating conversations.

Fixes

  • Resolve cron/assistant backends from snapshots and prefer runtime backend over stale extras.
  • Structure assistant-first errors for i18n.

Test

just push green: 6475 tests passed, lint/clippy and fmt clean.

Closes #499

zk added 30 commits June 15, 2026 17:57
The availability scheduler runs `try_connect_custom_agent` every 5
minutes for every agent, spawning a CLI subprocess and tearing it
down once the ACP handshake completes (or fails). For wrapper CLIs
that fork a long-lived grandchild — `npm exec openclaw --acp` is the
production case — cleanup was leaking the grandchild because:

* `kill_on_drop(true)` on the tokio Command only signals the direct
  child (the npm exec wrapper), not its grandchild.
* The probe relied on `drop(protocol)` for the success path and on
  no explicit cleanup for the handshake-fail path, so `proc.kill`
  was never called.
* `CliAgentProcess::kill` itself short-circuited and returned Ok the
  moment the leader exited within the grace period — so even when
  callers did invoke it, no group-wide SIGKILL was sent.

Result: dozens of zombie `openclaw-acp` processes accumulated per
day under the 5-minute scheduler.

Fix:

1. `CliAgentProcess::kill` now always issues a group-wide SIGKILL
   after the grace period, even when the leader has already exited.
   `force_kill` already maps ESRCH to success, so the sweep is
   idempotent for already-reaped trees.
2. `try_connect_custom_agent` calls `proc.kill` on every outcome
   (success, ACP failure, handshake timeout) by hoisting the spawn
   out of the inner future and running cleanup unconditionally
   after the timeout race resolves.
3. New regression test `probe_kills_grandchild_left_behind_by_wrapper`
   exercises the exact wrapper-grandchild shape from production and
   asserts the grandchild is reaped before the probe returns.
zk added 30 commits June 18, 2026 16:55
list_management_rows now sets env to Vec::new() instead of copying
meta.env, which contained merged override secrets. The management
row already exposes has_command_override + env_override_key_count;
the UI does not need plaintext values.

The e2e test now asserts the management row's env is empty AND
that the secret value "sk-x" does not appear anywhere in the
management response body.
Added record_session_success to AgentAvailabilityFeedbackPort trait
and AgentAvailabilityService. It persists a session-kind available
snapshot, which clears needs_auth and updates last_success_at.

Turn orchestrator now calls record_agent_session_success when
send_message succeeds, mirroring the existing session failure path.

Test: record_session_success_clears_needs_auth verifies that a
needs_auth state set via session failure is cleared by session success.
Bare assistants for single-engine agents (e.g. Aion CLI) were generated
with an empty agent_backend because their engine identity lives in
agent_type, not the ACP-vendor backend column. An empty preset_agent_type
made the frontend route aionrs assistants as ACP, dropping the top-level
model and persisting a NULL model that later failed warmup with
"Provider '' not found". Reconcile now falls back to agent_type when
backend is empty.

Also fix pre-existing test breakage left from the assistant-first
migration:
- add missing agent_status/team_selectable/deletable to the
  AssistantResponse camelCase rejection fixture
- update channel default-settings e2e to expect the generated aionrs
  bare assistant binding instead of a null assistant
- drop the obsolete agent.select persist integration test (direct agent
  selection is no longer supported; covered by the unknown-action unit test)
- add missing last_check_error_details to the bare assistant test row
…/offline

Align the backend agent status model with the simplified frontend semantics:
a probe only verifies an ACP handshake (reachability), not authorization, so
the misleading "available"/"needs_auth" verdicts are dropped in favor of plain
online/offline.

- agent_discovery: rename AgentManagementStatus and AgentSnapshotCheckStatus
  variants Available/Unavailable/NeedsAuth -> Online/Offline (serde snake_case).
- availability: simplify snapshot persistence — record_session_failure always
  yields offline; drop the success-recording path and auth detection.
- registry: parse "online"/"offline"; derive management status as Offline ->
  Offline else Online.
- turn_orchestrator: always report session_send_failed on send error; no
  success recording.
- assistant/service + tests: migrate status values to online/offline.
…iders

Deepen the agent health probe from `initialize` to `session/new` so it
reflects real usability, not just protocol reachability — `initialize`
returns authMethods even for authorized agents and cannot tell apart
"reachable but not signed in".

- custom_agent_probe: after `initialize`, open a throwaway `session/new`
  (no prompt); classify the outcome as Ok / Auth (ACP auth_required,
  JSON-RPC -32000) / Fail. Applies to both the custom and builtin-managed
  probe paths.
- api-types: add `TryConnectCustomAgentResponse::FailAuth` (tag `fail_auth`).
- availability: map FailAuth → offline + `auth_required` code; gate aionrs
  (built-in agent, no external CLI) availability on having at least one
  enabled model provider, mirroring AssistantService::resolve_default_agent_type
  — offline + `no_provider` otherwise.
- custom: accept test-on-save when the agent is reachable but auth_required
  (a valid agent the user just hasn't logged into yet).
- registry: add guidance for auth_required and no_provider error codes.
- The background scheduler shares run_probe, so periodic checks reflect the
  same session/new-based status.
…ficeAI/AionCore into feat/agent-connection-testing-phase2

* 'feat/agent-connection-testing-phase2' of github.com:iOfficeAI/AionCore:
…ackend

An assistant's agent_status was matched to its agent row by `backend`
only. aionrs (the built-in Rust agent) has a NULL `backend` and is keyed
by `agent_type` ("aionrs"), so every aionrs-backed assistant failed to
resolve a row and was mislabelled Missing/unavailable.

Match the agent row on `backend == effective_backend` OR
`agent_type.serde_name() == effective_backend`, so aionrs assistants
resolve to the real aionrs row and reflect its actual status.

Add a regression test covering an aionrs assistant (row with NULL backend,
agent_type Aionrs, Online) resolving to Online instead of Missing.
…-testing-phase2

# Conflicts:
#	crates/aionui-conversation/src/service.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Detect agent availability: probe session/new, recognize auth_required, gate aionrs by provider

1 participant